Backing Up Cosmos DB (Azure Data Factory Edition)

Now that ADF is a trusted service I wanted to document the state of my current solution since I’ve been able to dump the hack-y PowerShell script I put together. I haven’t been able to get the level of abstraction I’d really like to see but overall I think I’m pretty happy with the solution (and I still get to include a hack-y PowerShell script). My solution consists of

  • a control pipeline,
  • a notification pipeline and
  • 1 pipeline for every Cosmos DB service I want to to work with. (This is because I wasn’t able to figure out a way to abstract the data source connection for the Copy Data task.)

We’ll start at the backup pipeline and then back out to the larger solution:

Screen shot of the CosmosDB pipeline tasks

The backup pipeline is relatively simple: I set a date string that ends up being a part of the backup filename: then use the Copy Data task to pull the contents of Cosmos DB container into blob storage. We then perform cleanup of the backup files¬† (we are keeping 2 weeks of backups) then clean up the logs generated by the delete step. If either of the two steps fail (backup or backup cleanup) we use an execute pipeline task to kick off a notification pipeline. (ADF doesn’t natively send emails but it was very easy to leverage a logic app to send any necessary notifications: see this mssqltips article for details on how to set that up). This pipeline is parameterized with the name of the Cosmos DB service, database name and container name. These are used to create the folder structure for the backup in blob storage (formatted as [service name][database name][container name]_[YYYYMMDD_HHMM].json). This was where I couldn’t seem to abstract away the Cosmos DB service itself. I have 1 pipeline per known Cosmos DB service. Any time I need to add a new service I define a single data source and clone an existing pipeline to point at the new data source.

Each backup pipeline is called by my Dynamic Backup control pipeline which has two tasks:

The screen shot of tasks from the dynamic backup control pipelineLookup Backuplist grabs a json formatted list of Cosmos DB services along with any associated databases and containers from a blob storage data source. This is where I had to return to a hack-y PowerShell script. There isn’t handy way for ADF to get a token to talk to Cosmos DB services and query for a list of databases or containers. Sure I could try and do it via REST calls and the Webhooks action but that was going to be a huge headache plus our Cosmos DB is already IP restricted so I wasn’t 100% sure if things would fail because I had written them terribly or if the communication just wasn’t going to happen. So I now have a PowerShell script that I run manually every so often that will pick up new databases and containers that were deployed to known services since the last time I checked and add them to the list. This was mostly just a bit of CYA: When I went to deploy the ADF backups I discovered there was a new container in one of the databases that no one had thought to notify me of. While I’m not necessarily in the loop for new databases or containers on existing Cosmos DB services it seems a safe bet that I’ll be informed of any new Cosmos DB services themselves. Here is the PS script:

<#
Author: Josh Smith
Created: 2020-03-02
Purpose: Will pull all databases and containers from known CosmosDB resources.
Cosmos db services must be known and added to the array below (paired with the
associated resource group).
#>
<# add cosmos db resource and groups here as needed: #>
$cosmosDBServices = @( @{resource = '<cosmos db service 1>'; group = '<resource group name>'}, `
@{resource = '<cosmos db service 2'; group = '<resource group name>'})
$connectAz = Read-Host "Did you already connect to Azure [Y/N]?"
# Sadly the need to authenticate with Azure means this script needs to be run manually.
if ($connectAz -eq 'N'){Connect-AzAccount -subscriptionName "<name of your Azure subscription>"}
$dbResourceType = "Microsoft.DocumentDb/databaseAccounts/apis/databases"
$containerResourceType = "Microsoft.DocumentDb/databaseAccounts/apis/databases/containers"
$BackupStorageName = "<Azure storage account name>"
$StorageResourceGroupName = "<resource group for azure storage>"
$apiVersion = "2015-04-08"
$finalJson = ""
foreach($c in $cosmosDBServices)
{
$ResourceName = $c.resource + "/sql/"
$resourceGroup = $c.group
$databases = Get-AzResource -ResourceType $dbResourceType -ApiVersion $apiVersion `
-ResourceGroupName $resourceGroup -Name $ResourceName | `
Select-Object -ExpandProperty Properties | Select-Object id
foreach($d in $databases)
{
$databaseName = $d.id
$db = $ResourceName + $databaseName
$containers = Get-AzResource -ResourceType $containerResourceType `
-ApiVersion $apiVersion -ResourceGroupName $ResourceGroup `
-Name $db | Select-Object -ExpandProperty Properties | Select-Object id
foreach($con in $containers)
{
$container = $con.id
$con = $null
$json = $null
$conObject = New-Object -TypeName psobject
$conObject | Add-Member -MemberType NoteProperty -Name ContainerName -Value $container
$conObject | Add-Member -MemberType NoteProperty -Name DatabaseName -Value $databaseName
$conObject | Add-Member -MemberType NoteProperty -Name ResourceName -Value $resourceName.Substring(0, $resourceName.Length - 5)
$json = $conObject | ConvertTo-Json
$finalJson = $finalJson + $json + ",
" # I'm a sucker for lazy line breaks
}
}
}
# lose the last line break and comma then output:
$finalJson = $finalJson.Substring(0, $finalJson.Length - 2)
$finalJson | Out-File -FilePath '.\CDBBackup.json' -Force
# connect to the blob storage and then push the file up:
$storageAccount = Get-AzStorageAccount -ResourceGroupName $StorageResourceGroupName `
-Name $BackupStorageName
$storageContext = $storageAccount.Context
Set-AzStorageBlobContent -File '.\CDBBackup.json' -Container 'backuplist' `
-Blob 'CDBBackups.json' -Context $storageContext -Force
# don't need that file hanging around here:
Remove-Item .\CDBBackup.json

Which will give me output like this:

[
{
"ContainerName" : "myFirstContainer",
"DatabaseName" : "myFirstDatabase",
"ResourceName" : "cosmos-db-service-1"
},
{
"ContainerName" : "mySecondContainer",
"DatabaseName" : "myFirstDatabase",
"ResourceName" : "cosmos-db-service-1"
},
{
"ContainerName" : "myFirstContainerMarkII",
"DatabaseName" : "mySecondDatabase",
"ResourceName" : "cosmos-db-service-2"
}
]
view raw backuplist.json hosted with ❤ by GitHub

A screenshot of a single Switch taskFor each returned item in the backup list I simply run it through a switch task based off the name of the cosmos db service from my collection. If none of the names match we simply don’t do anything with it (default). Once I find a service that is “known” (i.e. I have a data source for it) I can call the pipeline I deployed for that service via an execute pipeline task and pass in the database name and container in as parameters (along with the service name for consistency’s sake although it could be hard coded into the pipeline I am certain I’ll find a way to make this more dynamic in the future).

So that’s where we are at: ideally either ADF will get some built in methods to pull back meta-data about the service (webhooks and REST really sounds like a problem that only needs to be solved once) or I’ll find a way that I can run that PowerShell script in an automated fashion to populate the list.

Also a very appreciative shout out to Cathrine Wilhelmsen (b|t) without her Christmas time ADF blog marathon I would have spent much longer fumbling around than I ultimately did.

Backing Up A Cosmos DB with the Cosmos DB Migrator Tool

CosmosDB really is an amazing datastore and even better (you might be thinking): Microsoft handles the backups for you. Which is true. They take backups every four hours and keep the last two. If you need anything recovered from the database you’d better hope that you notice with in that window *and* get a ticket open with Microsoft to get it fixed. This being the case Microsoft helpfully recommends that in addition to the by default backups that come with the Cosmos DB service that you export your data to a secondary location as needed to meet your organizations SLA. Data Factory to the rescue right? Again, almost.

Unfortunately if you are restricting access to your Cosmos DB service based on IP address (a reasonable security measure) then Data Factory won’t work as of this writing as Azure Data Factory doesn’t operate like a trusted Azure service and presents as IP address from somewhere in the data center where it is spun up. Thankfully they are working on this. In the meantime however the next best thing is to use the Cosmos DB migration tool (scripts below) to dump the contents to a location where they can be retained as long as needed. Be aware in addition to the RU cost of returning the data that if you bring these backups back out of the data center where the Cosmos DB lives you’ll also incur egress charges on the data.

The script reads from a custom json file, this will contain the cosmos db service(s), as well as the databases and collections that need to be backed up. This file will have the read-only keys to your cosmos DB services in it so should be encrypted on the disk in order to limit the number of people who can access the file.

[
{
"_comment" : ["This is a sample object that should be modified for deployment.",
"Connect strings will need to be inserted and correct service, database and collection",
"names included as well as setting the database backup flag to true as needed."],
"service" : {
"name" : "<name of your cosmos db service>",
"connectString" : "<read-only connect string here>",
"databases" : [ {"name" : "database1",
"backupFlag" : true,
"collections" : [{"name" : "collection1"},
{"name" : "collection2"}]
}
, {"name" : "database2",
"backupFlag" : false,
"collections" : [{"name" : "collection1"}]
},
{"name" : "database2",
"backupFlag" : true,
"collections" : [{"name" : "collection1"}]
}
]
}
},
{
"service" : {
"name" : "<second cosmos db service>",
"connectString" : "<second service read-only key>",
"databases" : [ {"name" : "database1",
"backupFlag" : false,
"collections" : [{"name" : "collection1"}]
}
]
}
}]
view raw cosmosDB.json hosted with ❤ by GitHub

Once the config file is in place the following PowerShell will read the file and backup the appropriate services, databases and collections appropriately (and remove any old backups that are no longer needed).

<#
This script will call the cosmos db migration tool with the correct parameters based
on a list of databases that need to be backed up. It depends on a json param file that
contains the list of cosmos db services that have databases that require backup.
This script has a couple of dependencies:
(1) the dt.exe that it runs (the cosmos db migration tool and we assume the associated files/dlls
in the compiled folder) needs to be locally available.
(2) A configured json file to list out the cosmos services and databases that require backups.
Care should be taken (encrypt the file and provide access to the keys to a limited set of users)
as the read-only keys for the cosmos-db service will be stored here.
#>
$retentionDays = 14
$backupList = "C:\temp\CosmosBackup.json" # point these at the appropriate folders
$backupPath = 'C:\temp\'
$pathtoEXE = 'C:\temp\drop\dt.exe'
$backups = Get-Content -Raw -Path $backupList | ConvertFrom-Json
foreach($s in $backups.service)
{
$sName = $s.name
Write-Output "Processing service $sName..."
$cosmosConnectString = $s.connectString
foreach($d in $s.databases)
{
$database = $d.name
if ($d.backupFlag -eq $true)
{
Write-Output " Backing up collections in $database..."
foreach($c in $d.collections)
{
$collection = $c.name
Write-Output " Backing up collection $collection."
<# configure export arguments #>
$connectString = "$cosmosConnectString;Database=$database"
$date = Get-Date
$dateString = $date.ToString('yyyyMMdd')
$dateString = $dateString + '_' + $date.ToString('hhmm')
$targetFile = "$collection`_$dateString.json"
$args = "/ErrorLog:" + "$backupPath\backups\$sName\$database\$collection`_$dateString`_log.csv /OverwriteErrorLog"
$args = $args + " /ErrorDetails:Critical /s:DocumentDB /s.ConnectionString:$connectString"
$args = $args + " /s.ConnectionMode:Gateway /s.Collection:$collection /s.Query:`"SELECT * FROM c`""
$args = $args + " /t:JsonFile /t.File:$backupPath\backups\$Name\$database\$targetFile /t.Prettify /t.Overwrite"
<# now we are all configured: run the collection backup #>
Start-Process -FilePath $pathtoEXE -ArgumentList $args -Wait
}
}
else {
Write-Output " Skipping $dName backupFlag <> true."
}
}
}
$purgeDate = (Get-Date).AddDays(-1 * ($retentionDays + 1))
<# remove old logs and backups #>
Get-ChildItem -Path $backupPath -Recurse -Include *.json -Exclude *cosmosBackup.json | Where-Object {$_.CreationTime -lt $purgeDate} | Remove-Item
Get-ChildItem -Path $backupPath -Recurse -Include *.csv | Where-Object {$_.CreationTime -lt $purgeDate} | Remove-Item
view raw CosmosDBBkup.ps1 hosted with ❤ by GitHub

While this is not ideal if you have a need to immediately start backing up your cosmos dbs this will do the trick until Microsoft finishes incorporating Data Factory into their trusted services.

[Edited to add 10/3/2019:] Just yesterday it looks like MS updated their timeline for adding the needed functionality to ADF.

Partial Backups With Dell Litespeed

We currently use Dell’s Litespeed for SQL Server for our SQL backup compression and encryption needs. It has an intuitive GUI, manages log shipping and we are generally pretty happy with it. I’ve got a couple of problem databases where I don’t want the vanilla out of the box full backups. A couple of our data warehouse dbs are now over a terabyte in size and continuing to grow. We’re implementing partitioning to help with index and statistics maintenance as well as backups. The down side is that there is no GUI for partial backups with Litespeed so I’ve had to fumble through it with the help of their support. Here then are the working scripts for taking partial backups (and restores) of databases using the Litespeed stored procedures. Our database is in simple recovery mode and we’ll be backing up a read-only file group as well as all our active files.

First backup your read-only file group. You only need to back this up once. Or until an end user has you set it to read-write, and updates records. When that happens be sure to set the file group back to read-only and then back it up again.

EXEC master.dbo.xp_backup_database @database = '<database name>',
@filename = 'Y:\SQL_BACKUPS\<name of backup file>', @filegroup = '<filegroup to backup>'
, @compressionlevel = 4, @cryptlevel = 8, @encryptionkey = N'IMMAENCRYPTCHYU'
, @verify = 1, @returndetails = 1, @init = 1, @with = 'CHECKSUM';
view raw litespeed FG backup hosted with ❤ by GitHub

The xp_backup_database also has a @read_write_filgroups parameter we’ll use later. It’s important to note that if you include it here with a value of 0 (which seems an intuitive thing to do) you will essentially take a full backup rather than a backup of just the intended filegroup. Including @returndetails is optional but will instruct Litespeed to return a row of information about the backup as it was processed.

Once you have your read-only files backed up you can back up the read-write file groups daily (or hourly or whatever works for your SLA) with the following code. Notice we’ve removed the specific filegroup parameter and added the @read_write_filegroups parameter.

EXEC master.dbo.xp_backup_database @database = '<database name>', @filename = 'Y:\SQL_BACKUPS\<name of backup file>'
, @compressionlevel = 4, @cryptlevel = 8, @encryptionkey = N'IMMAENCRYPTCHYU'
, @read_write_filegroups = 1, @verify = 1, @returndetails = 1, @init = 1, @with = 'CHECKSUM';

Once you’ve taken a full backup you can do differential backups of the read-write file groups as well by including an additional @with:

EXEC master.dbo.xp_backup_database @database = '', @filename = 'Y:\SQL_BACKUPS\'
, @compressionlevel = 4, @cryptlevel = 8, @encryptionkey = N'IMMAENCRYPTCHYU'
, @read_write_filegroups = 1, @verify = 1, @returndetails = 1
, @init = 1, @with = 'CHECKSUM', @with = 'DIFFERENTIAL';

Restore scripts look very similar to the backup scripts. If you do not have differential backups of your read-write file groups change ‘NORECOVERY’ to ‘RECOVERY.’

EXEC master.dbo.xp_restore_database @database = '', @filename = ''
, @encryptionkey = N'IMMAENCRYPTCHYU', @read_write_filegroups = 1
, @with = 'NORECOVERY', @returndetails = 1;

Restoring a read-write differential backup (RECOVERY should be changed to NO RECOVERY if there are multiple differentials to apply):

EXEC master.dbo.xp_restore_database @database = '', @filename = ''
, @encryptionkey = N'IMMAENCRYPTCHYU', @read_write_filegroups = 1
, @with = 'RECOVERY', @returndetails = 1;

Finally read-only file groups can be restored after the database has been brought online after restoring all backups of read-write file groups:

EXEC master.dbo.xp_restore_database @database = '', @filename = ''
, @filegroup = '', @encryptionkey = N'IMMAENCRYPTCHYU', @with = 'RECOVERY'
, @with = 'MOVE '''' TO ''''', @returndetails = 1