Backing Up A Cosmos DB with the Cosmos DB Migrator Tool

CosmosDB really is an amazing datastore and even better (you might be thinking): Microsoft handles the backups for you. Which is true. They take backups every four hours and keep the last two. If you need anything recovered from the database you’d better hope that you notice with in that window *and* get a ticket open with Microsoft to get it fixed. This being the case Microsoft helpfully recommends that in addition to the by default backups that come with the Cosmos DB service that you export your data to a secondary location as needed to meet your organizations SLA. Data Factory to the rescue right? Again, almost.

Unfortunately if you are restricting access to your Cosmos DB service based on IP address (a reasonable security measure) then Data Factory won’t work as of this writing as Azure Data Factory doesn’t operate like a trusted Azure service and presents as IP address from somewhere in the data center where it is spun up. Thankfully they are working on this. In the meantime however the next best thing is to use the Cosmos DB migration tool (scripts below) to dump the contents to a location where they can be retained as long as needed. Be aware in addition to the RU cost of returning the data that if you bring these backups back out of the data center where the Cosmos DB lives you’ll also incur egress charges on the data.

The script reads from a custom json file, this will contain the cosmos db service(s), as well as the databases and collections that need to be backed up. This file will have the read-only keys to your cosmos DB services in it so should be encrypted on the disk in order to limit the number of people who can access the file.

[
{
"_comment" : ["This is a sample object that should be modified for deployment.",
"Connect strings will need to be inserted and correct service, database and collection",
"names included as well as setting the database backup flag to true as needed."],
"service" : {
"name" : "<name of your cosmos db service>",
"connectString" : "<read-only connect string here>",
"databases" : [ {"name" : "database1",
"backupFlag" : true,
"collections" : [{"name" : "collection1"},
{"name" : "collection2"}]
}
, {"name" : "database2",
"backupFlag" : false,
"collections" : [{"name" : "collection1"}]
},
{"name" : "database2",
"backupFlag" : true,
"collections" : [{"name" : "collection1"}]
}
]
}
},
{
"service" : {
"name" : "<second cosmos db service>",
"connectString" : "<second service read-only key>",
"databases" : [ {"name" : "database1",
"backupFlag" : false,
"collections" : [{"name" : "collection1"}]
}
]
}
}]
view raw cosmosDB.json hosted with ❤ by GitHub

Once the config file is in place the following PowerShell will read the file and backup the appropriate services, databases and collections appropriately (and remove any old backups that are no longer needed).

<#
This script will call the cosmos db migration tool with the correct parameters based
on a list of databases that need to be backed up. It depends on a json param file that
contains the list of cosmos db services that have databases that require backup.
This script has a couple of dependencies:
(1) the dt.exe that it runs (the cosmos db migration tool and we assume the associated files/dlls
in the compiled folder) needs to be locally available.
(2) A configured json file to list out the cosmos services and databases that require backups.
Care should be taken (encrypt the file and provide access to the keys to a limited set of users)
as the read-only keys for the cosmos-db service will be stored here.
#>
$retentionDays = 14
$backupList = "C:\temp\CosmosBackup.json" # point these at the appropriate folders
$backupPath = 'C:\temp\'
$pathtoEXE = 'C:\temp\drop\dt.exe'
$backups = Get-Content -Raw -Path $backupList | ConvertFrom-Json
foreach($s in $backups.service)
{
$sName = $s.name
Write-Output "Processing service $sName..."
$cosmosConnectString = $s.connectString
foreach($d in $s.databases)
{
$database = $d.name
if ($d.backupFlag -eq $true)
{
Write-Output " Backing up collections in $database..."
foreach($c in $d.collections)
{
$collection = $c.name
Write-Output " Backing up collection $collection."
<# configure export arguments #>
$connectString = "$cosmosConnectString;Database=$database"
$date = Get-Date
$dateString = $date.ToString('yyyyMMdd')
$dateString = $dateString + '_' + $date.ToString('hhmm')
$targetFile = "$collection`_$dateString.json"
$args = "/ErrorLog:" + "$backupPath\backups\$sName\$database\$collection`_$dateString`_log.csv /OverwriteErrorLog"
$args = $args + " /ErrorDetails:Critical /s:DocumentDB /s.ConnectionString:$connectString"
$args = $args + " /s.ConnectionMode:Gateway /s.Collection:$collection /s.Query:`"SELECT * FROM c`""
$args = $args + " /t:JsonFile /t.File:$backupPath\backups\$Name\$database\$targetFile /t.Prettify /t.Overwrite"
<# now we are all configured: run the collection backup #>
Start-Process -FilePath $pathtoEXE -ArgumentList $args -Wait
}
}
else {
Write-Output " Skipping $dName backupFlag <> true."
}
}
}
$purgeDate = (Get-Date).AddDays(-1 * ($retentionDays + 1))
<# remove old logs and backups #>
Get-ChildItem -Path $backupPath -Recurse -Include *.json -Exclude *cosmosBackup.json | Where-Object {$_.CreationTime -lt $purgeDate} | Remove-Item
Get-ChildItem -Path $backupPath -Recurse -Include *.csv | Where-Object {$_.CreationTime -lt $purgeDate} | Remove-Item
view raw CosmosDBBkup.ps1 hosted with ❤ by GitHub

While this is not ideal if you have a need to immediately start backing up your cosmos dbs this will do the trick until Microsoft finishes incorporating Data Factory into their trusted services.

[Edited to add 10/3/2019:] Just yesterday it looks like MS updated their timeline for adding the needed functionality to ADF.