Backing Up Cosmos DB (Azure Data Factory Edition)

Now that ADF is a trusted service I wanted to document the state of my current solution since I’ve been able to dump the hack-y PowerShell script I put together. I haven’t been able to get the level of abstraction I’d really like to see but overall I think I’m pretty happy with the solution (and I still get to include a hack-y PowerShell script). My solution consists of

  • a control pipeline,
  • a notification pipeline and
  • 1 pipeline for every Cosmos DB service I want to to work with. (This is because I wasn’t able to figure out a way to abstract the data source connection for the Copy Data task.)

We’ll start at the backup pipeline and then back out to the larger solution:

Screen shot of the CosmosDB pipeline tasks

The backup pipeline is relatively simple: I set a date string that ends up being a part of the backup filename: then use the Copy Data task to pull the contents of Cosmos DB container into blob storage. We then perform cleanup of the backup filesĀ  (we are keeping 2 weeks of backups) then clean up the logs generated by the delete step. If either of the two steps fail (backup or backup cleanup) we use an execute pipeline task to kick off a notification pipeline. (ADF doesn’t natively send emails but it was very easy to leverage a logic app to send any necessary notifications: see this mssqltips article for details on how to set that up). This pipeline is parameterized with the name of the Cosmos DB service, database name and container name. These are used to create the folder structure for the backup in blob storage (formatted as [service name][database name][container name]_[YYYYMMDD_HHMM].json). This was where I couldn’t seem to abstract away the Cosmos DB service itself. I have 1 pipeline per known Cosmos DB service. Any time I need to add a new service I define a single data source and clone an existing pipeline to point at the new data source.

Each backup pipeline is called by my Dynamic Backup control pipeline which has two tasks:

The screen shot of tasks from the dynamic backup control pipelineLookup Backuplist grabs a json formatted list of Cosmos DB services along with any associated databases and containers from a blob storage data source. This is where I had to return to a hack-y PowerShell script. There isn’t handy way for ADF to get a token to talk to Cosmos DB services and query for a list of databases or containers. Sure I could try and do it via REST calls and the Webhooks action but that was going to be a huge headache plus our Cosmos DB is already IP restricted so I wasn’t 100% sure if things would fail because I had written them terribly or if the communication just wasn’t going to happen. So I now have a PowerShell script that I run manually every so often that will pick up new databases and containers that were deployed to known services since the last time I checked and add them to the list. This was mostly just a bit of CYA: When I went to deploy the ADF backups I discovered there was a new container in one of the databases that no one had thought to notify me of. While I’m not necessarily in the loop for new databases or containers on existing Cosmos DB services it seems a safe bet that I’ll be informed of any new Cosmos DB services themselves. Here is the PS script:

<#
Author: Josh Smith
Created: 2020-03-02
Purpose: Will pull all databases and containers from known CosmosDB resources.
Cosmos db services must be known and added to the array below (paired with the
associated resource group).
#>
<# add cosmos db resource and groups here as needed: #>
$cosmosDBServices = @( @{resource = '<cosmos db service 1>'; group = '<resource group name>'}, `
@{resource = '<cosmos db service 2'; group = '<resource group name>'})
$connectAz = Read-Host "Did you already connect to Azure [Y/N]?"
# Sadly the need to authenticate with Azure means this script needs to be run manually.
if ($connectAz -eq 'N'){Connect-AzAccount -subscriptionName "<name of your Azure subscription>"}
$dbResourceType = "Microsoft.DocumentDb/databaseAccounts/apis/databases"
$containerResourceType = "Microsoft.DocumentDb/databaseAccounts/apis/databases/containers"
$BackupStorageName = "<Azure storage account name>"
$StorageResourceGroupName = "<resource group for azure storage>"
$apiVersion = "2015-04-08"
$finalJson = ""
foreach($c in $cosmosDBServices)
{
$ResourceName = $c.resource + "/sql/"
$resourceGroup = $c.group
$databases = Get-AzResource -ResourceType $dbResourceType -ApiVersion $apiVersion `
-ResourceGroupName $resourceGroup -Name $ResourceName | `
Select-Object -ExpandProperty Properties | Select-Object id
foreach($d in $databases)
{
$databaseName = $d.id
$db = $ResourceName + $databaseName
$containers = Get-AzResource -ResourceType $containerResourceType `
-ApiVersion $apiVersion -ResourceGroupName $ResourceGroup `
-Name $db | Select-Object -ExpandProperty Properties | Select-Object id
foreach($con in $containers)
{
$container = $con.id
$con = $null
$json = $null
$conObject = New-Object -TypeName psobject
$conObject | Add-Member -MemberType NoteProperty -Name ContainerName -Value $container
$conObject | Add-Member -MemberType NoteProperty -Name DatabaseName -Value $databaseName
$conObject | Add-Member -MemberType NoteProperty -Name ResourceName -Value $resourceName.Substring(0, $resourceName.Length - 5)
$json = $conObject | ConvertTo-Json
$finalJson = $finalJson + $json + ",
" # I'm a sucker for lazy line breaks
}
}
}
# lose the last line break and comma then output:
$finalJson = $finalJson.Substring(0, $finalJson.Length - 2)
$finalJson | Out-File -FilePath '.\CDBBackup.json' -Force
# connect to the blob storage and then push the file up:
$storageAccount = Get-AzStorageAccount -ResourceGroupName $StorageResourceGroupName `
-Name $BackupStorageName
$storageContext = $storageAccount.Context
Set-AzStorageBlobContent -File '.\CDBBackup.json' -Container 'backuplist' `
-Blob 'CDBBackups.json' -Context $storageContext -Force
# don't need that file hanging around here:
Remove-Item .\CDBBackup.json

Which will give me output like this:

[
{
"ContainerName" : "myFirstContainer",
"DatabaseName" : "myFirstDatabase",
"ResourceName" : "cosmos-db-service-1"
},
{
"ContainerName" : "mySecondContainer",
"DatabaseName" : "myFirstDatabase",
"ResourceName" : "cosmos-db-service-1"
},
{
"ContainerName" : "myFirstContainerMarkII",
"DatabaseName" : "mySecondDatabase",
"ResourceName" : "cosmos-db-service-2"
}
]
view raw backuplist.json hosted with ❤ by GitHub

A screenshot of a single Switch taskFor each returned item in the backup list I simply run it through a switch task based off the name of the cosmos db service from my collection. If none of the names match we simply don’t do anything with it (default). Once I find a service that is “known” (i.e. I have a data source for it) I can call the pipeline I deployed for that service via an execute pipeline task and pass in the database name and container in as parameters (along with the service name for consistency’s sake although it could be hard coded into the pipeline I am certain I’ll find a way to make this more dynamic in the future).

So that’s where we are at: ideally either ADF will get some built in methods to pull back meta-data about the service (webhooks and REST really sounds like a problem that only needs to be solved once) or I’ll find a way that I can run that PowerShell script in an automated fashion to populate the list.

Also a very appreciative shout out to Cathrine Wilhelmsen (b|t) without her Christmas time ADF blog marathon I would have spent much longer fumbling around than I ultimately did.

DB Deployments: Using A Custom Filter

So I’ve been rolling out the deployment pipelines as I was able and I ran into an issue with our third one: We have a helper database that lives on all of SQL Instances and while we’d like to eventually keep any customization in source control we only want the dbo schema deployed automatically via the pipeline. The Redgate “Deploy From Package” step has a handy option to include an alternative filter to use with the SQL Compare engine so it seemed really straight forward: I fired up SQL Compare: generated a unique filter file that I named DeployFilter.scpf and threw that file name into the job step!

deployfilter

Unfortunately I got stuck here because this does not work. The nuget packages don’t end up in the working directory for the deployment and I couldn’t find any documentation about where it might end up. I ran a few deployments so I could watch the working folder as the build ran to see what showed up with no luck. Looked for articles etc. that might have mentioned configuring the path (to no avail). Eventually I hit up the Redgate channel on the SQL Slack. Alex Yates (b|t) had the answer in his head all along: The previous step (self-deploy package) unpacks the nuget package on the tentacle and I have to get it to tell me where the package is (and then append the path to the filter within the package itself). This is all very long way of saying if you want to use a custom filter (a) it should be in source control with the rest of your code and (b) the variable is #{Octopus.Action[<name of package deployment step>].Output.Package.InstallationDirectoryPath}\<path to filter file in nuget package>. And I am putting it here for the next person that finds themselves googling how to make it work (which may still be me: my buffer is notoriously overworked).