Contents

Backing Up Cosmos DB (Azure Data Factory Edition)

Contents

Now that ADF is a trusted service I wanted to document the state of my current solution since I’ve been able to dump the hack-y PowerShell script I put together. I haven’t been able to get the level of abstraction I’d really like to see but overall I think I’m pretty happy with the solution (and I still get to include a hack-y PowerShell script). My solution consists of

  • a control pipeline,
  • a notification pipeline and
  • 1 pipeline for every Cosmos DB service I want to to work with. (This is because I wasn’t able to figure out a way to abstract the data source connection for the Copy Data task.)

We’ll start at the backup pipeline and then back out to the larger solution:

Screen shot of the CosmosDB pipeline tasks. The pipeline process is described in the following paragraph.

The backup pipeline is relatively simple: I set a date string that ends up being a part of the backup filename: then use the Copy Data task to pull the contents of Cosmos DB container into blob storage. We then perform cleanup of the backup filesĀ  (we are keeping 2 weeks of backups) then clean up the logs generated by the delete step. If either of the two steps fail (backup or backup cleanup) we use an execute pipeline task to kick off a notification pipeline. (ADF doesn’t natively send emails but it was very easy to leverage a logic app to send any necessary notifications: see this mssqltips article for details on how to set that up). This pipeline is parameterized with the name of the Cosmos DB service, database name and container name. These are used to create the folder structure for the backup in blob storage (formatted as [service name][database name]_[container name]_[YYYYMMDD_HHMM].json). This was where I couldn’t seem to abstract away the CosmosDB service itself. I have 1 pipeline per known Cosmos DB service. Any time I need to add a new service I define a single data source and clone an existing pipeline to point at the new data source.

Each backup pipeline is called by my Dynamic Backup control pipeline which has two tasks:

The screen shot of tasks from the dynamic backup control pipeline. There are two steps: lookup backup list followed by a foreach container.

Lookup Backuplist grabs a json formatted list of Cosmos DB services along with any associated databases and containers from a blob storage data source. This is where I had to return to a hack-y PowerShell script. There isn’t handy way for ADF to get a token to talk to Cosmos DB services and query for a list of databases or containers. Sure I could try and do it via REST calls and the Webhooks action but that was going to be a huge headache plus our Cosmos DB is already IP restricted so I wasn’t 100% sure if things would fail because I had written them terribly or if the communication just wasn’t going to happen. So I now have a PowerShell script that I run manually every so often that will pick up new databases and containers that were deployed to known services since the last time I checked and add them to the list. This was mostly just a bit of CYA: When I went to deploy the ADF backups I discovered there was a new container in one of the databases that no one had thought to notify me of. While I’m not necessarily in the loop for new databases or containers on existing Cosmos DB services it seems a safe bet that I’ll be informed of any new Cosmos DB services themselves. Here is the PS script:

<#
    Author:     Josh Smith
    Created:    2020-03-02

    Purpose:    Will pull all databases and containers from known CosmosDB resources.
    Cosmos db services must be known and added to the array below (paired with the
    associated resource group).

#>
<# add cosmos db resource and groups here as needed: #>
$cosmosDBServices = @( @{resource = '<cosmos db service 1>'; group = '<resource group name>'}, `
    @{resource = '<cosmos db service 2'; group = '<resource group name>'})


$connectAz = Read-Host "Did you already connect to Azure [Y/N]?"
# Sadly the need to authenticate with Azure means this script needs to be run manually.
if ($connectAz -eq 'N'){Connect-AzAccount -subscriptionName "<name of your Azure subscription>"}

$dbResourceType = "Microsoft.DocumentDb/databaseAccounts/apis/databases"
$containerResourceType = "Microsoft.DocumentDb/databaseAccounts/apis/databases/containers"
$BackupStorageName = "<Azure storage account name>"
$StorageResourceGroupName = "<resource group for azure storage>"
$apiVersion = "2015-04-08"
$finalJson = ""

foreach($c in $cosmosDBServices)
{
    $ResourceName = $c.resource + "/sql/"
    $resourceGroup = $c.group

    $databases = Get-AzResource -ResourceType $dbResourceType -ApiVersion $apiVersion `
        -ResourceGroupName $resourceGroup -Name $ResourceName | `
        Select-Object -ExpandProperty Properties | Select-Object id
    
    foreach($d in $databases)
    {
        $databaseName = $d.id

        $db = $ResourceName + $databaseName
        
        $containers = Get-AzResource -ResourceType $containerResourceType `
            -ApiVersion $apiVersion -ResourceGroupName $ResourceGroup `
            -Name $db | Select-Object -ExpandProperty Properties | Select-Object id
        
        foreach($con in $containers)
        {
            $container = $con.id

            $con = $null
            $json = $null
            $conObject = New-Object -TypeName psobject
            $conObject | Add-Member -MemberType NoteProperty -Name ContainerName -Value $container
            $conObject | Add-Member -MemberType NoteProperty -Name DatabaseName -Value $databaseName
            $conObject | Add-Member -MemberType NoteProperty -Name ResourceName -Value $resourceName.Substring(0, $resourceName.Length - 5)

            $json = $conObject | ConvertTo-Json

            $finalJson = $finalJson + $json + ",
"    # I'm a sucker for lazy line breaks
        }
    }
}

# lose the last line break and comma then output:
$finalJson = $finalJson.Substring(0, $finalJson.Length - 2)
$finalJson | Out-File -FilePath '.\CDBBackup.json' -Force

# connect to the blob storage and then push the file up:
$storageAccount = Get-AzStorageAccount -ResourceGroupName $StorageResourceGroupName `
    -Name $BackupStorageName 

$storageContext = $storageAccount.Context

Set-AzStorageBlobContent -File '.\CDBBackup.json' -Container 'backuplist' `
    -Blob 'CDBBackups.json' -Context $storageContext -Force

# don't need that file hanging around here:
Remove-Item .\CDBBackup.json

Which will give me output like this:

[{
  "ContainerName" : "myFirstContainer",
  "DatabaseName" : "myFirstDatabase",
  "ResourceName" : "cosmos-db-service-1"
 },
 {
  "ContainerName" : "mySecondContainer",
  "DatabaseName" : "myFirstDatabase",
  "ResourceName" : "cosmos-db-service-1"
 },
 {
  "ContainerName" : "myFirstContainerMarkII",
  "DatabaseName" : "mySecondDatabase",
  "ResourceName" : "cosmos-db-service-2"
 }]
A screenshot of a single Switch task

For each returned item in the backup list I simply run it through a switch task based off the name of the cosmos db service from my collection. If none of the names match we simply don’t do anything with it (default). Once I find a service that is “known” (i.e. I have a data source for it) I can call the pipeline I deployed for that service via an execute pipeline task and pass in the database name and container in as parameters (along with the service name for consistency’s sake although it could be hard coded into the pipeline I am certain I’ll find a way to make this more dynamic in the future).

So that’s where we are at: ideally either ADF will get some built in methods to pull back meta-data about the service (webhooks and REST really sounds like a problem that only needs to be solved once) or I’ll find a way that I can run that PowerShell script in an automated fashion to populate the list.

Also a very appreciative shout out to Cathrine Wilhelmsen (b|t) without her Christmas time ADF blog marathon I would have spent much longer fumbling around than I ultimately did.