short thoughts‎ > ‎

chef microsoft sql server backup

posted Dec 17, 2011, 1:56 PM by Steve Craig   [ updated Dec 18, 2011, 5:37 PM ]
Recipes for Microsoft Windows Server 2003 with MSSQL Database Server 2005 backup and restore with chef data bags

Backups in general are a critical operations task.  With databases in particular both the data contained inside the database as well as the transaction logs must be backed up or else point-in-time recovery is not possible.  With a full backup and an unbroken transaction log chain, however, the ability to recover a database to a recent point-in-time after a disaster is near-trivial.

The technologies used here are Windows Server 2003 running Microsoft SQL Server 2005 with Ruby 1.8.7 and Chef 0.10, yet most of the concepts will apply to newer Microsoft technologies ... with the notable exception of scheduled tasks.

Basic MSSQL database backup and restore/refresh activities should be fully automatic so that human intervention is un-necessary on a day-to-day basis.  Furthermore, "refresh" (importing databases from one environment {production} to another {qa}) should utilize essentially the same functionality as the basic "restore" (recovering a database from a backup copy) functionality.  Again, minimal-to-no human interaction should be required for specific restore scenarios.  Leverage simple batch files, text lock files, and windows scheduling to enable Chef to perform the grunt work.

This post focuses on two things: how to perform daily full database and hourly transaction log backups under Microsoft's "full recovery model" as well as how to recover the database from those backups.  There are two main paradigms for backup and restore of Microsoft SQL server: full and simple.  If you are unaware or unsure of the purpose of transaction logs, the full recovery model is not for you.  Use the simple recovery model, take a full backup of your database nightly and call it a day.  However: if you will ever need the ability to recover your database to a particular point-in-time, or need to close the potential data-loss window after a disaster to a minimal amount of time (minutes, not hours) then you will need the full recovery model.  Read on.

The goal:
Enable Chef to perform full MSSQL backups under the "full recovery model" on a daily basis with transaction log backups every hour.  Compress the backups and move them offsite.  Enable Chef to create empty databases and then perform MSSQL refresh and restore operations in an automated, scheduled fashion.

Left unsaid:
This post will gloss over certain bedrock Chef-related items such as installing and configuring chef agents on your servers and writing basic chef recipes along with the creation of templates and data bags.  The wiki is instructive in these matters, and I have also written bog posts on some of the basics of setting up chef HERE.  It skips over the details many Microsoft-related items, such as command line utilities for executing transact-sql and basic philosophies for MSSQL database configuration.

Key information:
Before one begins automating MSSQL backup and restore functionality under the "full recovery model", note the single most important concept and the two primary "gotchas" that go along with it:

- The single most important concept to understand in regards to "full recovery model" MSSQL backups is that the LSN (Log Sequence Number) is EVERYTHING.  Microsoft documentation is wordy and obtuse on this point.  Do not get caught up in the particulars of Microsoft's musty documentation.  Understand only that you MUST be able to query the LSN of the backup files that you are generating and know how to sequence them such that there is an unbroken transaction log file chain.  If the translogs are out of order or missing, you will only be able to restore to the break in the LSN.

THIS post has a summary of LSN, as well as the technical specifics of the transact-sql necessary to query LSN information of backup files.  These transact-sql commands are basic and important.  Do not fail to read the article HERE.  There are two methods of querying for LSN, depending on if you have access to the original database that spawned the backup files, or only have access to the files themselves.  Understand how to use both methods.  Literally, if you are attempting to code your own solution while referring to my post, take a break and get cozy with LSN now.  It WILL save you stress later.

1. if you have access to the MSSQL server that spawned the backups, use this transact-sql:

select database_name, type, first_lsn, last_lsn ,checkpoint_lsn ,database_backup_lsn from msdb..backupset where database_name = 'YOURDBNAME'

2. if you have access to only the backup files themselves, use these transact-sql commands to grab the information from the most recent full database backup, and then the translog backups afterwards:

-- investigate the headers of your fullbackup file ".bak"
-- investigate the headers of your translog backup files ".log"

Now that you know how important LSN is to the full recovery model, and how to query LSN information from either the source database or the backupfiles, here are the two primary GOTCHAS when working with LSN inside the full recovery model.

1. The most recent, uncompressed full backup file must be available when translog backups run.  So, if you want your backup routine to "cleanup" after itself by compressing the backups and moving them offsite, you will need to ensure that you leave the most recent uncompressed full backup behind.

2. By default, microsoft MSSQL backup functionality concatenates backups into an existing backup file.  While working inside the "full recovery model" for the first time, the default concatenate behaviour will normally surface in strange translog backup file sizes: ie, translog files that are uniformly large and seem to contain all the changes since the last full backup, rather than simply the changes since the previous translog backup.  For those of us with a rigorous compress-and-move-offsite backup strategy, this default concatenate behaviour is unwanted and unnecessary {IT'S CRAP, MICROSOFT!}.  Ensure you utilize the "WITH FORMAT" and/or "WITH INIT" transact-sql options when creating your backups to prevent this crazy concatenation.

So, what does all that look like on my windows servers? Onwards, opschef recipes! #devops, ho!

createdb.rb recipe:
createdb.rb attribute:
createdb.rb template1:
createdb.rb template2:

First, if we are going to take the time to backup and restore our databases in an automated fashion, it makes sense to first enable creation of databases from scratch as well.  If databases are going to be created by Chef, one natural location (although there are many) for their configuration information is a data bag.

The "createdb.rb" recipe utilizes information from two JSON data bags to create MSSQL databases on a target host.  It runs unless a lockfile  exists.  This is to prevent round-trip data bag searches from occurring on each Chef run, not to prevent MSSQL funkiness - MSSQL will fail to create a new database if one already exists with the requested name.  The first databag "running_database" contains a simple list of the names of the database that the host should run, and the second databag "database_information" contains the settings for each database.  "createdb.rb" is not currently fully-baked in some straightforward ways.  An in-depth discussion of createdb.rb's current shortcomings is beyond the scope of this blog post.  Any suggestions to clean/improve it are welcome.

After grabbing the necessary configuration information from the data bags, the "createdb.rb" recipe utilizes a dual-template method that I come back to in many of my windows recipes: one template contains the batch file to execute the necessary commands, and another template contains the payload information.

backupdb.rb recipe:
backupdb.rb attribute:
backupdb.rb template1:
backupdb.rb template2:
backupdb.rb template3:

"backupdb.rb" is the main recipe inside of a larger "backupdb" role.  This role also contains a "store-registry.rb" recipe (not included here), which inserts ssh keys into the node's registry and enables putty to transfer files across the innertubes via pscp as well as an "rsyncdb.rb" recipe that compresses the resulting backup files enables rsyncs to other nodes inside the same datacenter (referenced below).  The "rsyncdb.rb" recipe in turn depends on the "deltacopy.rb" recipe for the installation and configuration of a windows rsync daemon.  The "backupdb" role currently runs on the production database server.

Naturally, the "backupdb.rb" recipe utilizes the database configuration information located inside the previously-mentioned data bags, a few other settings inside application-specific attribute files, and some recipes that install supporting software that no windows server should be without (sevenzip, putty, deltacopy).  "backupdb.rb" will backup all the databases listed inside the running_database data bag for the node it runs on.

The paradigm is simple and similar to the "createdb.rb" recipe: only run if no lockfile exists.  Each of the database backup types (full backup and translog backup) has it's own lockfile (the log of the database backup).  These lockfiles is deleted on an appropriately scheduled basis (once a day for the full backup lockfile, and hourly for the translog backup lockfile), which permits chef to cook the recipe when the lockfile is missing.  The flow is for each database backup type to first grab the necessary database settings from the data bags if there is no lockfile; second, ensure the two templated files exist and are up to date (the batch file to execute the transact-sql statements as well as the specific backup sql to be run) and execute them.

The output of this is the database backup file and logging information (which is used as a lockfile). in the interest of simplicity, "backupdb.rb" is concerned only with getting a valid db dump to disk: no post-processing or other after the fact work is done here.  Tasks such as compressing the database dump and moving it offsite are handled by other recipes.  It is of course possible for the database backup to fail for any number of various reasons and still create a log/lockfile with this method, which would prevent the backup routine from running until the next scheduled run.  This is sub-optimal in some ways; however, if the backup report is logged to a database/emailed/tweeted/posted to a ticket and reviewed, operators can control for the risk.

rsyncdb.rb recipe:
"rsyncdb.rb" again utilizes the same data bags for necessary settings and only runs if its lockfile does not exist.  This lockfile is deleted via a scheduled task, as above.  Devoid of the presence of this lockfile, the node will gather necessary settings and stamp the backup files such that they can be restored later.  The full database backup is always labeled "zero" and each translog backup is stamped with the hour that it was created.  This makes restoring the database a snap.  As previously mentioned, database backups are rsync'd from the primary dbserver to any number of other locations within the same facility.  In my case this is primarily to enable nightly refreshes from prod to qa.

restoredb.rb recipe:
restoredb.rb attribute:
restoredb.rb template1:
restoredb.rb template2:

The "restoredb.rb" recipe can run on any database server.  It's current primary function (in addition to enabling database restores if necessary on production) is to enable nightly database refreshes on qa.  The "restoredb.rb" recipe is the first recipe inside a "restoredb" role that also includes a "postrestoredb.rb" recipe.

The "restoredb.rb" recipe contains a list of databases that should be restored/refreshed in-line.  This is sub-optimal.  It should utilize the data bags that the other recipes use.  Like everything else, it runs unless a lockfile exists.  This lockfile is deleted every 24 hours on QA, and also contains a reference to an internal application version number.  This enables a database refresh to occur after a different version of code is pushed to the app servers.

postrestoredb.rb recipe:
postrestoredb.rb template1:
This recipe is contained inside the "restoredb" role, after "restoredb.rb", and is mainly utilized to bring the QA database that was just refreshed with production data "up-to-speed" with schema changes that support the newer code running on QA. The meat of this recipe is similar to the others: "postrestoredb.rb" runs unless a lockfile exists.  Right now, it contains specific references to a database name mainly because only one particular database requires post-restore scripts to roll it forwards to the current qa schema, which is slightly ahead of production.


- templated batch file to wrap around the arbitrary T-SQL execution:
- templated T-SQL script to create databases:

- templated batch file to wrap around arbitrary T-SQL execution:
- templated T-SQL script to do full database backups:
- templated T-SQL script to do translog database backups:

- utilizes much functionality not covered here: sevenzip, deltacopy, and a twitter library.  Chef takes care of those for me as separate matters!

- templated batch file to wrap around arbitrary T-SQL execution:
- templated T-SQL script to do database restores:

- utilizes functionality not covered here: svn, emailing reports, and updating my trac tickets.  Chef takes care of these for me as separate matters!
- templated batch file to wrap around arbitrary T-SQL execution:

Hope this was useful!  Remember... if you are a runner, you need a FREE account for easy run analytics!

References from above: