DevOps for databases: “DataOps”
DevOps, a set of practices that combines software development (Dev) and information-technology operations (Ops), has become a very popular way to shorten the systems development life cycle and provide continuous delivery of applications (“software”). The implementation of continuous delivery and DevOps to data analytics has been termed DataOps, which is the topic of this blog.
Databases are more difficult to manage than applications from a development perspective. Applications, generally, do not concern themselves with state. For any given “release” or build an application can be deployed and overlaid over the previous version without needing to maintain any portion of the previous application. Databases are different. It’s much harder to deploy the next version of your database if you need to be concerned with maintaining “state” in the database.
So what is the “state” you need to be concerned with maintaining?
Lookup data is the simple example. Almost every database has tables that are used for allowable values, lookup data, and reference data. If you need to change that data for a new release, how do you do that? What happens if the customer or user has already changed that data? How do you migrate that data?
Another example: a table undergoes a major schema migration. New columns are added and the table is split and normalized among new tables. How do we write the migration code to ensure it runs exactly once or runs multiple times without side effects (using scripts that are “idempotent”)?
Other objects that require state to be considered during an upgrade:
- Indexes: what happens if an index is renamed or an included column is added? What happens if the DBA adds a new emergency index? Will your DevOps tool remove it since it isn’t in an official build?
- Keys: if you change a primary key, will that change require the PK to be dropped and recreated? If so, what happens to the foreign keys?
In most cases, database objects like functions, views, and stored procedures have no state considerations and can be re-deployed during every release.
So how do you overcome these “state” difficulties, especially if you are aiming towards frequent releases and agile, collaborative development?
The first step is to make a major decision when including databases in your DevOps processes, and that is how you will store the data model. There are two options:
Migration-based deployment: Sometimes called transformation-based deployment, this is the most common option today and is a very traditional way to work with databases during development. At some point you create an initial database (a “seed” database that is a single migration script stored inside source control), and after that you keep every script that’s needed to bring the database schema up to the current point (you can use SQL Server Management Studio to create the scripts). Those migration scripts will have an incremental version number and will often include data fixes or new values for reference tables, along with the Data Definition Language (DDL) required for the schema changes. So basically you are migrating the database from one state to another. The system of truth in a migration-based approach is the database itself. There are a few problems with this option:
- Deployments keep taking longer as more and more scripts need to be applied when upgrading a database. A way around this is to create new seed databases on a regular basis to avoid starting with the very first database
- A lot of wasted time can happen with large databases when dealing with, for example, the design of an index. If the requirements keep changing, a large index can be added to the database, then deleted, then reapplied slightly differently (i.e. adding a new column to it), and this can be repeated many times
- There is no data model that shows what the database should really look like. The only option is to look at the freshly updated database
- Upgrade scripts can break if schema drift occurs. This could happen if a patch was made to a production server and those changes didn’t make it back to the development environment or were not implemented the same way as was done in the production environment
- Upgrade scripts can also break if not run in the correct order
State-based deployment: With this option you store the data model by taking a snapshot of the current state of the database and putting it in source control, and using comparison tools to figure out what needs to be deployed (i.e. doing a schema compare between your repository and the target database). Every table, stored procedure, view, and trigger will be saved as separate sql files which will be the real representation of the state of your database object. This is a much faster option as the only changes deployed are those that are needed to move from the current state to the required state (usually via a DACPAC). This is what SQL Server Data Tools (SSDT) for Visual Studio does with its database projects that includes schema comparison and data comparison tools, or you can use a product like SQL Compare from Red-Gate. Using the example above of creating an index, in this option you simply create the final index instead of creating and modifying it multiple times. In a state-based approach the system of truth is the source code itself. Another good thing is that you do not have to deal with ALTER scripts with a state-based approach – the schema/data compare tool takes care of generating the ALTER scripts and runs it against the target database without any manual intervention. So the developer just needs to keep the database structure up-to-date and the tools will do all the work. The end result is there is much less work needed with this option compared to the migration-based deployment.
While it may seem state-based deployment is always the way to go, the migration-based deployment may make more sense in scenario’s where you need more fine-grain control in the scripts as with the state-based deployment you are not able to modify the difference script. And having control over the scripts allows you to write better scripts than you think the script compare would generate. Other reasons are: by making the change a first class artifact, you can “build once, deploy often” (as opposed to something new that is generated prior to each deployment); you encourage small, incremental changes (per Agile/DevOps philosophy); and it’s much easier to support parallel development strategies with migrations – in part because the migrations themselves are small, incremental changes (i.e. the ability to deploy different features or development branches to target databases, that is environments like stage and production).
Once you figure out which deployment method you will use, the next step is to learn the options for version control and an automated build process. Check out these blogs for help: Automating Builds from Source Control for the WideWorldImporters Database (state-based approach), Database Development in Visual Studio using SQL Change Automation: Getting Started (migration-based approach), Deploying Database changes with Redgate SQL Change Automation and Azure DevOps (lab), Introduction to SQL Server database continuous integration, Basic Database Continuous Integration and Delivery (CI/CD) using Visual Studio Team Services (VSTS), Continuous database deployments with Azure DevOps, Why You Should Use a SSDT Database Project For Your Data Warehouse.
One last point of clarification: DataOps focuses on keeping track of database objects such as tables, stored procedures, views, and triggers, while DevOps focuses on source control for application code. These are usually done separately (i.e. separate projects and pipelines in Azure DevOps). Also usually done separately are Azure Databricks (see CI/CD with Databricks and Azure DevOps), Power BI (for the PBI service, see The future of content lifecycle management in Power BI; for PBI desktop, see Power BI Source Control and Version control for Power BI files and Power BI Governance, Good Practices, Part 2: Version Control with OneDrive, Teams and SharePoint Online; and with the new XMLA endpoint support, you can use SSDT/DevOps to track models, see Use cases of the XMLA endpoint), Azure Data Factory (see Continuous integration and delivery (CI/CD) in Azure Data Factory, Azure data Factory –DevOps, and Azure DevOps Pipeline Setup for Azure Data Factory (v2)), and Azure Analysis Services (see Azure Analysis Services deploy via DevOps and You, Me, & CI/CD: Deploying Azure Analysis Services Tabular Models through Azure Pipelines). There is also a whole separate category for machine learning operations (MLOps), see the MLOps workshop by Dave Wetnzel (who helped with this blog). If you have any interest in these topics, please comment below and based on the interest I will follow up with additional blog posts.
Finally, check out this excellent DataOps workshop/demo by Lace Lofranco here.
More info:
Managing Schemas And Source Control For Databases
DevOps: Why Don’t Database Developers Use Source Control?
DevOps: Should databases use migration-based or state-based deployment?
Database Delivery – State based vs Migration based
Migration-base vs State-based database development approach
DataBase DevOps Migration-Based vs. State-Based
A strategy for implementing database source control
Why I like the ‘Release Flow’ branching strategy with Git for database DevOps
DataOps is NOT Just DevOps for Data
How SQL Server Professionals can benefit from using version control
Video DataDevOps for the Modern Data Warehouse on Microsoft Azure – Lace Lofranco (artifacts here)
In my opinion the way forward is a combination of the two. Idempotent pre- and post-deployment script sandwiching a state-based deployment. There are actions that will fail due to state if you rely on state-based deployment only.
Tuan, I’m trying to understand your comment. Could you give an example of when an idempotent script would _not_ work for a state-based deployment? Which actions will fail due to state if you rely on state-based deployments only?
It seems like one should be able to find a creative way to make _any_ SQL script idempotent, which means definitionally there should be no side effects if I press F5 a second time accidentally.
I love this post for so many reasons. IMO, state-based is the way to go… especially for team development!
Each member of the team should have their own DEV instance and be able to build/deploy the current copy of the db. Shared DEV instances are frustrating with multiple developers stepping on each other’s toes.
In some cases, testing can be tricky and special considerations are required for large datasets or regulatory privacy issues (e.g. HIPAA/GDPR/etc). Source control policies (e.g. gated checkins) can help ensure no member of the team “breaks the build” (is this possible with migration-based scripts?)
As for deployment, yes, that can also be tricky with state-based… but I’d argue the same for migration-based. One important thing to point out is that with state-based deployment (e.g. SSDT/SQLPackage.exe) you can generate a deployment script and then review/modify as needed and finally test-again in a separate non-production (e.g. Test or Integration) environment before final (automated) deployment to prod. This is very helpful with large databases where data movement is a concern and some state-based tools don’t generate the most efficient code.
It seems like both approaches suffer when there is drift. In the state-based deployment, if an index is added to the database after a snapshot is captured, then the deployment would presumably drop that index to enforce the state in source control. In the migration-based deployment, drift seems like it might only matter in cases where a migration refers to another object (e.g. a new foreign key to a pre-existing table and column) that’s been deleted. In the state-based case you would likely notice drift on the first deployment after the drift occurs. In migration-based case you could go for 100s of deployments before someone writes a new migration that relies on a deleted referent.
As in continuous delivery of applications, it seems that potentially the most critical change management activity is weening teams off of problem solving in production.