SQL Data Warehouse reference architectures
With so many product options to choose from for building a big data solution in the cloud, such as SQL Data Warehouse (SQL DW), Azure Analysis Services (AAS), SQL Database (SQL DB), and Azure Data Lake (ADL), there are various combinations of using the products, each with pros/cons along with differences in cost. With many customers looking at using SQL DW, I wanted to mention various reference architectures that I have seen, ordered by most cost to lowest cost:
- Do staging, data refinement and reporting all from SQL DW. You can scale compute power up when needed (i.e. during staging, data refinement, or large number of users doing reporting) or down to save costs (i.e. nights and weekends when user reporting is low). The pros of this option are by reducing the number of technologies you are building a simpler solution and reducing the number of copies of the data. The cons are since everything is done on SQL DW you can have performance issues (i.e. doing data refinement while users are reporting), can hit the SQL DW concurrent query limit, and can have a higher cost since SQL DW is the highest-cost product, especially if you are not able to pause it. Pausing it reduces your cost to zero for compute, only having to pay for storage (see Azure SQL Data Warehouse pricing), but no one can use SQL DW when paused
- Do staging and data refinement in a Hadoop data lake, and then copy all or part of the data to SQL DW to do reporting. This saves costs in SQL DW by offloading data refinement, and gives you the benefit of using a data lake (see What is a data lake? and Why use a data lake?). You save costs by not having to scale up SQL DW to do the data refinement (scaling up would minimize affecting reporting performance and refine data quicker) and by not having to store as much data in SQL DW. You also save costs by archiving data in the data lake and using PolyBase to access it (be aware these queries could be slow as PolyBase does not support query pushdown in SQL DW). A con of this architecture is having an extra copy of the data along with the extra ETL needed
- Do staging and data refinement in SQL DW, and copy some or all data to one or more data marts (in SQL DB or SQL Server in a VM) and/or one or more cubes (in AAS or SSAS in a VM) for reporting, which is considered a “Hub-and-Spoke” model. Scale down SQL DW after data refinement and use it for a limited amount of big queries. This overcomes the SQL DW concurrent query limit by having users query the data mart/cube and saves costs by querying less expensive options. You also get the benefits that come with a cube such as creating a semantic layer and row-level security that is not available in SQL DW (see Why use a SSAS cube?). This architecture can also be combined with the previous architecture to add in a data lake. A con of this architecture is having extra copies of the data along with the extra ETL needed
- Do staging and data refinement in SQL DW, and copy all data to a data mart (SQL DB or SQL Server in a VM) and/or a cube (AAS or SSAS in a VM) for reporting. Pause SQL DW after the staging and data refinement is done. This is used when giving users access to SQL DW will impact ELT and/or user queries wouldn’t be as responsive as needed, or when cost is a top priority (you only pay for storage costs when SQL DW is paused). A con of this architecture is having extra copies of the data along with the extra ETL needed, and not having SQL DW available for big queries
More info:
Using SQL Server to Build a Hub-and-Spoke Enterprise Data Warehouse Architecture
Hub-And-Spoke: Building an EDW with SQL Server and Strategies of Implementation
Common ISV application patterns using Azure SQL Data Warehouse
Azure SQL Data Warehouse Workload Patterns and Anti-Patterns
#Azure #SQLDW, the cost benefits of an on-demand data warehousing
Hi James,
I see a big push towards SQL DWH. But what I’m missing in scenario 1, 3 and 4 is the ETL component. You mention ‘Do staging, data refinement in SQL DWH’, but not which tools to use for doing that. Ofcourse we can write SQL stored procedures, but nobody likes having to much stored procedures, and they are a nightmare to maintain.
We could think of Azure Data Factory, but there we have to program our own transformations and execute it on some kind of host (say HDInsights)..
What do you see within clients that have shifted to use SQL DWH ?
Hi Niek,
Great question! You can use SSIS or Azure Data Factory for ELT, but they should be thought of as a orchestration tool (no data transformation within the tool), otherwise they become an ETL solution. With SQL DW I recommend you do the transformations within SQL DW via stored procedures for scenarios 1, 3, and 4 so you use the power of SQL DW to do the transformations quickly. Most clients use stored procedures, and if they want to do transformations outside SQ DW, that is where scenario 2 and the data lake come into play (a future blog will talk about the compute options when using Azure Data Lake Store).
So many options these days. I would add:
* CosmosDB to dump (stage) raw data – easier to work with than Hadoop in my opinion and lower cost entry point. Particularly good if source data structure is not so reliable.
* SQL Server on VM for ETL and Master Data Management (non-transactional) – hope to see this functionality as PaaS one day
* Curated data stored in SQL DW (particularly for larger datasets)
* Azure SSAS for OLAP (Tabular)
Thanks for the comment Ian! This blog post was just about SQL DW, but a future blog post will dive more into other choices as you pointed out.
Pingback:Azure SQL DW Reference Architectures – Curated SQL
Thanks for putting all these info together James.