Where should I clean my data?
As a follow-up to my blogs What product to use to transform my data? and Should I load structured data into my data lake?, I wanted to talk about where you should you clean your data when building a modern data warehouse in Azure. As an example, let’s say I have an on-prem SQL Server database and I want to copy one million rows from a few tables to a data lake (ADLS Gen2) and then to Azure SQL DW, where the data will be used to generate Power BI reports (for background on a data lake, check out What is a data lake? and Why use a data lake? and Data lake details). There are five places that you could clean the data:
- Clean the data and optionally aggregate it as it sits in source system. The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures). The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details. The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it. Strongly advise against this option
- Clean data as it goes from source to the data lake. The products that could be used for this include SSIS, Azure Data Factory (ADF) Data Flows (renamed to ADF Mapping Data Flows), Databricks, or Power BI Dataflow. Note that ADF Mapping Data Flows can only be used against four data sources: Azure SQL DB, Azure SQL DW, Parquet (from ADLS & Blob), and Delimited Text (from ADLS & Blob). Think of these four data sources as staging areas for the data to be used by ADF Mapping Data Flows. The disadvantages of this option are: It will cause additional time pulling data from the source system which could affect performance of end users using the system; and you won’t have the raw data in the data lake. If using SSIS or Power BI Dataflow it would cause nonparallel line-by-line transformations (see Difference between ETL and ELT) which could be very slow when cleaning many rows of data. See below for how SSIS compares to ADF Mapping Data Flows and Databricks. Strongly advise against this option
- Clean data after it has landed into data lake. You land the data into a raw area in the data lake, clean it, then write it to a cleaned area in the data lake (so you have multiple data lake layers such as raw and cleaned), then copy it to SQL DW via Polybase, all of which can be orchestrated by ADF. This is the preferred option as it results in having the raw data in the data lake and minimizes the time hitting the source system, as well as saving costs as opposed to cleaning the data in a data warehouse (see Reduce Costs By Adding A Data Lake To Your Cloud Data Warehouse). This will also be the way to clean the data the fastest. ADF makes it real easy to move data from a source system (supports 79 sources) to the data lake (ADLS Gen2) by creating an ADF pipeline that uses the Copy Activity with a Self-Hosted Integration Runtime connected to your on-prem SQL Server. You could land the data in the data lake in Parquet format with a high Data Integration Units (DIU) setting to make it super-fast. ADF also makes it real easy via the Copy Activity to copy the cleaned data in the data lake to 23 destinations (“sinks”) such as SQL DW. In addition, you could copy the cleaned data in the data lake to an MDM solution to master it, then write it back to the data lake in another layer called the mastered layer. The cleaned area is also where you could join the data from various files together or split them apart for security reasons (i.e. having a file with rows from different countries split into multiple files for each country). Products that could be used to clean the data: ADF Mapping Data Flows, Databricks, or HDInsight. See below for how ADF Mapping Data Flows compares to Databricks
- Clean data as it moves from raw area in data lake to SQL DW. The disadvantages of using this option is you won’t have the clean data in the data lake (meaning other products that needed to access the clean data would impact user queries in SQL DW), it would do row-by-row transformations which are slow, and some products don’t use Polybase so it would be slow to load the data into SQL DW. Also, it would take a longer time before an end user can access the cleaned data compared to having it in the data lake due to the delay of creating ETL and metadata code. Products that can be used for this include ADF Mapping Data Flows or Databricks using the SQL DW Connector (which uses Polybase). Strongly advise against this option
- Copy data from raw area in data lake to SQL DW then clean it via stored procedures in SQL DW. The disadvantages of using this option is it is more costly as SQL DW costs more than Databricks, the data transformations can affect users running queries on SQL DW, and the cleaned data won’t be available in the data lake for others to use (meaning other products that needed to access the clean data would impact user queries in SQL DW). Also, it would take a longer time before an end user can access the cleaned data compared to having it in the data lake due to the delay of creating ETL and metadata code. You would use ADF to copy the data from the raw data area into staging in SQL DW, which would use Polybase. Then execute stored procedures that would clean the data and copy it to production tables in SQL DW. Strongly advise against this option
Be aware this is not a “one size fits all” as you may be building a solution that is moving lots of data from many different data sources as there may be a handful of use cases within that solution where it may be faster and/or easier to clean the data as it moves to or from the data lake.
A bit about how ADF Mapping Data Flows works “under the covers”. ADF Mapping Data Flows is a big deal as it brings GUI-based design together with scale. It uses the Azure Databricks compute engine under the covers: ADF JSON code is converted to the appropriate code in the Scala programming language and will be prepared, compiled and executed in Azure Databricks which will automatically scale-out as needed. The distribution of data across worker/compute nodes in Databricks is where the optimizations come in: each of the transformations in the data flow has an “optimize” tab that lets you control how the data gets partitioned across the worker nodes. If your source is Azure SQL DB, it’ll be using the out-of-the-box JDBC driver, which means the Databricks workers are accessing Azure SQL DB directly. There are slight exceptions – if Azure SQL Data Warehouse is the source/destination, then it can land the data temporarily in blob storage (not the hierarchical namespace / ADLS Gen2) before Polybasing it in. If the source/destination can Polybase, you set a staging linked service/container in the Data Flow activity settings.
How does ADF Mapping Data Flows compare to using SSIS? In SSIS, you would create an SSIS Data Flow Task to do, for example, a Lookup and a Data Conversion on the one million rows. This could be slow because it would have to copy the data from SQL Server in batches (usually 10,000 rows) to the SSIS server (so your dependent on the size of the SSIS server, see How Much Memory Does SSIS Need?), then would go row-by-row updating the data, and landing the batch to a cleaned layer in ADLS Gen2 before doing the next one. With ADF Mapping Data Flows, you create an ADF pipeline that uses the Copy Activity to copy the one million rows from SQL Server to a raw area in ADLS Gen2, then create a Data Flow activity in the ADF pipeline to do the transformations (see Azure Data Factory Data Flow), which behind-the-scenes fires up Databricks, puts the data in a Spark in-memory DataFrame across the workers, and runs Scala code on the DataFrame to update the rows. This is very fast because the reads from that data lake into memory are done in parallel by the Databricks worker engines (you can be reading as many extents of that file as you have worker cores), and because Databricks can be scaled for the transformations (it’s still doing row-by-row iterations, but it’s doing them in parallel according to the number of workers/cores and the way the data is distributed across those workers). After the transformations are done, you would use the ADF Mapping Data Flow Sink Transformation to save this data in a cleaned layer in ADLS Gen2. So the end result is ADF Mapping Data Flows is likely to be much faster than SSIS.
What is the difference between using Databricks vs ADF Mapping Data Flows? ADF Mapping Data Flows is much easier to use because of its drag-and-drop interface, while Databricks is all code-based. However, use Databricks if you like notebooks, want to use ML tools, and want more monitoring. Performance when using ADF Mapping Data Flows would be the same as using Databricks separately or when using the Databricks Notebook activity in ADF unless you used a different partitioning scheme (controlled by the “optimize” tab for ADF Mapping Data Flows) or cluster size (default is 8 cores of general compute for ADF Mapping Data Flows where the driver node is assigned 4 cores with the rest assigned to workers. That is part of the definition of the default Azure IR. If you wish to set your own configurations, you will create a new Azure IR and choose that named IR in your data flow activity which can have up to 256 cores), but also depends on the startup time for Databricks when using an ADF Mapping Data Flow (which can take five minutes, but this can be greatly reduced if you turn on the Mapping Data Flow Debug Mode or use the time-to-live setting which will be available soon) compared to using a separate Databricks cluster that is already running (which is instantaneous if you keep it running but that can be very expensive). At best you can expect at least a one minute set-up time when using ADF Mapping Data Flows. That’s the normal job overhead of job execution in a distributed big data system like Spark, which is what ADF Mapping Data Flows is using under the hood. The good news is that as you scale that workload up to large data volumes, that one minute set-up time won’t increase with your workload, so your experienced performance will increase with data volume (see Mapping data flows performance and tuning guide). Finally, keep in mind that Databricks allows you to build a generic solution that could clean data in many files, while ADF Mapping Data Flows requires a package per object.
More info:
ADF Mapping Data Flow Patterns
Comparing Mapping and Wrangling Data Flows in Azure Data Factory
Three Ways to Use Power BI Dataflows
Doing Power BI the Right Way: 4. Power Query in Dataflows or Power BI Desktop
Hello James,
I am surprised by your recommendation of keeping the data in SQL DW as the number of concurrent queries that can be executed is pretty low compared with Exadata, Teradata and SQL Server. Didnt you face these issues at any of your clients?
I usually recommend having both a SQL DW and a cube as I explained at https://www.jamesserra.com/archive/2018/07/the-need-for-having-both-a-dw-and-cubes/
Thanks for the article James! Can you point me to resources that gives examples of how to do Type 1 and Type 2 dimensional modeling using Azure Databricks and Azure Data Lakes? I understand the value in using Azure Databricks for doing the type of data wrangling that is often necessary for data science work but I don’t understand how to use it to perform ETL tasks that I currently do using SQL based tools like MERGE statements and SSIS to populate data warehouses. I am trying to get a better understanding of the “modern data warehouse”.
Hi Ryan,
There is documentation on SCD Type 2 with Databricks here: https://docs.azuredatabricks.net/spark/latest/spark-sql/language-manual/merge-into.html
Thanks James!
Hi James
Great article as always.
I’ve found that ADFv2 is exorbitantly expensive, any advice on how to implement incremental load patterns using the Modern DW architecture? For example, how/where to surface LastModifiedDateTime or LastIncrementalExecutionDateTime metadata for each table load – CSV in ADLv2? Good ol’ SQL Server? Databricks Delta? HDFS? I’d love to see a detailed article.
Also, is there any plan for Power BI Dataflow integration with ADFv2 (via a component) or Databricks (via a jar/egg)? This is currently a big disconnect, e.g., Databricks cannot natively read/write the model.json metadata file associated with PBI Data Flows.
Hi Simon,
Incremental Data Load for ADF is explained here: https://docs.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-multiple-tables-portal
Just wondering: Is building SCD2 in Databricks faster than on-premise SQL server?
FYI I tested the SCD2 merge SQL in SQL Server: For a dimension with 20million rows, supposing half of them got updated the SCD2 process runs ~4 min. I got 30 million records after the dimension processing.
I know for a fact I could not get the same performance on on-premise data lake with spark.
@Will , are you sure you have partitioned you data optimally in on premise data lake also make sure that it is not some of your spark jobs starved of resources on the nodes from an Apple to Apple comparison are you working.
We have done a test on Databricks Delta and have found it faster .
James,
Excellent summary . As I’m still learning these technologies, I wanted to try and test each one of the scenarios mentioned and compare them.
Are there any sample data sets or tutorials that I can use to compare each scenario. For example If I had to copy 2 tables, do an aggregation and then report on structured data as well as the same with unstructured data-sets. Hope that makes sense