What product to use to transform my data?
If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake. And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL). Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data. There is good news and bad news when it comes to which product to use. The good news is there are a lot of products to choose from. The bad news is there are a lot of products to choose from :-). I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part):
- SSIS – The product used most often for on-prem ETL still has use cases for it when moving to the cloud, especially when migrating a solution from on-prem to the cloud. If you have hundreds of SSIS packages, it may make the most sense to continue to use those packages and just change the destination adapter (i.e. from your on-prem SQL Server to Azure SQL Database) and have them execute under the Azure-SSIS integration runtime (IR) in Azure Data Factory (ADF). Keep in mind if a package is accessing local resources (like storing temp files to a local path or using customer controls), some re-work will be required. A concern will be if a package is moving a ton of data, as executing transformations row-by-row (ETL) can be much slower than batch transformations (ELT) – see Difference between ETL and ELT. Also, the data has to be moved to the location of the IR and the compute power is limited by the size of the IR cluster. For those SSIS packages that are too slow you can just replace those with a product below
- Azure Data Factory (ADF) – Now that ADF has a new feature called Data Flow, it can transform data so it is more than just an orchestration tool. Behind the scenes, the ADF JSON code that is created when you build a solution is converted to the appropriate code in the Scala programming language and is prepared, compiled and executed in Azure Databricks. This means Data Flow operates in an ELT manner: It loads the data into a place where Databricks can access it, performs the transformations, and then moves it to the destination. ADF provides a native ETL scheduler so that you can automate data transformation and movement processes either through visual data flows or via script activities that execute in Databricks or other execution engines (so, like with SSIS, data flows are row-by-row transformations and for large amounts of data it may be faster to execute a batch transformation via a script in Databricks). My thoughts on when to use ADF are obviously if you are already using it or if your skillset lies in SSIS as it’s pretty easy to learn ADF with a SSIS background
- Databricks – It is a Spark-based analytics platform which makes it great to use if you like to work with Spark, Python, Scala, and notebooks. When choosing between Databricks and ADF, what I’ve noticed is that it depends highly on the customer personas and their capabilities. There are plenty of Data Engineers and Data Scientists who want to get deep into Python or Scala and sling some code in Databricks Notebooks. But the larger audience who wants to focus on building business logic to clean customer/address data, for example, doesn’t want to learn Python libraries, and will use the ADF visual data flow designer. Many of those are also Data Engineers and Data Scientists, but then we start to move up the value stack to include Data Analysts and Business Analysts, which is where we start to overlap with Power BI Dataflow (see below). Either way, when you want to orchestrate these cleaning routines with schedules, triggers, and monitors, you want that to be through ADF. Keep in mind if you code your transformations in Databricks Notebooks, you will be responsible for maintaining that code, troubleshooting, and scheduling those routines
- HDInsight (HDI) – Databricks is the preferred product over HDI, unless the customer has a mature Hadoop ecosystem already established. But more and more I tend to find that the majority of workloads are Spark, so Databricks is a better option. In terms of pure Spark workloads Databricks greatly outperforms HDI. Although Databricks clusters can be up and running indefinitely, they’re intended to be created and terminated as necessary – so if you wish to use other Apache Hadoop data transformation tools and have them available 24/7 then HDI may better a better option than Databricks
- PolyBase – Azure SQL Data Warehouse (SQL DW) supports PolyBase, so if you are using SQL DW you can pull data from the data lake into SQL DW using T-SQL. An option is to use PolyBase to import the raw data from the data lake into SQL DW and then clean it there using Stored Procedures, especially if you want to stick with T-SQL and don’t want to deal with Spark or Hive or other more-difficult technologies. But the benefit of cleaning the data as it sits in the data lake is to off-load the cleaning process so you are not affecting user queries and to reduce the storage space in SQL DW. Cleaning the data in the data lake can also be less expensive and there are more tools available for data manipulation than just T-SQL. Another reason is by cleaning the data in the data lake you can make it available to data scientists or power users who don’t have to wait until it’s in SQL DW
- Power BI Dataflow – In short, Dataflows integrates data lake and data prep technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep). So even though the main use case for Dataflow is for an individual solution or for small workloads, you can think of Power Query as the “compute” part that transforms data that lands in the data lake and can then be used as part of an enterprise solution
In addition to what products to use to clean the data, another consideration is where to clean the data. Say I have an on-prem SQL Server database and I’m exporting some tables into CSV files and then copying them to the data lake (ADLS Gen2) and then to Azure SQL DW. And I not only have to clean the data but join these CSV files with other files also in the data lake. Should I clean the data as it moves from the data source to the data lake via ADF or always land the raw data in the data lake and then clean it and join it and land it back into the data lake (via ADF of Databricks)? Or clean it as it moves from the data lake to SQL DW via ADF or Databricks? In most cases you will want do all the transforming in the data lake. So copy the raw data into the data lake, transform it and write it back into the data lake (so you have multiple data lake layers such as raw and cleaned), then copy it to SQL DW, all of which can be orchestrated by ADF. But this is not a “one size fits all” as you may be building a solution that is moving lots of data from many different data sources as there may be a handful of use cases within that solution where it may be faster and/or easier to clean the data as it moves to or from the data lake.
More info:
I’m surprised that you left off Azure Data Lake Analytics.
The U-SQL set-based workflow integrated with C# make for quite a robust tool set regarding the Transform arena. ADLA has many of the same features as the systems listed above (Azure Storage integration or ADLS, scale-out/clustering, etc.)
Is ADLA not dead already? I see people already moving workloads to Spark.
ADLA is an option for customers looking for C# friendly analytics engine, but we are seeing more interest in open-source solutions. Just be aware ADLA does not work with ADLS Gen2.
I ran a comparison with Azure Data Lake Analytics and Spark (databricks). Data lake analytics has some serious problems that are going to be too slow to overcome and catch-up with spark that has already been through the pain and is maturing… I can’t see that ADLA will survive and microsoft doesn’t need to & should focus on the services that bring something to the table e.g. powerbi, datalake store, data factory, sql server.
1. The string size limit is a serious problem in ADLA and the workaround is a lot of frustrating coding and maintenance which can still hit the binary field limit. Compared with how easy it is to load data in spark.
2. Type inference. Having intelligent type inference is really important especially with large unwieldy data sets. You have to get the data in somehow to start with to profile before you can define the best schema or understand the data. This is really easy and designed ground up for spark. I found it to be a huge pain in ADLA. It’s strongly typed, even on nulls, lots of coding for simple things that work easily in spark.
3. adhoc query capability – it’s just quicker, smoother and nicer to look at in spark
4. transactions and performance – databricks has some pretty impressive black magic going on in delta with snapshot transaction isolation. Allowing quite impressive lambda architecture in the same physical tables. Also performance enhancements – they’ve been at it for a while!
5. Spark supports an array and complex types natively that can be inferred directly from json with very little typing. Handling json in spark is pleasurable and because the table types natively support equivalent structures fully shredding data isn’t necessary and can be useful for multi-value attributes in a dimension – for example. ADLA support for json was pretty poor
7. Full parquet – ADLA brought in Parquet but again using it was a pain and doesn’t have complex types.
6. The built-in function library in spark is massive – there isn’t much you can’t do. The custom API flexibility is huge… it’s a bit of wrestle in ADLA
7. Spark has a functional language support (scala) and OO. Also supports python and R with some performance hits. ADLA – has python but not as good and no functional language
8. All the database integration sinks and driver capabilities in spark are quite comprehensive – ADLA tables are isolated… you can’t connect to them with anything!
7. Spark is a one stop shop for streaming and batch – the transformation code (dataframes) is identical! Or unified as they call it.
9. ADLA – I did like the job visualisation and heat map. That was a nice feature and was cheap. But I had so many failures on that platform so maybe not as cheap in the long run (pay per job).
It may seem like I’m a spark fan boy… but I wasn’t. I used to be very enthusiastic about ADLA because I’m familiar with C#. But the scars I came away with ADLA still hurt; Databricks just blew me away how easy things were in comparison and potential of what you can do with it… I’m fully converted and can’t get enough. If you can code you can code… C#, scala, java… coding is coding… having a head for data though and distributed coding is where it all comes together.
I have found some issues with Gen2 with Databricks at the mo… but still looking into it. That could be me… not Gen2
PS I don’t work for databricks! That was a genuine comparative study.
Great summary, thank you, James! I also find that sometimes doing transformations on-prem is sometimes worth considering, since that is compute capacity that you already own and pay for. That contrasts with cloud resources, where you pay-as-you-go for compute. That only works, of course, if the data fits in the on-prem scale, and only makes sense when you have already paid for the machine.
Good point Brent! I see a lot of companies aggregating data on-prem in order to reduce the amount of data that is copied to the cloud.
Hi James,
Thanks for providing some much appreciated clarity on the latest offerings in this area.
Pingback:Data Transformation Tools In The Azure Space – Curated SQL
Hi,
In ADF dataflow, do you know if the source is able to read directly from gzip files?
Right now this is possible with the Copy Activity, so hoping this will be available in dataflow.
Also will the dataflow source read all files in blob storage without having to create any looping logic, again like the copy activity
thanks