Azure Data Factory and SSIS compared
I see a lot of confusion when it comes to Azure Data Factory (ADF) and how it compares to SSIS. It is not simply “SSIS in the cloud”. See What is Azure Data Factory? for an overview of ADF, and I’ll assume you know SSIS. So how are they different?
SSIS is an Extract-Transfer-Load tool, but ADF is a Extract-Load Tool, as it does not do any transformations within the tool, instead those would be done by ADF calling a stored procedure on a SQL Server that does the transformation, or calling a Hive job, or a U-SQL job in Azure Data Lake Analytics, as examples. Think of it more as an orchestration tool. SSIS has the added benefit of doing transformations, but keep in mind the performance of any transformations depends on the power of the server that SSIS is installed on, as the data to be transformed will be pushed to that SSIS server. Other major differences:
- ADF is a cloud-based service (via ADF editor in Azure portal) and since it is a PaaS tool does not require hardware or any installation. SSIS is a desktop tool (via SSDT) and requires a good-sized server that you have to manage and you have to install SQL Server with SSIS
- ADF uses JSON scripts for its orchestration (coding), while SSIS uses drag-and-drop tasks (no coding)
- ADF is pay-as-you-go via an Azure subscription, SSIS is a license cost as part of SQL Server
- ADF can fire-up HDInsights clusters and run Pig and Hive scripts. SSIS can also via the Azure Feature Pack for Integration Services (SSIS)
- SSIS has a powerful GUI, intellisense, and debugging. ADF has a basic editor and no intellisense or debugging
- SSIS is administered via SSMS, while ADF is administered via the Azure portal
- SSIS has a wider range of supported data sources and destinations
- SSIS has a programming SDK, automation via BIML, and third-party components. ADF does not have a programming SDK, has automation via PowerShell, and no third-party components
- SSIS has error handling. ADF does not
- ADF has “data lineage“, tagging and tracking the data from different sources. SSIS does not have this
Think of ADF as a complementary service to SSIS, with its main use case confined to inexpensively dealing with big data in the cloud.
Note that moving to the cloud requires you to think differently when it comes to loading a large amount of data, especially when using a product like SQL Data Warehouse (see Azure SQL Data Warehouse loading patterns and strategies).
More info:
James, ADF might not be as inexpensive as it’s sold. ADF is priced per activity. An activity can move data from only one source table (dataset) to one destination table (dataset). For high frequency activities (executing more than once a day) it will cost you ONLY $1 a month. That looks cheap right? Well, for most large companies implementing ADF across the enterprise would result in thousands if not tens of thousands of activities. Guess what, that $1 just ballooned into more than $10,000/month for ADF alone. You are still going to pay for the Apache Spark/SQL DW/ADLA computation resources where such data would be transformed and further analyzed … in addition to ADF. Questions linger on whether it’s worth taking the ADF route.
Right. ETL without the T 🙂
Great post, James. I just wanted to note that you can use Biml to automate ADF as well. We are doing it on my current project copying data from on prem SQL Server to ADLS, and it has worked well.
I think it worth pointing out that ETL is in fact:
Extract – Transform – Load
But you also get ELT tools as well (e.g. Oracle Data Integrator) where the data is extracted from source, loaded into target and then transformed.
As such, I think what you are saying is that SSIS is an ETL tool whereas ADF is an ELT tool, amongst other differences.