Azure Synapse and Delta Lake
Many companies are seeing the value in collecting data to help them make better business decisions. When building a solution in Azure to collect the data, nearly everyone is using a data lake. A majority of those are also using delta lake, which is basically a software layer over a data lake that gives additional features. I have yet to see anyone using competing technologies to delta lake in Azure, such as Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi and Open Source Data Lake Table Formats: Evaluating Current Interest and Rate of Adoption).
The reasons most are using delta lake is because of the following features that delta lake provides over just using a data lake (with supporting the MERGE statement the biggest one):
- ACID transactions
- Time travel (data versioning enables rollbacks, audit trail)
- Streaming and batch unification
- Schema enforcement
- Supports commands
DELETE
,UPDATE
, andMERGE
- Performance improvement
Fortunately most Azure products now support delta lake, such as:
- Mapping data flows in Azure Data Factory/ Azure Synapse (see Transform data in delta lake using mapping data flows)
- Azure Synapse serverless SQL pool via OPENROWSET (see Query Delta Lake files using serverless SQL pool in Azure Synapse Analytics and How to query your Delta Lake with Azure Synapse SQL pool). A serverless SQL pool can read delta Lake files that are created using Apache Spark, Azure Databricks, or any other producer of the Delta Lake format. However, be aware of the limitations and known issues that you might see in delta lake support in serverless SQL pools
- Azure Synapse Spark pool
- Power BI: Reading Delta Lake tables natively in PowerBI
However, some products or features do not support delta lake (at least not yet), so I wanted to make you aware of those:
- Copy activity in Azure Data Factory/Azure Synapse Pipelines, unless you use a Databricks cluster (see Copy data to and from Azure Databricks Delta Lake using Azure Data Factory or Azure Synapse Analytics)
- Azure Synapse dedicated SQL pool when using external tables and PolyBase (see Use external tables with Synapse SQL)
- Azure Synapse database templates
Serverless SQL pools do not support updating delta lake files. Use Azure Databricks or Apache Spark pools in Azure Synapse Analytics to update Delta Lake.
Within Power BI, there is a connector for Synapse (called “Azure Synapse Analytics SQL”) that can connect to an Azure Synapse serverless SQL pool, which can have a view that queries a delta table. However, you are limited to the compute offered by the serverless pool, and if that does not give you the performance you need or if you want direct control on the ability to scale up, you might want to look to instead use the “Azure Databricks” connector which will give you more compute (see Connecting Power BI to Azure Databricks). Note there is a new “Azure Synapse Analytics workspace (Beta)” connector in Power BI that can also query a delta table (see Supercharge BI insights with the new Azure Synapse Analytics workspace connector for Power Query and Azure Synapse Analytics workspace (Beta)), but that is also using serverless SQL pool compute and not Spark pool compute.
Note that an Azure Synapse serverless SQL pool can access data in a data lake, delta lake, and data in a spark table, called Lake database (but only if the Lake database is in Parquet or CSV format and NOT in delta lake format – see Azure Synapse Analytics shared metadata tables – Update 12/19/22: delta lake format is now in public preview). An Azure Synapse Spark pool can access data in a data lake, delta lake, and a Lake database (any format, including delta lake). So if you are using a Lake database that is built on the delta lake format, you would not be able to use an Azure Synapse serverless SQL pool to query it, only a Azure Synapse Spark pool. Which also means if you are using the “Azure Synapse Analytics workspace (Beta)” connector in Power BI, you won’t see it display Lake database tables built on the delta lake format to connect to.
More info:
Exploring Delta Lake in Azure Synapse Analytics
HOW INTERCHANGEABLE ARE DELTA TABLES BETWEEN AZURE DATABRICKS AND AZURE SYNAPSE ANALYTICS?
With Delta Lake, Databricks sparks an open-source nerd war and customer confusion
James, great article and spot on explanation. Please don’t forget that Microsoft Azure Databricks has phenomenal Delta Lake support, together with the Photon engine enabling blazing fast queries on Delta Lake directly on ADLS. Azure Databricks is first party and integrates with the whole Azure ecosystem of ADF, AAD, PBI, ADLS, Synapse, etc. If anyone is interested in Delta Lake they should definitely check out Microsoft Azure Databricks.
Thanks for the feedback Ali! I was hoping I did not make any mistakes in my understanding of using Delta Lake 🙂
It’s spot on! You know your stuff and I always enjoy reading what you write. We were bummed when you left the “data world” for a while, but glad to have you back! 🙂
Unfortunately the performance of databricks is much slower than that of synaps
Pingback:Delta Lake Operability in Azure Synapse Analytics – Curated SQL
Hi James,
Great blog as always. Thank you for the posting.
The claim that Delta Lake ‘supports ACID transactions’ requires qualifying.
Unlike ACID transaction processing in a RDBMS such as SQL-Server, Delta Lake ACID support is constrained to a Delta Table. Executing DML transactions over more than one Delta Table in a Delta Lake will not guarantee ACID integrity for the respective transaction.
Hence, Delta Lake is not necessarily a replacement for RDBMS Data Warehousing.
Great point, thanks Satinder!
Hi James,
I follow your Blogs, Webinars, Youtube videos, linkedin, etc. It helped me a lot in Big Data Solution space. Thanks so much for your contribution to the Big Data space community.
I’m planning to design a Big Data Solution in Azure (for a large Organization) which includes Modern Data Warehouse as well as Advanced Anylytics and I’d like to have your comment on my solution Architecture.
The summery of my Solution Architecture is as follows:
1. Bring all structured, unstructured, streaming data into ADLS as Staging (Bronze version)
2. From ADLS’s Bronze version, use Azure Databricks Notebook (execute from Synapse Pipeline) to clean/transform data and load as Delta File/Table format in Delta Lake (Silver version) as “Single Source of Truth”.
3. From Delta Lake (Silver version), use Databricks Notebook (execute from Synapse Pipeline) and load data a) into Synapse Dedicate SQL Pool using SQL Endpoint. And b) also prepare data (from Silver) for Advanced Analytics(AI and ML) and load into Delta Lake as Gold version
4. In Synapse Dedicated SQL Pool, use CTAS and finally load into Star Schema
5. a) For BI Report, connect to Synapse Dedicated SQL Pool using Power BI and b) for Advanced Analytics Report use Delta Lake Gold version
Does the above Architecture look good to you? Please feel free to advise me.
Your advise would be highly appreciated 🙂
Thank you
Hey James,
with regards to Power BI you may also want to add the official Delta Lake connector for Power BI in your list which allows you to read a Delta Lake table from any storage supported by Power BI without the need of having a Databricks Cluster or Synapse Serverless SQL Pool in between!
https://delta.io/integrations/
https://github.com/delta-io/connectors/tree/master/powerbi
-gerhard
Done! Thanks Gerhard!
Very useful , thank you.
James and Gerhard, as Power BI connector for delta lake is already available, I was wondering if we really need any synapse or SQL layer on top of the Delta lake Gold layer for BI consumption. is there any performance aspect we know using Databricks vs Synapse or any other SQL layer for Power BI consumption or any BI reporting consumption ?
Hi Shakthi,
I am also having same confusion if delta table satisfy the BI needs why we need Azure Synapse.
In most of the projects they are using like
ADF –> ADB –> Delta Lake –> Synapse
my question is why are we loading data again into synapse. What are use case which are not solved by delta engine.
Hi James,
great article! I always read your Blog.
Just a note regarding Copy Activity and Delta Lake, only Sink is not supported, because you can still use as a source a Delta Lake by using a Synapse SQL Serverless View.
Thank you so much
–silvano
Hi James,
Great article, I especially appreciate your outline of limitations with regard to delta and related Synapse features.
I’d love to get your thoughts on this. We know we can put a T-SQL layer over delta that will use the serverless sql pool. However, let’s assume production performance requirements and features such as native row level security/column masking are a necessity. This rules out the serverless sql pool.
However, unless I’ve missed something, there seems a lack of Synapse options to bulk insert the lake delta into a dedicated pool table which offers improved performance with no spin up, and true RLS. An integration dataset cannot be created for a lakehouse delta table. The only way I’ve been able to achieve lake delta to sqlpool dedicated table is via a spark notebook, using the scala synapse dedicated pool connector:
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export
Any thoughts?
Great article, James
Gives a good overview of the delta lake support in Synapse.
Interesting that the only way to write to a delta lake is by spinning up a spark cluster or Databricks cluster. Are you aware of any libraries or other functionality that can write a delta file without spinning up a cluster?
It seems a bit overkill to me to spin up a cluster for simply writing/updating a number of .parquet + .json files.
Thanks
– Lars
Good read as always James! Here is one also for Synapse with Apache Hudi:
https://www.onehouse.ai/blog/apache-hudi-on-microsoft-azure
Hi James,
I’m hoping you can give me an answer to this question. I was told by the Product Manager last May that they were working on Delta Lake support for Lake Databases. This would be very helpful as we craft our strategy. Do you know if this was completed? IF so, can you point me to any documentation?
Thanks,
Gary
Hi Gary,
Are you referring to database templates? If so, that should be available soon.