Azure Synapse and Delta Lake

Many companies are seeing the value in collecting data to help them make better business decisions. When building a solution in Azure to collect the data, nearly everyone is using a data lake. A majority of those are also using delta lake, which is basically a software layer over a data lake that gives additional features. I have yet to see anyone using competing technologies to delta lake in Azure, such as Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi and Open Source Data Lake Table Formats: Evaluating Current Interest and Rate of Adoption).

The reasons most are using delta lake is because of the following features that delta lake provides over just using a data lake (with supporting the MERGE statement the biggest one):

ACID transactions
Time travel (data versioning enables rollbacks, audit trail)
Streaming and batch unification
Schema enforcement
Supports commands DELETE, UPDATE, and MERGE
Performance improvement

Fortunately most Azure products now support delta lake, such as:

Mapping data flows in Azure Data Factory/ Azure Synapse (see Transform data in delta lake using mapping data flows)
Azure Synapse serverless SQL pool via OPENROWSET (see Query Delta Lake files using serverless SQL pool in Azure Synapse Analytics and How to query your Delta Lake with Azure Synapse SQL pool). A serverless SQL pool can read delta Lake files that are created using Apache Spark, Azure Databricks, or any other producer of the Delta Lake format. However, be aware of the limitations and known issues that you might see in delta lake support in serverless SQL pools
Azure Synapse Spark pool
Power BI: Reading Delta Lake tables natively in PowerBI

However, some products or features do not support delta lake (at least not yet), so I wanted to make you aware of those:

Copy activity in Azure Data Factory/Azure Synapse Pipelines, unless you use a Databricks cluster (see Copy data to and from Azure Databricks Delta Lake using Azure Data Factory or Azure Synapse Analytics)
Azure Synapse dedicated SQL pool when using external tables and PolyBase (see Use external tables with Synapse SQL)
Azure Synapse database templates

Serverless SQL pools do not support updating delta lake files. Use Azure Databricks or Apache Spark pools in Azure Synapse Analytics to update Delta Lake.

Within Power BI, there is a connector for Synapse (called “Azure Synapse Analytics SQL”) that can connect to an Azure Synapse serverless SQL pool, which can have a view that queries a delta table. However, you are limited to the compute offered by the serverless pool, and if that does not give you the performance you need or if you want direct control on the ability to scale up, you might want to look to instead use the “Azure Databricks” connector which will give you more compute (see Connecting Power BI to Azure Databricks). Note there is a new “Azure Synapse Analytics workspace (Beta)” connector in Power BI that can also query a delta table (see Supercharge BI insights with the new Azure Synapse Analytics workspace connector for Power Query and Azure Synapse Analytics workspace (Beta)), but that is also using serverless SQL pool compute and not Spark pool compute.

Note that an Azure Synapse serverless SQL pool can access data in a data lake, delta lake, and data in a spark table, called Lake database (but only if the Lake database is in Parquet or CSV format and NOT in delta lake format – see Azure Synapse Analytics shared metadata tables – Update 12/19/22: delta lake format is now in public preview). An Azure Synapse Spark pool can access data in a data lake, delta lake, and a Lake database (any format, including delta lake). So if you are using a Lake database that is built on the delta lake format, you would not be able to use an Azure Synapse serverless SQL pool to query it, only a Azure Synapse Spark pool. Which also means if you are using the “Azure Synapse Analytics workspace (Beta)” connector in Power BI, you won’t see it display Lake database tables built on the delta lake format to connect to.

More info:

Exploring Delta Lake in Azure Synapse Analytics

HOW INTERCHANGEABLE ARE DELTA TABLES BETWEEN AZURE DATABRICKS AND AZURE SYNAPSE ANALYTICS?

With Delta Lake, Databricks sparks an open-source nerd war and customer confusion

Hi James,
I follow your Blogs, Webinars, Youtube videos, linkedin, etc. It helped me a lot in Big Data Solution space. Thanks so much for your contribution to the Big Data space community.

I’m planning to design a Big Data Solution in Azure (for a large Organization) which includes Modern Data Warehouse as well as Advanced Anylytics and I’d like to have your comment on my solution Architecture.

The summery of my Solution Architecture is as follows:

1. Bring all structured, unstructured, streaming data into ADLS as Staging (Bronze version)

2. From ADLS’s Bronze version, use Azure Databricks Notebook (execute from Synapse Pipeline) to clean/transform data and load as Delta File/Table format in Delta Lake (Silver version) as “Single Source of Truth”.

3. From Delta Lake (Silver version), use Databricks Notebook (execute from Synapse Pipeline) and load data a) into Synapse Dedicate SQL Pool using SQL Endpoint. And b) also prepare data (from Silver) for Advanced Analytics(AI and ML) and load into Delta Lake as Gold version

4. In Synapse Dedicated SQL Pool, use CTAS and finally load into Star Schema

5. a) For BI Report, connect to Synapse Dedicated SQL Pool using Power BI and b) for Advanced Analytics Report use Delta Lake Gold version

Does the above Architecture look good to you? Please feel free to advise me.

Your advise would be highly appreciated 🙂

Thank you

Reply ↓

Comments

Azure Synapse and Delta Lake — 19 Comments

Ali Ghodsi on March 2, 2022 at 12:22 pm said:

James, great article and spot on explanation. Please don’t forget that Microsoft Azure Databricks has phenomenal Delta Lake support, together with the Photon engine enabling blazing fast queries on Delta Lake directly on ADLS. Azure Databricks is first party and integrates with the whole Azure ecosystem of ADF, AAD, PBI, ADLS, Synapse, etc. If anyone is interested in Delta Lake they should definitely check out Microsoft Azure Databricks.

Reply ↓
- James Serra on March 2, 2022 at 12:25 pm said:
  
  Thanks for the feedback Ali! I was hoping I did not make any mistakes in my understanding of using Delta Lake 🙂
  
  Reply ↓
  - Ali Ghodsi on March 2, 2022 at 12:31 pm said:
    
    It’s spot on! You know your stuff and I always enjoy reading what you write. We were bummed when you left the “data world” for a while, but glad to have you back! 🙂
    
    Reply ↓
    - John on March 3, 2022 at 4:32 am said:
      
      Unfortunately the performance of databricks is much slower than that of synaps
      
      Reply ↓
Pingback:Delta Lake Operability in Azure Synapse Analytics – Curated SQL
Satinder Bal on March 5, 2022 at 6:23 pm said:

Hi James,

Great blog as always. Thank you for the posting.

The claim that Delta Lake ‘supports ACID transactions’ requires qualifying.

Unlike ACID transaction processing in a RDBMS such as SQL-Server, Delta Lake ACID support is constrained to a Delta Table. Executing DML transactions over more than one Delta Table in a Delta Lake will not guarantee ACID integrity for the respective transaction.

Hence, Delta Lake is not necessarily a replacement for RDBMS Data Warehousing.

Reply ↓
- James Serra on March 5, 2022 at 6:26 pm said:
  
  Great point, thanks Satinder!
  
  Reply ↓
Muhammad Rahman on March 7, 2022 at 9:38 pm said:

Hi James,
I follow your Blogs, Webinars, Youtube videos, linkedin, etc. It helped me a lot in Big Data Solution space. Thanks so much for your contribution to the Big Data space community.

I’m planning to design a Big Data Solution in Azure (for a large Organization) which includes Modern Data Warehouse as well as Advanced Anylytics and I’d like to have your comment on my solution Architecture.

The summery of my Solution Architecture is as follows:

1. Bring all structured, unstructured, streaming data into ADLS as Staging (Bronze version)

2. From ADLS’s Bronze version, use Azure Databricks Notebook (execute from Synapse Pipeline) to clean/transform data and load as Delta File/Table format in Delta Lake (Silver version) as “Single Source of Truth”.

3. From Delta Lake (Silver version), use Databricks Notebook (execute from Synapse Pipeline) and load data a) into Synapse Dedicate SQL Pool using SQL Endpoint. And b) also prepare data (from Silver) for Advanced Analytics(AI and ML) and load into Delta Lake as Gold version

4. In Synapse Dedicated SQL Pool, use CTAS and finally load into Star Schema

5. a) For BI Report, connect to Synapse Dedicated SQL Pool using Power BI and b) for Advanced Analytics Report use Delta Lake Gold version

Does the above Architecture look good to you? Please feel free to advise me.

Your advise would be highly appreciated 🙂

Thank you

Reply ↓
Gerhard Brueckl on March 8, 2022 at 4:54 am said:

Hey James,
with regards to Power BI you may also want to add the official Delta Lake connector for Power BI in your list which allows you to read a Delta Lake table from any storage supported by Power BI without the need of having a Databricks Cluster or Synapse Serverless SQL Pool in between!
https://delta.io/integrations/
https://github.com/delta-io/connectors/tree/master/powerbi

-gerhard

Reply ↓
- James Serra on March 15, 2022 at 12:06 am said:
  
  Done! Thanks Gerhard!
  
  Reply ↓
  - Dileepa on March 15, 2022 at 10:35 pm said:
    
    Very useful , thank you.
    
    Reply ↓
  - shakti on July 14, 2022 at 5:59 am said:
    
    James and Gerhard, as Power BI connector for delta lake is already available, I was wondering if we really need any synapse or SQL layer on top of the Delta lake Gold layer for BI consumption. is there any performance aspect we know using Databricks vs Synapse or any other SQL layer for Power BI consumption or any BI reporting consumption ?
    
    Reply ↓
    - Venkat on July 17, 2022 at 1:43 am said:
      
      Hi Shakthi,
      
      I am also having same confusion if delta table satisfy the BI needs why we need Azure Synapse.
      
      In most of the projects they are using like
      
      ADF –> ADB –> Delta Lake –> Synapse
      
      my question is why are we loading data again into synapse. What are use case which are not solved by delta engine.
      
      Reply ↓
Silvano on March 16, 2022 at 8:19 pm said:

Hi James,

great article! I always read your Blog.

Just a note regarding Copy Activity and Delta Lake, only Sink is not supported, because you can still use as a source a Delta Lake by using a Synapse SQL Serverless View.

Thank you so much

–silvano

Reply ↓
JP on April 6, 2022 at 3:42 am said:

Hi James,

Great article, I especially appreciate your outline of limitations with regard to delta and related Synapse features.

I’d love to get your thoughts on this. We know we can put a T-SQL layer over delta that will use the serverless sql pool. However, let’s assume production performance requirements and features such as native row level security/column masking are a necessity. This rules out the serverless sql pool.

However, unless I’ve missed something, there seems a lack of Synapse options to bulk insert the lake delta into a dedicated pool table which offers improved performance with no spin up, and true RLS. An integration dataset cannot be created for a lakehouse delta table. The only way I’ve been able to achieve lake delta to sqlpool dedicated table is via a spark notebook, using the scala synapse dedicated pool connector:

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export

Any thoughts?

Reply ↓
Lars Hammer on May 24, 2022 at 7:38 am said:

Great article, James
Gives a good overview of the delta lake support in Synapse.

Interesting that the only way to write to a delta lake is by spinning up a spark cluster or Databricks cluster. Are you aware of any libraries or other functionality that can write a delta file without spinning up a cluster?
It seems a bit overkill to me to spin up a cluster for simply writing/updating a number of .parquet + .json files.

Thanks
– Lars

Reply ↓
Kyle Weller on July 22, 2022 at 1:54 am said:

Good read as always James! Here is one also for Synapse with Apache Hudi:
https://www.onehouse.ai/blog/apache-hudi-on-microsoft-azure

Reply ↓
Gary Griffin on January 19, 2023 at 9:08 am said:

Hi James,

I’m hoping you can give me an answer to this question. I was told by the Product Manager last May that they were working on Delta Lake support for Lake Databases. This would be very helpful as we craft our strategy. Do you know if this was completed? IF so, can you point me to any documentation?

Thanks,

Gary

Reply ↓
- James Serra on January 19, 2023 at 9:39 am said:
  
  Hi Gary,
  
  Are you referring to database templates? If so, that should be available soon.
  
  Reply ↓

James Serra's Blog

Big Data and Data Warehousing

Comments

Azure Synapse and Delta Lake — 19 Comments

Leave a Reply Cancel reply

Share:

Leave a Reply Cancel reply