Data Lakehouse & Synapse
I am starting to see this relatively new phrase, “Data Lakehouse”, being used in the data platform world. It’s the combination of “Data Lake” and “Data Warehouse”. In this post I’ll give my thoughts on it, and how the next version of Azure Synapse Analytics that is in public preview fits right in with the Data Lakehouse.
First, I want to clear up a bit of confusion regarding Azure Synapse Analytics. If you login to the Azure portal and do a search, you will see:
Notice that Azure SQL Data Warehouse (SQL DW) is now called Azure Synapse Analytics. This product has been generally available (GA) for a while and is simply a renaming. It contains new features and soon-to-be new features that I blogged about here, but this is separate from other features in a public preview version that I will call “Synapse PP” (see the new docs). I’ll call the current GA version “Synapse GA”. I blogged about this here.
I like to think a major feature of a Data Lakehouse as the ability to use T-SQL to query data the same way no matter whether the data is in a data lake (storage such as ADLS Gen2) or a data warehouse (a relational database such as Synapse GA).
Here are some of the major new Synapse PP features that support a Data Lakehouse:
- Azure Synapse Studio
- Collaborative workspaces
- Apache Spark
- On-demand T-SQL, see my blog on this at SQL on-demand in Azure Synapse Analytics
- ADF integration
- Power BI integration
- Distributed T-SQL (over ADLS Gen2)
There are some great benefits to having a data lake that I blogged about here. And there are many reasons to also have a relational data warehouse along with your data lake that I blogged about here. Up until now, the data lake and the relational data warehouse where sort of on their own island, having to jump to a different product to interface with each. For example, your data lake could be in ADLS Gen2 and you can use Azure Databricks to query it with SparkSQL (first setting up a connection to ADLS Gen2), while your relational data warehouse could be in Synapse GA and you use SSMS to query it with T-SQL. Having to use Azure Data Factory or Power BI means opening up another tab in your browser, logging into those products, and connecting to ADLS Gen2. So it can be time consuming and costly to create an environment to query both, and also be confusing to query using two different versions of SQL.
You can now get the best of both worlds all under one roof via Azure Synapse Studio: using the relational data warehouse when you need fast query performance, high user concurrency, enhanced security, or just prefer working in a relational environment; or using the data lake when you have huge data volumes, need instant access to the data via schema-on-read, or need to deal with semi-structured data. You never have to leave your workspace no matter which tool you want to use, and can use T-SQL no matter if the data is in the data lake or in the relational data warehouse. And querying a file in the data lake is simply a matter of right-clicking the file (more on this in my next blog).
So at a high level you can view Azure Synapse Studio as supporting business intelligence, data science, T-SQL plus other languages (Python, Scala, C#, SparkSQL), schema flexibility, ACID compliance, any data type, petabytes of data, streaming and batch, with high performance and user concurrency.
So as you can see, Synapse PP combines the use of a data lake and a relational database to make it a Data Lakehouse. It is not really a “new” technology, but just a merging of existing technologies to make it easier to gain insights from data in order to make better business decisions.
With these new features, especially SQL on-demand and T-SQL against ADLS, I am starting to see a few use cases where you may not need a relational database anymore in your modern data warehouse, which goes against my previous way of thinking (see Is the traditional data warehouse dead?). I still feel you will want to have a relational database in your modern data warehouse architecture a large majority of the time, but there will be exceptions.
For example, I can land a bunch of parquet files into the data lake and create a T-SQL view on top of that data, where that view is stored in SQL on-demand. Then I call that view from Power BI making it appear to the end-user that they are using a relational database and at the same time they are only paying when the query is being run. So that leads to compute costs savings as well as not having to copy the data from the data lake to a relational database. This means less time is needed to build a solution and there is a reduction in the complexity of the solution and therefore additional cost savings. Below are some of the concerns that arise out of skipping the relational database and just using a data lake and the new data/delta lake features that “counter” some of those concerns:
- Speed: Queries against a relational storage will always be faster than against a data lake (roughly 5X) because of missing features in the data lake such as the lack of statistics, query plans, result-set caching, materialized views, in-memory caching, SSD-based caches, indexes, and the ability to design and align data and tables. Counter: DirectParquet, CSV 2.0, query acceleration, predict pushdown, and sql on-demand auto-scaling are some of the features that can make queries against ADLS be nearly as fast as a relational database. Then there are features like Delta lake and the ability to use statistics for external tables that can add even more performance. Plus you can also import the data into Power BI, use Power BI aggregation tables, or import the data into Azure Analysis Services to get even faster performance. Another thing to keep in mind affecting query performance is Synapse is a Massive parallel processing (MPP) technology that has features such as replicated tables for smaller tables (i.e. dimension tables) and distributed tables for large tables (i.e. fact tables) with the ability to control how they are distributed across storage (hash, round-robin). This could provide much greater performance compared to a data lake that uses HDFS where large files are chunked across the storage
- Security: Row-level security (RLS), column-level security, dynamic data masking, and data discovery & classification are security-related features that are not available in a data lake. Counter: User RLS in Power BI or RLS on external tables instead of RLS on a database table, which then allows you to use result set caching in Synapse
- Complexity: Schema-on-read (ADLS) is more complex to query than schema-on-write (relational database). Schema-on-read means the end-user must define the metadata, where with schema-on-write the metadata was stored along with the data. Then there is the difficulty in querying in a file-based world compared to a relational database world. Counter: Create a SQL relational view on top of files in the data lake so the end-user does not have to create the metadata, which will make it appear to the end-user that the data is in a relational database. Or you could import the data from the data lake into Power BI, creating a star schema model in a Power BI dataset. But I still see it being very difficult to manage a solution with just a data lake when you have data from many sources. Having the metadata along with the data in a relational database allows everyone to be on the same page as to what the data actually means, versus more of a wild west with a data lake
- Missing features: Auditing, referential integrity, ACID compliance, updating/deleting rows of data, data caching, Transparent Data Encryption (TDE), workload management, full support of T-SQL – all are not available in a data lake. Counter: some of these features can be accomplished when using Delta Lake, Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi), but will not be as easy to implement as a relational database and you will be locked into using Spark. Also, features being added to Blob Storage (see More Azure Blob Storage enhancements) can be used instead of resorting to Delta Lake, such as blob versioning as a replacement for time travel in Delta Lake
Some additional reasons for using a relational database include helping to break up a large problem into smaller pieces (think of the relational database as an additional step in the data journey from a data lake to the end-user), and the need to handle slowly changing dimensions (SCD) (see Create Generic SCD Pattern in ADF Mapping Data Flows). Using time travel in the Delta Lake or creating daily folders for each table in the data lake are options for handling SCD but much more complex than in a relational database world (see Slowly changing data (SCD) Type 2 operation into Delta tables). I also see some customers who have source systems that are relational databases and as they are in the early stages of building a modern data warehouse in Azure they will bypass the data lake and copy the relational data right into a relational data warehouse (especially if they have many existing SSIS packages that they are using for their on-prem data warehouse and they want to continue using those for the short-term). They do this for quick wins and eventually will land the data in the data lake.
A strong reason for skipping a relational database is because, for the short term, SQL on-demand is not able to access a relational database, so you can only use this feature against a data lake. So if you want to save costs by using SQL on-demand it would have to be against data in the data lake. This could be especially true if you have a “small” data warehouse.
There can even be a reason to not copy a large amount of data to a relational database because of the large cost savings in storage: Dedicated SQL pool costs at least $135 per TB per month (some regions cost more); for ADLS Gen2, hot access tier is at least $19 per TB per month, cool access tier is at least $10 per TB per month, and archive access tier is at least $2 per TB per month (some regions cost more) with an additional cost per transaction (which varies greatly so check the Azure calculator). But keep in mind the tradeoffs listed above. [UPDATE 12/4/20: Dedicated SQL pools costs have dropped 83% to $23 per TB per month!]
I’m still thinking through this, but these are my initial thoughts and it will be interesting to have discussions with customers on this topic when Synapse PP goes GA. Please comment below on your thoughts!
I’ll be presenting and doing a demo of the public preview of Azure Synapse Analytics at my sessions at European Digital Week on 9/25 (session info), SQL Bits on 10/3 (session info), PASS Summit on 11/13 (session info), and Big Data Conference Europe on 11/25 (session info).
More info:
The Data Lakehouse – Dismantling the Hype
Data LakeHouse — Paradigm of the Decade
Eckerson Group Webinar -Shop Talk – the Data Lakehouse – April 17, 2020
All Hail, the Data Lakehouse! (If Built on a Modern Data Warehouse)
An Architect’s View of the Data Lakehouse: Perplexity and Perspective
Data Lakehouses Hold Water (thanks to the Cloud Data Lake)
The Rise of Data Lakehouse with Data Virtualization
The Future of Analytics: Leveraging Data Lakes and Data Warehouses
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
SQL on-demand with a logical warehouse on top of a data lake is great! Now when can we have it in GA?
Excellent article as always. I like the ‘counter’ part that shows how to remove hurdle.
Now the big questions : it looks like SQL On-Demand supports CSV, Parquet, … but not yet ‘Delta Parquet’.
To me it’s a big show stopper because as you mention, building a nice usable data lake will likely require going ‘delta’ but then it soon become not possible to do the queries
Do you have more insight on this ?
I am currently mulling this questions. We implemented our first dimensional model in Azure SQL (fed from SQL Server ERP – on prem). We land delimited data into a data lake using ADF and then load it into Azure SQL. I need to data mask but I don’t see much talk of the that with SQL on-demand.
How is it a Lakehouse if it is not providing ACID compliance, updating/deleting rows of data, etc. In that case it’s simply just like Hadoop 10 years ago: Hive on HDFS data lake, best effort, but it does not have the warehousing advantages on the data lake.
Hi Michael…There are two ways to get ACID compliance and updating/deleting of rows: via the relational database part of the data lakehouse, or via Delta Lake.
Hi James,
I have been working on the Azure Synapse (formerly SQLDW) on a business use case. in the last 3 months most Microsoft sessions on Synapse have been centered around Azure Synapse Anlytics using Workspace. I have also seen a situation where I am unable to add the Azure Synapse Analytics (former SQLDW) into a Workspace analytics SQL pool it seems like workspace analytics needs its oqn SQL Pools. D owe have any timelines for the GA for Azure Stream analytics (Workspaces)?
Inside a Synapse workspace, you can restore a Synapse database (SQL DW) that was created outside the workspace. Sorry, I don’t know the GA timelines.
Hi James,
Do you know, or know where to find a roadmap of Synapse PP?
I’m having some great usecases but like to know when it will be GA, on estimate.
Hi Pepjin, the Synapse product team said they are hoping to GA this calendar year during a conference, but that is the only guidance I have heard.
Pingback:Databricks raises $1.6B series H funding round | ZDNet - News Azi
Pingback:Databricks raises $1.6B series H funding round
Pingback:Databricks raises $1.6B series H funding round – Voice Press
Pingback:Databricks raises $1.6B series H funding round – IT Aid Centre
Pingback:AWS, Microsoft participated in Databricks’ $1.6 billion round of funding | Technology For You
Pingback:AWS, Microsoft participated in Databricks’ $1.6 billion round of funding – Voice Press
Pingback:AWS, Microsoft participated in Databricks’ $1.6 billion round of funding – IT Aid Centre
Pingback:Data Lake VS Delta Lake – Data Upsert and Partition Compaction Management - Plainly Blog - Data Modelling, Advanced Analytics
Pingback:Azure Synapse Analytics database templates – SQLServerCentral