Data Lakehouse defined
As a follow-up to my blog Data Lakehouse & Synapse, I wanted to talk about the various definitions I am seeing about what a data lakehouse is, including a recent paper by Databricks.
Databricks uses the term “Lakehouse” in their paper (see Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics), which argues that the data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse. Instead of the two-tier data lake + relational data warehouse model, you will just need a data lake, which is made possible by implementing data warehousing functionality over open data lake file formats.
While I agree there may be some uses cases where technical designs may allow Lakehouse systems to completely replace relational data warehouses, I believe those use cases are much more limited than this paper suggests.
It’s funny how when Hadoop first came out, I heard many say that the end of relational data warehouses is here and just use a data lake for everything. This was clearly a mistake (see Is the traditional data warehouse dead?). Now we’re here once again!
For simplicity I’ll break down a data lakehouse into two types of architectures: one-tier that is data lake (in the form of using schema-on-read storage), which I’ll call NoEDW, and two-tier that is a data lake and a relational database (in the form of an enterprise data warehouse, or EDW), which I’ll call ProEDW. While Databricks touts NoEDW by using Delta Lake and SQL Analytics, Microsoft touts ProEDW with Azure Synapse Analytics.
For NoEDW, my thought process is, if you are trying to make a data lake work like a relational database, why not just use a relational database (RDBMS)? Then have the data lake do what it is good at, and the RDBMS do what it is good at?
The extra cost, complexity, and time to value in incorporating a relational database into a data lakehouse is worth it for many reasons, one of which is a relational database combines the metadata with the data to make it much easier for self-service BI compared to a data lake where the metadata is separated out from the data in many cases. This becomes even more apparent when you are dealing with data from many different sources.
I can certainly see some uses where you could be fine with a NoEDW: if you have a small amount of data, if the users are all data scientists (hence have advanced technical skills), if you just want to build a POC, or if you want to get a quick win with a report/dashboard. The additional prior reason which was to save costs, is much less now that the storage cost for Synapse has dropped around 80%, so it is about the same cost as data lake storage.
But I still see it being very difficult to manage a solution with just a data lake when you have data from many sources. Having the metadata along with the data in a relational database allows everyone to be on the same page as to what the data actually means, versus more of a wild west with a data lake. And a ProEDW gives you the additional benefits of speed, security, and features that I mentioned at Data Lakehouse & Synapse.
In addition, the NoEDW option requires using Delta Lake, adding a layer of complexity and requiring all tools using the data lake to have to support Delta Lake. Also note that Delta Lake does not support cross-table transactions and that Databricks does not have a pay-per-query approach like Synapse serverless has. The paper also does not discuss how master data management (MDM) fits in, which are almost always relational database solutions.
You definitely do need a data lake (see reasons). The bottom line is you can try to get by with just a NoEDW, but it is very likely that you will run into issues and will need to have some of the data in the data lake copied to a relational database.
Federated queries
What I’m seeing customers do is adopting a lakehouse architecture that goes beyond the data lake and the data warehouse. They do this by using federated queries to integrate data across the data lake, data warehouses, and any purpose-built data services that are being used.
Part of this architecture is making it easy to query data in multiple sources by building out a semantic layer using a distributed query engine like Presto or OPENROWSET in a serverless pool in Synapse. With Synapse you can do a federated query over ADLS Gen2, Spark Tables, and Cosmos DB, and eventually others such as Synapse dedicated pools, SQL Database and SQL Managed Instance. It uses T-SQL and is pay-per-query, so simply create SQL views over the data and they can be called from any tool that supports T-SQL, such as Power BI (see Synapse and federated queries).
A federated query give you this ad-hoc view. So if you want to see what you’ve got in multiple places, you can write a query and can get a result right now. Then think of a SQL view as being able to materialize that result on a continuous basis. Part of thinking about data movement is not just getting data from A to B, but it’s also being able to query data wherever it might live and get a result back in one place.
While using federated queries removes the ETL complexity and staleness, there are many cons, so make sure to read Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement. Querying distributed data falls into the Data Mesh category, a distributed data architecture, the subject of my next blog.
Unified analytics platform
I see Synapse as a great solution for a ProEDW, a unified analytics platform approach, where it incorporates a data lake, a relational data warehouse, spark tables, and tools such as Azure Data Factory, Power BI, and soon Azure Purview all under one roof called Azure Synapse Studio. It also supports federated queries mentioned above via a serverless model.
And eventually Synapse will support multiple serverless pools that can query a relational database, and the ability to have multiple dedicated pools all access the same relational database, allowing for unlimited compute over a relational database just like you currently have with unlimited compute over a data lake.
Synapse also makes supporting machine learning (ML) easier via automated ML for data in a data lake – see Tutorial: Train a machine learning model without code Tutorial: Sentiment analysis with Cognitive Services (Preview) Tutorial: Anomaly detection with Cognitive Services (Preview) and even for relational data – see Tutorial: Machine learning model scoring wizard (preview) for dedicated SQL pools
All these options allow for many ways to transform and move data, so here is my list of those options in order of easiest to hardest to build, less features to more features:
Transforming data options:
- *Power BI Data flows
- Azure Data Factory Wrangling Data flows
- Azure Data Factory Mapping Data flows
- T-SQL in serverless or dedicated SQL pools
- Synapse Spark
- *Databricks
ETL options:
- *Power BI Data flows
- Azure Data Factory
- T-SQL in serverless or dedicated SQL pools (COPY INTO)
- T-SQL in serverless or dedicated SQL pools (CETAS/CTAS)
- Synapse Spark
- *Databricks
* = done outside Synapse
More info:
Data Alone Is Not Enough: The Evolution of Data Architectures
DATA LAKE ARCHITECTURE by Bill Inmon
Advancing Spark – In Response to Bill Inmon
Doing Power BI the Right Way: 4. Power Query in Dataflows or Power BI Desktop
Azure Mythbusters: I don’t need a Data Warehouse now that I have a Data Lake
Pingback:Living in the Lakehouse – Curated SQL
Just wondering, if you can respond or comment on this blog post by Paul Andrew?
https://mrpaulandrew.com/2021/01/29/is-azure-synapse-analytics-ready-for-production/
Paul seems to make some interesting points especially around why someone would want to use Data Lakehouse for e.g. to avoid copying data into Synapse
Paul brings up some good points on areas where Synapse needs to improve, but I strongly disagree that Synapse is not ready for production. Just because it has a few areas where it is weak, does not mean it should not be used. There are 10s of thousands of companies that are using Synapse in production, where the weak areas he lists are not used, not important, or at least not a show stopper, and the benefits far outweigh the few limitations.
As for not using a SQLDW, I see only a few use cases from customers where I don’t think a SQLDW is needed. But at least 90% of customers do use a SQLDW for the reasons listed in this blog and in Data Lakehouse & Synapse. Many times the extra cost and ETL of copying data to a SQLDW is well worth it.
Curious there seems so little discussion about how primitive Synapse is in terms of pipelines or supporting what we already do today and often are quite happy where we are at. Would we like to take advantage of new tech – of course. That’s why we’re in this business. But….
Lets talk about the very common scenario so many of us live in shall we?
Our team supports over 10K SSIS packages. We cross databases with joins ALL the time. We load a lot of different sources using different methods and then feed data to many other downstream applications. The majority of sources are direct connections such as scraping deltas from Dynamics. The flat files are delimited such as csv or xml not parquet. We generate a lot of packages using BIML. We have load and query performance challenges within Sql Server with fact tables up to 1 trillion rows but it’s manageable. The thought of federating those queries is just such sheer lunacy – not just from a performance standpoint but also an integration viewpoint. Makes more sense to take cattle de-worming medication based on internet rumors…did I say that outloud?
So what in Synapse can someone who already has a sizable EDW utilize? In other words…as the proverbial US 1980s Wendy’s commercial asked….”where’s the beef?” We need the horsepower of the MPP in the dedicated Sql Pool (No one at Microsoft can think of a better name really?) but the idea we could migrate thousands of SSIS packages into…into what…synapse pipelines? Then I ask what you been smokin 😉 ADF – IR is likely our only realistic option for getting closer to a PaaS solution of SSIS but I question the cost verses benefit even for that especially since Microsoft is swinging away from ADF now.
Bottom line is this…many of us have really significant investments in existing technology and it may be working quite well. We want to prepare for the future and get fully PaaS no argument there. But we need to a reasonable roadmap that will both recognize business value along with take on a reasonable LOE.
I’d love to hear ideas on this. I’m leaning on just doing prep work such as getting to Sql 2019 and waiting for Synapse Gen 3/support for multiple databases.
Pingback:Data Mesh defined | James Serra's Blog
Pingback:Data Lake VS Delta Lake – Data Upsert and Partition Compaction Management - Plainly Blog - Data Modelling, Advanced Analytics
Pingback:Do we still need a data warehouse? | Data Warehousing and Machine Learning
Pingback:Mapping Dataflows: nieuwe tool, nieuwe standaard?? - Powerdobs
Pingback:Data Lakehouse vs Data Warehouse | Data Platform and Machine Learning
Pingback:Data Lakehouse vs Data Lake+Warehouse | Data Platform and Machine Learning