Using a data lakehouse
As I mentioned in my Data Mesh, Data Fabric, Data Lakehouse presentation, the data lakehouse architecture, where you use a data lake with delta lake as a software layer and skip using a relational data warehouse, is becoming more and more popular. For some customers, I will recommend “Use a data lake until you can’t”. What I mean by this is to take the following steps when building a new data architecture in Azure with Azure Synapse Analytics:
- Create a data lake in Azure and start with just using that – no delta lake or Synapse dedicated SQL pool. Set it up using best practices – see Best practices for using Azure Data Lake Storage Gen2 and The best practices for organizing Synapse workspaces and lakehouses. This may be all you need if your data size is on the smaller end, and you don’t have complicated queries. Just be aware of the tradeoffs of not using a dedicated SQL pool (See Data Lakehouse & Synapse). You will be using a Synapse serverless pool to access the data in the data lake so be sure to read Best practices for serverless SQL pool in Azure Synapse Analytics. Also, learn of ways to adjust the folder structure in your data lake to improve performance – see Synapse Serverless SQL Pool – Performance and cost optimization with partitioning. You might want to do a few POC’s to see if you can get by without a delta lake and/or a Synapse dedicated SQL pool
- If you find performance is not satisfactory or you need some of the features of a delta lake (see Azure Synapse and Delta Lake), then update all your code to write the data into your data lake in delta lake format (for example, see Transform data in delta lake using mapping data flows). Delta lake does give you many benefits, but it does have the tradeoff of more cost and complexity. In my experience, the majority of customers are using delta lake
- If you are using Power BI for reporting against data in the data lake, there are a lot of features you can use if you are not getting the performance you need. Using Import mode is usually the best option for fast performance, but there are other options – see Power BI Performance Features
- If your performance is still suffering, then it’s time to start using a Synapse dedicated SQL pool. Just realize you only need to copy the data into the Synapse dedicated SQL pool for the data that is causing problems, not all the data in the data lake. Be aware with dedicated SQL pools there are a lot of knobs to turn to improve performance (see Best practices for dedicated SQL pools in Azure Synapse Analytics)
- An option to get better performance with a dedicated SQL pool is to duplicate the data from one SQL pool to another that has different indexes on it. So, queries and reports will go against the pool that gives the best performance. Of course, the tradeoff is cost and complexity in updating two pools
- Another option to get further performance is to convert your data to a star schema, which can be done in the Synapse dedicated SQL pool or as a dataset in Power BI (see Power BI and a Star Schema), or you could even land the star schema in your data lake
A question I get asked in a data lakehouse discussion is: what is the difference between importing directly from the data lake into PBI vs using Synapse serverless? There are two blogs on this by Chris Webb from Microsoft that I refer customers to:
Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless
Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless, Part 2: Transformations
I’ll be discussing in detail the data lakehouse architecture (as well as other architectures such as data mesh and data fabric) in my upcoming book, that will be published by O’Reilly. I’m at least halfway through writing the draft of the book and hope to have a couple of chapters available soon that you can view via the O’Reilly early release program. I’ll then add another couple chapters every few weeks. The early release program is a great way to start reading the chapters well before the entire book is officially published. I’ll post on my blog when it becomes available.
I tried to rewrite dwh from azure synapse dedicated pool to databricks.
In databricks star schema is not very fast. Only partition queries work fine. In my opinion, databricks is not suitable for data warehouses