Azure Synapse Analytics & Power BI performance
With two new relational database features (Result-set caching and Materialized Views) just GA’d in Azure Synapse Analytics (formally called Azure SQL Data Warehouse), it makes for some very compelling reporting performance options when combined with Power BI. In this blog I’ll discuss the different ways you can make Power BI queries run faster, and whether you still need Azure Analysis Services or if the tabular model (i.e. “cube”) within Power BI is enough. Note there is a separate preview version of Azure Synapse Analytics which adds workspaces and new features such as Apache Spark, Azure Analytics Studio, serverless on-demand queries, and in which the relational database engine and relational storage are part of an “SQL Analytics” pool. Everything in this blog also applies to the SQL Analytics pool. To avoid confusion in the rest of this blog I will use “SQL DW” to refer to both the current version of Azure Synapse Analytics and the SQL Analytics pool that is in preview.
First a review of options available within Power BI:
Import: The selected tables and columns from the data source (i.e. SQL DW) are imported into Power BI Desktop and into the computers memory. As you create or interact with a visualization, Power BI Desktop uses the imported data and never touches the data source (underneath the covers Power BI stores the data in an analysis services engine in-memory cache). You must refresh the data, which imports the full data set again (or use the PBI premium feature incremental refresh), to see any changes that occurred to the underlying data since the initial import or the most recent refresh (so it’s not real-time). Imported datasets in the Power BI services have a 10GB dataset limitation for the Power BI premium version (with 400GB in preview, which is what Azure Analysis Services supports) and 1GB limitation for the Power BI free version. Note data is heavily compressed when imported into memory so you can import much large datasets than these limits. See Data sources in Power BI Desktop.
DirectQuery: No data is imported or copied into Power BI Desktop. Instead, as you create or interact with a visualization, Power BI Desktop queries the underlying data source (i.e. SQL DW), which means you’re always viewing the latest data in SQL DW (i.e. real-time). DirectQuery lets you build visualizations over very large datasets, where it otherwise would be unfeasible to first import and aggregate all of the data (although now with support for 400GB datasets and with Aggregation tables the need to use DirectQuery because the dataset won’t fit into memory goes away in many cases and DirectQuery is needed only if real-time results are required). See Data sources supported by DirectQuery.
Composite model: Allows a single report to seamlessly combine data from one or more DirectQuery sources, and/or combine data from a mix of DirectQuery sources and imported data. So this means you can combine multiple DirectQuery sources with multiple Import sources.
Dual Storage Mode: Dual tables can act as either cached (imported) or not cached, depending on the context of the query that’s submitted to the Power BI dataset. In some cases, you fulfill queries from cached data. In other cases, you fulfill queries by executing an on-demand query (DirectQuery) to the data source.
Aggregations: Create an aggregated table (which will be in-memory if set to Import mode) from an underlying detail table (which is set to DirectQuery meaning the detailed data is kept at the data source and not imported). If the user query asks for data that it can get from the aggregated table, it will get it from the in-memory table. Otherwise, it will use a DirectQuery to the underlying detail table. You can create multiple aggregation tables using different summations off of one detail table. Think of aggregation tables as mini-cubes, or a performance optimization technique similar to the way you use indexes in SQL databases to help speed up SQL queries. Be aware it could take some time to create the aggregated table (its similar to processing a cube) and the data is not real-time (it’s as of the last refresh). See Aggregations in Power BI and Creative Aggs Part I : Introduction.
You may ask are aggregation tables still needed with Power BI now supporting large models (400GB) in import mode? In most cases, yes. For one, large model support is only in Power BI Premium, and if you are not using that you are limited to 10GB models. Also, importing a large detail table will import all of the detailed records with no aggregations being done (if it’s a 26 billion row table, you are importing 26 billion rows into memory). So really large tables won’t be able to be imported even with a 400GB limit, where using aggregation tables allows you to create aggregations, hence the name (a 26 billion row table may reduce to only 10 million aggregated rows so easily fits in memory). So, you need to replicate much less data into memory. And if a query wants an aggregation that it can’t get from the in-memory aggregation table, no problem, Power BI will send a DirectQuery to the detail table. So in summary aggregation tables can unlock massive big datasets that would not otherwise fit in memory, save costs by needing a smaller sku, and not have to manage replicating all that data into memory.
Two new performance features in Azure Synapse Analytics that are a big deal:
- Result-set caching: Automatically caches query results in the user database for repetitive use. This allows subsequent query executions to get results directly from the persisted cache so recomputation is not needed. Result set caching improves query performance (down to milliseconds) and reduces compute resource usage. In addition, queries using cached results set do not use any concurrency slots in Azure Synapse Analytics and thus do not count against existing concurrency limits
- Materialized Views: A view that pre-computes, stores, and maintains its data in SQL DW just like a table. There’s no recomputation needed each time when a materialized view is used. Queries that use all or a subset of the data in materialized views can get faster performance. Even better, queries can use a materialized view without making direct reference to it, so there’s no need to change application code
So, if we look at all the layers involved where a query can access data when using Power BI that is using Azure Synapse Analytics as the data source, it would look like this:
As an example of the speed of each layer, during an Ignite session (view here), there was a Power BI query run against 26 billion rows that was returning a sum of store sales by year. The same query was run three times using a different layer:
- Using a DirectQuery against tables in SQL DW took 8 seconds
- Using a DirectQuery against a materialized view in SQL DW took 2.4 seconds. Note you don’t have to specify that you are using a materialized view in the query, as the SQL DW optimizer will know if it can use it or not
- Using a Aggregation table that is Imported into Power BI took 0 milliseconds
Keep in mind this is all hidden from user – they just create the report. If they do a query against a table not in memory in Power BI, it will do a DirectQuery against the data source which could take a while. However, due to SQL DW result-set caching, repeat DirectQuery’s can be very fast (in the Ignite session they demo’d a DirectQuery that took 42 seconds the first time the query was run, and just 154 milliseconds the second time the query was run that used result-set caching).
So by using features such as result-set caching and materialized views, you may achieve the results you are looking for without having to load the data into Power BI.
One thing to note: In the short term, Synapse won’t leverage materialized views with queries that use outer joins (which are the types of queries Power BI sends by default). So Power BI users will need to set the trust data source for data integrity setting for materialized views to be useful (this won’t be required when Synapse goes GA). Also note that materialized views can be queried directly, so users can create a materialized view and in Power BI they can create aggregations that refer to them directly.
These new features bring into question whether you still need Azure Analysis Services (AAS) or if the tabular model (i.e. “cube”) within Power BI is enough. Or whether you don’t need either – instead just use DirectQuery within Power BI.
For the first question, the goal is soon there won’t be a reason to use AAS assuming you already have Power BI Premium (especially now with XMLA endpoint support). Eventually Power BI Premium will be a superset of AAS. But AAS will be supported for the foreseeable future for the lift and shift scenario: AAS has modeling parity with SSAS on-prem so for customers that haven’t yet looked at the full modernization benefits of Power BI (see those benefits below) then those customers may choose to move to AAS perhaps as a stop-gap till they’re ready to commit to Power BI (but eventually lifting and shifting SSAS to Power BI will be supported). And of course AAS would still be needed if you wanted to use a 3rd-party product against it (a reporting tool for example) and were not able to use Power BI Premium (because your company had standardized on a different reporting tool for example). A scenario I currently see is customers spending money training a small group of people to build AAS cubes and making them available for a large group of users who can then do self service reporting against the cubes.
Those benefits of Power BI over AAS include data protection based on Microsoft Cloud App Security using Microsoft Information Protection labels, shared datasets, aggregations and composite models over DWs for big data, simplified management of incremental refresh, lineage view of Power BI artifacts across workspaces, much better connectivity to big data sources, paginated reports, dataflows, automated ML and cognitive services, Power BI Desktop model authoring, Bring Your Own Key (for encryption), and multi-model memory management.
Note there are still some features in AAS that are not yet in Power BI: the ability to deploy from Visual Studio to Power BI, XMLA write endpoint, query scale-out, some 3rd-party tool integration, backup/restore, firewalls, multiple tiers, Log Analytics integration, asynchronous processing REST API, Perspectives, Key Performance Indicators (KPIs) within the model, Ragged Hierarchies, object level security, and metadata translations.
The good news is that it should be very easy to deploy an Analysis Services model to Power BI Premium in the future, so you won’t lose any work if you use Analysis Services now. You’ll simply change the value of the deployment server property in Visual Studio from your existing SSAS or Azure AS server to the XMLA endpoint of the Power BI Premium workspace.
Some excellent blogs comparing Power BI Premium and Azure Analysis Services: Power BI Premium and Azure Analysis Services, Choosing Azure Analysis Services for large datasets, Power BI Large Datasets: The Good, the Bad, and the Ugly, ANALYSIS SERVICES VS POWER BI PREMIUM MODEL FEATURE MATRIX, and Migrating from Azure Analysis Services to Power BI Premium.
For the second question, on whether you can just use DirectQuery against SQL DW and avoid using both AAS and a cube within Power BI, the answer is ‘yes’ if you are certain the queries to be used will always hit the result-set cache in SQL DW (see what’s not cached here and when cache results are used here). But if some won’t, especially when you are using a Power BI dashboard where you need millisecond response time for all queries, it would be best to use a cube or aggregations in Power BI for the dashboards and use SQL DW for ad-hoc queries. Also be aware that there are DAX limitations when using DirectQuery in Power BI against SQL DW that may point you to using AAS or a cube within Power BI instead of DirectQuery. These limitations include time intelligence capabilities (see full list here). Additionally keep in mind that a cube gives you benefits you will miss out on if you don’t use one: semantic layer, no need for joins or relationships, hierarchies, KPI’s, advanced time-calculations, and higher concurrency limits.
More info:
Power BI new feature: Composite models
Choosing Azure Analysis Services for large datasets
Great summary on where we are at with Azure Synapse and how to use it with Power BI.
Question about using DirectQuery against SQL DW. There are DAX limitations on using DirectQuery such as time intelligence. Wouldn’t that also be a consideration of choosing DirectQuery over AAS or the cubes within Power BI?
Good point Ibrahim! I have updated the blog to reflect your comment. Thanks!
Good timing for this article. Need to understand Power BI and Synapse for coming up demo.
What a well written summary article! I really like the section you have listed the high lighted feature in Synapse where helps the query performance.
hi,James, about 400GB storage, I know pbi gateway has a limit of 5 hours refresh.during 5 hours, pbi cannot import 400gb data, right? so , because of 5 hours gateway refresh limit, dataset size limit is equal to 5 hours gateway refresh amount, right?
Great article James. I’ve been scouring for these information in Microsoft documentation 🙂 Thank you