Synapse and federated queries
As I was finishing up a demo script for my presentation at the SQL PASS Virtual Summit on 11/13 (details on my session here), I wanted to blog about part of the demo that shows a feature in the public preview of Synapse that is frankly, very cool. It is the ability to query data as it sits in ADLS Gen2, a Spark table, and Cosmos DB and join the data together with one T-SQL statement using SQL On-demand (also called SQL serverless), hence making it a federated query (also known as data virtualization or a logical data warehouse). The beauty of this is you don’t have to first write ETL to collect all the data into a relational database in order to be able to query it all together, and don’t have to provision a SQL pool, saving costs. Further, you are using T-SQL to query all of those data sources so you are able to use a reporting tool like Power BI to see the results (or any tool that supports T-SQL). Synapse is the tool to easily create that virtual layer on top of the data.
The queries are also fast as SQL On-demand will push-down queries from the front-end to the back-end nodes (which contain a SQL Server engine) with those back-end nodes sitting next to the storage (this is done via POLARIS: The Distributed SQL Engine in Azure Synapse, which is being used for SQL On-demand pools and eventually for SQL Provisioned pools). Note there is no cache yet in SQL On-demand so the queries won’t run faster after the first run.
My demo queries tweets in parquet files in ADLS Gen2. Users or Cities can be banned and the query has to return the tweets only from non-banned entities. The list of banned users are in a Spark table and the banned cities are in a Cosmos DB table. Now on to my demo:
Working inside Azure Synapse Studio, first, I wrote a T-SQL view that queried all the parquet files in an ADLS Gen2 folder (the key is the OPENROWSET function):
CREATE VIEW [dbo].[vFQ_ADLSTable] AS
SELECT
ADLSTable.*
FROM
OPENROWSET(
BULK
'https://asaexpdatalakejs1234.dfs.core.windows.net/twitterdata/*.parquet',
FORMAT='PARQUET'
) AS ADLSTable
Notice how I am using a wildcard in the file path to query all parquet files in the folder instead of just one.
Then I wrote a T-SQL view that queried a Spark table called bannedusers in the Spark database users. Even though it is a Spark table and the Spark cluster is not running, I am able to query this table using SQL On-demand because of a feature in Synapse that copies the metadata for the Spark table to SQL On-demand so you can use it outside of Spark (see Azure Synapse Analytics shared metadata). Plus, when a Spark table is created it actually stores the data in the Synapse primary storage area (in my case it is in the folder asaexpworkspacejs1234/tempdata/synapse/workspaces/
serrademoworkspace/warehouse/users.db/bannedusers) so there is no need to use the Spark engine as SQL On-demand can directly query that storage location:
CREATE VIEW [dbo].[vFQ_SparkTable] AS
SELECT
SparkTable.*
FROM users.dbo.bannedusers AS SparkTable
In both views above the data is read directly from storage, in parallel, and filtering is done at the SQL engine nodes themselves. Also, in some cases the data being read is filtered – for example, eliminating some files early on based on partition elimination, and in case of Parquet also skipping reading some column segments where possible, and in the future using such technologies as Query Acceleration for ADLS.
Then I wrote a T-SQL view that queries a database in Cosmos DB called CosmosIOTDemo and the container bannedcities using Azure Synapse Link for Cosmos DB (see demo here):
CREATE VIEW [dbo].[vFQ_CosmosTable] AS
SELECT
CosmosTable.*
FROM OPENROWSET (
'CosmosDB', N'account=synapselinkdemoserra;database=CosmosIOTDemo;region=eastus;key=xxx',bannedcities
) with ( id varchar(36), city varchar(1), cityname varchar(30), banned varchar(1) ) AS CosmosTable
When using the Azure Synapse Link for Cosmos DB, the SQL on-demand compute engine pushes down predicates to the remote, columnar store of Cosmos DB (aka Analytical Store) and returns the results to the Synapse compute layer.
Finally, I created a view using all three above views to join the data together. The parquet files in ADLS Gen2 are twitter feeds, and the Spark table and Cosmos DB table are reference tables that list the users and cities in the twitter feeds that should be excluded from the query (they have to be in the table with the banned field set to Y to not show in the query). While the twitter feed was only about 4000 rows, this query only took 2-3 seconds (you may have to wait an additional 2-3 seconds for the SQL On-demand pool to warm up the first time):
CREATE VIEW [dbo].[vFQ_Combined] AS
SELECT
t1.*,
t2.Banned as UserBanned,
t3.Banned as CityBanned
FROM
vFQ_ADLSTable t1
LEFT JOIN vFQ_SparkTable t2
ON (t1.UserName COLLATE SQL_Latin1_General_CP1_CI_AS = t2.username)
LEFT JOIN vFQ_CosmosTable t3
ON (t1.City = t3.cityname)
WHERE (t2.banned is NULL or t2.banned = 'N')
AND (t3.banned is NULL or t3.banned = 'N')
Then I went to the Develop hub, and under Power BI chose “Power BI datasets” (first you have to link to a Power BI workspace), chose “New Power BI dataset”, chose my SQL On-demand pool, and downloaded the Power BI Desktop Source (.pbids) file that has the connection info to the SQL On-demand database. Once I clicked on that download file, it opened up Power BI Desktop and allowed me to choose the combined query vFQ_Combined. You can choose to import the data or use DirectQuery to get the results in real-time as any file that is dropped into the ADLS Gen2 folder along with the other parquet files will be read.
It appears to the end user that the data is stored in a relational database because the view is creating an abstract layer with the metadata (although because of automatic schema inference you don’t even need to specify the field layouts for a file in ADLS Gen2 or a Spark table).
Data in ADLS Gen2 can be moved to different access tiers to save costs, and eventually when SQL On-demand supports querying the relational SQL Provisioned pool you can archive out older relational data to ADLS Gen2 and use a query on that storage and combine it with a query on the relational database to also save costs and improve performance within the relational database.
As I said, very cool!
More info:
Direct Query your Data Lake Files using Azure Synapse Analytics and Power BI
Pingback:Querying Multiple Data Sources in Azure Synapse Analytics – Curated SQL
Pingback:Top Azure Synapse Analytics and Power BI questions | James Serra's Blog
Pingback:Top Azure Synapse Analytics and Power BI questions from Blog Posts - SQLServerCentral | Packt Hub
Pingback:Top Azure Synapse Analytics and Power BI questions from Blog Posts - SQLServerCentral - Actu Tech
Nice!!
Pingback:Top Azure Synapse Analytics and Power BI questions – SQLServerCentral