Query options in Azure Synapse Analytics
(updated 12/8/20)
The public preview version of Azure Synapse Analytics has three compute options and four types of storage that it can access (mentioned in my blog at SQL on-demand in Azure Synapse Analytics). This gives twelve possible combinations of querying data. Not all of these combinations currently are supported and some have a few quirks of which I list below.
(NOTE: I’ll demo these features at my sessions at European Digital Week on 9/25 (session info), SQL Bits on 10/3 (session info), PASS Summit on 11/13 (session info), and Big Data Conference Europe on 11/25 (session info). I hope you can join!)
Storage > Compute | Relational Database | ADLS Gen2 | Spark Table | Cosmos DB |
---|---|---|---|---|
Dedicated SQL pool | Y | Y (*1) | N (*2) | N (*6) |
Serverless SQL pool | N (*6) | Y | Y (*3) | N (*4) |
Apache Spark pool | Y (*7) | Y | Y | Y (*5) |
*1: This is available via an external table which uses the Polybase technology and does not use push-down queries so can be slow. A feature that will be available after Azure Synapse Analytics goes GA called fast parquet will speed up queries over external tables mapped to parquet files (the technology underneath is the same that is being used for SQL on-demand)
*2: This feature will be available soon, but will only support Spark tables created using Parquet as the input format
*3: This feature only supports Spark tables created using Parquet as the input format (explained here)
*4: This feature is available via the Azure Synapse Link and will soon be available
*5: This feature is available via the Azure Synapse Link
*6: This feature will be available after Azure Synapse Analytics goes GA
*7: Note this option will use the compute of the Dedicated SQL pool, so you are not able to run the queries against the relational database in Spark unless the Dedicated SQL pool for that relational database is running
If you have data in ADLS Gen2 that is stored in Delta Lake format (via methods such as a Spark Table or Azure Data Factory), that data can be read via the Apache Spark pool (no matter if the Delta Lake is built via open source or Databricks), soon via Serverless SQL (a workaround for now is here), and not via a Dedicated SQL pool. Note that Power BI can read the Delta Lake format but requires you to use the Power BI Databricks connector. Also see HOW INTERCHANGEABLE ARE DELTA TABLES BETWEEN AZURE DATABRICKS AND AZURE SYNAPSE ANALYTICS?
If you have data in ADLS Gen2 that is in the Common Data Model (CDM) format (via methods such as Dynamics 365 CDS export to ADLS Gen2 or stored directly, or via Azure Data Factory), Serverless SQL is not yet able to read it, but you can via the Apache Spark pool (see Spark CDM connector). Note that Power BI can read CDM if using Power BI dataflows.
A huge benefit of using the public preview version of Azure Synapse Analytics is the ability to query a file in the data lake by simply right-clicking the file. When you do that a menu will pop-up giving you three choices on how to query the data. Currently this works on three different files types: parquet, csv, and JSON (note that you can always query additional file types using a Spark notebook). The three query choices are listed below with all but one currently supported:
Right-click object-type: | Preview | New SQL script -> Select TOP 100 rows | New notebook -> Load to DataFrame |
---|---|---|---|
Parquet | N (*3) | Y (*5) | Y |
CSV | Y | Y (*1) | Y |
JSON | Y | Y (*2) | Y |
Relational Tbl | N | Y | Y |
External Tbl | N | Y | Y |
Spark Tbl | N | Y | Y |
Cosmos DB | N | Y (*3) | Y |
“Preview” opens a pop-up window with the contents of the file, “Select TOP 100 rows” opens a tab with a T-SQL SELECT statement using SQL on-demand and then you can run the statement to show the results, and “New notebook” opens a Spark notebook that has PySpark code that you can run to load the data into a Spark in-memory DataFrame and display the rows.
*1: This feature was just added and has a limitation in that the column headers do not show correctly (a feature is being worked on to enable the query to infer the column names if the file has a header row)
*2: Replace “varchar(8000)” with “varchar(MAX)” in the SELECT statement if you receive an error when running the query
*3: A work-around is to right-click the file, choose New SQL script -> Bulk load, click the Continue button and you will see a “Preview data” option
*4: Need to replace “key” with the Cosmos DB key in the generated SQL
*5: Because of automatic schema inference you don’t need to specify the field layouts for a file in a query
As far as writing to ADLS Gen2 via a Serverless SQL pool, that could be done using CETAS. Note that dropping an external table does not delete the data. You can also use the copy activity in ADF to read data via views in SQL on-demand and write it to the data lake, but the write is using ADF compute.
More info:
Analyze and explore data with T-SQL in Azure Synapse Analytics
Pingback:Azure Synapse Analytics Query Options – Curated SQL