External tables vs T-SQL views on files in a data lake
A question that I have been hearing recently from customers using Azure Synapse Analytics (the public preview version) is what is the difference between using an external table versus a T-SQL view on a file in a data lake?
Note that a T-SQL view and an external table pointing to a file in a data lake can be created in both a SQL Provisioned pool as well as a SQL On-demand pool.
Here are the differences that I have found:
- Overall summary: views are generally faster and have more features such as OPENROWSET
- Virtual functions (filepath and filename) are not supported with external tables which means users cannot do partition elimination based on FILEPATH or complex wildcard expressions via OPENROWSET (which can be done with views)
- External tables can be shareable with other computes, since their metadata can be mapped to and from Spark and other compute experiences, while views are SQL queries and thus can only be used by SQL On-demand or SQL Provisioned pool
- External tables can use indexes to improve performance, while views would require materialized views for that
- Sql On-demand automatically creates statistics both for a external table and views using OPENROWSET. You can also explicitly create/update statistics on files on OPENROWSET. Note that automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of CSV files statistics is supported
- Views give you more flexibility in the data layout (external tables expect the OSS Hive partitioning layout for example), and allow more query expressions to be added
- External tables require an explicit defined schema while views can use OPENROWSET to provide automatic schema inference allowing for more flexibility (but note that an explicitly defined schema can provide faster performance)
- If you reference the same external table in your query twice, the query optimizer will know that you are referencing the same object twice, while two of the same OPENROWSETs will not be recognized as the same object. For this reason in such cases better execution plans could be generated when using external tables instead of views using OPENROWSETs
- Row-level security (Polybase external tables for Azure Synapse only) and Dynamic Data Masking will work on external tables. Row-level security is not supported with views using OPENROWSET
- You can use both external tables and views to write data to the data lake via CETAS (this is the only way either option can write data to the data lake)
- If using SQL On-demand, make sure to read Best practices for SQL on-demand (preview) in Azure Synapse Analytics
I often get asked what is the difference in performance when it comes to querying using an external table or view against a file in ADLS Gen2 vs. querying against a highly compressed table in a SQL Provisioned pool (i.e. managed table). It’s hard to quantify without understanding more about each customers scenario, but you will roughly see a 5X performance difference between queries over external tables and views vs. managed tables (obviously, depending on the query, that will vary but that’s a rough number – could be more than 5X in some scenarios). A few things that contribute to that: in-memory caching, SSD based caches, result-set caching, and the ability to design and align data and tables when they are stored as managed tables. You can also create materialized views for managed tables which typically bring lots of performance improvements as well. If you are querying Parquet data, that is in a columnstore file format with compression so that would give you similar data/column elimination as what managed SQL clustered columnstore index (CCI) would give, but if you are querying non-Parquet files you do not get this functionality. Note that for managed tables, on top of performance, you also get a granular security model, workload management capabilities, and so on (see Data Lakehouse & Synapse).
Hi James. Thanks for the post. Quick question, I have always though that OPENROWSET was a single threaded operation that needed to move data through the control node and not directly to the compute nodes. So by saying that views (which use OPENROWSET) are generally faster, does that mean the OPENROWSET implementation in Synapse Workspace is NOT single threaded (like in Synapse Analytics) and enjoys the MPP benefits of reading data from a data lake straight to the compute nodes and not through the control node?
Hi James, OPENROWSET works over files in the data lake and the “compute” used to query the files is not MPP. MPP is just used with SQL provisioned pools on a relational database.
Pingback:External Tables vs T-SQL Views in Synapse – Curated SQL
James,
Nice break down of hot conversation. One more blog entry I will be referencing to my friends and customers..
> Note that a T-SQL view and an external table pointing to a file in a data lake can be created in both a
> SQL Provisioned pool as well as a SQL On-demand pool.
what’s the syntax to create a view referencing a external parquet file in dedicated? I need the virtual functions on dedicated. Everything i’m reading points to it only being available on serverless.
It is explained here: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=azure-sqldw-latest&preserve-view=true&tabs=dedicated