PolyBase use cases clarified
I previously talked about PolyBase and its enhancements (see PASS Summit Announcements: PolyBase enhancements). There is some confusion on PolyBase use cases as they are different depending on whether you are using PolyBase with Azure SQL Data Warehouse (SQL DW) or SQL Server 2016, as well as the sources you are using it against. The three main use cases for using PolyBase are: Loading data, federating querying, and aging out data. Here is the support for those three uses cases in SQL DW and SQL Server 2016:
PolyBase in: | Parallelize Data Load (Blob and ADLS) | Federated Query (push down) HDInsights | Federated Query (push down) HDP/Cloudera (local or blob) | Federated Query (push down) five new sources* | Age Out Data |
SQL DW | Yes | N/A | N/A | No support for on-prem sources | Maybe |
SQL Server 2016 | Yes via scale-out groups. Blob, not ADLS | N | Y (Creates MapReduce job) | Y | Maybe |
* = Teradata, Oracle, SQL Server, MongoDB, generic ODBC (Spark, Hive, Impala, DB2)
For federated queries: “N” requires all data from the source to be copied into SQL Server 2016 and then filtered. For “Y”, the query is pushed down into the data source and only the results are returned back, which can be much faster for large amounts of data.
I mention “Maybe” for age out data in SQL DW as you can use PolyBase to access the aged-out data in blob or Azure Data Lake Storage (ADLS), but it will have to import all the data so may have slower performance (which is usually ok for accessing data that is aged-out). For SQL Server 2016, it will have to import the data unless you use HDP/Cloudera, in which case the creation of the MapReduce job will add overhead.
Here are details on what PolyBase supports for each product:
PolyBase (works with) |
Azure Blob Store (WASB) |
Push Down |
Azure Data Lake Store (ADLS) |
Push Down |
HDI |
Push Down |
Cloudera (CDH) |
Push Down |
Horton Works (HDP) |
Push Down |
SQL 2016 |
Yes |
N/A |
No |
N/A |
No |
No |
Yes |
Yes |
Yes |
Yes |
Azure SQL DW |
Yes |
N/A |
Yes |
N/A |
No |
No |
No |
No |
No |
No |
APS |
Yes |
N/A |
No |
N/A |
Yes |
Yes (internal region)
|
Yes |
Yes |
Yes |
Yes |
Here are some important notes:
- The file types that PolyBase supports: UTF-8 and UTF-16 encoded delimited text, RC File, ORC, Parquet, gzip, zlib, Snappy. Not supported: extended ASCII, fixed-file format, WinZip, JSON, and XML
- Azure SQL Database does not support PolyBase
- SQL DW recently added PolyBase support for ADLS but does not support compute pushdown
- ADLS in only in two regions (East US 2, Central US)
- PolyBase supports row sizes up to 1MB
- PolyBase can do writes to blob/ADLS and HDFS (using CETAS)
- PolyBase requires the CREATE EXTERNAL TABLE command
- PolyBase offers ability to create statistics on tables (but they are not auto-created or auto-updated)
PolyBase parallelized reads for data loading:
- Supported: in SQL using CTAS or INSERT INTO
- Not supported: BCP, Bulk Insert, SQLBulkCopy
- Not supported: SSIS (unless used to call stored procedure containing CTAS or use the Azure SQL DW Upload Task)
- Supported: ADF
- If source compatible with PolyBase, will directly copy
- If source not compatible with PolyBase, will stage to Blob
- If source is ADLS, will still stage to Blob (having to stage to blob will be removed week of 8/20, so PolyBase will copy from ADLS directly to target)
The bottom line is, for SQL DW, think of PolyBase as a mechanism for data loading. For SQL Server 2016, think of PolyBase for federated querying.
More info:
Pingback:Polybase Use Cases – Curated SQL
Pingback:U-SQL Defined | James Serra's Blog
Pingback:U-SQL Defined – Cloud Data Architect