PolyBase explained

PolyBase is a new technology that integrates Microsoft’s MPP product, SQL Server Parallel Data Warehouse (PDW), with Hadoop. It is designed to enable queries across relational data stored in PDW and in non-relational Hadoop data that is stored in the Hadoop Distributed File System (HDFS), bypassing Hadoop’s MapReduce distributed computing engine that is typically used to read data from HDFS. You can create an external table in PDW that references Hadoop data (kinda like a linked server) and you can then query it with SQL, in essence adding structure to un-structured data. So you can: 1) retrieve data from HDFS with a PDW query that will even allow that data to be joined to native PDW relational tables so that Hadoop and SQL PDW can be queried in tandem, with result sets that integrate data from each source (seamlessly joining structured and semi-structured data); 2) you can import data from HDFS to PDW; and 3) you can export data from PDW to HDFS (for example, as a backup strategy).

The biggest benefit with PolyBase is you don’t need to understand HDFS or MapReduce (typically written in Java) to access Hadoop, and there is no ETL needed. And you can quickly and easily use a tool such as Power Pivot to connect to PDW and pull in data from PDW tables and external Hadoop tables.

Microsoft Technical Fellow David Dewitt is one of the principals behind PolyBase. Some things to note:

When selecting data in Hadoop, the data is not stored in PDW – it uses a ShuffleMove/BroadcastMove/Round Robin to temporarily bring the data into PDW into temporary tables
PolyBase only works within PDW for now, but later it might be added to SQL Server (but there are no plans for that). PolyBase relies on the Data Movement Service (DMS) in PDW, and DMS does not exist in SQL Server
It does not support DML operations
It may in the future be able to access other storage systems besides Hadoop
It only works for delimited text files
It requires Java RunTime environment (Oracle JRE)
It can connect to Hortonworks Data Platform (HDP) on Windows Server, HDP on Linux, Cloudera (CHD) on Linux
Soon PDW will have the ability to add a Hadoop scale-unit (compute nodes and storage) right into the PDW rack

In a future version of PolyBase the query optimizer will be able make a cost-based decision, when referencing data in an HDFS, to determine whether it should transform the query into a MapReduce job to be performed on the Hadoop cluster or if it should just process using the SQL server instances on the PDW. Also, the optimizer will have the ability to move the workload of a query involving only PDW data to the Hadoop cluster. This intelligence within the optimizer will allow it to split the workload between the two platforms and thus leverage the true capabilities of the Hadoop cluster.

So in summary, the main features of PolyBase are:

Simplicity: You can query data in Hadoop via regular SQL
Performance: Parallelized data reading and writing into Hadoop
Openness: Supports various Hadoop distributions
Integration: Works with Microsoft BI tools such as Power Pivot, Power View, SSRS, SSAS

More info:

Seamless insights on structured and unstructured data with SQL Server 2012 Parallel Data Warehouse

Polybase: Hadoop Integration in SQL Server PDW V2

Microsoft’s PolyBase mashes up SQL Server and Hadoop

Insight through Integration: SQL Server 2012 Parallel Data Warehouse

PASS talk: Polybase: What, Why, How

PolyBase in APS – Yet another SQL over Hadoop solution?

Comments

PolyBase explained — 26 Comments

Andrew Peterson on February 24, 2014 at 9:31 am said:

I watched the joint Hortonworks and Microsoft webinar a few days ago on how PDW has incorporated Hortonworks. When they got to the section where the presenter showed how a user can easily join existing PDW data with Hadoop Hive/HCatalog tables it was impressive. Really the future of how easy it will be to merge unstructured data in Hadoop with traditionally structured data. Now, how soon will we be able to get that with the basic SQL Server BI stack?

Reply ↓
Pingback:Introduction to Hadoop - SQL Server - SQL Server - Toad World
Pingback:Introduction to Hadoop | James Serra's Blog
Pingback:What is HDInsight? | James Serra's Blog
Pingback:Parallel Data Warehouse (PDW) benefits made simple | James Serra's Blog
Pingback:Parallel Data Warehouse (PDW) benefits made simple - SQL Server - SQL Server - Toad World
Pingback:Parallel Data Warehouse (PDW) AU1 released - SQL Server - SQL Server - Toad World
Pingback:What is the Microsoft Analytics Platform System (APS)? - SQL Server - SQL Server - Toad World
Pingback:Non-obvious APS/PDW benefits - SQL Server - SQL Server - Toad World
Pingback:Non-obvious APS/PDW benefits | James Serra's Blog
Pingback:The Modern Data Warehouse - SQL Server - SQL Server - Toad World
Pingback:The Modern Data Warehouse | James Serra's Blog
Pingback:Hadoop and Data Warehouses | James Serra's Blog
Pingback:What is a data lake? - SQL Server - SQL Server - Toad World
Pingback:Azure SQL Data Warehouse | James Serra's Blog
Pingback:Azure SQL Data Warehouse - SQL Server - SQL Server - Toad World
Pingback:What is a data lake? | James Serra's Blog
Pingback:Azure Data Lake | James Serra's Blog
Pingback:Azure Data Lake - SQL Server - SQL Server - Toad World
Pingback:SQL Server 2016 public preview arriving this summer - SQL Server - SQL Server - Toad World
Romeo Valencia on May 27, 2015 at 4:01 pm said:

We have determined in our product/solution to revise the existing strategy and architecture as far as data is concerned. We have come to the point where we needed a common gateway or endpoint where structured and unstructured data can seamlessly be accessible and can be homogenized. Our backend data processing are all HADOOP based infrastructure and our OLTP/DWH database are currently under SQL SERVER 2008 R2 platform. According to our research, the new version of SQL SERVER and that is SQL SERVER 2014, offers a specific feature that is suitable to our requirements. The specific feature that we are looking at is the “POLYBASE”. Our main objective for the re-achitecuring are as follows: 1. Very minimal to zero revisions/impacts to our front-end application; 2. Optimal performance gain in accessing (including data manipulation) data for both structured/unstructured; 3. Scalability and maintainability; 4. Re-utilization of an existing skillset; What we are looking at or expectation: 1. Some success stories implementing such feature and if need be some presentation; 2. Provide and guide us in the implementation of such feature; 3. Any other feature as an extension of SQL Server to Hadoop.

Reply ↓
Pingback:SQL Server 2016 public preview arriving this summer | James Serra's Blog
Andrew Peterson on September 4, 2015 at 11:50 am said:

This is a follow-up to my first comment. We’ll, it appears that PolyBase will be coming to the enterprise edition of SQL Server 2016. I’ve reviewed CTP 2.2 and PolyBase is a big part of CTP 2.2. This is great news. If anyone is interested, I’ve created a series of posts on how to setup, and get started with PolyBase in SQL Server 2016.
You can find the posts here: http://realizeddesign.blogspot.com/

Reply ↓
Cyed on October 9, 2015 at 3:08 pm said:

Hi Sir,

Great article sir, I have been waiting for this kind of stuff, which can help us to fetch data from hadoop without MR,Yarn,PIG,HIVE etc.
Thank you for posting

Reply ↓
Pingback:Why use a data lake? - SQL Server - SQL Server - Toad World
Pingback:Why use a data lake? | James Serra's Blog

James Serra's Blog

Big Data and Data Warehousing

Comments

PolyBase explained — 26 Comments

Leave a Reply Cancel reply

Share:

Leave a Reply Cancel reply