SQL Server 2019 Big Data Clusters
At the Microsoft Ignite conference, Microsoft announced that SQL Server 2019 is now in preview and that SQL Server 2019 will include Apache Spark and Hadoop Distributed File System (HDFS) for scalable compute and storage. This new architecture that combines together the SQL Server database engine, Spark, and HDFS into a unified data platform is called a “big data cluster”, deployed as containers on Kubernetes. Big data clusters can be deployed in any cloud where there is a managed Kubernetes service, such as Azure Kubernetes Service (AKS), or in on-premises Kubernetes clusters, such as AKS on Azure Stack. The SQL Server 2019 relational database engine in a big data cluster leverages an elastically scalable storage layer that integrates SQL Server and HDFS to scale to petabytes of data storage. The Spark engine is now part of SQL Server:
While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer. The virtual data layer allows users to query data from many sources through a single, unified interface. Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date. Storage costs and data governance complexity are minimized. See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement.
SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:
In SQL Server 2019 big data clusters, the SQL Server engine has gained the ability to natively read HDFS files, such as CSV and parquet files, by using SQL Server instances collocated on each of the HDFS data nodes that can filter and aggregate data locally in parallel across all of the HDFS data nodes.
Performance of PolyBase queries in SQL Server 2019 big data clusters can be boosted further by distributing the cross-partition aggregation and shuffling of the filtered query results to “compute pools” comprised of multiple SQL Server instances that work together (this is similar to a PolyBase scale-out group).
When you combine the enhanced PolyBase connectors with SQL Server 2019 big data clusters data pools, data from external data sources can be partitioned and cached across all the SQL Server instances in a data pool, creating a “scale-out data mart”. There can be more than one scale-out data mart in a given data pool, and a data mart can combine data from multiple external data sources and tables, making it easy to integrate and cache combined data sets from multiple external sources. This will also be a great solution for importing IoT data.
SQL Server 2019 big data clusters make it easier for big data sets to be joined to the data stored in the enterprise relational database, enabling people and apps that use SQL Server to query big data more easily. The value of the big data greatly increases when it is not just in the hands of the data scientists and big data engineers but is also included in reports, dashboards, and applications used by regular end users. At the same time, the data scientists can continue to use big data ecosystem tools against HDFS while also utilizing easy, real-time access to the high-value data in SQL Server because it is all part of one integrated, complete system.
Azure Data Studio (previously released under the name of SQL Operations Studio) is an open-source, multi-purpose data management and analytics tool for DBAs, data scientists, and data engineers. New extensions for Azure Data Studio integrate the user experience for working with relational data in SQL Server with big data. The new HDFS browser lets analysts, data scientists, and data engineers easily view the HDFS files and directories in the big data cluster, upload/download files, open them, and delete them if needed. The new built-in notebooks in Azure Data Studio are built on Jupyter, enabling data scientists and engineers to write Python, R, or Scala code with Intellisense and syntax highlighting before submitting the code as Spark jobs and viewing the results inline. Notebooks facilitate collaboration between teammates working on a data analysis project together. Lastly, the External Table Wizard, which uses PolyBase connectors, simplifies the process of creating external data sources and tables, including column mappings (it’s much easier than the current way of creating external tables).
There will also be a management service that will provision a bunch of agents on each pod that will collect monitoring data and the logs that can be seen via a browser-based cluster admin portal, which will also provide managed services for HA, backup/recovery, security, and provisioning.
In summary, SQL Server 2019 Big Data Clusters improves the 4 V’s of Big Data with these features:
More info:
Introducing Microsoft SQL Server 2019 Big Data Clusters
What are SQL Server 2019 big data clusters?
SQL Server 2019 Big Data Clusters white paper
Video New Feature Offerings in SQL Server 2019
Video SQL Server vNext meets AI and Big Data
Video The future of SQL Server and big data
Video Deep dive on SQL Server and big data
Big Data Clusters for the Absolute Beginner
Choosing the Right Infrastructure for A SQL Server 2019 Big Cluster
Pingback:Big Data Clusters In SQL Server 2019 – Curated SQL