Query Acceleration for ADLS
Just announced is Query Acceleration for Azure Data Lake Storage Gen2 (ADLS) as well as Blob Storage. This is a new capability for ADLS that enables applications and analytics frameworks to dramatically optimize data processing by retrieving only the data that they require to perform a given operation from storage. This reduces the time and processing power that is required to query stored data.
For example, if an application will execute a SELECT statement that filters columns and rows from a csv file, instead of all pulling the entire csv file over the network into the application and then filtering the data, it will instead do the filtering at the time the data is read from the disk, so that only the filtered data is transferred over the network to the application. So if you have a csv file with 50 columns and 1 million rows, but the filters limit the data to 5 columns and 1000 rows, then only the 5 columns and 1000 rows will be retrieved from the disk and sent over the network to the application.
It accomplishes this by pushing-down predicates and column projections, so they may be applied at the time data is first read, enabling applications to filter rows and columns at the time that data is read from disk so that all downstream data handling is saved from the cost of filtering and processing unrequired data. This improves network latency and compute cost (an analysis showed that 80% of data is needlessly transferred across the network, parsed, and filtered by applications). Also, the CPU load that is required to parse and filter unneeded data requires your application to provision a greater number and larger VMs in order to do it’s work. By transferring this compute load to query acceleration, applications can realize significant cost savings.
You can use SQL to specify the row filter predicates and column projections in a Query Acceleration request. A request processes only one file. Therefore, advanced relational features of SQL, such as joins and the GROUP BY aggregate, aren’t supported. Query acceleration supports CSV and JSON formatted data as input to each request.
The following diagram illustrates how a typical application uses Query Acceleration to process data:
- The client application requests file data by specifying predicates and column projections
- Query Acceleration parses the specified query and distributes work to parse and filter data
- Processors read the data from the disk, parse the data by using the appropriate format, and then filter data by applying the specified predicates and column projections. Azure Storage already consists of a non-trivial amount of compute to implement all of the storage functionality (eg. serve requests, encrypt/decrypt, attach JBODs, etc.). Query Acceleration simply is allocated a quota of this resource to do its varied jobs
- Query Acceleration combines the response shards to stream back to client application
- The client application receives and parses the streamed response. The application doesn’t need to filter any additional data and can apply the desired calculation or transformation directly
Query acceleration supports an ANSI SQL-like language for expressing queries over blob contents. The query acceleration SQL dialect is a subset of ANSI SQL, with a limited set of supported data types, operators, etc., but it also expands on ANSI SQL to support queries over hierarchical semi-structured data formats such as JSON.
Right now this works via Java and .NET, and in the future it will work for such tools as Python and Azure Synapse Analytics (Microsoft is actively working with OSS and commercial partners to integrate Query Acceleration into these frameworks, as appropriate).
Due to the increased compute load within the ADLS service, the pricing model for using query acceleration differs from the normal ADLS transaction model. Query acceleration charges a cost for the amount of data scanned as well as a cost for the amount of data returned to the caller.
Despite the change to the billing model, Query acceleration’s pricing model is designed to lower the total cost of ownership for a workload, given the reduction in the much more expensive VM costs.
To find out more about Query Acceleration for Azure Data Lake Storage you can:
- Sign up for the Azure Data Lake Storage Query Acceleration Public Preview program. The query acceleration feature is in public preview, and is available in the Canada Central and France Central regions
- Read the Azure Data Lake Storage query acceleration documentation
- Learn how to use Query Acceleration for Java and .NET
- Learn how to Filter data by using Azure Data Lake Storage query acceleration
- Learn about Query acceleration SQL language reference
- Understand the pricing model for Query Acceleration. Public preview pricing for Query Acceleration is offered at a 50% discount
Very cool new capability! Since Power BI Dataflows uses ADLS on the backend, do you know if that platform will also see the benefits of Query Acceleration at some point (despite the in-progress work on the enhanced compute engine, which leverages a SQL-cache)?
Hi James,
just for the clarification: you are reffering in your article that “advanced relational features of SQL, such as joins and group by aggregates” aren’t supported. But than on the official site it seems like aggregate expressions are supported:
https://docs.microsoft.com/en-us/azure/storage/blobs/query-acceleration-sql-reference
Thank you.
Regards,
Gapy
Hi Gapy,
I meant the GROUP BY is not supported. I updated the blog to clarify. Thanks!
Hi James,
We have often many small equal files in a directory structure that we want to extract data from.
Will it be possible to query a bunch of equal files in a directory and sub directories like in spark
(Ex: df = spark.read.option(“multiline”, “true”).json(“/mnt/jsonfiles/*.json”))
?
Br Dag
I came across Query Acceleration recently, which improved some of my data pulls from 10’s of minutes to 10’s of seconds. I wonder if this could be further improved by indexing JSON / CSV files the way Azure Cognitive Search does. It would be great to get query times to 10’s of milliseconds!