Databricks Delta Lake
Introduced in April 2019, Databricks Delta Lake is, in short, a transactional storage layer that runs on top of cloud storage such as Azure Data Lake Storage (ADLS) Gen2 and adds a layer of reliability to organizational data lakes by enabling many features such as ACID transactions, data versioning and rollback. It does not replace your storage system. It is a Spark proprietary extension and cloud-only. It has been open sourced and the code can be found here. Full documentation is at the Delta Lake Guide.
It’s real easy to use. When writing code in Databricks, instead of using “parquet” for the format property, just use “delta”. So instead of having data land in your cloud storage in its native format, it instead lands in parquet format and while doing so adds certain features to the data. See Delta Lake Quickstart.
Some of the top features are:
- ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data via the implementation of a transaction log, which includes checkpoint support. See Diving Into Delta Lake: Unpacking The Transaction Log
- Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease
- Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink, making a solution for a Lambda architecture but going one step further since both batch and real-time data are landing in the same sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box
- Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion. See Diving Into Delta Lake: Schema Enforcement & Evolution
- Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments. This allows queries to pull data as it was at one point in the past. It can query the data using a data version point or a data timestamp. In both cases, Delta guarantees that the query will return as a complete dataset as was at the time of the version or Timestamp. It can be used for data recovery following an unwanted change or job run. It can help identify all the changes that occurred to a table during a time interval. In the case of Disaster Recovery, Delta time travel can be used to replicate the data from the DR site back to the regular site and reduce the work and time until the business is back to usual. See Introducing Delta Time Travel for Large Scale Data Lakes
- Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, etc. See Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python APIs
- Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet
- 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine
- Performance: Delta boasts query performance of 10 to 100 times faster than with Apache Spark on Parquet. It accomplishes this via Data Skipping (Delta maintains file statistics on the data subset so that only relevant portions of the data is read in a query), Compaction (Delta manages file sizes of the underlying Parquet files for the most efficient use), and Data Caching (Delta automatically caches highly accessed data to improve run times for commonly run queries) and well as other optimizations
The Bronze/Silver/Gold in the above picture are just layers in your data lake. Bronze is raw ingestion, Silver is the filtered and cleaned data, and Gold is business-level aggregates. This is just a suggestion on how to organize your data lake, with each layer having various Delta Lake tables that contain the data. Delta Lake tables are a combination of Parquet based storage, a Delta transaction log and Delta indexes (so updating the indexes and ACID support will slow down the ingestion performance a bit). The tables can only be written/read by a Delta cluster. See Productionizing Machine Learning with Delta Lake.
Having a Delta Lake does not mean you don’t need a relational data warehouse in your modern data warehouse solution. While Delta Lake can store and process data faster and easier than a relational data warehouse and can scale better, it is not a replacement for a data warehouse as it is not as robust and performant, among other reasons (see Is the traditional data warehouse dead?) .
I usually see customers using Delta Lake for the staging and processing of data, supporting streaming and batch. As for the final product I see it used for drill down, in-frequent, or large data queries and real-time dashboards.
One big clarification, since the transaction log in Delta Lake sounds a lot like one you would find in a relational database. This leads to customers asking if Delta Lake can support OLTP. The answer is a big NO. This is not the domain of the product and I would not recommend it. You can stream a large number of inserts to Delta, but update and deletes will slow down the operation as this involves file copies and versioning.
Another thing to note is Delta Lake does not support true data lineage (which is the ability to track when and how data has been copied within the Delta Lake), but does have auditing and versioning (storing the old schemas in the metadata). Also, Databricks partnered with Immuta and Privacera, which provides row/column security in Delta Lake like you find with Apache Ranger in HDFS.
One limitation is you can’t access Delta Lake tables outside of Databricks Runtime as ingestion requires using the Delta ACID API (via Spark Streaming) and running queries requires the use the Delta JDBC (with one exception of static data sets but it’s not easy – see Can I access Delta Lake tables outside of Databricks Runtime?). So most of the use of Delta Lake is within the Databricks ecosystem as you can’t copy data from the Delta Lake to upstream products like Azure SQL Data Warehouse, but expect this to change as other 3rd party products along with Hive and Presto build native readers to Delta Lake. Details on how to connect Power BI to Delta Lake can be found here.
More info:
Databricks Delta Lake vs Data Lake ETL: Overview and Comparison
A standard for storing big data? Apache Spark creators release open-source Delta Lake
A Deep Dive Into Databricks Delta
Stream IoT sensor data from Azure IoT Hub into Databricks Delta Lake
Video Simplify and Scale Data Engineering Pipelines with Delta Lake
Video Delta Architecture, A Step Beyond Lambda Architecture
Video Making Apache Spark™ Better with Delta Lake
Video Delta Lake – Open Source Reliability for Data Lakes
Azure Databricks: Delta Lake, Part 1
Diving Into Delta Lake: DML Internals (Update, Delete, Merge)
Hi James,
Can you explain this paragraph:
“It’s real easy to use. When writing code in Databricks, instead of using “parquet” for the format property, just use “delta”. So instead of having data land in your cloud storage in its native format, it instead lands in parquet format and while doing so adds certain features to the data.”
It’s explained at: https://docs.databricks.com/delta/quick-start.html
Pingback:Azure Synapse Analytics November updates | James Serra's Blog
Making Apache Spark™ Better with Delta Lake
The above video link does not exits. Please repost.
Fixed! Thanks for pointing it out!
Delta Lake is getting pretty wide adoption now, I’m trying to think about Synapse scenario’s where I DON’T want to use Delta Lake & just stick to Parquet. Maybe Raw layer Read-Only is better in vanilla parquet?