Data lake details
I have blogged before about data lakes (see What is a data lake? and Why use a data lake?), and wanted to provide more details on this popular technology, some of which I cover in my presentation “Big data architectures and the data lake“. In short, the data lake is a storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.
The data lake should be the center of just about any big data solution for these major reasons:
- Inexpensively store unlimited data
- Centralized place for multiple subjects (single version of the truth)
- Collect all data “just in case” (data hoarding). The data lake is a good place for data that you “might” use down the road
- Easy integration of differently-structured data
- Store data with no modeling – “Schema on read”
- Complements enterprise data warehouse (EDW)
- Frees up expensive EDW resources for queries instead of using EDW resources for transformations (avoiding user contention)
- Wanting to use technologies/tools (i.e Databricks) to refine/filter data that do the refinement quicker/better than your EDW
- Quick user access to data for power users/data scientists (allowing for faster ROI)
- Data exploration to see if data valuable before writing ETL and schema for relational database, or use for one-time report
- Allows use of Hadoop tools such as ETL and extreme analytics
- Place to land IoT streaming data
- On-line archive or backup for data warehouse data
- With Hadoop/ADLS, high availability and disaster recovery built in
- It can ingest large files quickly and provide data redundancy
- ELT jobs on EDW are taking too long because of increasing data volumes and increasing rate of ingesting (velocity), so offload some of them to the Hadoop data lake
- Have a backup of the raw data in case you need to load it again due to an ETL error (and not have to go back to the source). You can keep a long history of raw data
- Allows for data to be used many times for different analytic needs and use cases
- Cost savings and faster transformations: storage tiers with lifecycle management; separation of storage and compute resources allowing multiple instances of different sizes working with the same data simultaneously vs scaling data warehouse; low-cost storage for raw data saving space on the EDW
- Extreme performance for transformations by having multiple compute options each accessing different folders containing data
- Data lake has the benefit of applying any number of different compute types on the data (i.e. Databricks, HDInsight, etc), but with a relational database you have only one choice for compute.
- The ability for an end-user or product to easily access the data from any location
If you are wondering if structured data should also be copied to a data lake, check out Should I load structured data into my data lake?
The data lake introduces a new data analysis paradigm shift:
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
This allows you to avoid a lot of up-front work before you are able to analyze data. With the old way, you have to know the questions to ask. The new way supports situations when you don’t know the questions to ask.
This solves the two biggest reasons why many EDW projects fail:
- Too much time spent modeling when you don’t know all of the questions your data needs to answer
- Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value
Data Lake layers:
- Raw data layer– Raw events are stored for historical reference. Also called staging layer or landing area
- Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer
- Application data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed
- Sandbox data layer – Optional layer to be used to “play” in. Also called exploration layer or data science workspace
Be aware that you still need data governance so your data lake does not turn into a data swamp! Having a data lake does not lessen the data governance that you would normally apply when building a relational data warehouse. In fact, it usually requires more data governance. So look at tools such as the Azure Data Catalog to keep track of all your data.
Data typically lands in products such as Hadoop Distributed File System (HDFS) or the Azure Data Lake Store (ADLS). Since these technologies are just storage, you need other technologies for the compute part. For example, here are ways to clean data in ADLS:
- HDInsight with Hive or Pig or MapReduce
- Hortonworks/Cloudera/MapR IaaS. Available in the Azure Marketplace
- U-SQL in Azure Data Lake Analytics (ADLA)
- PolyBase in SQL Data Warehouse (SQL DW) or PolyBase in SQL Server 2016 IaaS
Since you will typically have both a Data Lake and a relational data warehouse when building a big data solution, it’s important to understand the roles of each:
Data Lake/Hadoop (staging and processing environment)
- Batch reporting
- Data refinement/cleaning
- ETL workloads
- Store historical data
- Sandbox for data exploration
- One-time reports
- Data scientist workloads
- Quick results
Data Warehouse/RDBMS (serving and compliance environment)
- Low latency
- High number of users
- Additional security
- Large support for tools
- Easily create reports (Self-service BI): A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it? Very little, hence a major reason to use a RDBMS
A question I get frequently is if all relational data should be copied to the data lake, especially if some of the relational data is not needed for reporting or analytics in the data lake (just in the data warehouse). There still could be reasons to copy the data to the data lake, such as for backup purposes, to use low-cost storage for raw data saving space on the data warehouse, to use Hadoop tools, or to offload the refining of the data from the data warehouse (especially if ETL jobs on the data warehouse are taking too long). But in some cases you may want to skip the data lake, especially if you have many existing SSIS packages as you can minimize the changes to them (i.e. just changing the destination source). Also keep in mind the extra time needed to export data out from a relational database into a flat file before copying it to the data lake.
One issue to be aware of is when you have to move a lot of data each day from a relational database to a data lake. You may want to look at 3rd-party products for Change Data Capture (CDC) for high volumes of data to get updates of your source system into a data lake such as HVR and Attunity. Performing updates to rows of data in files sitting in a Hadoop/HDFS data lake can be very slow compared to appends.
More info:
Data Lake Use Cases and Planning Considerations
Video A Deep Dive Into Data Lakes
FAQs About Organizing a Data Lake
Unlock the value of data faster through Modern Data Warehousing
The Fast-Shifting Data Landscape: Data Lakes and Data Warehouses, Working in Tandem
James – thoughts on ADF, data factory for pulling data on prem to cloud ? Azure in this case
Hi James
Thanks for this very good summary.
I do have a view question (have not worked with ASDL so far).
1) Are the Data-Lake Layers (Raw-data layer,…) layers which are supported by the ASDL or is this something you have to organize on your own.
2) As far as I understood it now the Azure Data-Catalog is something which needed to be configured manually, there is now way adding “meta-tags” automatically (e.g. doing this via an application, or as an output of some process). Rather I’ve to store the data in dedicated folders which I’ve assigned the tag before via the Data-Catalog. Do I get this right? Could be Azure Search used for doing data governance
Pingback:Thinking About The Data Lake – Curated SQL