Data lake details

I have blogged before about data lakes (see What is a data lake? and Why use a data lake?), and wanted to provide more details on this popular technology, some of which I cover in my presentation “Big data architectures and the data lake“. In short, the data lake is a storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.

The data lake should be the center of just about any big data solution for these major reasons:

Inexpensively store unlimited data
Centralized place for multiple subjects (single version of the truth)
Collect all data “just in case” (data hoarding). The data lake is a good place for data that you “might” use down the road
Easy integration of differently-structured data
Store data with no modeling – “Schema on read”
Complements enterprise data warehouse (EDW)
Frees up expensive EDW resources for queries instead of using EDW resources for transformations (avoiding user contention)
Wanting to use technologies/tools (i.e Databricks) to refine/filter data that do the refinement quicker/better than your EDW
Quick user access to data for power users/data scientists (allowing for faster ROI)
Data exploration to see if data valuable before writing ETL and schema for relational database, or use for one-time report
Allows use of Hadoop tools such as ETL and extreme analytics
Place to land IoT streaming data
On-line archive or backup for data warehouse data
With Hadoop/ADLS, high availability and disaster recovery built in
It can ingest large files quickly and provide data redundancy
ELT jobs on EDW are taking too long because of increasing data volumes and increasing rate of ingesting (velocity), so offload some of them to the Hadoop data lake
Have a backup of the raw data in case you need to load it again due to an ETL error (and not have to go back to the source). You can keep a long history of raw data
Allows for data to be used many times for different analytic needs and use cases
Cost savings and faster transformations: storage tiers with lifecycle management; separation of storage and compute resources allowing multiple instances of different sizes working with the same data simultaneously vs scaling data warehouse; low-cost storage for raw data saving space on the EDW
Extreme performance for transformations by having multiple compute options each accessing different folders containing data
Data lake has the benefit of applying any number of different compute types on the data (i.e. Databricks, HDInsight, etc), but with a relational database you have only one choice for compute.
The ability for an end-user or product to easily access the data from any location

If you are wondering if structured data should also be copied to a data lake, check out Should I load structured data into my data lake?

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data. With the old way, you have to know the questions to ask. The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

Too much time spent modeling when you don’t know all of the questions your data needs to answer
Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

Data Lake layers:

Raw data layer– Raw events are stored for historical reference. Also called staging layer or landing area
Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer
Application data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed
Sandbox data layer – Optional layer to be used to “play” in. Also called exploration layer or data science workspace

Be aware that you still need data governance so your data lake does not turn into a data swamp! Having a data lake does not lessen the data governance that you would normally apply when building a relational data warehouse. In fact, it usually requires more data governance. So look at tools such as the Azure Data Catalog to keep track of all your data.

Data typically lands in products such as Hadoop Distributed File System (HDFS) or the Azure Data Lake Store (ADLS). Since these technologies are just storage, you need other technologies for the compute part. For example, here are ways to clean data in ADLS:

HDInsight with Hive or Pig or MapReduce
Hortonworks/Cloudera/MapR IaaS. Available in the Azure Marketplace
U-SQL in Azure Data Lake Analytics (ADLA)
PolyBase in SQL Data Warehouse (SQL DW) or PolyBase in SQL Server 2016 IaaS

Since you will typically have both a Data Lake and a relational data warehouse when building a big data solution, it’s important to understand the roles of each:

Data Lake/Hadoop (staging and processing environment)

Batch reporting
Data refinement/cleaning
ETL workloads
Store historical data
Sandbox for data exploration
One-time reports
Data scientist workloads
Quick results

Data Warehouse/RDBMS (serving and compliance environment)

Low latency
High number of users
Additional security
Large support for tools
Easily create reports (Self-service BI): A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it? Very little, hence a major reason to use a RDBMS

A question I get frequently is if all relational data should be copied to the data lake, especially if some of the relational data is not needed for reporting or analytics in the data lake (just in the data warehouse). There still could be reasons to copy the data to the data lake, such as for backup purposes, to use low-cost storage for raw data saving space on the data warehouse, to use Hadoop tools, or to offload the refining of the data from the data warehouse (especially if ETL jobs on the data warehouse are taking too long). But in some cases you may want to skip the data lake, especially if you have many existing SSIS packages as you can minimize the changes to them (i.e. just changing the destination source). Also keep in mind the extra time needed to export data out from a relational database into a flat file before copying it to the data lake.

One issue to be aware of is when you have to move a lot of data each day from a relational database to a data lake. You may want to look at 3rd-party products for Change Data Capture (CDC) for high volumes of data to get updates of your source system into a data lake such as HVR and Attunity. Performing updates to rows of data in files sitting in a Hadoop/HDFS data lake can be very slow compared to appends.

More info:

Data Lake Use Cases and Planning Considerations

Zones in a Data Lake

Video A Deep Dive Into Data Lakes

FAQs About Organizing a Data Lake

Unlock the value of data faster through Modern Data Warehousing

The Fast-Shifting Data Landscape: Data Lakes and Data Warehouses, Working in Tandem

Building your Data Lake on Azure Data Lake Storage gen2

Comments

Data lake details — 3 Comments

John on June 29, 2017 at 8:37 pm said:

James – thoughts on ADF, data factory for pulling data on prem to cloud ? Azure in this case

Reply ↓
Wolfgang on June 30, 2017 at 3:43 am said:

Hi James

Thanks for this very good summary.

I do have a view question (have not worked with ASDL so far).

1) Are the Data-Lake Layers (Raw-data layer,…) layers which are supported by the ASDL or is this something you have to organize on your own.

2) As far as I understood it now the Azure Data-Catalog is something which needed to be configured manually, there is now way adding “meta-tags” automatically (e.g. doing this via an application, or as an output of some process). Rather I’ve to store the data in dedicated folders which I’ve assigned the tag before via the Data-Catalog. Do I get this right? Could be Azure Search used for doing data governance

Reply ↓
Pingback:Thinking About The Data Lake – Curated SQL

James Serra's Blog

Big Data and Data Warehousing

Comments

Data lake details — 3 Comments

Leave a Reply Cancel reply

Share:

Leave a Reply Cancel reply