Hadoop and Data Warehouses
I see a lot of confusion when it comes to Hadoop and its role in a data warehouse solution. Hadoop should not be a replacement for a data warehouse, but rather should augment/complement a data warehouse. Hadoop and a data warehouse will often work together in a single information supply chain: Hadoop excels in handling raw, non-relational data and complex data with vast programming flexibility; Data warehouses, on the other hand, manage relational data, integrating subject areas and providing interactive performance through BI tools. As an example in manufacturing, let’s look at using Hadoop and a data warehouse on a Parallel Data Warehouse (PDW):
There are three main use cases for Hadoop with a data warehouse, with the above picture an example of use case 3:
- Archiving data warehouse data to Hadoop (move)
Hadoop as cold storage/Long Term Raw Data Archiving:
– So don’t need to buy bigger PDW or move older data to a USB drive or delete it - Exporting relational data to Hadoop (copy)
Hadoop as backup/DR, analysis, cloud use:
– Export conformed dimensions to compare incoming raw data with what is already in PDW
– Can use dimensions against older fact table
– Sending validated relational data to Hadoop
– Hadoop data to WASB and have that used by other tools/products (i.e. Cloud ML Studio)
– Incremental Hadoop load / report - Importing Hadoop data into data warehouse (copy)
Hadoop as staging area (also called landing zone or data lake):
– Great for real-time data, social networks, sensor data, log data, automated data, RFID data (ambient data)
– Where you can capture the data and only pass the relevant data to PDW
– Can do processing of the data “data refining” as it sits in Hadoop (clean it, aggregate it, transform it)
– Offloading workloads to the staging area frees resources on the core DW
– Some processing is better done on Hadoop instead of SSIS
– Way to keep staging data
– Long-term raw data archiving on cheap storage that is online all the time (instead of tape) – great if need to keep the data for legal reasons
– Others can do analysis on it and later pull it into data warehouse if find something useful
– Allows for schema flexibility as Hadoop is “schema on read” so it can ingest any data format in its original form, including evolving schema’s
– In the end, Hadoop becomes just another data store, albeit one for multi-structured data
Note there will still be a need for some Hadoop skills: Loading data into Hadoop (unless all done thru Polybase), maintenance, cleaning up files, managing data, etc. But people who need access to data don’t need any Hadoop skills or special connections – they can use all skills they have today.
Here are some of the reasons why it is not a good idea to have only Hadoop as your data warehouse:
- Hadoop does not provide for very fast query reads. HDP 1.0 is a batch processing infrastructure, not interactive/realtime. HDP 2.0 will support interactive queries but is new and still will not perform anywhere near PDW for interactive querying. This is why PolyBase is so important, as it bridges the gap between the two technologies so customers can take advantage of both the unique features of Hadoop and realize the benefits of a EDW. Truth be told users won’t want to wait 10+ seconds for a MapReduce job to start up to execute a Hive query
- Hive is the most popular tool for querying data on Hadoop HDFS using an SQL-like syntax called HiveQL. But HiveQL does not fully support ANSI SQL which limits the range of users, tools, and applications that can access Hadoop data
- Hadoop lacks a sophisticated query optimizer, in-database operators, advanced memory management, concurrency, dynamic workload management and robust indexing strategies and therefore performs poorly for complex queries
- Hadoop does not have the ability to place “hot” and “cold” data on a variety of storage devices with different levels of performance
- Hadoop is not relational, as all the data is in files in HDFS, so there always is a conversion process to convert the data to a relational format
- Hadoop is not a database management system. It does not have functionality such as update/delete of data, referential integrity, statistics, ACID compliance, data security, and the plethora of tools and facilities needed to govern corporate data assets
- Restricted SQL support, such as certain aggregate functions missing
- There is no metadata stored in HDFS, so another tool needs to be used to store that, adding complexity and slowing performance
- Finding expertise in Hadoop is very difficult: The small number of people who understand Hadoop and all its various versions and products versus the large number of people who know SQL
- Super complex, lot’s of integration with multiple technologies to make it work
- Many tools/technologies/versions/vendors (fragmentation), no standards, difficult to make a corporate standard. See all the various Apache Hadoop technologies here
- Some reporting tools don’t work against Hadoop, as well as some reporting tools require data in OLAP
- May require end-users to learn new reporting tools and Hadoop technologies to query the data
- The new Hadoop solutions (Tez, X, Spark, etc) are still figuring themselves out. Customers should not take the risk of investing in one of these solutions (like MapReduce) that may be obsolete
- It might not save you much in costs: you still have to purchase hardware, support, licenses, training, migration costs. And then there is the possibility of a big company acquiring a Hadoop company and the cheap solution becoming much more expensive
- If you need to combine relational data with Hadoop, you will need to move that relational data to Hadoop since there is no PolyBase-like technology
I also wanted to mention that “unstructured” data is a bit of a misnomer. Just about all data has at least some structure. Better to call it “semi-structured”. I like to think of it as data in a text file is semi-structured until someone adds structure to it, by doing something like importing it into a SQL Server table. Or think of structured data as relational and unstructured as non-relational.
More info:
Using Hadoop to Augment Data Warehouses with Big Data Capabilities
WEBINAR Replay: Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
Video Big Data Use Case #5 – Data Warehouse Augmentation
How Hadoop works with a Data Warehouse
Design Tip #165 The Hyper-Granular Active Archive
No, Hadoop Isn’t Going To Replace Your Data Warehouse
A fair benchmark for Impala 2.0 – part 1
Big Data: What Does it Really Cost?
Cool post James. Do you have a link that has the data to support 20+ seconds claim? I’m involved in some R&D so I’d like to review the research.
Thanks
Hi Time…I got that number via my own testing as well as watching some Hortonworks presentations.
Pingback:The Modern Data Warehouse | James Serra's Blog
Pingback:What is a data lake? - SQL Server - SQL Server - Toad World
Pingback:What is a data lake? | James Serra's Blog