Data Warehouse Architecture – Kimball and Inmon methodologies
NOTE: The subject of this blog was developed into a presentation that can be found at: Building an Effective Data Warehouse Architecture
What is the best methodology to use when creating a data warehouse? Well, first off, let’s discuss some of the reasons why you would want to use a data warehouse and not just use your operational system:
- You need to integrate many different sources of data in near real-time. This will allow for better business decisions because users will have access to more data. Plus this will save users lots of time because they won’t waste precious time retrieving data from multiple sources
- You have tons of historical data that you need to gather in one easily accessible place in which it will have common formats, common keys, common data model, and common access methods.
- You need to keep historical records, even if the source transaction systems does not
- You can restructure the data and rename tables and fields so if makes more sense to the users
- You need to use master data management to consolidate many tables, such as customers, into one table
- Users are running reports directly against operational systems, causing performance problems. Instead, create a data warehouse so users can run reports off of that. Plus, the data warehouse is optimized for read access, resulting in faster report generation
- Having an easy to use data warehouse allows users to create their own reports without having to get IT involved
- Improve data quality by cleaning up data as it is imported into the data warehouse (providing more accurate data) as well as providing consistent codes and descriptions
- Having one version of the truth, so each department will produce results that are in line with all the other departments, providing consistency
- Having a data warehouse makes it easy to create business intelligence solutions on top of it, such as SSAS cubes
- Companies that have implemented data warehouses and complementary BI systems have generated more revenue and saved more money than companies that haven’t invested in BI systems and data warehouses
Once you decide to build a data warehouse, the next step is deciding between a normalized versus dimensional approach for the storage of data in the data warehouse.
The dimensional approach, made popular by in Ralph Kimball (website), states that the data warehouse should be modeled using a Dimensional Model (star schema or snowflake). The normalized approach, also called the 3NF model, made popular by Bill Inmon (website), states that the data warehouse should be modeled using an E-R model/normalized model.
In a dimensional approach, data is partitioned into either “facts”, which are generally numeric transaction data, or “dimensions“, which are the reference information that gives context to the facts. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated. Plus, if you are used to working with a normalized approach, it can take a while to fully understand the dimensional approach and to become efficient in building one.
In the normalized approach, the data in the data warehouse are stored following database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The normalized structure divides data into entities, which creates several tables in a relational database. When applied in large enterprises the result is dozens of tables that are linked together by a web of joins. Furthermore, each of the created entities is converted into separate physical tables when the database is implemented. The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to join data from different sources into meaningful information and then access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.
The final step in building a data warehouse is deciding between using a top-down versus bottom-up design methodology.
Kimball is a proponent of an approach to data warehouse design described as bottom-up in which dimensional data marts are first created to provide reporting and analytical capabilities for specific business areas such as “Sales” or “Production”. These data marts are eventually integrated together to create a data warehouse using a bus architecture, which consists of conformed dimensions between all the data marts. So the data warehouse ends up being segmented into a number of logically self-contained and consistent data marts, rather than a big and complex centralized model. Business value can be returned as quickly as the first data marts can be created, and the method lends itself well to an exploratory and iterative approach to building data warehouses so that no master plan is required upfront.
Inmon is one of the leading proponents of the top-down approach to data warehouse design, in which the data warehouse is designed using a normalized enterprise data model. He has defined a data warehouse as a centralized repository for the entire enterprise. Dimensional data marts containing data needed for specific business processes or specific departments are created from the enterprise data warehouse only after the complete data warehouse has been created. In the Inmon vision the data warehouse is at the center of the “Corporate Information Factory” (CIF), and the newer DW 2.0, which provides a logical framework for delivering business intelligence (BI) and business management capabilities. The work is a long-term, construction will last a long time, but the return is expected to be a long-lasting and reliable data architecture.
Kimball is the most frequently used methodology, especially if you are using the Microsoft BI stack. It is popular because business users can see some results quickly, with the risk you may create duplicate data or may have to redo part of a design because there was no master plan. With Inmon there is a master plan and usually you will not have to redo anything, but if could be a while before you see any benefits, and the up-front cost is significant. And another risk is by the time you start generating results, the business source data has changed or there is changed priorities and you may have to redo some work anyway.
Note there are hybrid solutions consisting of the best of breed practices from both 3rd normal form and star schema such as the Data Vault.
More info:
Why & When Data Warehousing? Is it Relevant?
Top Five Benefits of a Data Warehouse
The 10 Essential Rules of Dimensional Modeling
Information architecture is a matter
LinkedIn discussion KIMBALL/INMON and ARCHITECTURE
LinkedIn discussion What formal data architectures do we have that represent a compromise between Inmon and Kimball?
Microsoft EDW Architecture, Guidance and Deployment Best Practices
Kimball vs. Inmon…or, How to build a Data Warehouse
Kimball versus Inmon: a peace offer?
Inmon vs. Kimball – An Analysis
Data Warehousing: Similarities and Differences of Inmon and Kimball
Video SQLBI Methodology
LinkedIn discussion What is the difference between Inmon and Kimble point of view of the data warehouse?
Which Data Warehouse Architecture Is Most Successful? (2006)
Kimball or Inmon in an enterprise environment
Data Warehousing Battle of the Giants: Comparing the Basics of the Kimball and Inmon Models
James, You seem to be conflating Architecture with Methodology. In my experience there’s nothing about an integrated, normalized data warehouse (Inmon CIF architecture) that means it will take longer to deliver results or cost more up-front. It may be argued that *waterfall* methodologies can take longer and cost more up-front, but that is orthogonal to what *architecture* you choose. Agile, iterative approaches are surely very popular with BI projects these days and both Inmon and Kimball architectures are often implemented using an agile approach.
An important advantage of a normalized data warehouse with dependent marts is that it supports the iterative, agile approach better than coupling each denormalized mart (a’la Kimball) directly to data sources. With a normalized warehouse it is typically easier to add new data sources and evolve the warehouse model because it is less tightly coupled to any one set of reporting requirements and because there are fewer moving parts (transformation layer) on the upstream side of the warehouse. The downstream side (between warehouse and marts) is where decision-support business logic goes and that is simplified too because it only has to consume data already validated and integrated into the data warehouse.
I agree with the advantage D points out. Having integrated the data into the normalized data warehouse also leads to much more consistency across the various data marts in terms of their data models and vocabulary. This is certainly the approach I prefer.
Also, a small correction regarding terminology. In a dimensional approach, data is partitioned into either “measures” or “dimensions“. The “fact” is the measure within the context provided by the dimensions, i.e. each row of a fact table.
Please view another alternative to the Star and Snowflake Schema’s.
Pingback:Why you need Business Intelligence | James Serra's Blog
Pingback:Why You Need a Data Warehouse | James Serra's Blog
Kimball is NOT a bottom up methodology (Inmon calls it that but Kimball disputes). Inmon offers no methodolgy for data marts. If you use Kimballs (atomic) data mart methodology with Inmons CIF you end up with 2 full copies of source transactions. Imon is subject oriented meaning all business processes for each subject (for example client) need to be modelled before the EDW can be a single version of the truth. This takes a LONG time.
Hey there! I’ve been following your web site for a while now
and finally got the courage to go ahead and give you
a shout out from Houston Texas! Just wanted to say keep up
the excellent job!
Thanks! Go Texans!
Great and informative article.
Pingback:An Architecture For Data Quality Kimball | Architecture Fan
Pingback:Corporate Information Factory Vs Bus Architecture | Easy Architecture Fan