Data Virtualization vs Data Warehouse
Data virtualization goes by a lot of different names: logical data warehouse, data federation, virtual database, and decentralized data warehouse. Data virtualization allows you to integrate data from various sources, keeping the data in-place, so that you can generate reports and dashboards to create business value from the data. It is an alternative to building a data warehouse, where you collect data from various sources and store a copy of the data in a new data store.
The main advantage of data virtualization is speed-to-market, where we can build a solution in a fraction of the time it takes to build a data warehouse. This is because you don’t need to design and build the data warehouse and the ETL to copy the data into it, and also don’t need to spend as much time testing. Copying the data means more hardware costs, more software licenses, more ETL flows to build and maintain, more data inconsistencies and more data governance costs, so using data virtualization can also save you a lot of money. Other reasons for data virtualization include rapid prototyping for batch data movement, self-service analytics via a virtual sandbox, and regulatory constraints on moving data.
Some of the more popular data virtualization products are Cisco Data Virtualization (previously called Composite Software), Denodo Platform for Data Virtualization, Informatica Data Virtualization, Dremio, IBM Big SQL and Incorta.
Along the same lines of data virtualization vs data warehouse is federated queries vs data lake. Such technologies as PolyBase, Metanautix, and U-SQL in Azure Data Lake Analytics provide for federated queries.
But there are some major drawbacks to data virtualization and federated queries, so you have to ask the following questions when you are thinking about using it:
- Speed. Is this something I could use for a Power BI dashboard where I wanted to slice and dice data with sub-second response times? Or is this more for operational type reporting?
- How much will this affect the performance of the source system? Could a query consume all the resources of a server with a data source that I’m querying against? Does it push down the query in the same way PolyBase does?
- Do I need to install something on each server that contains a data source I want to use?
- Does it use the indexes of each technology on the data store, or does it create its own indexes? Are the statistics from each data source used for queries?
- How is security handled for giving users access to each data source?
- How is master data management handled (i.e. the same customer in multiple data sources but the customer name spelled differently)?
- Where and how will the data be cleaned?
- Will reports break if the source system is changed?
And there are some very valid reasons why a physical data warehouse is required:
- Many production systems don’t keep track of historical data. This data must be stored somewhere for historical analysis of the data. The physical data warehouse is, in this case, the most obvious solution
- Accessing production systems directly for reporting and analytics can lead to too much interference on those systems and to performance degradation. Note that this was once the reason why physical data warehouses were developed in the first place
- Speed: A data warehouse is optimized for read access while a source system is usually optimized for writes
- In building a data warehouse you will be restructuring, renaming, and joining data (i.e. creating star schemas) to make it easy for users to create reports
- A data warehouse protects users against source system upgrades
A word about Views, such as those in SQL Server, can be thought of as a “lightweight” data virtualization solution: When users need access to operational data, views can be defined directly on the operational data store or the production databases. But views have their own issues: operations for correcting and transforming data must be added to the view definitions, because the original operations are implemented in the physical data warehouse environment and are now bypassed. They now have to be virtualized.
More info:
IT pros reveal benefits, drawbacks of data virtualization software
Experts Reconsider the Data Warehouse
Clearly Defining Data Virtualization, Data Federation, and Data Integration
Mark Beyer, Father of the Logical Data Warehouse, Guest Post
The Logical Data Warehouse: Smart Consolidation for Smarter Warehousing
data federation technology (data virtualization technology or data federation services)
Logical Data Warehousing for Big Data
The Many Uses of Data Federation
Enterprise Data Management, Part 1
How Data Federation Can Co-exist with an EDW
Demystifying Data Federation for SOA
Federation Supplements The Data Warehouse – Not Either/Or, Never Was
Pingback:Data Mesh defined | James Serra's Blog
Hi James,
Thanks for the great posts about “Data Virtualization”:
1) https://www.jamesserra.com/archive/2018/02/data-virtualization-vs-data-movement/
2) https://www.jamesserra.com/archive/2017/08/data-virtualization-vs-data-warehouse/
One area of strong recent interest is about leveraging the rich data-semantic of large-scale high-performant central tabular model (AAS or PBI large premium dataset) as quick-to-market zero-data-movement” Data Virtualization layer. With higher appetite for good central semantic layer (leveraging PBI large datasets), more experts are avoiding duplicated data logics in other classic Data Virtualizations when richer data + semantic already exists in their large PBI datasets. (Data Virtualization principle of zero-movement)
They see additional cost & technology efficiencies (SOLID principle) on leveraging the rich central data semantic layer beyond the traditional BI “pattern” (reporting tools querying the tabular model). Now it’s possible a true zero-data-movement “Data Virtualization” layer for:
a) by upstream ETLs systems consuming aggregated/augmented information (using PBI REST API for DAX Query)
b) by the brand-new “Data Query for PBI Datasets” (up to 3-levels of composite Direct Query Datasets).
Pattern vs Anti-patterns
What makes this topic interesting is not that many champions are already doing it (considering the performance and volume limits of course), but that many classic “architects” are feeling out of their comfort zone and are rushing to label this as an “antipattern” (in the name of “performance concerns”, etc.).
So your post “Data Movement vs Data Virtualization” helps a lot to review the key principles that should drive the conversation of “Data Virtualization” (pattern vs. anti-patterns).
The answer to this is important, as more and more organizations are seeing huge value on leveraging the rich data augmented data + semantic content there as good foundation for Data Virtualization (not only for Visualization but for upstream transformations at the next level).
What do you think James?
Do you see any “philosophical” antipattern in using a well-designed large-scale high-performant Analytic Service (based on AAS or on a PBI Large Premium) as a valid architectural solution for a fast-to-market zero-data-movement Data Virtualization layer? Or should organizations duplicate the same logic on other Dataflows, Dataware houses, etc.
Thank you, James!
Hi Raul,
I mentioned using PBI and DirectQuery as a lightweight virtualization layer in my blog at https://www.jamesserra.com/archive/2021/09/products-for-microsoft-gaps/. I agree with you it’s a quick solution and I think it will become more popular.