Data Mesh: Centralized ownership vs decentralized ownership
I have done a ton of research lately on Data Mesh (see the excellent Building a successful Data Mesh – More than just a technology initiative for more details), and have some concerns about the paradigm shift it requires. My last blog tackled the one about Centralized vs decentralized data architecture. In this one I want to talk about centralized ownership vs decentralized ownership, along with another paradigm shift (or core principle) closely related to it, siloed data engineering teams vs cross-functional data domain teams.
First I wanted to mention there is a Data Mesh Learning slack channel that I have spent a lot of time reading and what is apparent is there is a lot of confusion on exactly what a data mesh is and how to build it. I see this as a major problem as the more difficult it is to explain a concept the more difficult it will be for companies to successfully build that concept, so the promise of a data mesh improving the failure rates for big data projects will be difficult to achieve if we can’t all agree exactly what a data mesh is. What’s more is the core principles of the data mesh sound great in theory but will have challenges in implementing them, hence my thoughts in this blog on centralized ownership vs decentralized ownership.
To review what is centralized ownership vs decentralized ownership (which reminds me of the data mart arguments of the Kimball vs Inmon debates many years ago): Rather than thinking in terms of pipeline stages (i.e. data source teams copying data to a central data lake to be filtered by a centralized data team in IT, who then prepare it for data consumers, so “central ownership”), we think about data in terms of domains (e.g. HR or marketing or finance) where the data is owned and kept within each domain (called a data product), hence “decentralized ownership” (also called domain or distributed ownership). From a business perspective this makes things easier as it maps much more closely to the actual structure of your business. Domains can be followed from one end of the business to the other. Each team is accountable for their data, and their processes can be scaled without impacting other teams. Each domain will have their own team for implementing their domain solution (“cross-functional data domain teams”) instead of one centralized team that resides in IT being responsible for all the implementations (“siloed data engineering teams”).
Inside a domain such as HR, that team is managing their HR-related OLTP systems (i.e. Salesforce, Dynamics) and have created their own datasets built on top of a data warehouse or a data lake that has combined the data from all the HR-related OLTP systems. I have not seen clarity from the data mesh discussions on how exactly a domain handles OLTP and analytical data so please comment below if you have a different opinion.
To be part of the data mesh, each domain must follow a set of IT guidelines and standards (“contracts”) that describe how their domain data will be managed, secured, discovered and accessed.
Having built database and data warehouse solutions for 35 years, I have some concerns about this approach:
- Domains will only be thinking of their own data product and not how to work with other products, possibly making it difficult to combine the data from multiple domains
- Not having IT-like people in each product group to do the implementation but instead trying to use business-like people
- Does each domain have the budget to do its own implementation?
- You may have domains not wanting to deal with data and just focus on what they are good at (i.e. serving their customers), happy to have IT handle their data
- Each domain could be using different technology, some of which could be obscure. And not having the experience to pick the right technology
- Having centralized policies with a data mesh oftentimes leaves the implementation details to the individual teams. This has the potential of inconsistent implementations that may lead to performance degradations and differing cost profiles
- If implementing a Common Data Model (CDM), then you will have to get every domain to implement it
- You will have to coordinate each domain to have its own unique ID’s for rows when it has the same types of data as other domains (i.e. customers)
- Domains may have their own roadmap and want to implement their use case now and/or don’t want to pay or wait for a data mesh. And what if you have dozens of domains/orgs who feel this way?
- Conformed dimensions would have to be duplicated in each domain
- You could plan on having a bunch of people with domain knowledge within each domain, but what about if you already have many people in IT who understand all the domains and how to integrate the data to get more value than the separate domains of data? Wouldn’t this favor a centralized ownership?
- Ideally you want deep expertise on your cross-functional teams in streaming, ETL batch processing, data warehouse design, and data visualization. So if you have many domains this means many roles to fill and that might not be affordable. The data mesh approach assumes that each domain team has the necessary skills, or can acquire them, to build robust data products. These skills are incredible hard to find
- How do you convince ‘business people’ in each domain to take ownership of data if it only introduces extra work for them? And that there could possibly be a disruption in service?
- If each domain is building their own data transformation code, then there will be a lot of duplication of effort
- If there are already data experts within each domain, why not just have IT work closely with them if using a centralized ownership?
- The domain teams may say their data is clean and won’t change it, where if the data is centralized then it can be cleaned. And domains may have different interpretations of clean or how to standardize data (i.e. defining states with abbreviations or the full state name). And what if the domains don’t have time to clean the data?
- Who scans for personally identifiable information (PII) data and who fixes the issue if it is found out that people are seeing PII information that they should not be allowed to see?
- Who coordinates if a domain changes its data model, causing problems with core data models or queries that join domain data models?
- Who handles DataOps?
- Shifting from a centralized set of individuals servicing their data requests to a self-serve approach could be very challenging for many companies
- Each domain ingesting their own data could lead to duplication of purchased data, along with many domains building similar ingestion platforms
- The problem of domains ignoring the data security standards or data quality standards, which would not happen in a centralized architecture
- You create data silos for domains that don’t want to join the data mesh or are not allowed to because they don’t follow the data mesh contract for domains
- Replacing the IT data engineers with engineers in each domain (“business engineers”) will provide the benefit of business engineers knowing the data better, but the tradeoff is they don’t have the specialized technical knowledge that IT data engineers have which could lead to less-than-optimal technical solutions
- Having multiple domains that have aggregates or copies of data from other domains for performance reasons leads to duplication of data
- A data mesh assumes that the people who are closest to the data are the best able to understand it, but that is not always true. Plus, they likely don’t understand how best to combine their data with other domains
- A data mesh touts that it reduces the “organizational complexity”, but it may actually make it worse when the teams are distributed instead of centralized and many more people are involved
- The assumption that IT data engineers don’t have business and domain knowledge is not always true in my experience. I have seen some that have more knowledge than the actual domain experts, plus they understand how to combine the data from different domains together. And if IT data engineers don’t have the domain knowledge, having them obtain that knowledge could be a better solution than a whole new way of working that comes with a data mesh (in which those people are in many cases just moved to the business group). Wouldn’t improving the communication between IT and the domains be the easiest solution?
Finally, I have to take issue when I hear that current big data solutions don’t scale and data mesh will solve that problem. It is trying to solve what it perceives as a major problem (“crisis”) that is really not major in my opinion. There are thousands that have implemented successful big data solutions, but there are very few data meshes in production. I have seen many “monolithic” architectures scale the technology and the organization very well. Sure, many big data projects fail, but for the same reasons that would of made them fail if they tried to implement a data mesh instead (and arguable there would be an even higher failure rate trying to build a data mesh due to the additional challenges of a data mesh). Technology for centralizing data has improved greatly allowing solutions to scale, having serverless options now to meet the needs of most big data requirements along with cost savings, and it will continue to improve. There is a risk with the new architecture and organizational change that comes with a data mesh, especially compared to the centralized data warehouse which has proven to work for many years if done right. Plus, the data mesh assumes that each source system can dynamically scale to meet the demands of the consumers which will be particularly challenging when data assets become “hot spots” within the ecosystem.
But I want to be clear that I see a lot of positives with the data mesh architecture, and my hope is that it will be a great approach for certain use cases (mainly large fragmented organizations). I’m just trying to point out that a data mesh is not a silver bullet and you need to be aware of the concerns listed above before undertaking a data mesh to make sure it’s the right approach for you so you don’t become another statistic under the failed project column. It requires a large change in a companies technology strategy and an even larger change in a companies organizational strategy which will be a huge challenge that you have to be prepared for.
More info:
Building a data mesh to support an ecosystem of data products at Adevinta
“domains thinking of their own data product” is a key issue. The primary consumer of a data product is likely to be the product owner. Having product owners take feature requests from other domains and scaling to meet the needs of other domains are but two of the many organizational challenges with mesh architectures. As they were with other distributed ownership architectures. Great article as always.
Thanks for this comment, it confirms my main concern with the data mesh: the concept assumes domains will go the extra mile to offer their data as data products for other domains to consume. But it remains unclear how these domains will be incentivized to go that extra mile.
We are in the infancy of a data mesh that will need enterprise diligence in governance. We have used enterprise data engineers and domain experts to process clean and govern. Our success has come from data engineer leadership that empowers their domain teams to govern their data. Many companies have built the mesh but we have very few that could keep it going across the enterprise. The breakdown is always in the governance.
Pingback:Data Mesh and Ownership Strategies – Curated SQL
Thank you James for the candid critique of Data Mesh, as well as your general thought leadership. When the concept is boiled down, the technical challenges only exist because we cannot delegate software/data development to business stakeholders (even if we call then “engineers”); you called this point out above. Organizations simply are not geared to maintain architect and developer capabilities within the business function, nor should they be. What could be a better approach is relying on business functions for data governance and stewardship, but relying on enterprise IT to create the data glue. The more recent centralized model seems to have finally gained traction with products like Synapse and Snowflake. Augmenting these products with interoperability and governance feels like a better move than “data mesh” for the time being. What is the business case for data mesh vs. other modern approaches? (sincerely asked).
Data mesh tries to solve three challenges with a centralized data lake/warehouse:
Lack of ownership: who owns the data – the data source team or the infrastructure team?
Lack of quality: the infrastructure team is responsible for quality but does not know the data well
Organizational scaling: the central team becomes the bottleneck, such as with an enterprise data lake/warehouse
But as I pointed out in my blog you can certainly have a centralized approach without these challenges, without introducing the extra work and challenges a data mesh gives you.
Hi James
Your article is a good collation of challenges posed by Data Mesh. Federated computational governance can be challenging, as it may warrant a lot of self-discipline within each data domain.
Viewing from an Agile team organisation perspective, if you have Chapters comprising of team members cutting across the Squads (Data Product teams), you can at least make an attempt to ensure all domains have a healthy inter-operable eco-system. I think Data Mesh relies heavily on the assumption that each domain exposes its metadata (along with its operational and analytical data) which must be well defined for other domains to consume. I have made an attempt to take the Data Mesh paradigm a small step forward via my blog on Medium using a hypothetical use case in a healthcare setup. Hope it is worth having a read.
https://medium.com/capgemini-microsoft-team/data-mesh-implementation-in-a-multi-cloud-architecture-ac2a7b089789
In the past 20 years we talked about data mart, hub and spoke, federated queries as a solution to analytics – rather than fixing issue with centralized analytical data store. Now, it is data mesh. This too shall pass.
Hello,
I don’t see it as big issue and we have completely valid implementation available. Data Mesh no different to CQRS pattern (eg. Saga, Event Sourcing patterns). But while CQRS is introduced to API communication. Data Mesh is no different but just for things we call Data Warehouses/Lakes/Hubs, and now Mesh.
I tried in my last company to explain that there should be single platform template in form of CQRS and it should be integrated everywhere. The only centralized stuff is the Event Store, so solution to speed this up is to create a Snapshots. And you get it.
So imagine you your standard 3-tier app design, eg. front/back/storage, right? So next time you will start developing new APP, the back and storage will be handled by Data Mesh, so your api storage and data lake storage become one 🙂 That is the real beauty behind all this. And this is what I call microservices, or more event driven functions. So logic is distributed as functions flow. As everything is an event, it can be transformed into any kind of domain as CQRS demonstrate in form of Projections/Aggregates.