When to have multiple data lakes
A question I get asked frequently from customers when discussing Data lake architecture is “Should I use one data lake for all my data, or multiple lakes?”. Ideally, you would use just one data lake, but I have seen many valid use cases where customers are using multiple data lakes. Here are some of those reasons:
- Because of organizational structure, where each org keeps ownership of their own data. Typical with a data mesh
- To support multi-regional deployments, where certain regions have data residency/sovereignty requirements. For example, data in China cannot leave China
- To avoid Azure subscription or service limits, quotas, and constraints. For example, 250 maximum number of storage accounts with standard endpoints per region per subscription
- To enact different azure policies for each data lake. For example, specifying that storage accounts should have infrastructure encryption
- Having multiple lakes, each with its own Azure subscription, makes it easier to track costs for billing purposes, especially compared to other options such as using tags
- If you have confidential or sensitive data and want to keep it separate from other less sensitive data for security reasons. Plus, you can implement more restrictive security controls on the sensitive data
- Different lakes for dev, test, and prod environments
- To improve latency – having a data lake reside in the same region as an end-user or an application querying the data, instead of users all over the world accessing data in one lake that could be located a considerable distance away
- For security purposes, to limit the scope of a person with elevated privileges only having those privileges in the lake they are working in
- Having one source-aligned data lake as well as a consumer-aligned data lake
- To manage data that has different governance or compliance requirements. This can be especially important for organizations that need to comply with regulations such as GDPR or HIPAA.
- You have different teams or departments that need their own data lake for specific use cases or projects
- For better disaster recovery by having multiple data lakes in different regions with copies of the data, so you can ensure that your data is available in the event of a disaster or other disruption
- To enable the use of different data recovery and disaster recovery strategies for different types of data
- To enable different data retention policies. Organizations may have to retain data for a certain period of time due to legal or regulatory requirements, and having separate data lakes for different types of data can make it easier to implement different retention policies for different types of data
- The ability to implement different levels of service for different types of data. For example, you could use one data lake for storing and processing high-priority data that requires low-latency access and high availability, and another data lake for storing and processing lower-priority data that can tolerate higher latencies and lower availability. This can help to optimize the cost and performance of your data management infrastructure by allowing you to use less expensive storage and processing resources for lower-priority data
It’s important to note that using multiple data lakes can increase the complexity and cost of your data management infrastructure and require more resources and more expertise to maintain, so it’s important to weigh the benefits against the costs before implementing multiple data lakes (although in some cases you will have no choice to have multiple data lakes such as data sovereignty). Multiple data lakes also may require additional data integration and management tools to help ensure that the data is properly transferred between the different data lakes and that data is consistent across all data lakes. Finally, having multiple data lakes adds the performance challenge of combining the data when a query or report needs data from multiple lakes.
Great post. I’m particularly curious about your take on when and when not to use data mesh. It is a concept I have limited knowledge about and what I’ve heard is that data mesh will make a bad data arch worst because it abstracts the problems. If there is a best practices post you have on data mesh or is it (hopefully) forthcoming?
Hi James,
Glad you like the post! Your questions about data mesh are exactly what I am writing about in my book 🙂 I hope to make some chapters available in the next couple of months.
Pingback:Tales From The Field Weekly Wrap Up for the Week of 01-31-2023 – SQLServerCentral