Modern Data Warehouse & Reverse ETL
An extension to the Modern Data Warehouse (MDW) that I have heard a bit about lately is called “Reverse ETL”. Before I describe what that is, first I wanted to give a quick review of a typical MDW, which consists of five stages:
- Ingest: Data is ingested from multiple sources via ELT
- Store: The ingested data is stored in a data lake in a raw layer in the format that it came from the source
- Transform: The data is then cleaned and written to a cleaned layer in the data lake, and then joined and/or aggregated and copied into a presentation layer in the data lake
- Model: Some of the data is then copied into a relational database in third normal form (3NF) and/or a star schema
- Visualize: The data is then reported off of
There are many variations, additions, and exceptions to these stages and multiple products/tools can be used at each stage. In the Azure world, typically Azure Data Factory is used to ingest data, Azure Data Lake Storage Gen2 is used to store the data, mapping data flows in Azure Data Factory is used to transform the data, Azure Synapse Analytics is used to model the data, and Power BI is used to visualize the data. Of course, there are many variations on the tools you can use. I’ll post a video in the next few weeks that will discuss this in more detail.
“Reverse ETL” is the process of moving data from a modern data warehouse into third party systems to make the data operational. Traditionally data stored in a data warehouse is used for analytical workloads and business intelligence (i.e. identify long-term trends and influencing long-term strategy), but some companies are now recognizing that this data can be further utilized for operational analytics. Operational analytics helps with day-to-day decisions with the goal of improving the efficiency and effectiveness of an organization’s operations. In simpler terms, it’s putting a company’s data to work so everyone can make better and smarter decisions about the business. As examples, if your MDW ingested customer data which was then cleaned and mastered, that customer data can then by copied into multiple SaaS systems such as Salesforce to make sure there is a consistent view of the customer across all systems. Customer info can also be copied to a customer support system to provide better support to that customer by having more info about that person, or copied to a sales system to give the customer a better sales experience. As a last example, you can identify at-risk customers by surfacing customer usage data in a CRM.
Companies are building key definitions in SQL on top of the data warehouse such as Lifetime Value (LTV), Product Qualified Lead (PQL), propensity score, customer health, etc. Yes, you can easily create reports and visualizations using this data in BI tools or SQL, but these insights can be much more powerful if they drive the everyday operations of your teams across sales, marketing, finance, etc. in the tools they live in. Instead of training sales reps to use the BI reports you built from the MDW, the data analyst can operationalize their analysis by feeding lead scores from the data warehouse into, for example, a custom field in Salesforce. As another example, say your data science team calculates a propensity score on top of the data warehouse or data lake describing the user’s likelihood of buying a product. Using Reverse ETL, you can move the propensity score to a operational production database to serve customers personalized in-app experiences in real time.
Instead of writing your own API connectors from the data warehouse to SaaS products to pipe the data into operational systems like Salesforce and dealing with all the mapping of fields, reverse ETL solutions have appeared which offer out of the box connectors to numerous systems. They provide the mapping to the SaaS products and allow you to continuously sync or define what triggers the syncing between the two systems. I see reverse ETL products with extensive and easy-to-use mapping capabilities to the many SaaS products such as Salesforce, Hubspot, Marketo, and Zendesk as a big advantage over trying to use Azure Data Factory and coding your own.
There are now a handful of startups building reverse ETL products including Hightouch, Census, Grouparoo (open source), Headsup, Polytomic, RudderStack, and Seekwell. It will be interesting to see how these product evolve and if reverse ETL becomes popular.
More info:
Reverse ETL is Just Another Data Pipeline
Power BI/SSAS Post DAX query to SaaS is the solution.It has been used for two years in my company。
SSAS is the best Business logic computing engine!
https://github.com/microsoft/azure-analysis-services-http-sample
Pingback:Reverse ETL in a Modern Data Warehouse – Curated SQL
ThoughtSpot just announced it acquired Seekwell. So BI (or DBMS) vendors may coopt this space before it solidifies!
Good article, James, as always.
It makes sense to re-use the DW dimensional model throughout an organisation, so that all applications use a single source of truth. However ‘reverse ETL’ seems a slimline Master Data Management.
Good point Mark. Similar concept as Master Data Management with a new name so marketing has something to sell. And I would think the label “Reverse EL” is more appropriate here than “Reverse ETL”, since no Transformations are happening.
Agree that Reverse ETL is not the best descriptor, although it is catchy and quickly conveys the general approach. ThoughtSpot wants to call it AnalyticsOps, but I can’t stomach another “Ops” designation. I’ve opted to call it “Automated Writebacks” – what do you think?
Pingback:Reverse ETL con Power BI - dataXbi
You should have a look at World’s first chat based Data Engineering tool powered by AI “Ask On Data” : https://www.askondata.com Simply type in English language and create data pipelines
– Zero learning curve. Type and get it done.
– No technical knowledge required. Anybody can use
– Automatic documentation
– Super fast speed of development at the speed of typing, save around 93% of time as compared to other tools
– Save money in infra by decoupling processing in case if you are using platforms like Snowflake, Databricks etc