Data Platform products for Microsoft gaps
Microsoft has a ton of data platform-related products, but there are certain areas where they either don’t have a product or what they have is limited and you need to look at a 3rd-party product to fill that gap. At the company I work at, EY, we are building a data fabric on Azure and I have listed below the areas that we have had to look at other products outside the Microsoft realm:
- Master Data Management (MDM): Microsoft has Master Data Services (MDS), but it is for lightweight MDM needs and has not had any new features in quite a while and requires SQL Server. Microsoft usually recommends Profisee instead. Other options: Informatica, Tamr, boomi, Riversand, Semarchy
- Data Quality: Microsoft has a data quality product called Data Quality Services (DQS), but it does not seem to be supported anymore and is limited in features and also requires SQL Server. Instead, if you are using an MDM tool like Profisee it has built-in data quality features or look at other options: Informatica Cloud Data Quality, Talend Data Quality
- Data virtualization: Microsoft hafs sort of a “light” version of virtualization with their Serverless SQL pool in Azure Synapse Analytics, which can query remote data stores. It currently only supports querying data in the Azure Data Lake (Parquet, Delta Lake, delimited text formats), Cosmos DB, or Dataverse, but hopefully more will come in the future. Power BI and DirectQuery is also a light version of virtualization (see Data Virtualization in Microsoft Power BI and supported DirectQuery sources). For full virtualization software, check out: Denodo, Dremio, Starburst, Fraxses, Stratio
- Data Catalog: Microsoft has a nice product in this area called Purview, but it is not yet GA. If you need a GA product or one that has been around a while, check out: Informatica, Waterline data, Alation, Collibra, Amundsen, Databricks Unity Catalog (not GA), erwin Data Intelligence, Apache Atlas, data.world
- Attribute-based access control (ABAC): ABAC for security is becoming more popular but Microsoft has limited support for it (see What is Azure attribute-based access control (Azure ABAC)? (preview)). Hopefully ABAC will be added to Purview, but for now look at: Immuta, Okera. For an excellent paper to see the benefits of ABAC over RBAC check out GigaOm Report: Immuta vs. Apache Ranger
- Multi-master cluster warehouse: Basically this means you can have multiple compute clusters all accessing the same database, as opposed to a cluster only able to access one database that it is assigned to (i.e. five clusters all accessing databaseA, instead of cluster1 only accessing databaseA, cluster2 only accessing databaseB, etc). This functionality was demo’d in Azure Synapse Analytics quite a long time ago (see Azure Synapse Analytics & Power BI concurrency), but is still not available yet. Snowflake does have this feature and it is quite popular
Note these are just some of the products for each category based on my knowledge. Please leave a comment for products that I have missed that you like!
Very helpful article James! I think that Synapse allows virtualization with Storage accounts as well in addition to ADLS. Also, Talend data catalog is very good with auto scrolling capacity.
For data virtualisation SQL Server 2019 does offer options including connectivity to Terradata, Oracle, MongoDb in addition to what you have mentioned – https://docs.microsoft.com/en-us/sql/relational-databases/polybase/data-virtualization?view=sql-server-ver15
A tool for budgeting and forecasting. Something like SAP BPC or IBM TM1.
Microsoft has had forecasting for many years. It’s a part of SQL Server Analysis Services. Unfortunately, I think that it was removed in the 2017 release. We used to do a LOT of forecasting for different retailers, I worked for a company that got bought by AC Nielsen.
A financial forecast, on the detailed level of accounts? In the format of 3/9, 6/6 and 9/3? I would have liked to see that in SSAS using the data mining tools, especially with the limited documentation.
For a non IAAS solution,
Azure managed instance sql will also support polybase on ADLS gen 2. It has been in private preview since February. In the first release, it will support the Parquet and CSV format.
Additionally, it can already query other databases such as sql server and azure SQL which would make it a good “low cost” candidate for some data virtualization cases.
Great info, thanks Eric!
Here is a gap my company, Cloud Data Solutions has addressed. Azure Data Factory (ADF) has limitations in terms of managing and scheduling pipelines. We have developed ChillETL to address this gap. It is completely meta data driven which makes it possible to import many objects from different data sources with just a few pipelines.
Native feature store for ML
Other gaps: Data labeling, classification, and tokenization service
Awesome and an highly interesting post to stumble at on this massive website! Never write some replies only now i couldnt i could not resist .
Data virtualization solutions implementation in New York
Microsoft lack a software retail till in their stack versus Oracle have Symphony and others and promote the benefits of their synergy and integration with the Fusion ERP. OK that’s not directly data …but sure generates plenty of it. 🙂
I am a retail data specialst hence landing on this page at all. The above is bigger than any item in the data list in this sector due to its wider consequences. [with my apols for going slightly off beam]