Microsoft Products vs Hadoop/OSS Products
Microsoft’s end goal is for Azure to become the best cloud platform for customers to run their data workloads. This means Microsoft will provide customers the best environment to run their big data/Hadoop as well as a place where Microsoft can offer services with our unique point-of-view. Specific decision points on using Hadoop is if the customer wants to use open source technologies or not. Some of the benefits of running open source software (OSS) on Azure include:
- Quick installs
- Support
- Easy scale
- Products work together
- Don’t need to get your own hardware
To determine the cost savings by moving your OSS to Azure, see the Total Cost of Ownership (TCO) Calculator.
Of course there are many benefits of using Microsoft products over OSS, such as ease of use, support, better security, easier to find people with skills, less frequent version updates, more stable (less bugs), more compatibility and integration between products, etc. But there are still reasons to use OSS (i.e. cost, faster performance in some cases, more product selection and features), so I created a list that shows many of the Microsoft products and their equivalent, or close equivalent, Hadoop/OSS product.
I tried to list only Apache products unless there was no equivalent Apache product or there is a really popular Open Source Software (OSS) product (updated 1/30/21).
Microsoft Product | Hadoop/Open Source Software Product |
Office365/Excel | OpenOffice/Calc |
Cosmos DB | MongoDB, MarkLogic, HBase, Cassandra |
SQL Database | SQLite, MySQL, PostgreSQL, MariaDB, Apache Ignite |
Azure Data Lake Analytics/YARN | None |
Azure VM/IaaS | OpenStack |
Blob Storage | HDFS, Ceph (Note: These are distributed file systems and Blob storage is not distributed) |
Azure HBase | Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion |
Event Hub | Apache Kafka |
Azure Stream Analytics | Apache Storm, Apache Spark Streaming, Apache Flink, Apache Beam, Twitter Heron |
Power BI/Reporting | Apache Zeppelin, Apache Jupyter, Airbnb Superset, Preset (pay), Kibana |
Power BI/Cubing engine | Arcadia Data (pay) |
HDInsight | Hortonworks (pay), Cloudera (pay), MapR (pay) |
Azure ML (Machine Learning) | Apache Mahout, Apache Spark MLib, Apache PredictionIO |
Microsoft R Open | R |
Azure Synapse Analytics/Interactive queries | Apache Hive LLAP, Presto, Apache Spark SQL, Apache Drill, Apache Impala, Databricks SQL Analytics, Apache Arrow |
IoT Hub | Apache NiFi |
Azure Data Factory | Apache Falcon, Airbnb Airflow, Apache Oozie, Apache Azkaban, data build tool (dbt), dbt cloud (pay), Astronomer (pay), Prefect, Luigi, Argo, Kubeflow, Dagster, Flyte |
Azure Data Lake Storage Gen2/WebHDFS | HDFS Ozone |
Azure Analysis Services/SSAS | Apache Kylin, Apache Druid, AtScale (pay) |
SQL Server Reporting Services | None |
Hadoop Indexes | Jethro Data (pay) |
Azure Purview | Apache Atlas, Amundsen |
PolyBase | Apache Drill |
Azure Cognitive Search | Apache Solr, Apache ElasticSearch (Azure Search build on ES) |
SQL Server Integration Services (SSIS) | Talend Open Studio, Pentaho Data Integration |
Microsoft Common Data Model (Storage overlay) | Delta Lake, Apache Hudi, Apache Iceberg |
Others | Apache Ambari (manage Hadoop clusters), Apache Ranger (data security such as row/column-level security), Apache Knox (secure entry point for Hadoop clusters), Apache Flume (collecting log data) |
Many of the Hadoop/OSS products are available in Azure. If you feel I’m missing some products from this list, please let me know as this is very subjective and comments are always welcome!
Comments
Microsoft Products vs Hadoop/OSS Products — No Comments
HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>