Distributed SQL
A very brief history of databases for online transaction processing (OLTP) workloads starts with relational databases (RDBMS), which worked well for many years. With the advent of the internet and the need for millions of transactions per second, NoSQL filled that need, but is modeled in means other than the tabular relations used in relational databases and uses eventual consistency instead of ACID (among other differences). Then came NewSQL, which is a class of RDBMS that seeks to provide the scalability of NoSQL systems for OLTP workloads while maintaining the ACID guarantees of a traditional database system. Now the latest technology for databases is Distributed SQL.
A Distributed SQL database is a single relational database which replicates data across multiple servers. They are strongly consistent and most support consistency across racks, data centers, and wide area networks including cloud availability zones and cloud geographic zones. Distributed SQL databases typically use the Paxos or Raft algorithms to achieve consensus across multiple nodes. They are for OLTP applications and not for OLAP solutions. Sometimes distributed SQL databases are referred to as NewSQL but NewSQL is a more inclusive term that includes databases that are not distributed databases. Usually distributed SQL databases are built from the ground-up and NewSQL databases include replication and sharding technologies added to existing client-server relational databases like PostgreSQL. Some experts define Distributed SQL databases as a more specific subset of NewSQL databases. For more details on the differences between relational databases, non-relational databases (NoSQL), and NewSQL, see my blog Relational databases vs Non-relational databases.
A Distributed SQL database was really made possible because of the cloud – with it’s ability to easily scale and with its built-in resiliency and disaster recovery. You simply scale by adding more nodes, and each node can handle read or write queries. It can scale beyond just a single data center into multiple regions (even different clouds), but maintains one single logical database (so your applications uses one single connection string). It can survive the failure of a node or even a service disruption in an entire region. You can also geo-locate data near a user to reduce read/write latencies or comply with regulations.
To better understand how a Distributed SQL database works, I’ll go into details using CockroachDB, since it is one of the more popular ones (2nd most popular next to Amazon Aurora according to db-engines):
Every database consists of three layers, from bottom to top: Storage, SQL execution, and SQL Syntax. CockroachDB rearchitects the bottom two. Behind the scenes you will find most Distributed SQL solutions store the data differently than what you find in a relational database. For example, CockroachDB uses a key-value store (called Pebble). So data appears to the user to be in a relational database (PostgreSQL in the case of CockroachDB) but is actually in a key-value store to allow tables to be sharded (called ranges) across multiple regions. It uses the Raft algorithm to replicate data to three replicas in real-time (you can configure to use more than three replicas). Each of the three replicas can be placed anywhere, based on your applications requirements, such as diversity (balanced across nodes to improve resiliency), load (balance placement based on real-time usage to improve performance), or latency & geo-partitioning (to improve latency and satisfy regulations). It provides consistent transactions (ACID) so it can be used for things like financial transactions (unlike NoSQL which provides eventual consistency). If a node is added, CockroachDB automatically redistributes the replicas to even the load across the cluster based on the three replica placement heuristics just discussed. If a node goes down, it automatically replaces it with a new replica on an active node. The tradeoff to writing to three replicas is writes can take longer, especially if the nodes are in regions that are far apart. And reads can take longer if a query needs data that is in multiple nodes that are far apart (this is improved via push-down queries and cost-based optimizers). The key to improving latency is to have your ranges and replicas all close together via the three replica placement heuristics (or via other methods discussed at 9 Techniques to Build Cloud-Native, Geo-Distributed SQL Apps with Low Latency), but there are built-in optimizations in CockroachDB to help with distributed transaction performance (called Lazy Transaction Records and Write Pipelining).
The below slide from a CockroachDB video compares Distributed SQL with relational and NoSQL:
Distributed SQL databases include: MariaDB’s Xpand, CockroachDB, Google Cloud Spanner, YugabyteDB, MariaDB’s SkySQL, Azure Database for PostgreSQL-Hyperscale, Amazon Aurora, and PingCap TiDB. NoSQL databases include: MongoDB, Azure Cosmos DB, Cassandra, Coachbase, and HBase. NewSQL databases include VoltDB, NuoDB, MemSQL, SAP HANA, Splice Machine, Clustrix, and Altibase.
In my experience, I have yet to see a company that uses a Distributed SQL database. Considering the most popular one, Amazon Aurora, is only ranked #45 on db-engines means they are not that popular. They have certainly become more popular than NewSQL databases, which have really dropped in customer use. I can’t see Distributed SQL databases becoming much more popular than they are now, as there use case is small: if your OLTP application does not need to support millions of transactions per second, then a traditional relational database will do just fine. And most workloads that do need that type of performance have gone the NoSQL route, and I think it will be hard for Distributed SQL databases to cut into the NoSQL market which has been around a lot longer and is very popular.
More info:
Distributed SQL Takes Databases to the Next Level
Video Distributed SQL Architecture | Cloud Database | Distributed Database
SQL VS. NOSQL VS. NEWSQL: FINDING THE RIGHT SOLUTION
NoSQL vs. NewSQL vs. Distributed SQL: DZone’s 2020 Trend Report
A Decade in Review: Distributed SQL Takes the Stage as NewSQL Exits
Great explanation
Jim, where does the snowflake fits in?
Snowflake is a OLAP/DW solution so I did not consider it for this blog, which is for OLTP.
Interesting perspective you have on the future of distributed SQL databases: “I can’t see Distributed SQL databases becoming much more popular than they are now, as there use case is small”
How large do you think the Total Addressable Market (TAM) for Distributed SQL Databases is?