Relational databases vs Non-relational databases
I see a lot of confusion about the place and purpose of the many new database solutions (“NoSQL databases”) compared to the relational databases solutions that have been around for many years. So let me try to explain the differences and best use cases for each.
First lets clarify these database solutions into two groups:
1) Relational databases, which can also be called relational database management systems (RDBMS) or SQL databases. The most popular of these are Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2. These RDBMS’s are mostly used in large enterprise scenarios, with the exception of MySQL, which is mostly used to store data for web applications, typically as part of the popular LAMP stack (Linux, Apache, MySQL, PHP/ Python/ Perl).
2) Non-relational databases, also called NoSQL databases, the most popular being MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4j. These databases are usually grouped into four categories: Key-value stores, Graph stores, Column stores, and Document stores (see Types of NoSQL databases).
All relational databases can be used to manage transaction-oriented applications (OLTP), and most non-relational databases that are in the categories Document stores and Column stores can also be used for OLTP, adding to the confusion. OLTP databases can be thought of as “Operational” databases, characterized by frequent, short transactions that include updates and that touch a small amount of data and where concurrency of thousands of transactions is very important (examples including banking applications and online reservations). Integrity of data is very important so they support ACID transactions (Atomicity, Consistency, Isolation, Durability). This is opposed to data warehouses, which are considered “Analytical” databases characterized by long, complex queries that touch a large amount of data and require a lot of resources. Updates are infrequent. An example is analysis of sales over the past year.
Relational databases usually work with structured data, while non-relational databases usually work with semi-structured data (i.e. XML, JSON).
Let’s look at each group in more detail:
Relational Databases
A relational database is organized based on the relational model of data, as proposed by E.F. Codd in 1970. This model organizes data into one or more tables (or “relations”) of rows and columns, with a unique key for each row. Generally, each entity type that is described in a database has its own table with the rows representing instances of that type of entity and the columns representing values attributed to that instance. Since each row in a table has its own unique key, rows in a table can be linked to rows in other tables by storing the unique key of the row to which it should be linked (where such unique key is known as a “foreign key”). Codd showed that data relationships of arbitrary complexity can be represented using this simple set of concepts.
Virtually all relational database systems use SQL (Structured Query Language) as the language for querying and maintaining the database.
The reasons for the dominance of relational databases are: simplicity, robustness, flexibility, performance, scalability and compatibility in managing generic data.
But to offer all of this, relational databases have to be incredibly complex internally. For example, a relatively simple SELECT statement could have dozens of potential query execution paths, which a query optimizer would evaluate at run time. All of this is hidden to users, but under the hood, the RDBMS determines the best “execution plan” to answer requests by using things like cost-based algorithms.
For large databases, especially ones used for web applications, the main concern is scalability. As more and more applications are created in environments that have massive workloads (i.e. Amazon), their scalability requirements can change very quickly and grow very large. Relational databases scale well, but usually only when that scaling happens on a single server (“scale-up”). When the capacity of that single server is reached, you need to “scale-out” and distribute that load across multiple servers, moving into so-called distributed computing. This is when the complexity of relational databases starts to cause problems with their potential to scale. If you try to scale to hundreds or thousands of servers the complexities become overwhelming. The characteristics that make relational databases so appealing are the very same that also drastically reduce their viability as platforms for large distributed systems.
Non-relational databases
A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
Motivations for this approach include:
- Simplicity of design. Not having to deal with the “impedance mismatch” between the object-oriented approach to write applications and the schema-based tables and rows of a relational database. For example, storing all the customer order info in one document as opposed to having to join many tables together, resulting in less code to write, debug, and maintain
- Better “horizontal” scaling to clusters of machines, which solves the problem when the number of concurrent users skyrockets for applications that are accessible via the web and mobile devices. Using documents makes it much easier to scale-out as all the info for that customer order is contained in one place as opposed to being spread out on multiple tables. NoSQL databases automatically spread data across servers without requiring application changes (auto-sharding), meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption
- Finer control over availability. Servers can be added or removed without application downtime. Most NoSQL databases support data replication, storing multiple copies of data across the cluster or even across data centers, to ensure high availability and disaster recovery
- To easily capture all kinds of data “Big Data” which include unstructured and semi-structured data. Allowing for a flexible database that can easily and quickly accommodate any new type of data and is not disrupted by content structure changes. This is because document database are schemaless, allowing you to freely add fields to JSON documents without having to first define changes (schema-on-read instead of schema-on-write). You can have documents with a different number of fields than other documents. For example, a patient record that may or may not contain fields that list allergies
- Speed. The data structures used by NoSQL databases (i.e. JSON documents) differ from those used by default in relational databases, making many operations faster in NoSQL than relational databases due to not having to join tables (at the cost of increased storage space due to duplication of data – but storage space is so cheap nowadays this is usually not an issue). In fact, most NoSQL databases do not even support joins
- Cost. NoSQL databases usually use clusters of cheap commodity servers, while RDBMS tend to rely on expensive proprietary servers and storage systems. Also, the licenses for RDBMS systems can be quite expensive while many NoSQL databases are open source and therefore free
The particular suitability of a given NoSQL database depends on the problem it must solve.
NoSQL databases are increasingly used in big data and real-time web applications. They became popular with the introduction of the web, when databases went from a max of a few hundred users on an internal company application to thousands or millions of users on a web application. NoSQL systems are also called “Not only SQL” to emphasize that they may also support SQL-like query languages.
Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance. Some reasons that block adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and huge investments in existing SQL. Also, most NoSQL stores lack true ACID transactions or only support transactions in certain circumstances and at certain levels (e.g., document level). Finally, RDBMS’s are usually much simpler to use as they have GUI’s where many NoSQL solution use a command-line interface.
Comparing the two
One of the most severe limitations of relational databases is that each item can only contain one attribute. If we use a bank example, each aspect of a customer’s relationship with a bank is stored as separate row items in separate tables. So the customer’s master details are in one table, the account details are in another table, the loan details in yet another, investments in a different table, and so on. All these tables are linked to each other through the use of relations such as primary keys and foreign keys.
Non-relational databases, specifically a database’s key-value stores or key-value pairs, are radically different from this model. Key-value pairs allow you to store several related items in one “row” of data in the same table. We place the word “row” in quotes because a row here is not really the same thing as the row of a relational table. For instance, in a non-relational table for the same bank, each row would contain the customer’s details as well as their account, loan and investment details. All data relating to one customer would be conveniently stored together as one record.
This seems an obviously superior method of storing data, but it has a major drawback: key-value stores, unlike relational databases, cannot enforce relationships between data items. For instance, in our key-value database, the customer details (name, social security, address, account number, loan processing number, etc.) would all be stored as one data record (instead of being stored in several tables, as in the relational model). The customer’s transactions (account withdrawals, account deposits, loan repayments, bank charges, etc.) would also be stored as another single data record.
In the relational model, there is an built-in and foolproof method of ensuring and enforcing business logic and rules at the database layer, for instance that a withdrawal is charged to the correct bank account, through primary keys and foreign keys. In key-value stores, this responsibility falls squarely on the application logic and many people are very uncomfortable leaving this crucial responsibility just to the application. This is one reason why relational databases will continued to be used.
However, when it comes to web-based applications that use databases, the aspect of rigorously enforcing business logic is often not a top priorities. The highest priority is the ability to service large numbers of user requests, which are typically read-only queries. For example, on a site like eBay, the majority of users simply browse and look through posted items (read-only operations). Only a fraction of these users actually place bids or reserve the items (read-write operations). And remember, we are talking about millions, sometimes billions, of page views per day. The eBay site administrators are more interested in quick response time to ensure faster page loading for the site’s users, rather than the traditional priorities of enforcing business rules or ensuring a balance between reads and writes.
Relational-model databases can be tweaked and set up to run large-scale read-only operations through data warehousing, and thus potentially serve a large amount of users who are querying a large amount of data, especially when using relational MPP architectures like Analytics Platform System, Teradata, Oracle Exadata, or IBM Netezza, which all support scaling. As mentioned before, data warehouses are distinct from typical databases in that they are used for more complex analysis of data. This differs from the transactional (OLTP) database, whose main use is to support operational systems and offer day-to-day, small scale reporting.
However, the real challenge is the relational model’s lack of scalability when dealing with OLTP applications, or any solution with a lot of individual writes, which is the domain of relational SMP architectures. This is where non-relational models can really shine. They can easily distribute their data loads across dozens, hundreds and in extreme cases (think Google search) even thousands of servers. With each server handling only a small percentage of the total requests from users, response time is very good for each individual user. Although this distributed computing model can be built for relational databases, it is a real pain to implement, especially when there are a lot of writes (i.e OLTP), requiring techniques like sharding which usually requires significant coding outside of the application’s business logic. This is because the relational model insists on data integrity at all levels, which must be maintained, even as the data is accessed and modified by several different servers. This is the reason for the non-relational model as the architecture of choice for web applications such as cloud-computing and social networking.
So in summary, RDBMS’s suffer from no horizontal scaling for high transaction loads (millions of read-writes), while NoSQL databases solve high transaction loads but at the cost of data integrity and joins.
Keep in mind many solutions will use a combination of relational and non-relational databases (see What is Polyglot Persistence?).
Also keep in mind that you may not need the performance of a non-relational database and instead just going with storing files in HDFS and using Apache Hive will be enough (Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis that it provides via an SQL-like language called HiveQL).
And to end on a note that adds to the confusion, we have a another category forming called NewSQL: NewSQL is a class of modern RDBMS’s that seek to provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional relational database system. The disadvantages is they are not for OLAP-style queries, and they are inappropriate for databases over a few terabytes. Examples include VoltDB, NuoDB, MemSQL, SAP HANA, Splice Machine, Clustrix, and Altibase.
A picture showing the categories that many of the products fit into:
An excellent graphic that shows how all the technologies fit in the Azure cloud is from Understanding NoSQL on Microsoft Azure:
The bottom line for using a NoSQL solution is if you have an OLTP application that has thousands of users and has a very large database requiring a scale-out solution and/or is using JSON data, in particular if this JSON data has various structures. You also get the benefit of high availability as NoSQL solutions store multiple copies of the data. Just keep in mind you for performance you may sacrifice data consistency, as well as the ability to join data, use SQL, and to do quick mass updates.
More info:
MySQL vs. MongoDB: Looking At Relational and Non-Relational Databases
10 things you should know about NoSQL databases
Difference between SQL and NoSQL : Comparision
SQL vs NoSQL Database Differences Explained with few Example DB
NoSQL, NewSQL, or RDBMS: How To Choose
NoSQL vs NewSQL Databases Choose the Right Tool for the Right Job
SQL vs NoSQL: you do want to have a relational storage by default
Oracle Defends Relational DBs Against NoSQL Competitors
Understanding NoSQL on Microsoft Azure
Meet the Avant-Garde of New Relational Databases
To SQL or NoSQL? That’s the database question
What’s really new with NewSQL?
This is a good overview but the technologies are ever-changing. I’d argue that:
–not all columnstores are NoSQL/non-relational. HP Vertica is an MPP columnstore and its SQL is standard ANSI. You perform your physical modeling just like any standard data warehouse model.
–storage engines can now be swapped in/out for most products. Vertica can use HDFS natively. MySQL storage engines were swappable for at least 12 years. Products are becoming “hybrid” and “crossovers” to meet more use cases.
–the lines will further blur as NoSQL products add relational features and relational products add more NoSQL-like features. SQL Server added delayed transaction durability I’m sure as a response to a perceived shortcoming. And sql-like extensions are being added to the Hadoop stack daily. Remember when people were berated when they asked, “how can I turn off tran log?” Well, maybe we don’t always need the ability to roll back (or recover)? What heresy!!!
–many people are making in-memory work for huge data sets. It can be done and is likely the future for HANA and SAP’s BW product. I have a 100TB couchbase cluster that is all in-memory.
–NoSQL databases, while open source (free as in speech), are not always free (as in beer). You’ll definitely want vendor support and that ain’t cheap. I mean really, who actually understands how MapReduce works? Are you going to look at the source code at 3am?
–And if you are an ISV then you need to be aware that anything that is GPL’d will require you to either pursue a commercial license or open source your product. See MongoDB’s licensing terms. This scares ISVs…a lot.
–You actually can model your documents in a document store to be somewhat relational yet still avoid the overhead of JOINs. This is conceptually like nested sets and pointers. Likewise, I’ve seen lots of SQL Server databases using EAV patterns and XML columns that are not relational at all. This is where most people get confused that truly don’t understand these “new” non-relational technologies. There is logical modeling and then there is the physical implementation. The same logical model can often be expressed in a rowstore, columnstore, or document store.
–These NoSQL ideas aren’t “new” but are re-expressions of old ideas. Back before Codd we had “network” and “hierarchical” datastores and many hospitals still rely on them today (Mumps).
–I think the real motivation for “NoSQL” is the CAP Theorem and the fact that most SQL offerings have traditionally done a lousy job with it. Each product handles this uniquely differently and can often be tweaked for specific needs (MySQL has done this for 8 years, SQL Server now has delayed durability.) The CAP Theorem is the key. Most relational guys don’t get this. Yes, there are actually times where I may not mind some transactions being lost or replayed twice if I can process massive datasets realtime. Sometimes data isn’t isn’t your company’s most valuable asset.
–Every relational guy should understand the basics of this stuff to be able to speak somewhat intelligently about it and not just regurgitate FUD and non-truths. Likewise NoSQL guys that would express an accounting data model in Mongo need their heads examined.
To that end, great post.
Great points Dave, thanks for contributing!
Thank you James for putting this piece together. It is clear and concise written and will give any data “newbie” a very good picture of the data landscape we are in these days. I like you categorization of DB’s. It would be nice to add to it what is commercial and non-commercial use DB’s. Where would you place APS on your diagram?
Glad you liked it Louw! APS would fall under the same category as Netezza.
Nice article James, its cleared my doubts about RDBMS and non relational (NoSQL).
I am more interested in BigData technology,could you please lte me know which one is good to start.
I have 11+ ys of experiences in MSSQL and MSBI now I am looking in Analytics/analytical area could you advise me that should I change complete MS to Big or MS has own Big Data technology then what are those , so I can start looking these tech.
Thanks in Advance
Good Article, Really good explaination, there is link where i got first idea, i hop it might help some one.
http://growthefuturenow.com/relational-vs-non-relational-database/
Pingback:12 Core Competencies For Product Managers - Pendo blog
good contents sir….
such a good things that i learned through this article. this really helpful for someone who are intrested in databases,sql,BIGDATA……
thank you,
for such wonderfull article…..i must do share this article to my friends.
Great article! I am working on an executive overview that I need to present to my manager, can anybody point me to some other high-level discussions on this topic, I’m not a great writer (except when it comes to code :)) and I would like to see how some people describe it in a way that’s easy for non-technical folks … Thanks
I love how well you articulated the differences!
May I just add one more RDBMS, because I’m sure many people choosing between relational and non-relational are also stuck deciding on which software to use. The ones you listed are well known in North America, but I started using Tibero 6 recently and my company saved 50% on licensing fees. I think your readers need to know of cheaper versions, plus it has a better security 😉
You can find it here: http://www.tmaxsoft.com/cn_en/tibero_cn_en/
Carolina
I would suggest to put memory optimized rowstore of MemSQL under analytics and operational, as well as disk based columnstore of MemSQL under analytics
I would also provide here a link of our VP of engineering elaborating on use of rowstore vs columnstore: https://blog.memsql.com/should-you-use-a-rowstore-or-a-columnstore/
I think that a) SQL can scale to millions of writes and reads, MemSQL is a proof of this b) lines between operational and analytical workloads are blurring, due to use of memory and distributed architectures
NoSQL is ignoring the real need for joins. Let’s say General Motors has a work environment compensation package (I have no clue) and it is by type of worker. In SQL, this is in one table; in MongoDB, it’s a part of EVERY record for every worker around the world. One change to the compensation package would send MongoDB into a month-long tailspin to change it in every single worker’s record.
I’m still new to MongoDB so I really HOPE someone can address this, because arguing that the only reason for joins is to save space is a feeble one.
Be careful when you shoot down a product/solution/architecture given one single use case. It’s often a straw man. Let’s assume you absolutely would want to use Mongo in your use case…then I wouldn’t model the *physical* design in a relational manner with joins. Doc databases have made architectural decisions to avoid joins to gain benefits elsewhere. Joins are suboptimal in the physical implementation of a document-oriented store.
Assuming you really, REALLY wanted to use Mongo in your use case (which I wouldn’t) then you would want to have a “pointer” in the emp record that points to comp pkg lookup. You are then going to retrieve 2 documents. This is no different than your app today that makes multiple sql calls to point a webpage. This admittedly limits your ability to write reports. This again is a design decision for doc dbs.
As for “the only reason for joins is to save space”…this is a gross over-simplification. We use star schemas and data warehouse structures to avoid joins in relational dbs.
It’s important to understand when to use a technology…as well as when not to. It’s also important to understand architectural tradeoffs with these technologies. The best architects understand the strengths and limitations of various tools to ensure good decisions are being made
Pingback:SQL versus NoSQL databases | Big Data and Analytics
Pingback:Making sense of Microsoft technology | James Serra's Blog
Pingback:Making sense of Microsoft technology – Cloud Data Architect
Thanks for the clear and concise explanation! I’m researching possible cloud implementation of the R package and come from the desktop/system admin world so I have a lot of reading to do.
This was a great intro. Thanks.
Pingback:Distributed SQL – SQLServerCentral
Pingback:Distributed SQL | James Serra's Blog
Pingback:20 Punkte, die Sie bei der Erstellung Ihrer ersten Node.js-Anwendung beachten sollten
Pingback:What Types of Business Decisions Would an EIS Use AI For?