Cover

Can't We Just Get Along: The Spurious NoSQL Battle

August 3, 2014
No Comments.

The first blog post I ever wrote was a short one on databases. In fact for those of you who haven’t been following me for more than ten years, my old domain was “ADOGuy.com”. I wrote about ADO and ADO.NET pretty exclusively for years.

These days I’m more known for web and XAML than data but it’s something that touches most developers so I continue to watch the trends. Of course the NoSQL v. Relational Database  is the current fight in that space (taking over from the decade long ORM or no ORM skirmish). These fights seem awfully silly in the big picture because the answer is usually “it depends” or even “it’s both…” Let me see if I can convince you that the argument in itself is wrong.

Comparing NoSQL to Relational Databases

Let’s start by explaining what I mean by NoSQL and Relational, just to be clear. Relational databases (e.g. SQL Server, Oracle, DB2, and MySQL) store their data as two-dimensional arrays (e.g. Tables). Each of these tables support a way to relate to other tables. NoSQL databases (e.g. MongoDB, CouchDB, RavenDB, etc.) are really document stores. They tend to store their data as hierarchies. It is important to note that not all NoSQL data stores are document databases or use hierarchies, but most of them seem to. Why does this storage method matter?

NoSQL and relational are also different because of the way you access data. In general this means using SQL and not using SQL. I don’t think difference matters a whole lot. Some NoSQL databases invent their own language for search but for most simple data retrieval it is just retrieving hierarchies by ID.

Another big difference is the nature of schema. The E.F. Codd paper back in the early 1970s codified the idea of separating the design of the structure of your data storage. Essentially this introduced the idea of logical organization of the data inside a system. This was the backbone of most relational databases. To many of the NoSQL database designers, schema got in the way of dealing with data. Their argument was that code could ensure the structure of the data and that having the data storage be responsible for both. As data structures changed over time, the code could deal with these changes instead of having to go back to historical data and change the schema of all the data in a data store. Because of this, most NoSQL data stores don’t have rules for the structure of data. The data stores become just a serialization format.

Lastly, the deployment of NoSQL and relational databases can be very different. NoSQL advocates tend to say that they are easier to manage for ‘scale-out’ but I am not that convinced that it is that different. It feels different because doing fail-over and sharding is managed differently. But relational servers does not imply scale-up (e.g. adding memory and processors) as the only solution. There are plenty of companies that ‘shard’ their SQL servers too, though it tends to require more forethought than NoSQL.

As you can see, NoSQL and relational databases have different approaches to storing, changing and querying data. They each have their upsides as well as downsides. Like most things, both approaches contain many compromises to achieve their overall goals.

A Bit of History

The early 1970’s were an amazing time in computer science. Many new ideas were being put forward that still have reverberations in current-day development. One of these was the idea of a “relational data model”. In its wake, several important database systems were created including Ingres and System R. These were the forbearers of SQL Server, Oracle, DB2 and others. When relational databases were proposed to replace the existing storage systems in use, developers scoffed at the newly minted ideas. It took relational databases a while to take hold. But by the 1980’s they were the backbone of most of the software used by big businesses (though older COBOL, RPG and other systems were still in use like they are today).

As relational databases took hold, their limitations didn’t go un-noticed. Relational databases were meant to solve a common problem that happened: database corruption. The tenets of relational databases were there to help create consistency in data. But because of this they tended to be tuned for writing and not reading. Ultimately the loss of data was the big problem here and by tuning for writing, systems that were creating a lot of new data (or changing data) could be relied on more.

But in parallel to the development of relational database, there was always a need to have data stores that were tuned for reading or deferred writes. In fact, back in 2000 (yeah, fourteen years ago) – I wrote an article for MSDN magazine (then MSJ magazine, and the article is now just a dead link on Microsoft’s website) that pushed this very idea: use LDAP to store your data since LDAP is tuned for reading.

In reality, many of us have actually solved this by writing our own databases. We didn’t call them database, we called them “Cache Servers”, “Data Caches” or other some nomenclature. But we were basically writing what is now a NoSQL data store. So the idea of NoSQL isn’t all that new. We’ve always needed data storage that was fast and tuned for reading. We just didn’t think of them as databases.

Let’s talk about how document databases and relational databases think about data differently.

“Isn’t Most Data Made Up of Hierarchies?”

When people are taught about relational data, it usually starts with a diagram something like this:

nosql_1

Using the idea of a table of customers, table of orders and then a table of order details. The relationships between these business ideas is pretty simple. When you’re taught about relational databases in this way, it seems obvious that the data in a relational data might be just a simple hierarchy:

nosql_2

But this breaks down pretty quickly because it isn’t a hierarchy. We want it to be a hierarchy because we live in hierarchies. Class diagrams, results in REST, XML…they are all developer ideas that feel right in our world. But if we take this simple example just one step further, we can see that adding a product to the order detail breaks the hierarchy idea:

nosql_3

This is broken because we need the join. Joins are at the center of all of this. Even in NoSQL, at some point you’ll either need to join two documents together or store redundant data. In this example, it could just store the product…but if you do, you can’t easily check for availability, backordered items, etc.

Of course, not everything breaks this way. There are lots of great examples of true documents that make more sense to store as an atomic item than try to normalize them into a table structure. Back when I wrote the LDAP article, I was doing this because I was cataloging a million healthcare articles. If we broke up each article into the independent bits, the database would have been useful, but we’d have 30 joins to get each article back together. Each of these was an actual document and should have been stored that way.

But we had a problem. People needed to search for articles based on some of the document’s properties (e.g. author(s), keyword, date, etc.) So we pulled out these searchable elements into something that could be queried easily (in our case that was an LDAP server). This worked because we knew that once an article went into the system, the amount of change an article would have was pretty low. We would benefit from the speed of search and retrieval. Changes into the system were low priority and rare. That’s what made a NoSQL-like data store so perfect.

I Blame ORMs

A lot of the focus for developers to think about NoSQL and relational databases in an adversarial way goes back to object relational mappers. Most developers don’t want to be database developers. There has been so much code written in the attempt to make data an afterthought. Developers seem to just want ‘serialization’ not data. Lots has been written about the dissonance between relational models and object models. Trying to get one into the other has been a laudable goal. In fact, many ORMs of today are trying to hide all that information from you.

In that light, NoSQL data stores seem perfect. You take an object graph and pass it in, and it just saves it. No ORM needed because there is no mapping. You’re storing data closer to the way you use it. But as we’ve seen not all data is correctly modeled as a hierarchy or object graph.

The fact that developers seem to want to go to any length to avoid building databases and want the persistence ignorance that ORMs and NoSQL affords them, we’re going to be stuck. The fact is that today there are better tools for what to do with data once you have it. Developers seem to focus on creating and modifying data, but not analyzing it. But being able to do real reporting and business intelligence over your large data stores (e.g. data warehousing or data marts) is difficult today in the NoSQL world. If that is the type of data use you see happening eventually, escape from the persistence ignorance fallacy and build something that fulfills all your customer’s needs…not just the programming requirements. Think bigger. See the entire use case of your data.

Knowing the Roles

What does all of this tell us? To me it makes clear a simple idea: both data stores are good for their respective strengths. Understanding the requirements of your data storage needs is what will tell you about what database to use. Or in reality, where to use each. I suspect for most projects of any size, they can benefit from both relational and document data stores. Knowing how to mix and match these great technologies will be the challenge over the next few years.

Ultimately if we can stop the bickering about which is better and understand that they both have their appropriate benefits and weaknesses, we can focus on making better systems…not arguing about nonsensical dogma of why one is better than the other.

What do you think?