NoSQL Database Notes

When to use NoSQL databases, especially MongoDB

You expect a high write load.
You need high availability in an unreliable environment.
You need to grow big.
You need to shard data.
Your data is location-based (sharding).
Your data set is going to be big, but the schema won't be stable.
You don't have a dedicated database administration person.

Document	MongoDB, CouchDB
Wide-column	Cassandra, HBASE, Amazon SimpleDB
Key-value	Riak, Redis, DynamoDB, Voldemort, MemcacheDB
Graph	Neo4J, OrientDB
Search	(persistent-store) Lucene, Solr, ElasticSearch

ACID and CAP

ACID: atomic (all or none), consistent (what's written is valid), isolated (one operation at a time) and durability (once committed, it's there).

CAP: consistent (all or nothing: all clients always have the same view of the data), available (service always available: each client can always read and write) and partition-tolerant (only complete network failure fails to give response: works well despite physical network partitions).

Most NoSQL databases yield 2 out of 3. MongoDB gives 2 of 3: (eventual) consistent and partition-tolerant. Examples:

CA	MySQL, PostgreSQL, Oracle, SQL Server, Vertica
AP	Cassandra, SimpleDB, CouchDB, Riak, Dynamo, Voldemort
CP	MongoDB, MemcacheDB, Redis, HBASE

Random advice

If you shard heavily and only use Key/Value look-ups, then Riak is probably easier to manage on a large scale. In fact, the Bump post says exactly this: We decided to move to Riak because it offers better operational qualities than MongoDB...Nagios will email us instead of page us,...

If you're using MongoDB heavily as a cache, maybe you end up using Membase / Redis / HBase.

If you start using MongoDB as a Queue, you will eventually want to look at real queuing systems. RabbitMQ, ActiveMQ, ZeroMQ, etc.

If you start using MongoDB to do search, you will eventually find that things like Solr and Sphinx are better suited to resolve the full spectrum of search queries.

The key here is that MongoDB is really a set of trade-offs. Many of the databases above are very specific in what they do. MongoDB is less specific but is serviceable for handling many different cases.

Couchbase

MongoDB is superbly adapted to reads if requiring a serious amount of understanding and effort (especially in picking suitable shard keys) to ensure write scalability.

    MongoDB (2007) -----------------------------------------------------------> MongoDB

CouchDB (2005) -------------------------------+ (personnel-only)
                                              |
                                              v
NorthScale (?, memcache) -------> Membase ------------------------------------> Couchbase

Couchbase claims a JavaScript idiom.

MongoDB has done a great job with sharding, something that was mostly an add-on feature back in the days of traditional RDBMS. However, MongoDB is dedicated to the document. Sharding works well, but a) it's challenging to choose the best shard key and b) "document" implies operations weighting more heavily toward read than write.

Couchbase is more recent, builds on the shoulders of other NoSQL work and specifically to solve the difficulty of sharding via a sort of "auto-sharding" for balancing write loads.

There are a lot of good things about couch; the multi-master replication is very nice, though it requires some special handling to deal with inevitable conflicts that occur if the same data is changed in multiple places before the replication occurs or you’ll potentially lose track of (not technically lose, since it's still there) some of your data.

The main downside (or upside, depending on how you look at it) to Couch is that whatever types of lookups you want to have have to be defined as "views." A view is basically an indexed map/reduce output—though it's actually only the map that is indexed. It's very powerful, but in some ways limited.

What MongoDB has over Couchbase is that it is a lot more flexible and is a good middle ground, providing a lot of the query capabilities of a traditional RDBMS while still giving the flexibility of a JSON document store. Couchbase requires a major adjustment in thinking since you can’t just do look-ups on arbitrary fields; basically you have to manage the indices explicitly and using JavaScript functions. On the other hand, it is possible to tie to something like Couch-Lucene to gain full text search, so there is a lot of flexibility there if you’re willing to do the work for it.

Couchbase to MondoDB

Couchbase is key-value pairs; there is no indexing of secondary fields possible. This is one way in which a document database like MongoDB is superior.

Viber, MongoDB and Couchbase

Is what you're doing more write- or read-intensive?

MongoDB is really about documents at the ready for reading with redundancy backing up. Sharding was created to handle write-intensity, but if writing is the lion's share of what you're doing, you may wish to choose something else.

In a recent, high-profile example, Israeli company Viber abandoned MongoDB after years of swearing by it because they have a very high-write volume.

http://www.severalnines.com/news/article/database-clustering/viber-explains-switch-from-mongodb-to-couchbase/801696332

This said, Couchbase is, according to what I've heard, pretty nasty in the "finding and reading objects" department. Here they use Cassandra though I'm not confident they arrived at the decision in the right way. I've been told mostly that it works better with Amazon S3, but I'm pretty sure they're wrong about that and just didn't know what they were doing. Here, it's read-intensive, so I think they've made a mistake, but I'm nobody anyway.

There's also old-fashioned RDBMS like MySQL and PostgreSQL (and Oracle, hehehe), but these aren't so good for writing to and their sharding solutions are add-ons. They also do objects with more difficulty (though the problem's been mastered via Hibernate in Java).

MongoDB & Bitcoin: How NoSQL Design Flaws Brought Down Two Exchanges

http://java.dzone.com/articles/mongodb-bitcoin-how-nosql

Four major groupings of NoSQL

Key-value stores, e.g.: Redis
Document databases, e.g.: MongoDB
Columnar stores, e.g.: Cassandra
Graph databases, e.g.: Neo4j

Each uses a different approach with different features and drawbacks. Motivations:

Massive write scaling (sharding) needed—more than what a single server can provide.
Only simple data-access patterns needed.

What you give up (in theory or in practice):

Powerful query language.
Sophisticated query optimizer.
Data normalization.
Joins (see data normalization).
Referential integrity.
Durability.

Additional downsides...

Schemaless data requires complex client-side knowledge to process.

MongoDB: not all sunshine and rainbows O

Here are some downsides to MongoDB, excerpted from an article.

Data consistency/durability and performance

This is a common tradeoff people make when using MongoDB to achieve high performance. We ended up making the tradeoff too by specifying the most aggressive write concern (error ignored) and read preference (secondaryPreferred) for the most performance demanding modules. We would rather not do that if MongoDB could give us both strong data consistency/durability and high performance. The cost of the tradeoff was potential data loss and data inconsistency. Although for these modules minor data loss or temporary data inconsistency is acceptable, we want to react quickly if the situation gets worse. That was why we ended up building comprehensive monitoring support to watch the data and replication lag closely.

More normalized data and fewer network round trips

In MongoDB, there is no support for joins. If the data is highly normalized, then the client application has to issue more queries (more network round trips that add latency) to fetch data from different collections (tables). We had to de-normalize the data in DPS to reduce network round trips. The cost was that same data could be scattered in different collections, which not only occupied more disk space but also could easily lead to data inconsistency. Applications also needed to do busy work duplicating data in different collections, which became frustrating at times. It is a good idea to carefully design your schema in order to make the right tradeoffs for your application.

Giving up multi-document or multi-collection transactions

Using write concerns and read preferences can mitigate some of the data consistency and durability problems without using transactions, but it cannot guarantee atomic update across multiple documents or multiple collections. As of now, we are still not confident to use MongoDB in Perfect Market Vault (a web based admin tool that is used by both internal administrators and external partners) because due to the nature of the data Vault manages the requirement on multi-document or multi-collection data consistency for Vault is much more demanding than it is for DPS.