Wanted: Distributed Document Database

After experiencing both PostgreSQL and CouchDB, I believe I can list what I'd like most in a storage system.

Documents

The hash or dictionary is a shockingly useful data structure. A JSON document is a great choice for a record format that serializes easily.

HTTP/REST Interface

CouchDB/Mongo led the way here. GET and PUT easy for a web developer to grasp and takes nothing more than a generic http library and json parser in your app.

Easy Replicas

CouchDB  made replication dead-simple. It showed that multi-master can work and can be easy, as long as you have MVCC and a merge conflict approach that works for your needs. Using "newest-wins" is enough in many cases.

Another innovation from CouchDB is the _changes feed which provides a vector-clock indexed list of changes to the database and updates contiuously using long-polling.

Possibilities

If this storage system is setup into two layers, logic and storage, there are interesting posibilities for each. Node.js is my first thought for something that answers lots of HTTP requests. LevelDB is simplicity itself with only PUT, GET, and DELETE operations (possibly *too* simple). A new project, which is probably crazy given the number of interesting db projects out now, would be a useful excuse to code something in rust.

Further down the road

If the minimum version of this storage system were to exist, it would outgrow itself quickly. It would then need:

Stored Procedures

Of the many huge changes going from SQL to NoSQL, lack of group operations on the server side is a sneaky gotcha. I never use stored procedures in SQL - keep the storage dumb and put the complexity in the middleware. That being said, its easy to assume that 'DELETE from USERS where CREATION_DATE < '2012-01-01" has a no-sql equivalent, but there isnt!

The answer is to do the query alone and transfer the answer - all the users made before this year- into RAM of your application/webapp. Then iterate through the list and retrieve each record(!) to get the last revision number, then delete each record with a seperate HTTP DELETE using the record id and revision number. Its incredibly slow and inefficient. The CouchDB Bulk API exists but only addresses part of the problem.

This is more of a nice-to-have and could be implemented with uploaded javascript that is executed with access to the whole database and run on demand.

Distributed Queries

Map/reduce takes a huge shift in thinking but the rewards are great. A SQL query can only be as fast as the box that is executing the query, but a map/reduce query can divide the query itself into pieces to be farmed out.

tags: