directed graphs and document storage

With the release of Google Buzz, i was reminded of the XFN friends network. I dusted off my XFN Spider and thought about replacing the SQL storage with a document database. There are many options for data storage these days, as described in this writeup

I'm still trying to fundamentally understand the differences between document storage and relational/sql storage. I figure this directed graph is a good opportunity to go further with that. I've decided to stick with couchdb as my document system of choice because its only retrieval mechanism is map/reduce. Once i get that down, i'll look at mongodb further and its custom query language with optional map/reduce.

A directed graph has nodes and links between nodes. Its a directed graph because a link goes from one node and to another node. A social network has people that "follow" each other. This can be modeled by two SQL tables. CREATE TABLE nodes (id int, name text); CREATE TABLE links (id int, from_node_id int, to_node_id int)

The following is a graph from my home page to twitter and delicious. My twitter page links back to my homepage while the delicious page does not.

nodes

idname
1donpark.org
2twitter.com
3delicious.com

links

idfrom_node_idto_node_id
11 2
12 1
11 3

Modeling this data with documents is remarkably similar to the SQL case. Here I'm focusing on creating relationships between documents. Documents can also contain entire related records but thats another topic. A database in couchdb holds documents. A document is a set of key/value pairs. Couchdb gives every document an ID automatically. Since couchdb has no schema, I'll list the data directly.

nodes

{_id: "ad7b377d501ae5f17f623e7853f89ccc", name: "donpark.org"}
{_id: "49c02d3dbffe3649cc9f1dbf99827d97", name: "twitter.com"}
{_id: "5b5a99f6799bbf984da4b11e62a434ab", name: "delicious.com"}

links

{_id: "9efd0174f043dead5319aaee3ac9df92", from_node_id: "ad7b377d501ae5f17f623e7853f89ccc", to_node_id: "49c02d3dbffe3649cc9f1dbf99827d97"}
{_id: "a227c15681639f83e26a93c2a6aa3522", _rev: "1-76c2de7086aa314f0971e41043b34924", from_node_id: "49c02d3dbffe3649cc9f1dbf99827d97", to_node_id: "ad7b377d501ae5f17f623e7853f89ccc"}
{_id: "dc77b57e71bb9b05af7bdd0ab7eb1001", _rev: "1-fe817df1eb7d6e5208f3ce1be2d622ab", from_node_id: "ad7b377d501ae5f17f623e7853f89ccc", to_node_id: "5b5a99f6799bbf984da4b11e62a434ab"}
Here I'm already exposing my newbieness with document storage. I created two databases, one for nodes and one for links. I'm not sure if thats a good division or not. Another approach is one database and a 'type' field so that both node and link records are in the same database. That means each view has to filter to that type which is what lead me to separate databases.

In each case, given a starting node, I want to follow the links to get each targeted node and return a list of those nodes, or at least a list of the IDs of those nodes.

Given a starting string of "donpark.org", the SQL solution is SELECT id FROM nodes WHERE name="donpark.org"; SELECT to_node_id FROM links WHERE from_node_id=1; The couchdb view solution is function(doc) { emit(doc.name, null); } then HTTP query this view, and use a key parameter of ?key="donpark.org". $ curl http://localhost:5984/directedgraphs/_design/nodes/_view/select {"total_rows":3,"offset":0,"rows":[ {"id":"5b5a99f6799bbf984da4b11e62a434ab","key":"delicious.com","value":null}, {"id":"ad7b377d501ae5f17f623e7853f89ccc","key":"donpark.org","value":null}, {"id":"49c02d3dbffe3649cc9f1dbf99827d97","key":"twitter.com","value":null} ]} $ curl http://localhost:5984/directedgraphs/_design/nodes/_view/select?key='"donpark.org"' {"total_rows":3,"offset":1,"rows":[ {"id":"ad7b377d501ae5f17f623e7853f89ccc","key":"donpark.org","value":null} ]} The point of a view is to take pieces of the document to be the key and value in a list of key/values. The key is strategically chosen to support finding a match for the question being asked, while the value is strategically chosen to be the answer. The second view is function(doc) { emit(doc.from_node_id, doc.to_node_id) } and query this view with ?key="ad7b377d501ae5f17f623e7853f89ccc" $ curl http://localhost:5984/links/_design/links/_view/fromto {"total_rows":3,"offset":0,"rows":[ {"id":"a227c15681639f83e26a93c2a6aa3522","key":"49c02d3dbffe3649cc9f1dbf99827d97","value":"ad7b377d501ae5f17f623e7853f89ccc"}, {"id":"9efd0174f043dead5319aaee3ac9df92","key":"ad7b377d501ae5f17f623e7853f89ccc","value":"49c02d3dbffe3649cc9f1dbf99827d97"}, {"id":"dc77b57e71bb9b05af7bdd0ab7eb1001","key":"ad7b377d501ae5f17f623e7853f89ccc","value":"5b5a99f6799bbf984da4b11e62a434ab"} ]} $ curl http://localhost:5984/links/_design/links/_view/fromto?key='"ad7b377d501ae5f17f623e7853f89ccc"' {"total_rows":3,"offset":1,"rows":[ {"id":"9efd0174f043dead5319aaee3ac9df92","key":"ad7b377d501ae5f17f623e7853f89ccc","value":"49c02d3dbffe3649cc9f1dbf99827d97"}, {"id":"dc77b57e71bb9b05af7bdd0ab7eb1001","key":"ad7b377d501ae5f17f623e7853f89ccc","value":"5b5a99f6799bbf984da4b11e62a434ab"} ]} The two value fields are the IDs of the nodes which donpark.org points to.

If they both do the same thing, with SQL being easier to do general queries with, why use a document storage system like couchdb? It is supposed to be simple to setup any number of couchdb instances to scale both the read load and the write load. Couchdb is multi-master and rather than locking on writes, it lets writes happen simultaneously and does a conflict resolution process later. For a social network graph, its enough to simply let the most recent information win in the event of a conflict.

In theory, reads are supposed to scale not only with more boxes answering queries (each having their own copy of the data), but the work of the query itself can be divided and spread amongst boxes. Though I don't think couchdb supports this, that I believe is the whole point of using a map/reduce framework. An overly broad statement but in the right ballpark is: Five servers can handle five times the number of queries sure, but they can also answer a single query five times as fast. That needs more looking into.

update: the solution for distributing a single query across multiple servers is to use couchdb-lounge

tags: