With the release of Google Buzz, i was reminded of the XFN friends network. I dusted off my XFN Spider and thought about replacing the SQL storage with a document database. There are many options for data storage these days, as described in this writeup
I'm still trying to fundamentally understand the differences between document storage and relational/sql storage. I figure this directed graph is a good opportunity to go further with that. I've decided to stick with couchdb as my document system of choice because its only retrieval mechanism is map/reduce. Once i get that down, i'll look at mongodb further and its custom query language with optional map/reduce.
A directed graph has nodes and links between nodes. Its a directed graph because a link goes from one node and to another node. A social network has people that "follow" each other. This can be modeled by two SQL tables.
The following is a graph from my home page to twitter and delicious. My twitter page links back to my homepage while the delicious page does not.
nodes
|
links
|
Modeling this data with documents is remarkably similar to the SQL case. Here I'm focusing on creating relationships between documents. Documents can also contain entire related records but thats another topic. A database in couchdb holds documents. A document is a set of key/value pairs. Couchdb gives every document an ID automatically. Since couchdb has no schema, I'll list the data directly.
nodes
{_id: "ad7b377d501ae5f17f623e7853f89ccc", name: "donpark.org"} |
{_id: "49c02d3dbffe3649cc9f1dbf99827d97", name: "twitter.com"} |
{_id: "5b5a99f6799bbf984da4b11e62a434ab", name: "delicious.com"} |
links
{_id: "9efd0174f043dead5319aaee3ac9df92", from_node_id: "ad7b377d501ae5f17f623e7853f89ccc", to_node_id: "49c02d3dbffe3649cc9f1dbf99827d97"} |
{_id: "a227c15681639f83e26a93c2a6aa3522", _rev: "1-76c2de7086aa314f0971e41043b34924", from_node_id: "49c02d3dbffe3649cc9f1dbf99827d97", to_node_id: "ad7b377d501ae5f17f623e7853f89ccc"} |
{_id: "dc77b57e71bb9b05af7bdd0ab7eb1001", _rev: "1-fe817df1eb7d6e5208f3ce1be2d622ab", from_node_id: "ad7b377d501ae5f17f623e7853f89ccc", to_node_id: "5b5a99f6799bbf984da4b11e62a434ab"} |
In each case, given a starting node, I want to follow the links to get each targeted node and return a list of those nodes, or at least a list of the IDs of those nodes.
Given a starting string of "donpark.org", the SQL solution is
If they both do the same thing, with SQL being easier to do general queries with, why use a document storage system like couchdb? It is supposed to be simple to setup any number of couchdb instances to scale both the read load and the write load. Couchdb is multi-master and rather than locking on writes, it lets writes happen simultaneously and does a conflict resolution process later. For a social network graph, its enough to simply let the most recent information win in the event of a conflict.
In theory, reads are supposed to scale not only with more boxes answering queries (each having their own copy of the data), but the work of the query itself can be divided and spread amongst boxes. Though I don't think couchdb supports this, that I believe is the whole point of using a map/reduce framework. An overly broad statement but in the right ballpark is: Five servers can handle five times the number of queries sure, but they can also answer a single query five times as fast. That needs more looking into.
update: the solution for distributing a single query across multiple servers is to use couchdb-lounge