adhd

Adhd is an asynchronous, distributed hard drive. Actually we’re not sure if it’s really asynchronous or even what asynchronicity would mean in the context of a hard drive, but it is definitely distributed. Adhd is essentially a management layer (written using Ruby and eventmachine) which controls clusters of CouchDB databases to replicate files across disparate machines. Unlike most clustering storage solutions, adhd assumes that machines in the cluster may have different capabilities and is designed to work both inside and outside the data centre.

How it works

External API

The user issues an HTTP PUT request to a ADHD url. This request can go to any machine in the ADHD cluster. A file is attached to the body of the request, and ADHD stores the file on a number of different servers. At a later time, the user can issue a GET request to the same url and get back their file.

Under the Hood

Overview

Multiple server nodes are part of an ADHD network. Each node has a full copy of two CouchDB databases: the node management database, and the shard range database.

The node management database keeps a record of all existing nodes, their location (IP), and current status. This database is replicated across all nodes in the cluster; any node is in theory capable of becoming a management node if the current management nodes become unavailable.

The shard range database is also replicated everywhere. It allows any node to find out which storage nodes are responsible for holding a particular piece of content. Management nodes are in charge of maintaining the shard database and pushing changes to non-management nodes.

Storage nodes hold one or more shards in addition to the management and shard range databases. These shards contain the content and some metadata. A shard consists of a CouchDB database which holds Couch documents - one attached file per document.

PUT requests

A PUT request reaches any ADHD node. The first line of the request is parsed, and the ADHD node figures out the name, MIME type and content length of the file based on the incoming HTTP request headers. Based on the MD5 hash of the filename, we extract a 20-byte internal ID string for the file and use this string to figure out which shard range the file will be assigned to.

Once we have a shard range, we ask the shard range database which node(s) are in charge of storage for the given shard and are currently available. One of these nodes is chosen (starting with a master node and falling back to other nodes as necessary). The file metadata is written as a CouchDB document to the chosen node, followed by the file as an attachment.

GET requests

The GET request is basically the same as the PUT request, except that the file is retrieved from the CouchDB shard instead of being created on a shard.

Replication

Replication happens differently for different databases. The node management database and the shard range database are periodically synced to and from the management nodes. The content shards are replicated when they get updated (i.e. when a file is added to a node which is responsible for a given shard, the file will be replicated to all other nodes which are responsible for the same shard).

Rationale

We can have multiple redundant nodes storing the same files, getting rid of the need for backups. Not every node needs to store every file, so we can scale up storage capacity by adding more nodes. The design also provides load balancing as it means that we can serve files from multiple nodes.

Because all admin information is shared by every node, the unavailability of any node does not jeopardize the operation of the system. As soon as a node becomes unavailable, any other node can perform its functions (although we will lose storage capacity throughout the cluster if we lose multiple large storage nodes).

Some vapourware objectives

The design also allows for different capabilities in storage nodes. Shards can be assigned based on specific properties (available bandwidth, available processing power, etc) so that we can store files in the most efficient possible way. It might be desirable, for example, to put all video files on machines with fast processors for doing video encoding, whereas audio and photos wouldn’t need any post-processing and could go somewhere else. Or we could store new and popular content in shard ranges which are on fast servers with high throughput, while putting less popular content on servers with less bandwidth (and lower storage costs). Ideally we would like to get to a point where low-demand files can be stored on relatively low-bandwidth home or office network connections and still be available in the cluster.

Fictional use cases (no one has ever used this software in real life)

Wikipedia: photos of Cheryl Cole and Robbie Williams are unfortunately wildly popular at present and will be requested often. Photos of the millions of singers with more talent and less marketing budgets will not be requested very often, but it’s still good if they are in the cluster. If by some stroke of good luck Kevin Quain suddenly gets the recognition he deserves, his photo will be shifted from the “dial-up” node class to something with more stomp.

Archive.org: those sex-ed videos and Bert the turtle from the Prelinger Archive are awesome and everybody wants to watch them, so they should be on beefy video-storage nodes with high bandwidth. Documentaries on the yellow-bellied sapsucker may be requested only once in a while.

BBC News: material going up on the site right now will need a lot of bandwidth, but by next week most of this media will be out of the news cycle and can be consigned to a set of servers with much less bandwidth. By next year, it is unlikely that this media will be viewed even a few times a day, so it could be smart to trade storage costs for speed and put old media on cheaper, lower-bandwidth boxes with huge hard drives.

Installation

Don’t use this software yet. It is experimental and may eat your mother.

Having said that,

sudo apt-get install couchdb; sudo gem install adhd

may do the job.

Usage

You’ll need a node_config.yml file that looks something like this:

---
   node_name: my_adhd_node
   node_url: "192.168.1.93"
   couchdb_server_port: 5984
   http_server_port: 5985
   buddy_server_url: "http://192.168.1.74:5984"
   buddy_server_node_name: another_adhd_node

The config parameters are:

node_name: a unique prefix name that will be used as an identifier for your nodes in the global node database.

node_url: the hostname or IP which the node should bind to.

couchdb_server_port: the port that CouchDB is running on, usually 5984.

http_server_port: the port which the adhd node should run on.

buddy_server_url: the URL (including the CouchDB port) of any other adhd node which is currently running. The node will connect to the buddy_server_url and replicate the cluster’s management database information from it. This can be blank for the first node or if you want to make an entirely new cluster.

buddy_server_node_name: the node_name of the buddy node as set in its config file.

You can start adhd once it’s installed by invoking it like this:

adhd -c /path/to/node_config.yml

If it worked, you should see a bunch of messages in your console telling you what’s replicating, how many shards there are, etc. If you want to test out multiple nodes on the same machine, you can simply have multiple config files which use different http_server_port and node_name values.

Once a node has started, you can do something like this to add content to it:

curl -i -0 -T /path/to/somefile.txt http://192.168.1.93:5985/adhd/somefile.txt

That command should PUT somefile.txt into the adhd cluster. If there is more than one node running, somefile.txt will be automatically replicated to the proper shard.

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Commit, do not mess with rakefile, version, or history (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull).

  • Send me a pull request. Bonus points for topic branches.

Copyright © 2009 dave@netbook. See LICENSE for details.