DFS for a several small clusters over WAN

Question

My friends and I all have TBs on our system(s). None of us have any full backups which are geographically distributed however, because at that amount of data, solutions such as Dropbox, S3, et al. are cost-prohibitive for us. However, each of us has local storage in excess. TBs each, in fact, going unused.

We began thinking: If we could network our hosts into some form of Distributed File System, we could each gain geographically distributed backups of our complete data sets while achieving higher utilization of the storage capacity we have. The perfect solution... we think.

There are at least 3 of us. Surely 6 or more if the project yields fruit.
Each of us has 1-2TB of data, and at least that much to spare.
We're all spread out over WAN.
We'd need the ability for any host(s) to enter and leave the cloud service arbitrarily.
Real(ish)-time synchronization. Otherwise we'd just meet up once a week over beers and trade around a pile of external HDDs.
F/OSS is requisite, but we have plenty of elbow grease.
If we can use/learn/leverage a distributed computing platform in the process, so much the better.

We started out thinking about building a Dropbox-esque interface on top of OpenStack or Hadoop, but I'd like to hear if there are other alternatives out there which we're ignoring. Perhaps for our case there is an even simpler solution? Is something like this even feasible, given the low number of nodes per cluster?

NB: Naturally the initial synchronization/balancing/transfer/etc will take days at the least, but that's acceptable.

score 2 · Answer 1 · answered Nov 21 '12 at 11:15

I used sshfs on Ubuntu server and a simple rsync script via cron. Each host retains its own autonomy (even though I have root access in my configure across 3 hosts) and how often to replicate across nodes and to which nodes is also fully controllable. The amount of storage can be controlled via partition or quota, I chose partition simply because I am controlling all 3 hosts. A disadvantage comes with lack of replication frequency (synchronization) control. If a host syncs frequently it could over utilize bandwidth particularly if snapshots are used across the wan. Playing nicely with others and using kbps limits on the rsync commands are necessary.

score 1 · Accepted Answer · answered Apr 25 '12 at 23:28

Its not FOSS, but crashplan's a pretty good option for this. Dead simple to set up and run, but it'll handle 3 4 and 5 perfectly. Its dead simple to set up as well - install the client, set usable space, and add people who you want to allow to use that space.

DFS for a several small clusters over WAN

2 Answers2