Friday, June 14, 2013

Distributed file system, part 1

There's a lot of data around on a lot of computers everywhere...far too much to fit on any one machine, or even on some kind of larger storage in any cost-effective manner for us little guys.

At work I have a SAN, 100TB available storage. THAT is a lot of storage; but given what I do there, actually not all that hard to fill up. But that kind of device STILL does not solve the larger problem, nor was it very cost effective--I could replace the drives, from 2TB to 3TB, but that would only be a 50% increase...suppose I need a 10X increase? 100X? More?

2TB drives aren't very expensive any more (you know, it seems almost absurd to even be able to say that, given that my first computer had a 20 MB drive in it), and it's not hard to find dirt-cheap machines around, used or even free. Regrettably they are seldom small, and therefore tend to be a little power hungry...not a prob for a data center kinda place, but uncomfortable for me at home.

Suppose I decided I had a problem to work where 30TB looked like the right capacity...and let's say that means 10 machines @ 3TB each...

I've written a heterogeneous distributed OS-agnostic Grid Engine. Perfect for doing data processing on a 10-node cluster. But this really works best when all the nodes are using a shared/common file system. THAT works best with a SAN and a Blade Server, like at work. Well, the blade server part isn't really very expensive ($3k will buy a decent used one that is full, and pleasantly fast--look on EBay for IBM HS21 systems). But getting a SAN on there--not going to happen. OK, I could perhaps put some high-cap 2.5" drives in the blades, etc, but that doesn't solve the resulting problem, which is still how do they share data with each other?

Well, on a limited basis you can make file shares and cross-mount all the shares across all the machines--but that doesn't scale all that far, and those shares all become a nightmare--and they STILL aren't a shared common file system.

So really the problem I have is how to make a shared common file system across a bunch of machines? I need it to be heterogeneous, since I run Mac/Win/Linux machines, and am considering other things like Gumstix.

There are homogeneous file systems around...several, it turns out, although they are mostly Linux-only (FUSE, Lustre/Gluster, etc), which doesn't help me. OK, I could just buy the cheap hardware, and install Linux everywhere, but what happens when I have a windows-only software tool to run?

I've been hunting for an OS-agnostic tool, it's not really clear whether there is such a thing. OpenAFS (i.e, Andrew File System) might do it, which would be perhaps the ideal solution. I haven't tried this yet. Pretty much everything I've read about doesn't meet my requirements, heterogeneous being the first fail point. At work I'm using StorNext with the SAN, but I can't afford that on my own.

So I think I have to solve this myself. What I kinda think I want is a BYOD approach where you'd have to run some agents to join, but you'd have access to everything shared on the network without having to cross mount a zillion things that you can't even find out about casually.

What you would NOT have is something that shows up in Finder/Windows-Explorer. I can probably figure out how to finagle that too, altho I don't consider that a critical requirement. I expect that OpenAFS has that figured out.

Is it going to take an Advanced Degree(tm) to figure this out? It's not an easy problem.

No comments: