Hyde University Advanced Degree: Distributed file system, part 1

There's a lot of data around on a lot of computers everywhere...far too much to fit on any one machine, or even on some kind of larger storage in any cost-effective manner for us little guys.

At work I have a SAN, 100TB available storage. THAT is a lot of storage; but given what I do there, actually not all that hard to fill up. But that kind of device STILL does not solve the larger problem, nor was it very cost effective--I could replace the drives, from 2TB to 3TB, but that would only be a 50% increase...suppose I need a 10X increase? 100X? More?

2TB drives aren't very expensive any more (you know, it seems almost absurd to even be able to say that, given that my first computer had a 20 MB drive in it), and it's not hard to find dirt-cheap machines around, used or even free. Regrettably they are seldom small, and therefore tend to be a little power hungry...not a prob for a data center kinda place, but uncomfortable for me at home.

Suppose I decided I had a problem to work where 30TB looked like the right capacity...and let's say that means 10 machines @ 3TB each...

I've written a heterogeneous distributed OS-agnostic Grid Engine. Perfect for doing data processing on a 10-node cluster. But this really works best when all the nodes are using a shared/common file system. THAT works best with a SAN and a Blade Server, like at work. Well, the blade server part isn't really very expensive ($3k will buy a decent used one that is full, and pleasantly fast--look on EBay for IBM HS21 systems). But getting a SAN on there--not going to happen. OK, I could perhaps put some high-cap 2.5" drives in the blades, etc, but that doesn't solve the resulting problem, which is still how do they share data with each other?

Well, on a limited basis you can make file shares and cross-mount all the shares across all the machines--but that doesn't scale all that far, and those shares all become a nightmare--and they STILL aren't a shared common file system.

So really the problem I have is how to make a shared common file system across a bunch of machines? I need it to be heterogeneous, since I run Mac/Win/Linux machines, and am considering other things like Gumstix.

There are homogeneous file systems around...several, it turns out, although they are mostly Linux-only (FUSE, Lustre/Gluster, etc), which doesn't help me. OK, I could just buy the cheap hardware, and install Linux everywhere, but what happens when I have a windows-only software tool to run?

I've been hunting for an OS-agnostic tool, it's not really clear whether there is such a thing. OpenAFS (i.e, Andrew File System) might do it, which would be perhaps the ideal solution. I haven't tried this yet. Pretty much everything I've read about doesn't meet my requirements, heterogeneous being the first fail point. At work I'm using StorNext with the SAN, but I can't afford that on my own.

So I think I have to solve this myself. What I kinda think I want is a BYOD approach where you'd have to run some agents to join, but you'd have access to everything shared on the network without having to cross mount a zillion things that you can't even find out about casually.

What you would NOT have is something that shows up in Finder/Windows-Explorer. I can probably figure out how to finagle that too, altho I don't consider that a critical requirement. I expect that OpenAFS has that figured out.

Is it going to take an Advanced Degree(tm) to figure this out? It's not an easy problem.

Hyde University Advanced Degree

Labels

Blog Archive

Links

About Me

Friday, June 14, 2013

Distributed file system, part 1

No comments: