Wednesday, June 19, 2013

Distributed File System, part 2

In April, I had a training class called "Intro to Big Data", from Learning Tree. It's really aimed at your getting into Hadoop, but prelim topics were covered first with separate tools. Nice course, really. LT is clearly good at this kind of thing (was my first/only LT course), unlike some other "training" I've had in the last year.

So what sparked my thinking again on DFS/VFS was the segment about Distributed Hash Table. That might work as the lookup mechanism I need to have server as the complete distributed file table.

Making a distributed database is not easy, even the big guys have trouble with this, and overall performance is not all that great. My fave SQL database, H2, is not distrib.

I do not, as yet, know anything about what sort of performance I need. *I* probably don't need all that much, but running my Grid Engine would need more.

Suppose I take a DHT tool (Apache Cassandra is one possibility) have it store this:

filename, directory path, host

where filename is the access key, and maybe host/path is stored as a URL.

filename, URL

If the URL is good, I could pass it to the Grid Engine as is, and let relevant/interested process(es) use it directly to open a file stream. That could work; it could mean having a lot of file streams/handles open at any one time. (The GE typically wouldn't have more than 100 at a time per machine, probably. Well, maybe 200.) So depending on file size, maybe that's too much network traffic; if nothing else, it's not going to scale well.

Maybe I should be using the file-content MD5 as the key? that is at least fixed size (32 chars). That ends up being much more the DHT approach, because you could distribute keys based on the first character of the MD5 (or maybe the first two, if you had a lot of machines).

MD5, URL

So what am I doing with these things? Suppose I have what I think a DHT is: a local service which can tell me where a file actually is for a given MD5; that MD5 has come from the Grid Engine. OK, that feels clunky, because I only know MD5s from the GE.



Other tools: HDFS (Hadoop) has several issues: the "ingest problem" (i.e., how do you get all your data into it), internal replication (it wants 3X, although you can set that to just 1X: you lose any redundancy security, but ingest is faster), and block size, since it uses 64MB ??!! That's maybe not so painful if your files are all 2GB video...

Another reason to NOT try to use a huge SAN cluster (you can daisy-chain these things) is that you end up having to have a minimum block size around 4k or 8k. Well, that's fine if your files are mostly big, but what happens when you tend to have a lot of 1K files? That issue argues for VFS which lets you use (for example) a ZIP file as a file system, which probably gets around the minimum block-size problem; I expect that has other performance issues, but wasted space isn't one of them.

No comments: