Sunday, June 23, 2013

Distributed File System, part 3

I've thought through a lot of this already, but I have not implemented much. But I have gotten started...

Wish I could put a block diagram thing in here somehow...image seems the only way, but I don't really have any.

So you want to have all the shared files available everywhere, but you sure can't keep copies everywhere, and I discussed the idea of cross-mounts, or buying truly massive storage devices, etc... none of those things are workable, really.

So what I think you do is gather the knowledge of what all the shared files are, catalog them, publish the catalog via a web-service, and then transparently copy things locally when you need to use them, and age them off later (either by a size heuristic, or a time heuristic, or LRU heuristic), making them available locally.

Depending on what's happening across your network, you could wind up with a popular file having copies actually reside a lot of places...for a while. Most files would only have two locations: primary shared, and whatever there is for standard backup,

At the moment I'm thinking of it rather like a library system.You have your own collection of files (books you acquired somewhere). You are willing to share some of them. Others are likewise willing to share some. There's a "library/librarian" service. You can ask the service what all is available (from all those willing to share--the "library" doesn't have its own repository), and you can have a copy of anything listed, until you bump into your local age-off restrictions. Remember how your local physical library works? You can look at the catalog, find something you want, check out a book for 30 days, take it home to be in your personal library, and then return it: i.e., locate a file, copy it locally for temporary use, and then delete it.

If you find yourself having age-off space problems, maybe you buy some bigger bookshelves (i.e., a new and larger disk drive).

This is not a perfect analogy, but works ok for the moment.

So there are some other storage units that could/should participate in this, and they need a proxy of sorts to do so: SAN, NAS--that sort of thing. A NAS device can be just a mountable filesystem, which suggests that perhaps the Librarian needs to take on the management of that, although that doesn't quite fit the analogy the right way: I am thinking of the file-copying as being a lot more like a P2P file-transfer system.

So there's the Librarian service(s), the local shared-publishing service, the P2P file-transferring, and the local storage management. I've written a small part of the Librarian, more of the local shared, I've been looking at file-transfer codes, and merely thought about the storage mgmt. It's all just casual so far, although it's been in the back of my mind for months. Been writing down the use cases, too.  I should have a working system in a couple of months, I think.

[Later: ok, I've put less time into it recently, so not til this fall at the earliest]

Friday, June 21, 2013

The annual V-day M/F relationships writings...

you see online...  

[this blog post was started in 2011 and then forgotten for a while]

There were several interesting ones this year [2011]. The first was from a Mormon woman, let's say late 20s, in a big city (possibly NYC, but I don't recall). She was lamenting the usual "can't find a man" situation. So of course she had some requirements that weren't being met: same religion, no pre-marital sex. IIRC, she was unhappy that guys would not stick around long; not like she wasn't a good catch: good education, good job. Eventually one of them made it clear to her: "You left nothing for us to be/do in your life" (Mormons being still a bit more traditional per historical attitudes.) In other words: you are sufficiently independent that we have no self-perceived value in the relationship--how can we be a "provider" when you don't need that?

A male comment on an entirely different story I read some weeks later put it better: "We need to be needed." When we aren't, well, it's time to leave.

So around V-day there was a story by a woman in NYC who basically said to other women: "Can't find a man? It's you, not them." It went right to the heart of things: what you say you want and what you do aren't the same. Of course there was a firestorm of comments in response. Many were a little off-target ("Why the assumption that every woman needs a man?" -- you have to wonder why those folks even read the story to begin with, and then complained, they aren't the target audience). The author was herself having this trouble, thinks NYC demographics are part of the problem (apparently there are noticeably more single women than men there), but blames herself for essentially pursuing the excitement factor and variety rather than something else.

HU sez: "don't bitch about there being no good men--if you haven't found one then that isn't what you want."

Game Philosophy

What causes a game to be successful? Do you need an Advanced Degree (tm) to figure it out?

To what extent is a game's success based on:

  1. Visuals/graphics
  2. Story
  3. Action
  4. Explorability
  5. Good AI
  6. Other

What do I mean here?

Visuals/graphics: the very best-looking games these days are things like Skyrim, CoD, etc. Fabulous 3D world to wander through. I love Skyrim (altho I think I like Oblivion better, for reasons of greater variety); visually stunning. But other games, less good, have decent graphics, too, and some interesting games have fairly limited graphics. I've replayed Total Annihilation recently (from GOG, despite my having the original install disk), and heck, that's only just barely 3D at all, it's 8-bit color, etc, and yet that doesn't matter in the end--it can still be quite difficult.

Story: Skyrim etc have pretty good stories in them. There's a main plot, and some relevant/valuable major sub-plots, and lots of little tiny things. This all works great. In fact, those little side projects work so well, I haven't even started the main plot yet, and that's after several hundred hours of game time.

Action: Quake 1-3, Unreal Tournament, etc, are all about the action. The 3D-ness of the maps is interesting, but not critical. There's no story whatsoever. For me, this makes for limited interest. Replayability is all about improving your twitch skill. I enjoy the speed and action, but the only real interesting thing about replayability is that you can do it in relatively tiny increments, like 5-10 mins.

Explorability: Half-Life 2 is a great game, but it gets a zero on this scale. It's very linear. Too linear. Dungeon Siege 1 is equally linear (well, nearly so), but you have a lot of leeway in how you play your character/team. Skyrim etc are anything BUT linear--you don't EVER have the play the main story. I like this aspect--I really don't like being locked into playing a game only one possible way, being locked into a developer's limitations--they might as well do machinima of it for you. I'm not suggesting that linearity = ease of play, it means no opportunity to meander around and look at things.

Good AI: This doesn't even apply to a wide range of games. Team Fortress 2, UT04, Quake3, etc. The AI is other humans. Alpha Centauri, otoh, is mostly AI, and can be really hard to take on.

Other: not sure what I think this is, but maybe it's something like you can find in MMO games, where you can participate without exactly being a quest player, like by being a "crafter". This doesn't interest me. I actually felt more distracted by this whole routine. DLC is a new aspect.

Think back a bit on all your games...Pong, 40 years ago, was the absolute minimalist graphics game, but was not at all easy--it was action-only, no AI, playable in tiny increments; Tank was much the same, only very slightly more complex. This was the era when graphics were super-limited. Think of other games where this is some better, but still the game has to be dominated by something else--while better-looking, Diablo is a little less about action, it seems more about the process of managing loot and such like. There's some story of sorts, but I wasn't really keeping track of that too well--despite the maps being mostly unique each play-through, it's still fairly linear.

So where is the trade-off sweet-spot? I'm sure there's a range. Could we describe it, put some bounds on it? Maybe more by example than by measurement. Reason I ask: I have developed a game or two in the distant past (known as the 70s), and have contemplated making one again, but I find myself debating what flavor I would create. Certainly it would avoid things I dislike, like the repair/crafting stuff. I'd want auto-generated maps to maximize replayability. I'd want to have some reasonable amount of action, but not where it devolves into a twitch game. I'd want some reasonable amount of story; I think I'm more story-driven than most folks. My son is more action-oriented, it seems, he can play TF2 for hours/days; he has, however, played Oblivion et al about as much as I have, HL2 more, Mass Effect, Fallout3/FNV more...he does have more time right now, but that won't last.

How much work goes into making a good story? Is it really all that much? If it's not, you should be able to take one of the free "game engines" and make a game. How difficult is it? How do you make it a story you can actually participate in, as opposed to just following a script? Think of making a game from a movie: seems over-constrained.

It seems to me that good story is what really makes a game--for the kind where there even IS a story. Think about it--I think we tolerate less-than-photorealistic visuals for a better story.

So how hard is it to make a really good story? Do you need more than one? Is it even possible to have more than one? They'd mostly have to be disjoint. Perhaps retirement is the time for me to tackle creating a better story for a game. The problem with that is that it is probably going to still feel too linear. If you allow much variability it's going to become very hard to manage reaching a pre-defined endgame conclusion. My goal would probably be to aim for a much less predictable outcome: create a starting point, play rules, and run it more like a simulation, and watch to see what happens.

I need to re-experiment with some AI activities. Can I make something that is largely emergent-behavior and interesting?

Wednesday, June 19, 2013

Distributed File System, part 2

In April, I had a training class called "Intro to Big Data", from Learning Tree. It's really aimed at your getting into Hadoop, but prelim topics were covered first with separate tools. Nice course, really. LT is clearly good at this kind of thing (was my first/only LT course), unlike some other "training" I've had in the last year.

So what sparked my thinking again on DFS/VFS was the segment about Distributed Hash Table. That might work as the lookup mechanism I need to have server as the complete distributed file table.

Making a distributed database is not easy, even the big guys have trouble with this, and overall performance is not all that great. My fave SQL database, H2, is not distrib.

I do not, as yet, know anything about what sort of performance I need. *I* probably don't need all that much, but running my Grid Engine would need more.

Suppose I take a DHT tool (Apache Cassandra is one possibility) have it store this:

filename, directory path, host

where filename is the access key, and maybe host/path is stored as a URL.

filename, URL

If the URL is good, I could pass it to the Grid Engine as is, and let relevant/interested process(es) use it directly to open a file stream. That could work; it could mean having a lot of file streams/handles open at any one time. (The GE typically wouldn't have more than 100 at a time per machine, probably. Well, maybe 200.) So depending on file size, maybe that's too much network traffic; if nothing else, it's not going to scale well.

Maybe I should be using the file-content MD5 as the key? that is at least fixed size (32 chars). That ends up being much more the DHT approach, because you could distribute keys based on the first character of the MD5 (or maybe the first two, if you had a lot of machines).

MD5, URL

So what am I doing with these things? Suppose I have what I think a DHT is: a local service which can tell me where a file actually is for a given MD5; that MD5 has come from the Grid Engine. OK, that feels clunky, because I only know MD5s from the GE.



Other tools: HDFS (Hadoop) has several issues: the "ingest problem" (i.e., how do you get all your data into it), internal replication (it wants 3X, although you can set that to just 1X: you lose any redundancy security, but ingest is faster), and block size, since it uses 64MB ??!! That's maybe not so painful if your files are all 2GB video...

Another reason to NOT try to use a huge SAN cluster (you can daisy-chain these things) is that you end up having to have a minimum block size around 4k or 8k. Well, that's fine if your files are mostly big, but what happens when you tend to have a lot of 1K files? That issue argues for VFS which lets you use (for example) a ZIP file as a file system, which probably gets around the minimum block-size problem; I expect that has other performance issues, but wasted space isn't one of them.

Friday, June 14, 2013

Distributed file system, part 1

There's a lot of data around on a lot of computers everywhere...far too much to fit on any one machine, or even on some kind of larger storage in any cost-effective manner for us little guys.

At work I have a SAN, 100TB available storage. THAT is a lot of storage; but given what I do there, actually not all that hard to fill up. But that kind of device STILL does not solve the larger problem, nor was it very cost effective--I could replace the drives, from 2TB to 3TB, but that would only be a 50% increase...suppose I need a 10X increase? 100X? More?

2TB drives aren't very expensive any more (you know, it seems almost absurd to even be able to say that, given that my first computer had a 20 MB drive in it), and it's not hard to find dirt-cheap machines around, used or even free. Regrettably they are seldom small, and therefore tend to be a little power hungry...not a prob for a data center kinda place, but uncomfortable for me at home.

Suppose I decided I had a problem to work where 30TB looked like the right capacity...and let's say that means 10 machines @ 3TB each...

I've written a heterogeneous distributed OS-agnostic Grid Engine. Perfect for doing data processing on a 10-node cluster. But this really works best when all the nodes are using a shared/common file system. THAT works best with a SAN and a Blade Server, like at work. Well, the blade server part isn't really very expensive ($3k will buy a decent used one that is full, and pleasantly fast--look on EBay for IBM HS21 systems). But getting a SAN on there--not going to happen. OK, I could perhaps put some high-cap 2.5" drives in the blades, etc, but that doesn't solve the resulting problem, which is still how do they share data with each other?

Well, on a limited basis you can make file shares and cross-mount all the shares across all the machines--but that doesn't scale all that far, and those shares all become a nightmare--and they STILL aren't a shared common file system.

So really the problem I have is how to make a shared common file system across a bunch of machines? I need it to be heterogeneous, since I run Mac/Win/Linux machines, and am considering other things like Gumstix.

There are homogeneous file systems around...several, it turns out, although they are mostly Linux-only (FUSE, Lustre/Gluster, etc), which doesn't help me. OK, I could just buy the cheap hardware, and install Linux everywhere, but what happens when I have a windows-only software tool to run?

I've been hunting for an OS-agnostic tool, it's not really clear whether there is such a thing. OpenAFS (i.e, Andrew File System) might do it, which would be perhaps the ideal solution. I haven't tried this yet. Pretty much everything I've read about doesn't meet my requirements, heterogeneous being the first fail point. At work I'm using StorNext with the SAN, but I can't afford that on my own.

So I think I have to solve this myself. What I kinda think I want is a BYOD approach where you'd have to run some agents to join, but you'd have access to everything shared on the network without having to cross mount a zillion things that you can't even find out about casually.

What you would NOT have is something that shows up in Finder/Windows-Explorer. I can probably figure out how to finagle that too, altho I don't consider that a critical requirement. I expect that OpenAFS has that figured out.

Is it going to take an Advanced Degree(tm) to figure this out? It's not an easy problem.