Wednesday, June 10, 2015

Disk drive troubles

or other "storage devices", since they ain't always rotating platters any more.

Fall 2014 I had an OMG bad system crash at home, when the power supply on my PC decided to die in the middle of when Windows was doing system updates (you know, the one point where they tell you "Don't turn off your computer now" -- yeah, that's bad, there is no recovery).

So this time I put in an SSD, re-intalled Windows 7. So far, that's been just fine. Windows definitely boots faster, and software installed there starts faster. Great. Hmm. I wonder: did I clone that thing afterwards? I probably should have put in a smaller drive, or partitioned it smaller, and then cloned the boot partition. Well, Acronis doesn't work the way I'd like (which is to say: just like CCC on Mac). [Later: well, it didn't at that time, or I didn't figure it out; subsequently, at work, we've cloned a smaller onto a larger without the prob I had]

Anyway...early this year, I got a "promotion" of sorts: I am now leading a tiny team (4 ppl) doing a new version of a project at the customer location. New (used) hardware, new concepts and variations on things I was doing last year.

So we bought some used servers, partially filled rack. Twelve HP DL 360 G6 1U dual quad-core intel 5560, four drive slots, in which we are putting 3 SSDs and one platter drive for backup, 96 GB RAM each (going to be trying a software system re-architecting using RAMDISK instead of actual disks). Three clusters for federated databases for dev/int/rel usage.

The SSDs are small, 128 GB SATA. Crucial MX100.

Well, ina few weeks already I have broken one drive totally dead, a second is going bad, and I might have pushed a third one onto the death-track.

How? Well, I've replaced the indexing engine in the commercial database we've been using with Berkley DB, a very seriously fast (but not, itself, SOTA) hashing engine. Whatever BDB does and how it does it apparently kills an SSD pretty fast; exceedingly the write-limit on it.

Remember that an SSD has a finite lifetime, quite short in comparison with a regular old platter-style disk. I have indexed hundreds of millions of objects with BDB on platter drives at work over the last year--with zero failures. I can't even get through a few million objects on the SSD before it is reporting "unable to find sector" errors, which means I have damaged it. I don't even know if CHKDSK can repair that--it may be completely hosed (the first one is totally dead; won't mount).

OK, this means for now that we are going to RAMDISK a bit sooner than I anticipated, which presents its own interesting challenges. Not because making/using a RAMDISK is hard, it's nearly trivial, but the logistics of how to not lose any data permanently presents challenges around the rest of the system requirements. 96 GB allows for an interesting-size RAMDISK.

Talking with a friend who's a disk/storage-system expert, RAMDISK is probably the final answer anyway, but at the least an Enterprise SSD would be a good idea. So we might be trying that too.

First, however, I've bought an Enterprise 146GB 10K SAS drive for the moment ($25, omg), to replace the now-dying SSD (which was itself a replacement for the first REALLY dead one), we'll see what that does. And then maybe we'll try an Enterprise SSD to see what that buys us.

There are a bunch of performance needs on this new system (current prod system has speed issues all over, but it has ZERO optimizations, so new system delivers with ALL those optimizations, as we figure them out).

But really...I've killed two SSDs in the last 4 weeks. and I could easily kill the others in a matter of hours...

Painful lesson. Those "consumer" SSDs have an entirely different target audience than me and my HPC work.

Later: the SAS drive works great, no errors, $24 delivered. So we bought more of them, to be the boot drives in each server. They aren't as fast as I'd like, but RAMDISK is still the target, because we're going to deliver on machines with 256 GB of RAM--what the heck else should we do with that much RAM?

Later later: it turned out that RAMDISK didn't make a difference in our usage of commercial database--that wasn't the slow part. I was bummed.

RAMDISK would have been great on a previous effort, if I'd had machines I could have put enough RAM in to make that worthwhile, because I was pounding some enterprise 2T drives enough to have to replace four (out of 60, in a SAN). The region thing I was pounding that hard could have been replaced by RAMDISK where I booted the system, copied some executables into RAMDISK, and then run from there, avoiding the pounding on the drive where those things sat (the pounding was reading them + libs hundreds to thousands of times per minute, totaling hundreds of millions by the time that activity ended.

Later(3): we punted that commercial database. There were other issues about being unable to deal with the volume of data we were putting into it. Other customers were apparently able to put a lot more data in, but their data was different from ours--we were banging up against some implementation decisions that were poor ones for data like ours (supernodes in an object database). We've replaced it with Neo4J, because of some other speed reasons; don't yet know if that is going to play out long term.

No comments: