Monday, June 12, 2017

Music tastes and online info

My favorite flavor of music is called "Progressive Rock."

If you know what it is already, then you know what it is. If you don't know what that name means, a short description is that it is rock music that has/is progressed beyond simple 2-or-3-chord songs.

Generally it's roots are classical music rather than pop songs. It's longer, more complex, listening rather than singing or dancing music. The focus is much more the unstrumental composing than the lyrics; lots has no lyrics at all.

And it is generally best described by a list of the performers. The category is broad and there are flavors of Prog Rock I'm not keen on (in general, newer and kinda derivative). So: ELP, Yes, early Genesis. The style originates about 1968.

My favorite online source for info is the Gibraltar Encyclopedia of Progressive Rock. I was casually involved with that back in the 90s, before it was even an online source. My participation was slight in the early-mid-90s, but it was there, when there was really just the alt.music.progressive thing (or whatever it was, been years). The web has improved the amount of info, but things have splintered. Things like FreeDB are good, but have zero meta-info beyond album name, track names, and track lengths--and zero cross-linking.

Anyway. GEPR. There was once a book of it. I bought that, it's a great item, but such a thing is out of date the day after it's printed.

The GEPR website has languished for about six years now. Fred Trafton was running it for years, still owns it, and hasn't had time for it, endangering the value and life of the content (granted, the Wayback Machine would still have it all if Fred didn't at least renew the domain name).

So he and I were having an email chat a few weeks ago. I argued that it needs converting into a Wiki existence, so that others can participate, and he doesn't have to worry about maintenance. I offered to do a chunk of that initial setup. In particular, to scrub the existing website and strip out all the raw info from the web-pages (regrettably, there's no database in the background, just hand-managed HTML, and aging HTML at that).

So I've been working that. Did a large data-mine off another (actually dead) website a few years ago, so I knew what I was up against in attempting it. I don't expect or intend to achieve a perfect extraction, no way that is even worth my time to attempt; because it's hand-made HTML there are a lot of little tiny variations all over, and I'm not writing special handlers for each one.

At this point I'm approaching 2/3 completion. Some of that is done via correcting HTML flaws in the source rather than writing special-case handling for singletons.

The hard thing about doing this kind of extraction is that people who make pages like this treat HTML as a structural content organizer, rather than just as visual markup. And then they aren't consistent about what they do, so the structure is casual rather than strict. This is possible because web browsers tolerate a lot of slop in the HTML. That's really not a good thing any more.

The extraction is going into a database now.  Once I can read everything I want from the HTML into the DB, then I have to make a simplistic form for getting it out again, so that I/we can test to be sure there's no content-rip disasters in there.

Then I have to figure out how to dump that database into something that is wiki pages. I really have no idea how I might do this. Maybe I don't even want to do that, exactly. Maybe I want to push the DB into a wiki DB.

My involvement here is to get to the point where there's a wiki full of as much content as I can manage without spending forever on it. I don't want to own it after that.

No comments: