Sunday, February 28, 2010

Vancouver Olympics

I'm watching the gold-medal ice hockey game between Canada and the US men's teams...I watched the women's final as well...I had forgotten that USA had a good women's ice hockey team, but we weren't quite good enough this time. It's interesting watching hockey where checking isn't allowed. The hockey games have not been as challenging as one might hae wished, given that the US and Canadian teams are really dominant.

The Olympics have certainly been good this year. I watch nearly every minute that is actually broadcast that I'm home for, which is about 10 times as much tv as I can really stand. It is pretty high-res stuff, tho...watching the instant-replay for hockey the puck is really sharp on the screen, which is impressive considering that thing can be moving 100 mph.

I have something else going on this year, in that I wanted to track news stories about various events, because I need some additional input content for training a piece of software I wrote a year ago: it filters text content (stories, like news) based on training which is pre-categorized...i.e., I take various stories, assign topics, create a training model from them, and then use that training model to categorize new stories. The technique uses what's called "support vector machine", it's mostly about positive examples (where other trained systems used both pos and neg training). Anyway, I generally grab content from various RSS feeds, because they are slightly pre-categorized (like the Washington Post sports feed); this is not great because it tends to be limited content: NFL, NBA, MLB, and college equivalents. I need some other sources, but haven't looked very hard yet. Olympics is a fairly concentrated bunch of stories about skiing, skating, hockey, and curling. And it turned out that Vancouver has its own RSS feed, which is ideal for me: I have a separate tool written several years ago whose sole purpose is to grab stories from RSS feeds. This means I can continuously grab stories, not worry about them getting superseded or expiring, and then zip through a folder of them marking them for their training topics.

I should fire up a few more feeds for this, but I especially wanted to get the olympics because of the fairly unique set of stories I could get.

The only flaw in this is that I don't have a really good set of story topics that covers a lot of territory in a lot of detail. If I made more topics, I could probably get into more detail, but I don't know how much is really appropriate.

This whole idea, for me, dates back to some work in about the 1996 time-frame. You'd have a text-processing system where content would be brought in (like a large number of RSS feeds), you'd run them through a topic recognizer, and run that output through several differently-trained name-finders. From there you'd feed the names into a database, use some other correlation techniques.

I was trying to go this direction again in '08 with the pirates demo: could you find out what those guys are up to by any online content (in retrospect, I think not, that seems to be entirely target of opportunity, rather than any organized piracy with malice aforethought). Like drug-related stuff in Mexico, and Columbia, it's mostly about kidnapping, rather than loot.

----

The Canadian hockey team is outplaying us, same as the women's game. Sigh. We are not doing the passing we need to, and shooting too early.