Saturday, March 14, 2009

Text Processing etc, part 2

Been continuing on with another text processing tool. This one will be able to read in a story, and spit back the top several topics in that story.

Actually this is 3 tools. First is the training input creator. Second is the model creator. Third is the runtime document processor.

Training creator shows you an input doc (web-p, text file, etc; simple things), allows you to mark it with topics, and save the result. Hmm...just occurred to me: should I allow PDF as input? that's not actually too hard to accomplish, with a PDF ripper front-end.

Model Creator takes the training input and creates a recognition model. It's not a statistical model. I was thinking about using SVM (Support Vector Machine) in this, but that kinda wants actual percent probabilities, which I don't have. I probably could if I think of a way to normalize values.

Runtime processor receives the story you want to know about, and returns some topics.

I also use an english word dictionary in this (although I don't there's any requirement to do that). You'd think that finding a good one wouldn't be that hard...I thought that. But we are wrong! Finding dictionary files is not that hard, I have several. The biggest one I could find online had over 200K words, but you'd be amazed at the basic words that were missing... "cat", for example. And "horse"? You'd be likewise amazed at the really unusual words it DOES have: "catachrestically" -- what the heck is that? And why is "catawampously" in there? Have you ever even seen those two before?

This is a bizarre dictionary. And it's WAY bigger than the others I found...although it seems likely that the others have a lot more of the basic/common stuff and not so much the exotic words. Maybe I just need to merge them all...

But this weirdness has forced me to track unknown words, since a lot of them are fairly common.

Why can't we have a good, pretty complete, free dictionary word list? i.e., one that is better than the ones I've found recently...

Related to that: ever looked at WordNet? An interesting project. If you look around, you can find a number of browser-based WN viewers: enter a word, get a view of the words or phrases that are nearby in terms of some flavor of semantics. You also get use-type (noun, verb, etc). And some more exotic aspects that I don't quite know what they are.

What you don't get is also interesting, because I went looking for this. You don't get the root word for your word. E.g., if your word is "catawampously", the root word for that is "catawampous". So who cares about root words? Well, the topic-ID software would have a better model if I could convert training words and runtime-doc words into their root/stemmed form.

So I do of course know about the Porter Stemmer, I grabbed the java version, and have integrated that...problem is that it overstems, in my opinion. (I have read some of the more formal study work that compares stemmers; Porter is really good--for english--and really bad--for other languages. Porter is entirely suffix-based, and only knows english suffixes. (You could do the same thing for other languages, I'm sure.) Porter will make an error like stemming "heading" into "head"--where "heading" most likely means "direction" and "head" most likely means "part of your body where your brain is", although I suspect that both have less common usage that is exactly reversed. The formal comparisons suggest this is a small problem. So I don't know. It'd be easy enough to insert the Porter stemmer into the pipeline and try it out--except that I don't know how I'd tell if it was better...

Leads you to wonder why there's no serious dictionary-based stemmer...I've read about them, too, and what you seem to get is a hybrid that does a little of the Porter style, and more table-lookup.

So why isn't there a pure dictionary-based table-lookup stemmer? You'd base it off a really large dictionary (you see where this has been going now). That would not be perfect, you'd get some errors where the stem is different depending on noun/verb usage. You'd only need a hash-table to implement this. If you needed to be fancier, you could deal with the noun-verb-etc aspect, but figuring that out in the first place is probably more expensive than the error (and is itself an imperfect process, so you'd be introducing a different flavor of error into the answer).

This doesn't make sense to me...a pure dictionary-based stemmer would be time-consuming to create, but trivial to use. And it would work for all languages where root words exist (i.e., not chinese/japanese/etc). It'd be a little large, a complete english dictionary is a few megabytes, whereas the Porter stemmer code is a few kilobytes.

---

Further notes: this weird dictionary, "unabr.dict", appears to be associated with password-cracking...which might explain the missing common words. Might. Assoc with crossword puzzles, too?

This URL:

http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start

seems to say a lot more about word lists, and "unabr.dict" especially. I downloaded all the word lists mentioned there. Should produce a better set than "unabr.dict".

Should also point out that when I have written "dictionary" here, I really mean "word list", not a dictionary-with-definitions-n-stuff. That probably explains why I found "unabr.dict" early on.

Also to be noted: these lists aren't going to contain names, excepting when names are other words. Vaguely annoying, if you're doing what I'm doing with word lists, because "Bernanke" and "Greenspan" are going to correspond to several money/gov't topics, and probably nothing else.

No comments: