Thursday, March 26, 2009

News reporting n such

Loved the Jon Stewart new-A ripping on Jim Cramer. (As I recall, the last time Stewart really took someone to task a week later they didn't have a show any more.)

So there's this local nitwit, Richard Cohen, who writes for the Wash Post, who complains about Stewart's having done this...and goes on to prove he didn't actually watch the episode. And it all feels like whiny-boy stuff, and "how come my newspaper's not doing so well?"

Which of course is because newspapers are in the process of ceasing to exist. ALL of them.

Which is kinda sad, but probably inevitable.

More about the Jag

Spring is here, and I'm getting itchy to get the Jag fixed up and drive it.

Finally got a replacement (not new) alternator to try out (hope that's all that is wrong).

Just ordered new inner tubes for the wheels, so I can get the tires all replaced too. Then I can get out and really wind it up!

Having it out of commission the past few months has been painful...going past it in the garage, knowing I can't do anything but look.

Last I fiddled with it a few weekends back when there was some actual warmth outside, I took the non-functional A/C off, and the old ALT. With luck, this new(er) one...

Monday, March 23, 2009

O/S basics

I don't know why this is, the solution seems absurdly obvious...Only recently did calendar and address book functionality become part of the operating system, but it's so basic you have to wonder why it took so long.

It's not like that kind of info is difficult to store and make available...

I was back on this issue because of getting the new MacBook (17") at the beginning of the month (3/09). I wanted to get it set to read the calendars on my G5 and wife's Mini. But I had lost how to publish those other ones, in the Leopard upgrade a year ago. So we hadn't been sharing calendars for months. And that extended to our PDAs.

The problem has several aspects: #1--I don't want my calendars out on the web. I put things in it that are only for her to know about. #2--I'm not buying Leopard Server for another umpty-hundred $ (apparently "LS" has built-in assistance for managing this). #3--I wasn't remembering the correct name for what I wanted to do; I kept thinking it was CalDAV, which it would be for "LS".

What I needed was WebDAV for iCal. Regrettably there doesn't seem to be any helper tool around to get you through the awkward need to use Terminal (cmd-line). Not that I can't, been a unix user for nearly 20 years...but there are more than a few tiny details.

Turned out that I still had the old setup properly in place, I just needed to do the Apache parts, which are about creating a userid/password and a httpd config block. Of course, this means YAPTR (yet another password to remember), which really means it has to get written down somewhere...

I'm still migrating from the old Powerbook onto the new one...there's built-in help for something that complex...why not for WebDAV? Mostly done, except for things like my old address book, the keychain, mail archive, and probably something else I don't remember. I did the manual drag/drop so far, haven't run the migration tool. (Why not? Because I had to completely reconfigure my home network again. Seems I have to start over every single time I add a new wireless device, because I don't remember how I did it before. Even after writing it down. Too many passwords.)

Spelling check should have been part of the O/S years ago, too. Why wasn't it? It's not like that is hard either...granted, a big dictionary is a good-sized file, which would have been problematic >20 years ago, but now? Should be a standard function so that any app can use it.

Built-in general-purpose database, too. That, too, would have been a problem in the 80s...but it ain't now. Granted, there is no shortage of free databases around, but they take a lot of work to do anything, even something simple.

Which is why Excel became the defacto database for an awful lot of information. I mostly use Filemaker for that sort of thing, for my own personal data. It's pretty friendly.

But a built-in database would be the right kind of place to store all kinds of stuff...you could argue that the filesystem IS a database, and in a very loose sense, that's true, inasmuch as you can store anything. But it doesn't really have any built-in organizing capabilities; limited sorting; usually doesn't handle large quantities of files in a single folder very well...

What other things should be O/S built-in capabilities?

Saturday, March 14, 2009

Text Processing etc, part 2

Been continuing on with another text processing tool. This one will be able to read in a story, and spit back the top several topics in that story.

Actually this is 3 tools. First is the training input creator. Second is the model creator. Third is the runtime document processor.

Training creator shows you an input doc (web-p, text file, etc; simple things), allows you to mark it with topics, and save the result. Hmm...just occurred to me: should I allow PDF as input? that's not actually too hard to accomplish, with a PDF ripper front-end.

Model Creator takes the training input and creates a recognition model. It's not a statistical model. I was thinking about using SVM (Support Vector Machine) in this, but that kinda wants actual percent probabilities, which I don't have. I probably could if I think of a way to normalize values.

Runtime processor receives the story you want to know about, and returns some topics.

I also use an english word dictionary in this (although I don't there's any requirement to do that). You'd think that finding a good one wouldn't be that hard...I thought that. But we are wrong! Finding dictionary files is not that hard, I have several. The biggest one I could find online had over 200K words, but you'd be amazed at the basic words that were missing... "cat", for example. And "horse"? You'd be likewise amazed at the really unusual words it DOES have: "catachrestically" -- what the heck is that? And why is "catawampously" in there? Have you ever even seen those two before?

This is a bizarre dictionary. And it's WAY bigger than the others I found...although it seems likely that the others have a lot more of the basic/common stuff and not so much the exotic words. Maybe I just need to merge them all...

But this weirdness has forced me to track unknown words, since a lot of them are fairly common.

Why can't we have a good, pretty complete, free dictionary word list? i.e., one that is better than the ones I've found recently...

Related to that: ever looked at WordNet? An interesting project. If you look around, you can find a number of browser-based WN viewers: enter a word, get a view of the words or phrases that are nearby in terms of some flavor of semantics. You also get use-type (noun, verb, etc). And some more exotic aspects that I don't quite know what they are.

What you don't get is also interesting, because I went looking for this. You don't get the root word for your word. E.g., if your word is "catawampously", the root word for that is "catawampous". So who cares about root words? Well, the topic-ID software would have a better model if I could convert training words and runtime-doc words into their root/stemmed form.

So I do of course know about the Porter Stemmer, I grabbed the java version, and have integrated that...problem is that it overstems, in my opinion. (I have read some of the more formal study work that compares stemmers; Porter is really good--for english--and really bad--for other languages. Porter is entirely suffix-based, and only knows english suffixes. (You could do the same thing for other languages, I'm sure.) Porter will make an error like stemming "heading" into "head"--where "heading" most likely means "direction" and "head" most likely means "part of your body where your brain is", although I suspect that both have less common usage that is exactly reversed. The formal comparisons suggest this is a small problem. So I don't know. It'd be easy enough to insert the Porter stemmer into the pipeline and try it out--except that I don't know how I'd tell if it was better...

Leads you to wonder why there's no serious dictionary-based stemmer...I've read about them, too, and what you seem to get is a hybrid that does a little of the Porter style, and more table-lookup.

So why isn't there a pure dictionary-based table-lookup stemmer? You'd base it off a really large dictionary (you see where this has been going now). That would not be perfect, you'd get some errors where the stem is different depending on noun/verb usage. You'd only need a hash-table to implement this. If you needed to be fancier, you could deal with the noun-verb-etc aspect, but figuring that out in the first place is probably more expensive than the error (and is itself an imperfect process, so you'd be introducing a different flavor of error into the answer).

This doesn't make sense to me...a pure dictionary-based stemmer would be time-consuming to create, but trivial to use. And it would work for all languages where root words exist (i.e., not chinese/japanese/etc). It'd be a little large, a complete english dictionary is a few megabytes, whereas the Porter stemmer code is a few kilobytes.

---

Further notes: this weird dictionary, "unabr.dict", appears to be associated with password-cracking...which might explain the missing common words. Might. Assoc with crossword puzzles, too?

This URL:

http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start

seems to say a lot more about word lists, and "unabr.dict" especially. I downloaded all the word lists mentioned there. Should produce a better set than "unabr.dict".

Should also point out that when I have written "dictionary" here, I really mean "word list", not a dictionary-with-definitions-n-stuff. That probably explains why I found "unabr.dict" early on.

Also to be noted: these lists aren't going to contain names, excepting when names are other words. Vaguely annoying, if you're doing what I'm doing with word lists, because "Bernanke" and "Greenspan" are going to correspond to several money/gov't topics, and probably nothing else.