Tuesday, September 19, 2006

Personal web-crawler/indexers

The WWW has been around for over a decade now. Crawling and indexing ought to be pretty standard at this point in life, you ought be able to just find one and use it. But that turns out to be A LOT harder than you'd think. Does it take an Advanced Degree (tm) to figure it out? Yes, it does.

I was hoping it didn't, however.

Well, in some sense, it doesn't, until you want to index Acrobat/PDF files and MS Word docs, too. That's where the real trouble starts. There are a number of crawler tools, and really just one indexer (Lucene). But there are very few tools that are combined crawler-indexers.

I didn't investigate them all, so some of my conclusions aren't 100% correct. But for those I did investigate, they all fell rather short of my goals.

So what were my goals?

1) more or less like Google (R)
2) easy to use
3) filter-limitable to an internal website (don't want to index the world)
4) must be able to extract and index the contents of PDF and Word (ppt too would be nice)
5) cost $0 (Google mini appliance is $2k-10k)
6) continuous incremental crawl/index
7) written in Java if I have to do coding
8) web-based interface for search/retrieval

Lucene supplies #8, which is good. I could have written one, not that hard. It's java, which is #7, when combined with a java crawler. #6 comes about as a result of my coding, along with some mods i needed in order to do #4, and solve some other issues (having it only run at night, since that server is occasionally busy during the day, and the search/retrieve functions need to run during the day but not at night). Cost is nothing but my time. PDF/Word extract comes from a couple of other programs--pdf2text, and antiword; both have known failure rates, but work most of the time. #3 is from my own coding, and #2 and #1 are likewise.

Result: other tools I tried claimed to do what I wanted, but didn't. the biggest one of the bunch, nutch, doesn't do PDF/Word, and although some online doc suggests it does, I wasn't able to figure that out. other tools didn't even claim to do PDF/Word, so they only got limited attention. Nutch did do the crawl and index just fine, as well as search and retrieval. but...gotta have PDF and Word. There's a tool I didn't try out, Lius, because I found out about it a lot later, after I ad done nearly all the requisite coding. It too claims to do PDF and Word, and might be better than nutch...too late now.

anyway...if you are interested in my code for this, just email me. I'm happy to give it to you. Parts are sloppy, I warn you. However, it does a good job.

But it did take an Advanced Degree (tm), because of the amount of coding I had to do. Well, that's something I'm good at.