Tuesday, February 03, 2009

Text processing concepts and tools

In the past 15 years, I have worked on text-processing software tools more than once, and I'm doing it again here of late.

While it doesn't take an Advanced Degree (tm) to understand *most* of it, some aspects do get pretty exotic.

How I started way back when I've used several of the mentioned tools. Most don't really meet my needs or wants.

I participated in some of the MUC episodes (6 and 7, I think), and have known about the MET and ACE episodes.

There are others, of course.

There are other tools around that do the named-entity job. I wrote one myself, because the one I had used most a lot had some flaws I didn't care for (one of which was occasionally a show-stopper), and some experimental purposes.

What I would consider an interesting set of text-processing capabilities:

Tokenizer (separate words from each other and non-words)
Reconstitutor (re-assemble words or other things from separate tokens)
Stemmer (separate root words from their suffixes; Porter Stemmer is the standard)
Pattern matcher (match word sequences)
Name lists (annotated/typed names of whatever)
Dictionaries
Topic Finding
WordNet

----

Other related tools whose value I'm not convinced of:

POS tagging
Sentence splitting/parsing

-----

Why are these tools of interest or value?

There is a lot of text/words content on the Web, and in databases. No possible way to read it all, and nearly no way to even find out what you might *want* to read. How do you find all the stuff you *should* read? Or stories that mention things of interest? How do you find stories that are on topics of interest but didn't happen to use the words you expected (i.e., defeating google)? What if it was in a foreign language--which REALLY defeats Google..?

You need some help.

Which leads to the two tools of interest.

1) Named-entity recognition. Find various reasonably-unique-meaning words/phrases
2) Topic recognition. Stories on any given topic are likely to use a lot of the same words.

A third tool of interest would do this: recognize relationships between words in the stories; this could include the simple concept of pronoun-references, but could also be more complex relationships, like "Barack Obama is the President of the United States" would have a person name, a location name, a job title, and the relationship between all of them. In the MUC bake-offs this was known as Template Entity Recog. It's dramatically much harder than the others.

Name recog is important because you can use it to mark up stories as being about that particular name, without necessarily having seen that name before. Topic recog is valuable because you can then find stories "about" something-or-other, without having to know any of the right keywords. Of course the list of topics isn't going to be tiny, so choosing the right topic is not necessarily trivial.

-----

Peculiarities:

There are a few, but not many, human language families. One group are the "Romance" languages, which are derived from Latin. Many european languages are this type. There are the pictogram languages like Chinese, Japanese, and Korean. There are a few oddments, like Thai, which has no related languages (IIRC). Arabic languages are another family.

Writing direction: left to right, right to left, top to bottom...likewise varies, but probably corresponds closely to origin.

Use of an alphabet, and whitespace. The pictogram languages don't have an alphabet in the same way that Latin languages do, nor do they use white space as word separators. I'd argue that these are fundamental flaws in the languages' written form.

All these things complicate computing, because they lead to language-specific solutions.

-----

Words are important. Without them you cannot express concepts, and you can't really invent new concepts. Language has to be mutable. But let's have the computer do some of the work.

No comments: