Tuesday, October 24, 2006

Why does BlogRevolution mess up the summaries sometimes?

... because it's complicated.

That's the short answer.

BlogRevolutions finds newly-popular links on web logs, assembles a list of them according to some arbitrary threshold, then goes out and tries to generate titles and descriptions of these links. Virtually all of the most popular new links for a day on political blogs are news articles or other blog posts.

It's completely automated. A non-intelligent computer program that doesn't understand English tirelessly does it all.

Sometimes it messes up the descriptions.

The current state of the front page is an example -- most of the summaries barely work. It's not usually this bad, it usually gets 9/10 correct, but that's the nature of quasi-randomness.

It usually messes on up on articles in 2nd tier newspapers or local tv news sites -- this is because major news sites leave little clues behind to search engines as to what the summaries should be. Apparently the cheaper, less-thorough content management systems that lesser news sites use don't do this, or do it badly.

There are three basic scenarios where the article summarizer (which usually gets it right!) chokes. One: often a news site will put a bunch of ad copy propaganda into the title tag of their news article ("The best news site evarrr!!!"). I have various half-formed ideas on how to defeat this and filter it out. The second source of error, is when the news site tries to torment search engines by leaving those clues I mentioned earlier to lead to similar exclamation-pointed agitprop, instead of a description of what the story is actually about. Finally, if the summarizer doesn't find any clues, it will be forced to compensate by guessing from the content of the article itself. In such cases, again, it gets it right more often than not, but the error rate is still too high.

I have an agitprop detector that can virtually always tell when a few lines of text are booster ad copy or factual news, but the problem is if it rejects this text, the only option left is to guess from the page content itself, and I've found that sites that try to fool you with ad copy usually have such awfully written HTML that the page scanner has a much higher error rate than usual.

When BlogRevolution messes up a summary, it's probably more jarring to people who are non-me, because I can understand what's going wrong on the backend, but the average reader does not. They may even think this site is done by amateurs, which, in fact, is a pretty good guess. Most people may not even figure out that the site's content is put together by a machine, and expect prefection, or worse, professionalism.

This flaw is one reason that I have not actively promoted the site at all -- the page scanner needs work. It's high accuracy rate is not high enough, and I think that the occasional error is too jarring for the casual user.

No comments: