Tuesday, December 12, 2006

What's cooking in BlogRevolution Labs

  • Better title cleaning. Some article titles still contain promotional cruft like ' | The St. Olaf Times-Dispatch'. We've taken steps to remove most of this but we're working on something more general that will work with more obscure news sites like the St. Olaf Times-Dispatch and others.

  • News photo detection from blog sites in addition to news sites. This is simple in the abstract but requires a rewrite of our RSS detection algorithm. If it can be called an algorithm, it's more like a series of heuristics.

    Web sites or weblogs can announce the location of their RSS feed by placing its url in a <link> tag in the head section of their HTML pages. Remarkably few bloggers seem to know about this unfortunately and when this method fails as it does surprisingly often, we have to detect links to it using a series of searches: look for a link that goes to a URL that ends with either .rss or .xml and which contains a small orange image, and if that fails look for any like to a URL that ends in .rss or .xml that has "feed" in the link text and so on with a dozen or so rules each a little more general (and therefore more error-prone) than the previous. Currently it uses a best-match method where it will only check the most likely result; we're changing it so that it will give each candidate a score and it will test, say, the top 5 possible candidates to see whether they really are the site's RSS feed.

Oh, and of course in about 1/25 articles the summarizer completely messes up. This is a long-standing issues for which we have several possible ways to improve, but have thus far avoided in favor of grasping at lower-hanging fruit.

BCC: Tantalus

No comments: