Thursday, October 26, 2006

Oh, boy

That meme that was/is going around?

Slashdot's covering it as deliberate google-bombing.

You're not just spamming google, you're spamming me. I will have to find a way to detect this activity and eliminate its effect on the news results.

Editions are A-OK

Recently I added "editions" of the site -- every update during the day is given a name.

I really like it. It makes it much more obvious when the site has been updated, and encourages people to check back. It meets my personal seal of approval, which is the best that any BlogRevolution feature could ever hope for.

Wednesday, October 25, 2006

Attack of the killer meme

Today's posts (archive link) are all the result of many bloggers copying-and-pasting the same list of links and encouraging others to do the same. Since BlogRevolution looks for new links appearing in blogs to look for news, virus-like this "meme" has totally overwhelmed all other news that might have been detected today.

Tuesday, October 24, 2006

Why does BlogRevolution mess up the summaries sometimes?

... because it's complicated.

That's the short answer.

BlogRevolutions finds newly-popular links on web logs, assembles a list of them according to some arbitrary threshold, then goes out and tries to generate titles and descriptions of these links. Virtually all of the most popular new links for a day on political blogs are news articles or other blog posts.

It's completely automated. A non-intelligent computer program that doesn't understand English tirelessly does it all.

Sometimes it messes up the descriptions.

The current state of the front page is an example -- most of the summaries barely work. It's not usually this bad, it usually gets 9/10 correct, but that's the nature of quasi-randomness.

It usually messes on up on articles in 2nd tier newspapers or local tv news sites -- this is because major news sites leave little clues behind to search engines as to what the summaries should be. Apparently the cheaper, less-thorough content management systems that lesser news sites use don't do this, or do it badly.

There are three basic scenarios where the article summarizer (which usually gets it right!) chokes. One: often a news site will put a bunch of ad copy propaganda into the title tag of their news article ("The best news site evarrr!!!"). I have various half-formed ideas on how to defeat this and filter it out. The second source of error, is when the news site tries to torment search engines by leaving those clues I mentioned earlier to lead to similar exclamation-pointed agitprop, instead of a description of what the story is actually about. Finally, if the summarizer doesn't find any clues, it will be forced to compensate by guessing from the content of the article itself. In such cases, again, it gets it right more often than not, but the error rate is still too high.

I have an agitprop detector that can virtually always tell when a few lines of text are booster ad copy or factual news, but the problem is if it rejects this text, the only option left is to guess from the page content itself, and I've found that sites that try to fool you with ad copy usually have such awfully written HTML that the page scanner has a much higher error rate than usual.

When BlogRevolution messes up a summary, it's probably more jarring to people who are non-me, because I can understand what's going wrong on the backend, but the average reader does not. They may even think this site is done by amateurs, which, in fact, is a pretty good guess. Most people may not even figure out that the site's content is put together by a machine, and expect prefection, or worse, professionalism.

This flaw is one reason that I have not actively promoted the site at all -- the page scanner needs work. It's high accuracy rate is not high enough, and I think that the occasional error is too jarring for the casual user.

Monday, October 23, 2006

New features!

Everyone loves new features. We've rolled out a number of them this weekend, with more on the way soon.

Recently-added features include:
∞ Links to "archive" pages for the current and previous day
∞ Each update during the day has a named update "edition"
∞ A number of changes on the backend that you can't tell are there

The point of naming the editions will make it more obvious the schedule of updating the front page, and I hope that it will encourage people to visit the site more often since it will be more obvious when the site will contain new updates during the day.

Friday, October 20, 2006

Yummy RAM chips

The little old dinky server I'm using as a spider is going down for a badly-needed RAM infusion. The main link tables in the database have grown extremely large and crawls take 3 or 4 times as long as they used to probably because of this.

... God bless RAM. More generally, God bless Moore's Law. A good complement for a visiting foreigner should be, "May the RAM in your country be cheap and plentiful." We're living that dream today.

Tuesday, October 17, 2006

Update schedule

How often are updates posted to BlogRevolution?

Currently crawl cycles are taken once every 4 hours, and new updates are computed once every 3 hours.

The exact formulas and thresholds used for updates are something that I'm constantly tinkering with. Almost certainly this schedule is going to change, likely I'll make it so that updates are only sent after a crawl cycle is complete. Running extra updates in the middle of a cycle usually doesn't add much, and sometimes it completely messes up as was the case a few minutes ago when only one item was on the front page.

Monday, October 16, 2006

First post

Welcome to the offical BlogRevolution blog!

BlogRevolution is something I've been working on quietly for 4 (5?) months now. Since then it has grown enormously in complexity and size, and appears to have a small number of regular users.

I'm really excited about BlogRevolution because it works so well. I check it myself every day.

There's a lot of new features coming down the pike, and though it pains me to admit it, several bugs that need fixing.

So that's my little introduction. I hope you enjoy BlogRevolution as much as I do.