Saturday, December 30, 2006

Goals

John Edwards' YouTube announcement of his candidacy is the #1 link on the front page at the moment. This is actually a mistake -- all of today's "new" links are actually old and should have registered yesterday or the day before, however they are showing up now because of the correction of a bug. Today was what you would call a major news day -- the execution of Saddam Hussein -- so this may distort the results. This kind of thing is likely to occur from time to time as long as this site in use at the same time as it is in development, and it should serve to remind people that BlogRevolution's results should not be confused for scientific.

On an unrelated note, I feel like reflecting on my goals for the site.

BlogRevolution will never be earth-shattering or paradigm-shifting like YouTube or even digg, but what I would like the site to be is cool and useful and interesting to a lot of people. I hope that it has largely acheived these goals already. It is already cool and useful and interesting to me, for example.

No active promotion has even been done for BlogRevolution, we only recently opened the site up to search engine crawlers to peruse a couple of months ago. There are a few more features that I'd like to see added and a few more bugs lurking aware out there -- the string functions need some work, for example -- and after that the site may be ready for a final "release".

Thursday, December 28, 2006

The Corner's RSS feed is broken

Here's an interesting fact, classic right wing blog The Corner's RSS feed messes HTML entities.

Here's what the content of a recent post:

I didn#39;t realize that the Scott Johnson from the 2002 World Net Daily article who wrote to the State Department about Arafat#39;s responsibility fo... . . .

Normally, if for some reason you are worried that a single quotation mark (or 'apostrophe' you may be aware) isn't allowed at a particular point in HTML or XML markup, you may encode it using the following string of characters:

'

The Corner's blog appears to encode it as follows:

#39s;

Which will not actually work. It will just look like a literal '#39;'.

Whatever content management system the National Review uses, it looks totally custom. Some have noticed the strange format of their permalinks, which in fact appears to be a base64-encoded md5 or sha1 hash. Using base64 to encode such a hash will add no new information and only make the string longer. Also this completely throws the math in the linked article off. Maybe they meant to encode a binary hash but accidentally left hexadecimal mode flipped on.

The query fragment in this url, on the other hand: http://nrd.nationalreview.com/?q=MjAwNjEyMzE= decodes to '20061231', an obvious representation date. Maybe it base64 encodes all primary keys? And it uses (inefficient) fixed-length char strings as primary keys everywhere?

Monday, December 25, 2006

Site down

Our ISP had a serious failure on our server and disk data. At least it happened on a very low traffic day. The BlogRevolution site will be relatively easy to bring back online but all the other stuff we had on that server may be gone forever.

UPDATE: Back as of about 1:00 AM this morning.

Sunday, December 24, 2006

YouTube magic

Just proving that you can teach a new dog new tricks, here's an example of a YouTube video that was handled correctly.

Friday, December 22, 2006

Better last ditch content detection

New feature just rolled out: The page summarizer will now mess up to a lesser degree when it finds the meta tags and rss feeds of a page uninformative and tries to generate a summary directly from the contents of a page.

Dependency on the informativeness of the <title> tag has also been decreased, which IMO has been BlogRevolution's greatest flaw up until this point.

Tuesday, December 12, 2006

What's cooking in BlogRevolution Labs

  • Better title cleaning. Some article titles still contain promotional cruft like ' | The St. Olaf Times-Dispatch'. We've taken steps to remove most of this but we're working on something more general that will work with more obscure news sites like the St. Olaf Times-Dispatch and others.

  • News photo detection from blog sites in addition to news sites. This is simple in the abstract but requires a rewrite of our RSS detection algorithm. If it can be called an algorithm, it's more like a series of heuristics.

    Web sites or weblogs can announce the location of their RSS feed by placing its url in a <link> tag in the head section of their HTML pages. Remarkably few bloggers seem to know about this unfortunately and when this method fails as it does surprisingly often, we have to detect links to it using a series of searches: look for a link that goes to a URL that ends with either .rss or .xml and which contains a small orange image, and if that fails look for any like to a URL that ends in .rss or .xml that has "feed" in the link text and so on with a dozen or so rules each a little more general (and therefore more error-prone) than the previous. Currently it uses a best-match method where it will only check the most likely result; we're changing it so that it will give each candidate a score and it will test, say, the top 5 possible candidates to see whether they really are the site's RSS feed.

Oh, and of course in about 1/25 articles the summarizer completely messes up. This is a long-standing issues for which we have several possible ways to improve, but have thus far avoided in favor of grasping at lower-hanging fruit.

BCC: Tantalus

Saturday, December 09, 2006

Favicon updates

The little icons next to the story URL in BlogRevolution are called "favicons." Users of many web browsers — but not Internet Explorer 6 — see these next to every URL in the Location Bar or its equivalent.

Favicons are usually located at the root of the web server at /favicon.ico but may also may be specified by a <link> tag in HTML.

For example, Blogspot's favicon looks like this:

Until now, BlogRevolution would detect the location of a story's favicon and test the image file to make sure it was valid. It would then discard its copy and when an update to the site was generated it would show the favicon by creating an <img> pointing to the favicon URL on the original site.

By including a copy of the favicon on BlogRevolution's server, we can improve user experience in several ways:

  • Sometimes Safari would refuse to show .ico images correctly, and users on this browser would see a broken image icon.

  • Internet Explorer would show the lowest-resolution version of the icon available rather than the highest if multiple resolutions of the icon were contained in the .ico file.

  • Far away servers would occasionally refuse to serve up the file, resulting in a broken image icon on any browser.

  • The .ico file format was never particularly efficient and BlogRevolution will now convert favicons to png, typically resulting in a file size 10 or 20 times smaller. The Blogspot favicon is 3638 bytes, but in our format you browser would have to download just 238 bytes. This will lower page loading time.

  • Page loading time will be reduced in most circumstances because your operating system will not have to resolve different domain names for each icon. On the other hand, load on our server will be increased.

Internally, BlogRevolution stores favicons under the hash of the .ico file data, because many blog or content management systems provide a default favicon whose owners never change it and as a result many sites end up with the same favicons.

It's the little things that count I suppose, and this update was long in coming.

Tuesday, December 05, 2006

New version!

We are pleased to announce a dramatic new version of BlogRevolution!

Some of the new features:

% News photos detected for some articles
% Permalinks for all articles
% Permalink pages will show the story's rank if in the top 100 and a list of the daily pages on which the story appears
% RSS feed is now valid version 2.0
% If an article is a PDF file, a thumbnail of the first page of the screenshot will be shown
% New look and new sidebar, including "Blast from the past" and an archive calendar
% New site favicons

and a few more.

There are still more features in the works that were not included in order to get this release out the door.

Sunday, December 03, 2006

Feature release coming soon

By far the largest feature release ever is coming soon. Lots of new and exciting stuff.

We hope to start using it live in the next couple of days.

Monday, November 20, 2006

Abandoning XSLT

Warning: Geeky post ahead

Our content generator currently uses XSLT to send updates to the web server. One XML file is prepared for an entire day's worth of data and pushed to the web server which then processes it into front pages, archives pages, and the rss feed via different XSLT stylesheets.

It's remarkably elegant. Each layer is left in its own space, and the entire process is highly efficient and painless.

However, for two reasons we've decided to abandon this approach.

ONE: XSLT is oversimple. It direly needs string manipulation functions. What happens if you encounter a date in an XML file and you want to convert it to another format? It's simple: there is absolutely no way to do that. Tough luck. Want to generate a calendar? Too bad. There's no way to do that.

TWO: Too many layers slow down the development cycle. Most changes will now require changes to the database, changes to the XML file format and then changes to the XSLT stylesheets. This takes much more time, and I've noticed that it takes a lot more time to push new stuff through the pipe, with an end result of a lot less stuff.

SO: Now we're going to generate pages using Object-oriented PHP and push them using a more traditional method. (Why not Python? That's another blog post...)

Friday, November 03, 2006

On the hour

Another update that's coming is to make updates occur evenly at the hour. x:00 on the clock. I think this will make update times easier to learn and remember.

Right now it's publishing updates at x:46 on the clock. This is for esoteric technical reasons related to the schedule of the crawler, but I'll find some way to do it.

Thursday, October 26, 2006

Oh, boy

That meme that was/is going around?

Slashdot's covering it as deliberate google-bombing.

You're not just spamming google, you're spamming me. I will have to find a way to detect this activity and eliminate its effect on the news results.

Editions are A-OK

Recently I added "editions" of the site -- every update during the day is given a name.

I really like it. It makes it much more obvious when the site has been updated, and encourages people to check back. It meets my personal seal of approval, which is the best that any BlogRevolution feature could ever hope for.

Wednesday, October 25, 2006

Attack of the killer meme

Today's posts (archive link) are all the result of many bloggers copying-and-pasting the same list of links and encouraging others to do the same. Since BlogRevolution looks for new links appearing in blogs to look for news, virus-like this "meme" has totally overwhelmed all other news that might have been detected today.

Tuesday, October 24, 2006

Why does BlogRevolution mess up the summaries sometimes?

... because it's complicated.

That's the short answer.

BlogRevolutions finds newly-popular links on web logs, assembles a list of them according to some arbitrary threshold, then goes out and tries to generate titles and descriptions of these links. Virtually all of the most popular new links for a day on political blogs are news articles or other blog posts.

It's completely automated. A non-intelligent computer program that doesn't understand English tirelessly does it all.

Sometimes it messes up the descriptions.

The current state of the front page is an example -- most of the summaries barely work. It's not usually this bad, it usually gets 9/10 correct, but that's the nature of quasi-randomness.

It usually messes on up on articles in 2nd tier newspapers or local tv news sites -- this is because major news sites leave little clues behind to search engines as to what the summaries should be. Apparently the cheaper, less-thorough content management systems that lesser news sites use don't do this, or do it badly.

There are three basic scenarios where the article summarizer (which usually gets it right!) chokes. One: often a news site will put a bunch of ad copy propaganda into the title tag of their news article ("The best news site evarrr!!!"). I have various half-formed ideas on how to defeat this and filter it out. The second source of error, is when the news site tries to torment search engines by leaving those clues I mentioned earlier to lead to similar exclamation-pointed agitprop, instead of a description of what the story is actually about. Finally, if the summarizer doesn't find any clues, it will be forced to compensate by guessing from the content of the article itself. In such cases, again, it gets it right more often than not, but the error rate is still too high.

I have an agitprop detector that can virtually always tell when a few lines of text are booster ad copy or factual news, but the problem is if it rejects this text, the only option left is to guess from the page content itself, and I've found that sites that try to fool you with ad copy usually have such awfully written HTML that the page scanner has a much higher error rate than usual.

When BlogRevolution messes up a summary, it's probably more jarring to people who are non-me, because I can understand what's going wrong on the backend, but the average reader does not. They may even think this site is done by amateurs, which, in fact, is a pretty good guess. Most people may not even figure out that the site's content is put together by a machine, and expect prefection, or worse, professionalism.

This flaw is one reason that I have not actively promoted the site at all -- the page scanner needs work. It's high accuracy rate is not high enough, and I think that the occasional error is too jarring for the casual user.

Monday, October 23, 2006

New features!

Everyone loves new features. We've rolled out a number of them this weekend, with more on the way soon.

Recently-added features include:
∞ Links to "archive" pages for the current and previous day
∞ Each update during the day has a named update "edition"
∞ A number of changes on the backend that you can't tell are there

The point of naming the editions will make it more obvious the schedule of updating the front page, and I hope that it will encourage people to visit the site more often since it will be more obvious when the site will contain new updates during the day.

Friday, October 20, 2006

Yummy RAM chips

The little old dinky server I'm using as a spider is going down for a badly-needed RAM infusion. The main link tables in the database have grown extremely large and crawls take 3 or 4 times as long as they used to probably because of this.

... God bless RAM. More generally, God bless Moore's Law. A good complement for a visiting foreigner should be, "May the RAM in your country be cheap and plentiful." We're living that dream today.

Tuesday, October 17, 2006

Update schedule

How often are updates posted to BlogRevolution?

Currently crawl cycles are taken once every 4 hours, and new updates are computed once every 3 hours.

The exact formulas and thresholds used for updates are something that I'm constantly tinkering with. Almost certainly this schedule is going to change, likely I'll make it so that updates are only sent after a crawl cycle is complete. Running extra updates in the middle of a cycle usually doesn't add much, and sometimes it completely messes up as was the case a few minutes ago when only one item was on the front page.

Monday, October 16, 2006

First post

Welcome to the offical BlogRevolution blog!

BlogRevolution is something I've been working on quietly for 4 (5?) months now. Since then it has grown enormously in complexity and size, and appears to have a small number of regular users.

I'm really excited about BlogRevolution because it works so well. I check it myself every day.

There's a lot of new features coming down the pike, and though it pains me to admit it, several bugs that need fixing.

So that's my little introduction. I hope you enjoy BlogRevolution as much as I do.