Saturday, December 30, 2006

Goals

John Edwards' YouTube announcement of his candidacy is the #1 link on the front page at the moment. This is actually a mistake -- all of today's "new" links are actually old and should have registered yesterday or the day before, however they are showing up now because of the correction of a bug. Today was what you would call a major news day -- the execution of Saddam Hussein -- so this may distort the results. This kind of thing is likely to occur from time to time as long as this site in use at the same time as it is in development, and it should serve to remind people that BlogRevolution's results should not be confused for scientific.

On an unrelated note, I feel like reflecting on my goals for the site.

BlogRevolution will never be earth-shattering or paradigm-shifting like YouTube or even digg, but what I would like the site to be is cool and useful and interesting to a lot of people. I hope that it has largely acheived these goals already. It is already cool and useful and interesting to me, for example.

No active promotion has even been done for BlogRevolution, we only recently opened the site up to search engine crawlers to peruse a couple of months ago. There are a few more features that I'd like to see added and a few more bugs lurking aware out there -- the string functions need some work, for example -- and after that the site may be ready for a final "release".

Thursday, December 28, 2006

The Corner's RSS feed is broken

Here's an interesting fact, classic right wing blog The Corner's RSS feed messes HTML entities.

Here's what the content of a recent post:

I didn#39;t realize that the Scott Johnson from the 2002 World Net Daily article who wrote to the State Department about Arafat#39;s responsibility fo... . . .

Normally, if for some reason you are worried that a single quotation mark (or 'apostrophe' you may be aware) isn't allowed at a particular point in HTML or XML markup, you may encode it using the following string of characters:

'

The Corner's blog appears to encode it as follows:

#39s;

Which will not actually work. It will just look like a literal '#39;'.

Whatever content management system the National Review uses, it looks totally custom. Some have noticed the strange format of their permalinks, which in fact appears to be a base64-encoded md5 or sha1 hash. Using base64 to encode such a hash will add no new information and only make the string longer. Also this completely throws the math in the linked article off. Maybe they meant to encode a binary hash but accidentally left hexadecimal mode flipped on.

The query fragment in this url, on the other hand: http://nrd.nationalreview.com/?q=MjAwNjEyMzE= decodes to '20061231', an obvious representation date. Maybe it base64 encodes all primary keys? And it uses (inefficient) fixed-length char strings as primary keys everywhere?

Monday, December 25, 2006

Site down

Our ISP had a serious failure on our server and disk data. At least it happened on a very low traffic day. The BlogRevolution site will be relatively easy to bring back online but all the other stuff we had on that server may be gone forever.

UPDATE: Back as of about 1:00 AM this morning.

Sunday, December 24, 2006

YouTube magic

Just proving that you can teach a new dog new tricks, here's an example of a YouTube video that was handled correctly.

Friday, December 22, 2006

Better last ditch content detection

New feature just rolled out: The page summarizer will now mess up to a lesser degree when it finds the meta tags and rss feeds of a page uninformative and tries to generate a summary directly from the contents of a page.

Dependency on the informativeness of the <title> tag has also been decreased, which IMO has been BlogRevolution's greatest flaw up until this point.

Tuesday, December 12, 2006

What's cooking in BlogRevolution Labs

  • Better title cleaning. Some article titles still contain promotional cruft like ' | The St. Olaf Times-Dispatch'. We've taken steps to remove most of this but we're working on something more general that will work with more obscure news sites like the St. Olaf Times-Dispatch and others.

  • News photo detection from blog sites in addition to news sites. This is simple in the abstract but requires a rewrite of our RSS detection algorithm. If it can be called an algorithm, it's more like a series of heuristics.

    Web sites or weblogs can announce the location of their RSS feed by placing its url in a <link> tag in the head section of their HTML pages. Remarkably few bloggers seem to know about this unfortunately and when this method fails as it does surprisingly often, we have to detect links to it using a series of searches: look for a link that goes to a URL that ends with either .rss or .xml and which contains a small orange image, and if that fails look for any like to a URL that ends in .rss or .xml that has "feed" in the link text and so on with a dozen or so rules each a little more general (and therefore more error-prone) than the previous. Currently it uses a best-match method where it will only check the most likely result; we're changing it so that it will give each candidate a score and it will test, say, the top 5 possible candidates to see whether they really are the site's RSS feed.

Oh, and of course in about 1/25 articles the summarizer completely messes up. This is a long-standing issues for which we have several possible ways to improve, but have thus far avoided in favor of grasping at lower-hanging fruit.

BCC: Tantalus

Saturday, December 09, 2006

Favicon updates

The little icons next to the story URL in BlogRevolution are called "favicons." Users of many web browsers — but not Internet Explorer 6 — see these next to every URL in the Location Bar or its equivalent.

Favicons are usually located at the root of the web server at /favicon.ico but may also may be specified by a <link> tag in HTML.

For example, Blogspot's favicon looks like this:

Until now, BlogRevolution would detect the location of a story's favicon and test the image file to make sure it was valid. It would then discard its copy and when an update to the site was generated it would show the favicon by creating an <img> pointing to the favicon URL on the original site.

By including a copy of the favicon on BlogRevolution's server, we can improve user experience in several ways:

  • Sometimes Safari would refuse to show .ico images correctly, and users on this browser would see a broken image icon.

  • Internet Explorer would show the lowest-resolution version of the icon available rather than the highest if multiple resolutions of the icon were contained in the .ico file.

  • Far away servers would occasionally refuse to serve up the file, resulting in a broken image icon on any browser.

  • The .ico file format was never particularly efficient and BlogRevolution will now convert favicons to png, typically resulting in a file size 10 or 20 times smaller. The Blogspot favicon is 3638 bytes, but in our format you browser would have to download just 238 bytes. This will lower page loading time.

  • Page loading time will be reduced in most circumstances because your operating system will not have to resolve different domain names for each icon. On the other hand, load on our server will be increased.

Internally, BlogRevolution stores favicons under the hash of the .ico file data, because many blog or content management systems provide a default favicon whose owners never change it and as a result many sites end up with the same favicons.

It's the little things that count I suppose, and this update was long in coming.

Tuesday, December 05, 2006

New version!

We are pleased to announce a dramatic new version of BlogRevolution!

Some of the new features:

% News photos detected for some articles
% Permalinks for all articles
% Permalink pages will show the story's rank if in the top 100 and a list of the daily pages on which the story appears
% RSS feed is now valid version 2.0
% If an article is a PDF file, a thumbnail of the first page of the screenshot will be shown
% New look and new sidebar, including "Blast from the past" and an archive calendar
% New site favicons

and a few more.

There are still more features in the works that were not included in order to get this release out the door.

Sunday, December 03, 2006

Feature release coming soon

By far the largest feature release ever is coming soon. Lots of new and exciting stuff.

We hope to start using it live in the next couple of days.