Wednesday, October 22, 2008

New feature: quote quality

I've added a new feature to Blogrevolution: quote quality thresholds.

Under each story we show a list of the blogs in Blogrevolution's database that have linked to that particular story. The way Blogrevolution's parser grabbed these quotes wasn't always ideal and often included cruft that wasn't related to the story in question.

To address this I created a statistical model of high- versus low-quality quotes, and on the site this can now be seen limiting the number of quotes that are shown to a smaller number of higher quality quotes. The remainder are still shown without context under the residual final list item, as seen below:

The statistical model is a Bayesian model, like all of the other decision-making models that Blogrevolution uses. It's still something of a work in progress, attaining about 6 out of 7 percent accuracy for quote quality.

One of the interesting things about a model like this is that it discovers all sorts of unexpected relationships that you can nonetheless be sure of, through the magic of empiricism. For instance, presence of the word 'this' was a predictor of a high-quality quote, while appearance of 'that' was a predictor of a low-quality quote. Some of the others you can guess, like the percentage of all-caps words or use of passive voice verb constructions.

Additionally, there is an additional threshold on the uniqueness of the quote - if the context isn't unique enough compared to the other quotes it will get shunted to the residual "also" category of shame at the bottom.
Finally, all of the quotes from the different sites are now listed according to the order that the model scores them, rather than in order of recency as it had been previously.

No comments: