Monday, October 8, 2007, 8:55pm
The CMU research team is involved in digitising old books and manuscripts supplied by a non-profit organisation called the Internet Archive, and uses Optical Character Recognition (OCR) software to examine scanned images of texts and turn them into digital text files which can be stored and searched by computers.
But the OCR software is unable to read about one in 10 words, due to the poor quality of the original documents.
The only reliable way to decode them is for a human to examine them individually - a mammoth task since CMU processes thousands of pages of text every month.
To solve this problem the team takes images of the words which the OCR software can't read, and uses them as CAPTCHAs.
These CAPTCHAs, known as reCAPTCHAS, are then distributed to websites around the world to be used in place of conventional CAPTCHAs.
When visitors decipher the reCAPTCHAs to gain access to the web site, the answers - the results of humans examining the images - are sent back to CMU.
Every time an Internet user deciphers a reCAPTCHA, another word from an old book or manuscript is digitised.
To ensure that the reCAPTCHAs are deciphered correctly, website visitors are actually presented with images of two words to examine, the contents of one of which is already known.
Oddly, I knew nothing about this until reading about it at the BBC. Apparently, there's even a reCAPTCHA Drupal module that I can use. Nifty.