Mike Beversluis

Tuesday, March 03, 2009

A Thousand Monkeys Typing...

...really will produce Shakespeare, assuming that you happen to need Hamlet scanned in from an old library book, but that their OCR spit it out: How Spam Saves Libraries

I spend a lot of time peddling my accumulated junk on the virtual garage sale known as Craigslist. Each time I post an item, the website dutifully presents me with its security check, forcing me to decode one of those sequences of squiggly, distorted letters that look like a cross between a Rorschach test and a four-year-old’s signature — a captcha, as computer scientists call them, short for “Completely Automated Public Turing test to tell Computers and Humans Apart.”

The curious thing about Craigslist’s captchas, however, is that instead of testing me with a single sequence of random letters (ujFRuQ, say), the site asks me to decode two words, both in a distinctly old-fashioned (though distorted) font.

Read the rest, as they say. I once looked at the little codes Blogger made me type in, hoping that there was some ghost in the machine trying to say hello. With the random letters, not so much, but here where they're trying to recapture texts, there's a distinct chance you could get that impression.

Also, the whole point of captcha's is that they must be hard to automate, and the tasks picked were selected with this explicitly in mind. Which makes me wonder what the complete list of such tasks would be. I like the idea that there are tasks that are hard to algorithmize, in part because I am not good at thinking algorithmically and hence I am not a computer programmer, but also because I always wonder if self-consciousness is uncomputable.

Anyway, I think optical character recognition is most definately computable, but as my Grandfather's chicken scratch illustrated, it can be very difficult to do. What's neat about the work above is that after probably a million engineer-years of trying to automate the process, you pick the exact inputs that have proved resistant to your automation efforts in order to both select against spam-bots and also provide a useful service to the person that was trying to scan in the original document. Neat.

Labels: ,


Post a Comment

<< Home