Entity Extraction via the Calais Web Service
A few weeks ago, Barney mentioned that Clearforest, now owned by Reuters, had opened up an API to their Calais Web Service. After a few weeks of working through bugs on their end, I was able to publish a Ruby wrapper for their POST-based API. That was the easy part as my eventual goal was to have something functionally useful based on this library. Last night, I published the Autotagger, a really simple Merb application using my Calais library. Check it out. I’ve gotten some good results by pasting in text from Wikipedia or from the New York Times.
Why is entity extraction useful? One of the hardest tasks for someone working with the Semantic Web is that not all data on the web is tagged. It’s difficult to generate and leverage relationships between types of data when there’s no way to make that content portable. The Calais Web Service tries to attack this problem by pulling names and terms from a body of text. It also has the ability to provide relationships between terms. (I haven’t, yet, exposed this in my Autotagger but the data is present in the Ruby library.)

