New annotation tool: DiscoverText

Submitted by dan.mccloy on Wed, 12/01/2010 - 12:58

Stuart Shulman recently gave a workshop at UW on a new annotation tool he is developing: DiscoverText. It has some limitations that make it unsuitable for some linguistic purposes, but is powerful in other ways and might be a good choice for certain types of tasks.

Highlights:

Robust support for collaborative annotation, including non-destructive adjudication, tag merging, etc.
Annotation can be crowdsourced, and the results filtered based on various annotator credentials.
Built-in function to scrape publicly available data from Facebook, Twitter, and RSS.
A nice subset-via-search function called "bucketing" that will extract tokens and their surrounding context (of customizable size) and collect them into a subset of your corpus.
Alleged to be unicode compliant, though this wasn't explicitly demonstrated in the workshop.

Lowlights:

The codable unit size is fixed (determined by delimiters when the data is imported), so as far as I can tell, you can't annotate at the segment level, word level, and phrase level all within the same dataset.
DiscoverText is hosted software that appears to have a "freemium" usage model (some functionality free, some requires paid account). It is not yet clear where the line will be drawn once the product goes out of beta.

Stuart emphasized that the platform is still in development and he is eager for suggestions and feature requests, so if this tool looks valuable for your research I would encourage you to contact him.