-
Notifications
You must be signed in to change notification settings - Fork 4
WikiWoods
WikiWoods is an ongoing initiative to provide rich syntacto-semantic annotations for the full [http://en.wikipedia.org English Wikipedia]. A high-level discussion of the WikiWoods motivation and general methodology is provided by [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)]. The first public release will be available for download from this site on or before June 1, 2010.
The WikiWoods corpus is extracted from a [http://www.delph-in.net/wescience/enwiki-20080727-pages-articles.xml.bz2 Wikipedia snapshot]. As of mid-2010, the corpus comprises around 1.2 million content articles, for a total of around 55 million 'sentences' (or other types of root-level utterances). The corpus is available in two forms: (a) as a collection of [http://www.delph-in.net/wikiwoods/1004/raw.tgz raw articles], prior to preprocessing; and (b) as a set of preprocessed and sentence-segmented [http://www.delph-in.net/wikiwoods/1004/txt.tgz text files]. Both sets of files are organized by segments, each comprised of 100 articles.
Home | Forum | Discussions | Events