We consider the problem of building highlevel, class-specific feature detectors from only unlabeled data. Authors: Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean and Andrew Y. Ng. (2012)
DescriptionIt is a long-standing dream of AI to have algorithms automatically read and obtain knowledge from text. By applying a learning algorithm to parsed text, we have developed methods that can automatically identify the concepts in the text and the relations between them. For example, reading the phrase "heavy water rich in the doubly heavy hydrogen atom called deuterium", our algorithm learns (and adds to its semantic network) the fact that deuterium is a type of atom (Snow et al., 2005). By applying this procedure (and extensions: Snow et al., 2006, Snow et al., 2007) to large amounts of text, our algorithms automatically acquires hundreds of thousands of items of world knowledge, and uses these to produce significantly enhanced versions of WordNet (made freely available online). WordNet (a laboriously hand-coded database) is a major NLP resource, but has proven to be very expensive to manually build and maintain. By automatically inducing knowledge to add to WordNet, our work provides an even greater NLP resource (e.g., significantly greater precision/recall in identifying various relations), but at a tiny fraction of the cost.
Relevant PapersRion Snow, Sushant Prakash, and Andrew Y. Ng, "Learning to Merge Word Senses". EMNLP 2007. [pdf]
Rion Snow, Daniel Jurafsky, and Andrew Y. Ng, "Semantic taxonomy induction from heterogenous evidence". COLING/ACL, 2006. Received Best Paper Award. [pdf]
Rion Snow, Daniel Jurafsky, and Andrew Y. Ng, "Learning syntactic patterns for automatic hypernym discovery". NIPS 17, 2005. [pdf]
Related PapersSharon Caraballo, "Automatic Acquisition of a Hypernym-Labeled Noun Hierarchy from Text". Brown University Ph.D. Thesis, 2001.
Marti Hearst, "Automatic Acquisition of Hyponyms from Large Text Corpora". COLING 1992. [pdf]
George Miller, "WordNet: a lexical database for English. Communications of the ACM, 1995."