arXiv Analytics

Sign in

arXiv:1709.08600 [cs.CL]AbstractReferencesReviewsResources

EZLearn: Exploiting Organic Supervision in Large-Scale Data Annotation

Maxim Grechkin, Hoifung Poon, Bill Howe

Published 2017-09-25Version 1

We propose Extreme Zero-shot Learning (EZLearn) for classifying data into potentially thousands of classes, with zero labeled examples. The key insight is to leverage the abundant unlabeled data together with two sources of organic supervision: a lexicon for the annotation classes, and text descriptions that often accompany unlabeled data. Such indirect supervision is readily available in science and other high-value applications. The classes represent the consensus conceptualization of a given domain, and their standard references can be easily obtained, often readily available in an existing domain ontology. Likewise, to facilitate reuse, public datasets typically include text descriptions, some of which mention the relevant classes. To exploit such organic supervision, EZLearn introduces an auxiliary natural language processing system, which uses the lexicon to generate initial noisy labels from the text descriptions, and then co-teaches the main classifier until convergence. Effectively, EZLearn combines distant supervision and co-training into a new learning paradigm for leveraging unlabeled data. Because no hand-labeled examples are required, EZLearn is naturally applicable to domains with a long tail of classes and/or frequent updates. We evaluated EZLearn on applications in functional genomics and scientific figure comprehension. In both cases, using text descriptions as the pivot, EZLearn learned to accurately annotate data samples without direct supervision, even substantially outperforming the state-of-the-art supervised methods trained on tens of thousands of annotated examples.

Related articles: Most relevant | Search more
arXiv:1812.00677 [cs.CL] (Published 2018-12-03)
Clinical Document Classification Using Labeled and Unlabeled Data Across Hospitals
arXiv:2010.03276 [cs.CL] (Published 2020-10-07)
ZEST: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization
arXiv:2302.04813 [cs.CL] (Published 2023-02-09)
Explanation Selection Using Unlabeled Data for In-Context Learning