Abstract This article surveys recent work in active learning aimed at making it more practical fo... more Abstract This article surveys recent work in active learning aimed at making it more practical for real-world use. In general, active learning systems aim to make machine learning more economical, since they can participate in the acquisition of their own training data. An active learner might iteratively select informative query instances to be labeled by an oracle, for example.
Abstract We consider here the problem of building a never-ending language learner; that is, an in... more Abstract We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day.
Abstract This paper describes DUALIST, an active learning annotation paradigm which solicits and ... more Abstract This paper describes DUALIST, an active learning annotation paradigm which solicits and learns from labels on both features (eg, words) and instances (eg, documents). We present a novel semi-supervised training algorithm developed for this setting, which is (1) fast enough to support real-time interactive speeds, and (2) at least as accurate as preexisting methods for learning with mixed feature and instance labels.
This is a brief addendum to previous work (Settles, 2011), which corrects some errors reported in... more This is a brief addendum to previous work (Settles, 2011), which corrects some errors reported in the results in Table 1. These errors were introduced by third-party code, and the main results of the paper still hold. This note is intended to shed more light on the somewhat surprising result that the proposed multinomial naive Bayes (MNB) variant drastically outperformed maximum entropy (MaxEnt) models trained using generalized expectation (GE) criteria (Druck et al., 2008).
Accumulating evidence suggests that reversible protein acetylation may be a major regulatory mech... more Accumulating evidence suggests that reversible protein acetylation may be a major regulatory mechanism that rivals phosphorylation. With the recent cataloging of thousands of acetylation sites on hundreds of proteins comes the challenge of identifying the acetyltransferases and deacetylases that regulate acetylation levels. Sirtuins are a conserved family of NAD+-dependent protein deacetylases that are implicated in genome maintenance, metabolism, cell survival, and lifespan.
Abstract At personal goal-setting websites, people join others in committing to a challenging goa... more Abstract At personal goal-setting websites, people join others in committing to a challenging goal, such as losing ten pounds or writing a novel in a month. Despite the popularity of these online communities, we know little about whether or how they improve goal performance. Based on theories of goal-setting and group attachment, we examine the influence of two social factors in an online" songwriting challenge" community: early feedback evoking a shared social identity, and one-on-one collaborations with other members.
Active learning addresses this inherent bottleneck by allowing the learner to selectively choose ... more Active learning addresses this inherent bottleneck by allowing the learner to selectively choose which parts of the available data are labeled for training. The goal is to maximize the accuracy of the learner through such" queries," while minimizing the work required of human annotators. In this thesis, I explore several important questions regarding active learning for these and similar tasks involving structured instances. What query strategies are available for these learning algorithms, and how do they compare?
Summary: ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecul... more Summary: ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (eg protein and cell line) trained on standard corpora, for which performance is roughly state of the art.
Abstract This paper describes two natural language processing systems designed to assist songwrit... more Abstract This paper describes two natural language processing systems designed to assist songwriters in obtaining and developing ideas for their craft. Titular is a text synthesis algorithm for automatically generating novel song titles, which lyricists can use to back-form concepts and narrative story arcs. LyriCloud is a word-level language" browser" or" explorer," which allows users to interactively select words and receive lyrical suggestions in return.
Abstract We describe a system developed for the Annotation Hierarchy subtask of the Text Retrieva... more Abstract We describe a system developed for the Annotation Hierarchy subtask of the Text Retrieval Conference (TREC) 2004 Genomics Track. The goal of this track is to automatically predict Gene Ontology (GO) domain annotations given full-text biomedical journal articles and associated genes. Our system uses a two-tier statistical machine learning system that makes predictions first on" zone"-level text (ie abstract, introduction, etc.) and then combines evidence to make final document-level predictions.
Abstract Methods that learn from prior information about input features such as generalized expec... more Abstract Methods that learn from prior information about input features such as generalized expectation (GE) have been used to train accurate models with very little effort. In this paper, we propose an active learning approach in which the machine solicits" labels" on features rather than instances. In both simulated and real user experiments on two sequence labeling tasks we show that our active learning method outperforms passive learning with features as well as traditional active learning with instances.
An important challenge in information retrieval is bridging the “semantic gap”, which refers to t... more An important challenge in information retrieval is bridging the “semantic gap”, which refers to the disconnect between the way that humans and machines represent and describe objects. The semantic gap prevents humans from expressing their information needs using natural language, and makes it difficult for machines to explain the relevance of the retrieved items to humans. Attributes help bridge this semantic gap.
Abstract The goal of active learning is to minimize the cost of training an accurate model by all... more Abstract The goal of active learning is to minimize the cost of training an accurate model by allowing the learner to choose which instances are labeled for training. However, most research in active learning to date has assumed that the cost of acquiring labels is the same for all instances. In domains where labeling costs may vary, a reduction in the number of labeled instances does not guarantee a reduction in cost.
Most approaches to classifying media content assume a fixed, closed vocabulary of labels. In cont... more Most approaches to classifying media content assume a fixed, closed vocabulary of labels. In contrast, we advocate machine learning approaches which take advantage of the millions of free-form tags obtainable via online crowd-sourcing platforms and social tagging websites. The use of such open vocabularies presents learning challenges due to typographical errors, synonymy, and a potentially unbounded set of tag labels.
Abstract The key idea behind active learning is that a machine learning algorithm can achieve gre... more Abstract The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. An active learner may pose queries, usually in the form of unlabeled data instances to be labeled by an oracle (eg, a human annotator).
Abstract As the wealth of biomedical knowledge in the form of literature increases, there is a ri... more Abstract As the wealth of biomedical knowledge in the form of literature increases, there is a rising need for effective natural language processing tools to assist in organizing, curating, and retrieving this information. To that end, named entity recognition (the task of identifying words and phrases in free text that belong to certain classes of interest) is an important first step for many of these larger information management goals.
Abstract This article surveys recent work in active learning aimed at making it more practical fo... more Abstract This article surveys recent work in active learning aimed at making it more practical for real-world use. In general, active learning systems aim to make machine learning more economical, since they can participate in the acquisition of their own training data. An active learner might iteratively select informative query instances to be labeled by an oracle, for example.
Abstract We consider here the problem of building a never-ending language learner; that is, an in... more Abstract We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day.
Abstract This paper describes DUALIST, an active learning annotation paradigm which solicits and ... more Abstract This paper describes DUALIST, an active learning annotation paradigm which solicits and learns from labels on both features (eg, words) and instances (eg, documents). We present a novel semi-supervised training algorithm developed for this setting, which is (1) fast enough to support real-time interactive speeds, and (2) at least as accurate as preexisting methods for learning with mixed feature and instance labels.
This is a brief addendum to previous work (Settles, 2011), which corrects some errors reported in... more This is a brief addendum to previous work (Settles, 2011), which corrects some errors reported in the results in Table 1. These errors were introduced by third-party code, and the main results of the paper still hold. This note is intended to shed more light on the somewhat surprising result that the proposed multinomial naive Bayes (MNB) variant drastically outperformed maximum entropy (MaxEnt) models trained using generalized expectation (GE) criteria (Druck et al., 2008).
Accumulating evidence suggests that reversible protein acetylation may be a major regulatory mech... more Accumulating evidence suggests that reversible protein acetylation may be a major regulatory mechanism that rivals phosphorylation. With the recent cataloging of thousands of acetylation sites on hundreds of proteins comes the challenge of identifying the acetyltransferases and deacetylases that regulate acetylation levels. Sirtuins are a conserved family of NAD+-dependent protein deacetylases that are implicated in genome maintenance, metabolism, cell survival, and lifespan.
Abstract At personal goal-setting websites, people join others in committing to a challenging goa... more Abstract At personal goal-setting websites, people join others in committing to a challenging goal, such as losing ten pounds or writing a novel in a month. Despite the popularity of these online communities, we know little about whether or how they improve goal performance. Based on theories of goal-setting and group attachment, we examine the influence of two social factors in an online" songwriting challenge" community: early feedback evoking a shared social identity, and one-on-one collaborations with other members.
Active learning addresses this inherent bottleneck by allowing the learner to selectively choose ... more Active learning addresses this inherent bottleneck by allowing the learner to selectively choose which parts of the available data are labeled for training. The goal is to maximize the accuracy of the learner through such" queries," while minimizing the work required of human annotators. In this thesis, I explore several important questions regarding active learning for these and similar tasks involving structured instances. What query strategies are available for these learning algorithms, and how do they compare?
Summary: ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecul... more Summary: ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (eg protein and cell line) trained on standard corpora, for which performance is roughly state of the art.
Abstract This paper describes two natural language processing systems designed to assist songwrit... more Abstract This paper describes two natural language processing systems designed to assist songwriters in obtaining and developing ideas for their craft. Titular is a text synthesis algorithm for automatically generating novel song titles, which lyricists can use to back-form concepts and narrative story arcs. LyriCloud is a word-level language" browser" or" explorer," which allows users to interactively select words and receive lyrical suggestions in return.
Abstract We describe a system developed for the Annotation Hierarchy subtask of the Text Retrieva... more Abstract We describe a system developed for the Annotation Hierarchy subtask of the Text Retrieval Conference (TREC) 2004 Genomics Track. The goal of this track is to automatically predict Gene Ontology (GO) domain annotations given full-text biomedical journal articles and associated genes. Our system uses a two-tier statistical machine learning system that makes predictions first on" zone"-level text (ie abstract, introduction, etc.) and then combines evidence to make final document-level predictions.
Abstract Methods that learn from prior information about input features such as generalized expec... more Abstract Methods that learn from prior information about input features such as generalized expectation (GE) have been used to train accurate models with very little effort. In this paper, we propose an active learning approach in which the machine solicits" labels" on features rather than instances. In both simulated and real user experiments on two sequence labeling tasks we show that our active learning method outperforms passive learning with features as well as traditional active learning with instances.
An important challenge in information retrieval is bridging the “semantic gap”, which refers to t... more An important challenge in information retrieval is bridging the “semantic gap”, which refers to the disconnect between the way that humans and machines represent and describe objects. The semantic gap prevents humans from expressing their information needs using natural language, and makes it difficult for machines to explain the relevance of the retrieved items to humans. Attributes help bridge this semantic gap.
Abstract The goal of active learning is to minimize the cost of training an accurate model by all... more Abstract The goal of active learning is to minimize the cost of training an accurate model by allowing the learner to choose which instances are labeled for training. However, most research in active learning to date has assumed that the cost of acquiring labels is the same for all instances. In domains where labeling costs may vary, a reduction in the number of labeled instances does not guarantee a reduction in cost.
Most approaches to classifying media content assume a fixed, closed vocabulary of labels. In cont... more Most approaches to classifying media content assume a fixed, closed vocabulary of labels. In contrast, we advocate machine learning approaches which take advantage of the millions of free-form tags obtainable via online crowd-sourcing platforms and social tagging websites. The use of such open vocabularies presents learning challenges due to typographical errors, synonymy, and a potentially unbounded set of tag labels.
Abstract The key idea behind active learning is that a machine learning algorithm can achieve gre... more Abstract The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. An active learner may pose queries, usually in the form of unlabeled data instances to be labeled by an oracle (eg, a human annotator).
Abstract As the wealth of biomedical knowledge in the form of literature increases, there is a ri... more Abstract As the wealth of biomedical knowledge in the form of literature increases, there is a rising need for effective natural language processing tools to assist in organizing, curating, and retrieving this information. To that end, named entity recognition (the task of identifying words and phrases in free text that belong to certain classes of interest) is an important first step for many of these larger information management goals.
Uploads