research-article

Adaptive Rule Discovery for Labeling Text Data

Authors:

Sainyam Galhotra,

Behzad Golshan,

Wang-Chiew TanAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 2217 - 2225

https://doi.org/10.1145/3448016.3457334

Published: 18 June 2021 Publication History

Get Access

Abstract

To address these shortcomings, we present DARWIN, an interactive system designed to alleviate the task of writing rules for labeling text data in weakly-supervised settings. Given an initial labeling rule, DARWIN automatically generates a set of candidate rules for the labeling task at hand, and utilizes the annotator's feedback to adapt the candidate rules. We describe how DARWIN is scalable and versatile. It can operate over large text corpora (i.e., more than 1 million sentences) and supports a wide range of labeling functions (i.e., any function that can be specified using a context free grammar).

Finally, we demonstrate with a suite of experiments over five real-world datasets that DARWIN enables annotators to generate weakly-supervised labels efficiently and with a small cost. In fact, our experiments show that rules discovered by DARWIN on average identify 40% more positive instances compared to Snuba even when it is provided with 1000 labeled instances.

Supplementary Material

MP4 File (3448016.3457334.mp4)

Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of diverse, high-quality rules to label data (e.g., Snorkel), or require a labeled subset of the data to automatically mine rules (e.g., Snuba). The process of manually writing rules can be tedious and time consuming. At the same time, creating a labeled subset of the data can be costly and even infeasible in imbalanced settings. To address these shortcomings, we present DARWIN, an interactive system designed to alleviate the task of writing rules for labeling text data in weakly-supervised settings. Given an initial labeling rule, DARWIN automatically generates a set of candidate rules for the labeling task at hand, and utilizes the annotator's feedback to adapt the candidate rules. We describe how DARWIN is scalable and versatile. It can operate over large text corpora (i.e., more than 1 million sentences) and supports a wide range of labeling functions (i.e., any function that can be specified using a context free grammar). Finally, we demonstrate with a suite of experiments over five real-world datasets that DARWIN enables annotators to generate weakly-supervised labels efficiently and with a small cost. In fact, our experiments show that rules discovered by DARWIN on average identify 40\% more positive instances compared to Snuba even when it is provided with 1000 labeled instances.

Download
26.90 MB

References

[1]

Appen, https://appen.com.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Mining relational data from text: From strictly supervised to weakly supervised learning

SPL-LDP: a label distribution propagation method for semi-supervised partial label learning

WeakAL: Combining Active Learning and Weak Supervision

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations