Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3448016.3457334acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Adaptive Rule Discovery for Labeling Text Data

Published: 18 June 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of diverse, high-quality rules to label data (e.g., Snorkel), or require a labeled subset of the data to automatically mine rules (e.g., Snuba). The process of manually writing rules can be tedious and time consuming. At the same time, creating a labeled subset of the data can be costly and even infeasible in imbalanced settings.
    To address these shortcomings, we present DARWIN, an interactive system designed to alleviate the task of writing rules for labeling text data in weakly-supervised settings. Given an initial labeling rule, DARWIN automatically generates a set of candidate rules for the labeling task at hand, and utilizes the annotator's feedback to adapt the candidate rules. We describe how DARWIN is scalable and versatile. It can operate over large text corpora (i.e., more than 1 million sentences) and supports a wide range of labeling functions (i.e., any function that can be specified using a context free grammar).
    Finally, we demonstrate with a suite of experiments over five real-world datasets that DARWIN enables annotators to generate weakly-supervised labels efficiently and with a small cost. In fact, our experiments show that rules discovered by DARWIN on average identify 40% more positive instances compared to Snuba even when it is provided with 1000 labeled instances.

    Supplementary Material

    MP4 File (3448016.3457334.mp4)
    Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of diverse, high-quality rules to label data (e.g., Snorkel), or require a labeled subset of the data to automatically mine rules (e.g., Snuba). The process of manually writing rules can be tedious and time consuming. At the same time, creating a labeled subset of the data can be costly and even infeasible in imbalanced settings. To address these shortcomings, we present DARWIN, an interactive system designed to alleviate the task of writing rules for labeling text data in weakly-supervised settings. Given an initial labeling rule, DARWIN automatically generates a set of candidate rules for the labeling task at hand, and utilizes the annotator's feedback to adapt the candidate rules. We describe how DARWIN is scalable and versatile. It can operate over large text corpora (i.e., more than 1 million sentences) and supports a wide range of labeling functions (i.e., any function that can be specified using a context free grammar). Finally, we demonstrate with a suite of experiments over five real-world datasets that DARWIN enables annotators to generate weakly-supervised labels efficiently and with a small cost. In fact, our experiments show that rules discovered by DARWIN on average identify 40\% more positive instances compared to Snuba even when it is provided with 1000 labeled instances.

    References

    [1]
    Appen, https://appen.com.
    [2]
    Clueweb, https://lemurproject.org/clueweb09/.
    [3]
    Embeddings, https://spacy.io/models/en#en_core_web_lg.
    [4]
    Nell, http://rtw.ml.cmu.edu/rtw/kbbrowser/.
    [5]
    Spacy, https://spacy.io/.
    [6]
    Enrique Alfonseca, Katja Filippova, Jean-Yves Delort, and Guillermo Garrido. Pattern learning for relation extraction with a hierarchical topic model. In ACL, 2012.
    [7]
    Allan Peter Davis, Thomas C Wiegers, Phoebe M Roberts, Benjamin L King, Jean M Lay, Kelley Lennon-Hopkins, Daniela Sciaky, Robin Johnson, Heather Keating, Nigel Greene, et al. A ctd--pfizer collaboration: manual curation of 88 000 scientific articles text mined for drug--disease and drug--phenotype interactions. Database, 2013, 2013.
    [8]
    Xin Luna Dong and Divesh Srivastava. Big data integration. Synthesis Lectures on Data Management, 7(1), 2015.
    [9]
    L. Eadicicco. Baidu's andrew ng on the future of artificial intelligence, 2017. Time [Online; posted 11-January-2017].
    [10]
    Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. Robust entity resolution using random graphs. In SIGMOD, 2018.
    [11]
    Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. Efficient and effective er with progressive blocking. The VLDB Journal, pages 1--21, 2021.
    [12]
    Sainyam Galhotra, Behzad Golshan, and Wang-Chiew Tan. Adaptive rule discovery for labeling text data. arXiv preprint arXiv:2005.06133, 2020.
    [13]
    Arpita Ghosh, Satyen Kale, and Preston McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proceedings of the 12th ACM conference on Electronic commerce, 2011.
    [14]
    Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, and Donald Kossmann. Fault-tolerant entity resolution with the crowd. arXiv preprint arXiv:1512.00537, 2015.
    [15]
    Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher Ré. Training classifiers with natural language explanations. arXiv preprint arXiv:1805.03818, 2018.
    [16]
    Marti A Hearst. Automatic acquisition of hyponyms from large text corpora. In Coling 1992 volume 2: The 15th international conference on computational linguistics, 1992.
    [17]
    Yoon Kim. Convolutional neural networks for sentence classification. In EMNLP, 2014.
    [18]
    Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. A survey on truth discovery. ACM SIGKDD Explorations Newsletter, 17(2), 2016.
    [19]
    Christopher Manning and Hinrich Schutze. Foundations of statistical natural language processing. MIT press, 1999.
    [20]
    C. Metz. Google's hand-fed ai now gives answers, not just search results, 2016. Wired [Online; posted 29-November-2016].
    [21]
    Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In ACL/IJCNLP, 2009.
    [22]
    Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. Patty: A taxonomy of relational patterns with semantic types. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1135--1145, 2012.
    [23]
    Slav Petrov, Dipanjan Das, and Ryan McDonald. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086, 2011.
    [24]
    Protiva Rahman, Courtney Hebert, and Arnab Nandi. Icarus: minimizing human effort in iterative data completion. In PVLDB, volume 11, page 2263. NIH Public Access, 2018.
    [25]
    Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3), 2017.
    [26]
    Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Ré. Slimfast: Guaranteed results for data fusion and source reliability. In SIGMOD, 2017.
    [27]
    Stephen Roller, Douwe Kiela, and Maximilian Nickel. Hearst patterns revisited: Automatic hypernym detection from large text corpora. arXiv preprint arXiv:1806.03191, 2018.
    [28]
    Burr Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2012.
    [29]
    Rion Snow, Daniel Jurafsky, and Andrew Y Ng. Learning syntactic patterns for automatic hypernym discovery. Advances in Neural Information Processing Systems 17, 2004.
    [30]
    Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic compositionality through recursive matrix-vector spaces. In EMNLP-CoNLL, 2012.
    [31]
    Paroma Varma, Bryan D. He, Payal Bajaj, Nishith Khandwala, Imon Banerjee, Daniel L. Rubin, and Christopher Ré. Inferring generative model structure with static analysis. In NIPS, 2017.
    [32]
    Paroma Varma and Christopher Ré. Snuba: Automating weak supervision to label training data. PVLDB, 2019.
    [33]
    Jinpeng Wang, Gao Cong, Wayne Xin Zhao, and Xiaoming Li. Mining user intents in twitter: A semi-supervised approach to inferring intent categories for tweets. 2015.
    [34]
    Xiaolan Wang, Aaron Feng, Behzad Golshan, Alon Halevy, George Mihaila, Hidekazu Oiwa, and Wang-Chiew Tan. Scalable semantic querying of text. PVLDB, 11(9), 2018.
    [35]
    Peter Welinder, Steve Branson, Serge J. Belongie, and Pietro Perona. The multidimensional wisdom of crowds. In NIPS, 2010.
    [36]
    Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for knowledge base reasoning. In NIPS, 2017.
    [37]
    Jingru Yang, Ju Fan, Zhewei Wei, Guoliang Li, Tongyu Liu, and Xiaoyong Du. Cost-effective data annotation using game-based crowdsourcing. In PVLDB, 2018.

    Cited By

    View all

    Index Terms

    1. Adaptive Rule Discovery for Labeling Text Data

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
      June 2021
      2969 pages
      ISBN:9781450383431
      DOI:10.1145/3448016
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data labelling
      2. information extraction
      3. weak supervision

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)46
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)KGRED: Knowledge-graph-based rule discovery for weakly supervised data labelingInformation Processing & Management10.1016/j.ipm.2024.10381661:5(103816)Online publication date: Sep-2024
      • (2023)Understanding the influence of news on society decision making: application to economic policy uncertaintyNeural Computing and Applications10.1007/s00521-023-08438-835:20(14929-14945)Online publication date: 31-Mar-2023
      • (2022)NemoProceedings of the VLDB Endowment10.14778/3565838.356585915:13(4093-4105)Online publication date: 1-Sep-2022
      • (2022)WitanProceedings of the VLDB Endowment10.14778/3551793.355179715:11(2334-2347)Online publication date: 1-Jul-2022
      • (2022)Let’s Embed Your Knowledge into AI by Trial and Error Instead of AnnotationAdjunct Proceedings of the 2022 ACM International Joint Conference on Pervasive and Ubiquitous Computing and the 2022 ACM International Symposium on Wearable Computers10.1145/3544793.3560352(105-107)Online publication date: 11-Sep-2022

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media