research-article

Public Access

TruePIE: Discovering Reliable Patterns in Pattern-Based Information Extraction

Authors:

Timothy P. Hanratty,

Jiawei HanAuthors Info & Claims

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1675 - 1684

https://doi.org/10.1145/3219819.3220017

Published: 19 July 2018 Publication History

Abstract

Pattern-based methods have been successful in information extraction and NLP research. Previous approaches learn the quality of a textual pattern as relatedness to a certain task based on statistics of its individual content (e.g., length, frequency) and hundreds of carefully-annotated labels. However, patterns of good content-quality may generate heavily conflicting information due to the big gap between relatedness and correctness. Evaluating the correctness of information is critical in (entity, attribute, value)-tuple extraction. In this work, we propose a novel method, called TruePIE, that finds reliable patterns which can extract not only related but also correct information. TruePIE adopts the self-training framework and repeats the training-predicting-extracting process to gradually discover more and more reliable patterns. To better represent the textual patterns, pattern embeddings are formulated so that patterns with similar semantic meanings are embedded closely to each other. The embeddings jointly consider the local pattern information and the distributional information of the extractions. To conquer the challenge of lacking supervision on patterns' reliability, TruePIE can automatically generate high quality training patterns based on a couple of seed patterns by applying the arity-constraints to distinguish highly reliable patterns (i.e., positive patterns) and highly unreliable patterns (i.e., negative patterns). Experiments on a huge news dataset (over 25GB) demonstrate that the proposed TruePIE significantly outperforms baseline methods on each of the three tasks: reliable tuple extraction, reliable pattern extraction, and negative pattern extraction.

Supplementary Material

MP4 File (li_truepie_patterns.mp4)

Download
448.10 MB

References

[1]

Eugene Agichtein and Luis Gravano . 2000. Snowball: Extracting relations from large plain-text collections ACM DL.

Digital Library

[2]

Gabor Angeli, Sonal Gupta, Melvin Jose, Christopher D Manning, Christopher Ré, Julie Tibshirani, Jean Y Wu, Sen Wu, and Ce Zhang . 2014. Stanford's 2014 slot filling systems. TAC KBP Vol. 695 (2014).

[3]

Gabor Angeli, Melvin Johnson Premkumar, and Christopher D Manning . 2015. Leveraging linguistic structure for open domain information extraction ACL.

[4]

Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni . 2007. Open information extraction from the Web. In IJCAI.

Digital Library

[5]

Hannah Bast, Björn Buchhold, and Elmar Haussmann . 2015. Relevance Scores for Triples from Type-Like Relations SIGIR.

Digital Library

[6]

Hannah Bast, Björn Buchhold, Elmar Haussmann, and others . 2016. Semantic Search on Text and Knowledge Bases. Foundations and Trends® in Information Retrieval (2016).

Digital Library

[7]

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell . 2010. Toward an architecture for never-ending language learning AAAI, Vol. Vol. 5. 3.

Digital Library

[8]

Marie-Catherine De Marneffe, Bill MacCartney, Christopher D Manning, and others . 2006. Generating typed dependency parses from phrase structure parses Proceedings of LREC, Vol. Vol. 6. Genoa, 449--454.

[9]

Luciano Del Corro and Rainer Gemulla . 2013. Clausie: Clause-based open information extraction. WWW.

Digital Library

[10]

Anthony Fader, Stephen Soderland, and Oren Etzioni . 2011. Identifying relations for open information extraction EMNLP.

Digital Library

[11]

Rahul Gupta, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu . 2014. Biperpedia: An ontology for search applications. VLDB.

Digital Library

[12]

Alon Halevy, Natalya Noy, Sunita Sarawagi, Steven Euijong Whang, and Xiao Yu . 2016. Discovering structure in the universe of attribute names WWW. 939--949.

Digital Library

[13]

Marti A Hearst . 1992. Automatic acquisition of hyponyms from large text corpora Proceedings of the 14th conference on Computational linguistics-Volume 2. 539--545.

Digital Library

[14]

Meng Jiang, Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance M Kaplan, Timothy P Hanratty, and Jiawei Han . 2017. MetaPAD: Meta pattern discovery from massive text corpora KDD.

Digital Library

[15]

Taesung Lee, Zhongyuan Wang, Haixun Wang, and Seung-won Hwang . 2013. Attribute extraction and scoring: A probabilistic approach ICDE.

Digital Library

[16]

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky . 2014. The Stanford CoreNLP Natural Language Processing Toolkit Association for Computational Linguistics (ACL) System Demonstrations. 55--60.

[17]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean . 2013 a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[18]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean . 2013 b. Distributed representations of words and phrases and their compositionality NIPS. 3111--3119.

Digital Library

[19]

Thahir P Mohamed, Estevam R Hruschka Jr, and Tom M Mitchell . 2011. Discovering relations between noun categories. In EMNLP. 1447--1455.

Digital Library

[20]

Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek . 2012 a. Discovering and exploring relations on the web. Proceedings of the VLDB Endowment Vol. 5, 12 (2012), 1982--1985.

Digital Library

[21]

Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek . 2012 b. PATTY: A taxonomy of relational patterns with semantic types EMNLP.

Digital Library

[22]

Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda . 2009. English Gigaword Fourth Edition LDC2009T13. Linguistic Data Consortium, Philadelphia (2009).

[23]

Simon Parsons . 1996. Current approaches to handling imperfect information in data and knowledge bases. TKDE (1996).

Digital Library

[24]

Meng Qu, Xiang Ren, Yu Zhang, and Jiawei Han . 2017. Overcoming Limited Supervision in Relation Extraction: A Pattern-enhanced Distributional Representation Approach. arXiv preprint arXiv:1711.03226 (2017).

[25]

Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin . 2013. Relation extraction with matrix factorization and universal schemas HLT-NAACL.

[26]

Michael Schmitz, Robert Bart, Stephen Soderland, Oren Etzioni, and others . 2012. Open language learning for information extraction. EMNLP.

[27]

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon . 2015. Representing Text for Joint Embedding of Text and Knowledge Bases. EMNLP, Vol. Vol. 15. 1499--1509.

[28]

Mohamed Yahya, Steven Whang, Rahul Gupta, and Alon Y Halevy . 2014. ReNoun: Fact extraction for nominal attributes. In EMNLP.

[29]

Alexander Yates, Michael Cafarella, Michele Banko, Oren Etzioni, Matthew Broadhead, and Stephen Soderland . 2007. Textrunner: Open information extraction on the web ACL.

Digital Library

[30]

Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen . 2009. StatSnowball: A statistical approach to extracting entity relationships WWW.

Digital Library

Cited By

Hou WHong LZhu Z(2024)KGRED: Knowledge-graph-based rule discovery for weakly supervised data labelingInformation Processing & Management10.1016/j.ipm.2024.10381661:5(103816)Online publication date: Sep-2024
https://doi.org/10.1016/j.ipm.2024.103816
Hou WHong LXu HYin W(2023)RoREDInformation Sciences: an International Journal10.1016/j.ins.2023.01.132629:C(62-76)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.ins.2023.01.132
Fu Y(2022)DPRL: Labeling Relation Based on Distant Supervision and POS Rule2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE)10.1109/ICCECE54139.2022.9712720(63-67)Online publication date: 14-Jan-2022
https://doi.org/10.1109/ICCECE54139.2022.9712720
Show More Cited By

Index Terms

TruePIE: Discovering Reliable Patterns in Pattern-Based Information Extraction
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction

Recommendations

MetaPAD: Meta Pattern Discovery from Massive Text Corpora
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Mining textual patterns in news, tweets, papers, and many other kinds of text corpora has been an active theme in text mining and NLP research. Previous studies adopt a dependency parsing-based pattern discovery approach. However, the parsing results ...
Deep truth discovery for pattern-based fact extraction
Graphical abstract

Display Omitted

Abstract
Fact extraction, which aims to extract (entity, attribute, value)-tuples from massive text corpora, is crucial in the area of text data mining. Recent approaches have focused on extracting facts by mining textual patterns with semantic ...
Subjective Evaluation on Visual Perceptibility of Embedding Complementary Patterns for Nonintrusive Projection-Based Augmented Reality

In projection-based augmented reality (AR) alleviating visual distraction of patterns has been a great challenge. As a representative one, a method of embedding patterns and their complements (hereafter, we call the pairs complementary patterns) into AR ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2018

2925 pages

ISBN:9781450355520

DOI:10.1145/3219819

General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM

Copyright © 2018 ACM.

© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

KDD '18

Sponsor:

KDD '18: The 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 19 - 23, 2018

London, United Kingdom

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
1,263
Total Downloads

Downloads (Last 12 months)136
Downloads (Last 6 weeks)26

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hou WHong LZhu Z(2024)KGRED: Knowledge-graph-based rule discovery for weakly supervised data labelingInformation Processing & Management10.1016/j.ipm.2024.10381661:5(103816)Online publication date: Sep-2024
https://doi.org/10.1016/j.ipm.2024.103816
Hou WHong LXu HYin W(2023)RoREDInformation Sciences: an International Journal10.1016/j.ins.2023.01.132629:C(62-76)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.ins.2023.01.132
Fu Y(2022)DPRL: Labeling Relation Based on Distant Supervision and POS Rule2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE)10.1109/ICCECE54139.2022.9712720(63-67)Online publication date: 14-Jan-2022
https://doi.org/10.1109/ICCECE54139.2022.9712720
Ye CWang HDai GYe CWang HDai G(2022)Fact Discovery for Text DataKnowledge Discovery from Multi-Sourced Data10.1007/978-981-19-1879-7_5(69-83)Online publication date: 14-Jun-2022
https://doi.org/10.1007/978-981-19-1879-7_5
Ye CWang HDai GYe CWang HDai G(2022)IntroductionKnowledge Discovery from Multi-Sourced Data10.1007/978-981-19-1879-7_1(1-11)Online publication date: 14-Jun-2022
https://doi.org/10.1007/978-981-19-1879-7_1
Jiang TZeng QZhao TQin BLiu TChawla NJiang M(2021)Biomedical Knowledge Graphs Construction From Conditional StatementsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.297995918:3(823-835)Online publication date: 1-May-2021
https://doi.org/10.1109/TCBB.2020.2979959
Wang XZhang YChauhan ALi QHan J(2020)Textual Evidence Mining via Spherical Heterogeneous Information Network Embedding2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377958(828-837)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9377958
Wang XJiang M(2020)Precise temporal slot filling via truth finding with data-driven commonsenseKnowledge and Information Systems10.1007/s10115-020-01493-wOnline publication date: 16-Jul-2020
https://doi.org/10.1007/s10115-020-01493-w
Yu JLu WXu WTang Z(2020)Entity Synonym Discovery via Multiple AttentionsSemantic Technology10.1007/978-3-030-41407-8_18(271-286)Online publication date: 14-Feb-2020
https://doi.org/10.1007/978-3-030-41407-8_18
Wang XZhang HLi QShi YJiang M(2019)A Novel Unsupervised Approach for Precise Temporal Slot Filling from Incomplete and Noisy Temporal ContextsThe World Wide Web Conference10.1145/3308558.3313435(3328-3334)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313435
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten