Article

Open information extraction from the web

Authors:

Michael J. Cafarella,

Stephen Soderland,

Matt Broadhead,

Oren EtzioniAuthors Info & Claims

IJCAI'07: Proceedings of the 20th international joint conference on Artifical intelligence

Pages 2670 - 2676

Published: 06 January 2007 Publication History

Abstract

Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations.

This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries.

We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.

References

[1]

{Agichtein and Gravano, 2000} E. Agichtein and L. Gravano. Snowball: Extracting relations from large plaintext collections. In Proceedings of the Fifth ACM International Conference on Digital Libraries, 2000.

Digital Library

[2]

{Brill and Ngai, 1999} E. Brill and G. Ngai. Man (and woman) vs. machine: a case study in base noun phrase learning. In Proceedings of the ACL, pages 65-72, 1999.

Digital Library

[3]

{Brin, 1998} S. Brin. Extracting Patterns and Relations from the World Wide Web. In WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT'98, pages 172-183, Valencia, Spain, 1998.

Digital Library

[4]

{Bunescu and Mooney, 2005} R. Bunescu and R. Mooney. A shortest path dependency kernel for relation extraction. In Proc. of the HLT/EMLNP, 2005.

Digital Library

[5]

{Cafarella et al., 2006} Michael J. Cafarella, Michele Banko, and Oren Etzioni. Relational web search. Technical Report 06-04-02, University of Washington, 2006.

[6]

{Ciaramita et al., 2005} M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, and I. Rojas. Unsupervised learning of semantic relations between concepts of a molecular biology ontology. In Proceedings of IJCAI, 2005.

Digital Library

[7]

{Craven et al., 1999} M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the world wide web. In Artificial Intelligence, 1999.

Digital Library

[8]

{Culotta et al., 2006} A. Culotta, A. McCallum, and J. Betz. Integrating probabilistic extraction models and relational data mining to discover relations and patterns in text. In Proceedings of HLT-NAACL, New York, NY, 2006.

Digital Library

[9]

{Downey et al., 2005} D. Downey, O. Etzioni, and S. Soderland. A Probabilistic Model of Redundancy in Information Extraction. In Proc. of IJCAI, 2005.

Digital Library

[10]

{Downey et al., 2007} D. Downey, M. Broadhead, and O. Etzioni. Locating Complex Named Entities in Web Text. In Proc. of IJCAI, 2007.

Digital Library

[11]

{Etzioni et al., 2005} O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91-134, 2005.

Digital Library

[12]

{Kambhatla, 2004} N. Kambhatla. Combining lexical, syntactic and semantic features with maximum entropy models. In Proceedings of ACL, 2004.

Digital Library

[13]

{Klein and Manning, 2003} Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing. In Proceedings of the ACL, 2003.

Digital Library

[14]

{Lin and Pantel, 2001} D. Lin and P. Pantel. Discovery of inference rules from text. In Proceedings of KDD, 2001.

Digital Library

[15]

{Ngai and Florian, 2001} G. Ngai and R. Florian. Transformation-based learning in the fast lane. In Proceedings of the NAACL, pages 40-47, 2001.

Digital Library

[16]

{Pasca et al., 2006} M. Pasca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Names and similarities on the web: Fact extraction in the fast lane. In (To appear) Proc. of ACL/COLING 2006, 2006.

Digital Library

[17]

{Ratnaparkhi, 1998} A. Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania, 1998.

Digital Library

[18]

{Ravichandran and Hovy, 2002} D. Ravichandran and D. Hovy. Learning surface text patterns for a question answering system. In Proceedings of the ACL, pages 41-47, Philadelphia, Pennsylvania, 2002.

Digital Library

[19]

{Riloff, 1996} E. Riloff. Automatically constructing extraction patterns from untagged text. In Proc. of AAAI, 1996.

Digital Library

[20]

{Rosario and Hearst, 2004} B. Rosario and M. Hearst. Classifying semantic relations in bioscience text. In Proc. of ACL, 2004.

Digital Library

[21]

{Sekine, 2006} S. Sekine. On-demand information extraction. In Procs. of COLING, 2006.

Digital Library

[22]

{Shinyama and Sekine, 2006} Y. Shinyama and S. Sekine. Preemptive information extraction using unrestricted relation discovery. In Proc. of the HLT-NAACL, 2006.

Digital Library

Cited By

Zhang ZWinn RZhao YYu THalfond WJust RFraser G(2023)Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement LearningProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598066(411-422)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598066
David VFournier-S'niehotta RTravers NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)NeoMaPy: A Parametric Framework for Reasoning with MAP Inference on Temporal Markov Logic NetworksProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614757(400-409)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614757
Genest PPortier PEgyed-Zsigmond ELovisetto MChen HDuh WHuang HKato MMothe JPoblete B(2023)Linked-DocRED - Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction PipelinesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591912(3064-3074)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591912
Show More Cited By

Recommendations

Infrastructure for open-domain information extraction
HLT '02: Proceedings of the second international conference on Human Language Technology Research

The problem of performing open-domain Information Extraction (IE) was historically tied to the problem of ad-hoc acquisition of extraction patterns. In this paper we show that this requirement is not sufficient and that we also need to build new IE ...
A weighting scheme for open information extraction
NAACL HLT '12: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

We study the problem of extracting all possible relations among named entities from unstructured text, a task known as Open Information Extraction (Open IE). A state-of-the-art Open IE system consists of natural language processing tools to identify ...
Open information extraction using Wikipedia
ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Information-extraction (IE) systems seek to distill semantic relations from natural-language text, but most systems use supervised learning of relation-specific examples and are thus limited by the availability of training data. Open IE systems such as ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

IJCAI'07: Proceedings of the 20th international joint conference on Artifical intelligence

January 2007

2953 pages

Editors:
Rajeev Sangal
International Institute of Information Technology, Hyderabad
,
Harish Mehta
Onward Technologies Limited
,
R. K. Bagga
International Institute of Information Technology, Hyderabad

Sponsors

The International Joint Conferences on Artificial Intelligence, Inc.

Publisher

Morgan Kaufmann Publishers Inc.

San Francisco, CA, United States

Publication History

Published: 06 January 2007

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

376
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZWinn RZhao YYu THalfond WJust RFraser G(2023)Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement LearningProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598066(411-422)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598066
David VFournier-S'niehotta RTravers NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)NeoMaPy: A Parametric Framework for Reasoning with MAP Inference on Temporal Markov Logic NetworksProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614757(400-409)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614757
Genest PPortier PEgyed-Zsigmond ELovisetto MChen HDuh WHuang HKato MMothe JPoblete B(2023)Linked-DocRED - Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction PipelinesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591912(3064-3074)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591912
Yu JCai YSun MLi PBellogín ABoratto LCena F(2022)SpaceE: Knowledge Graph Embedding by Relational Linear Transformation in the Entity SpaceProceedings of the 33rd ACM Conference on Hypertext and Social Media10.1145/3511095.3531284(64-72)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3511095.3531284
Zhang YFei HLi PAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)End-to-end Distantly Supervised Information Extraction with Retrieval AugmentationProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531876(2449-2455)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531876
Ayranci PLai PPhan NHu HKolinowski ANewman DDou D(2022)OnML: an ontology-based approach for interpretable machine learningJournal of Combinatorial Optimization10.1007/s10878-022-00856-z44:1(770-793)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s10878-022-00856-z
Oliveira LClaro DSouza M(2022)DptOIE: a Portuguese open information extraction based on dependency analysisArtificial Intelligence Review10.1007/s10462-022-10349-456:7(7015-7046)Online publication date: 5-Dec-2022
https://dl.acm.org/doi/10.1007/s10462-022-10349-4
Qi SZheng LShang F(2021)Dependency Parsing-based Entity Relation Extraction over Chinese Complex TextACM Transactions on Asian and Low-Resource Language Information Processing10.1145/345027320:4(1-34)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3450273
Clark PEtzioni OKhashabi DKhot TMishra BRichardson KSabharwal ASchoenick CTafjord OTandon NBhakthavatsalam SGroeneveld DGuerquin MSchmitz M(2020)From F to A on the New York Regents Science Exams — An Overview of the Aristo ProjectAI Magazine10.1609/aimag.v41i4.530441:4(39-53)Online publication date: 1-Dec-2020
https://dl.acm.org/doi/10.1609/aimag.v41i4.5304
Laha AJain PMishra ASankaranarayanan K(2020)Scalable Micro-planned Generation of Discourse from Structured DataComputational Linguistics10.1162/coli_a_0036345:4(737-763)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1162/coli_a_00363
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents