Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1625275.1625705guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Open information extraction from the web

Published: 06 January 2007 Publication History

Abstract

Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations.
This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries.
We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.

References

[1]
{Agichtein and Gravano, 2000} E. Agichtein and L. Gravano. Snowball: Extracting relations from large plaintext collections. In Proceedings of the Fifth ACM International Conference on Digital Libraries, 2000.
[2]
{Brill and Ngai, 1999} E. Brill and G. Ngai. Man (and woman) vs. machine: a case study in base noun phrase learning. In Proceedings of the ACL, pages 65-72, 1999.
[3]
{Brin, 1998} S. Brin. Extracting Patterns and Relations from the World Wide Web. In WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT'98, pages 172-183, Valencia, Spain, 1998.
[4]
{Bunescu and Mooney, 2005} R. Bunescu and R. Mooney. A shortest path dependency kernel for relation extraction. In Proc. of the HLT/EMLNP, 2005.
[5]
{Cafarella et al., 2006} Michael J. Cafarella, Michele Banko, and Oren Etzioni. Relational web search. Technical Report 06-04-02, University of Washington, 2006.
[6]
{Ciaramita et al., 2005} M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, and I. Rojas. Unsupervised learning of semantic relations between concepts of a molecular biology ontology. In Proceedings of IJCAI, 2005.
[7]
{Craven et al., 1999} M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the world wide web. In Artificial Intelligence, 1999.
[8]
{Culotta et al., 2006} A. Culotta, A. McCallum, and J. Betz. Integrating probabilistic extraction models and relational data mining to discover relations and patterns in text. In Proceedings of HLT-NAACL, New York, NY, 2006.
[9]
{Downey et al., 2005} D. Downey, O. Etzioni, and S. Soderland. A Probabilistic Model of Redundancy in Information Extraction. In Proc. of IJCAI, 2005.
[10]
{Downey et al., 2007} D. Downey, M. Broadhead, and O. Etzioni. Locating Complex Named Entities in Web Text. In Proc. of IJCAI, 2007.
[11]
{Etzioni et al., 2005} O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91-134, 2005.
[12]
{Kambhatla, 2004} N. Kambhatla. Combining lexical, syntactic and semantic features with maximum entropy models. In Proceedings of ACL, 2004.
[13]
{Klein and Manning, 2003} Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing. In Proceedings of the ACL, 2003.
[14]
{Lin and Pantel, 2001} D. Lin and P. Pantel. Discovery of inference rules from text. In Proceedings of KDD, 2001.
[15]
{Ngai and Florian, 2001} G. Ngai and R. Florian. Transformation-based learning in the fast lane. In Proceedings of the NAACL, pages 40-47, 2001.
[16]
{Pasca et al., 2006} M. Pasca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Names and similarities on the web: Fact extraction in the fast lane. In (To appear) Proc. of ACL/COLING 2006, 2006.
[17]
{Ratnaparkhi, 1998} A. Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania, 1998.
[18]
{Ravichandran and Hovy, 2002} D. Ravichandran and D. Hovy. Learning surface text patterns for a question answering system. In Proceedings of the ACL, pages 41-47, Philadelphia, Pennsylvania, 2002.
[19]
{Riloff, 1996} E. Riloff. Automatically constructing extraction patterns from untagged text. In Proc. of AAAI, 1996.
[20]
{Rosario and Hearst, 2004} B. Rosario and M. Hearst. Classifying semantic relations in bioscience text. In Proc. of ACL, 2004.
[21]
{Sekine, 2006} S. Sekine. On-demand information extraction. In Procs. of COLING, 2006.
[22]
{Shinyama and Sekine, 2006} Y. Shinyama and S. Sekine. Preemptive information extraction using unrestricted relation discovery. In Proc. of the HLT-NAACL, 2006.

Cited By

View all
  • (2023)Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement LearningProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598066(411-422)Online publication date: 12-Jul-2023
  • (2023)NeoMaPy: A Parametric Framework for Reasoning with MAP Inference on Temporal Markov Logic NetworksProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614757(400-409)Online publication date: 21-Oct-2023
  • (2023)Linked-DocRED - Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction PipelinesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591912(3064-3074)Online publication date: 19-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
IJCAI'07: Proceedings of the 20th international joint conference on Artifical intelligence
January 2007
2953 pages
  • Editors:
  • Rajeev Sangal,
  • Harish Mehta,
  • R. K. Bagga

Sponsors

  • The International Joint Conferences on Artificial Intelligence, Inc.

Publisher

Morgan Kaufmann Publishers Inc.

San Francisco, CA, United States

Publication History

Published: 06 January 2007

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement LearningProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598066(411-422)Online publication date: 12-Jul-2023
  • (2023)NeoMaPy: A Parametric Framework for Reasoning with MAP Inference on Temporal Markov Logic NetworksProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614757(400-409)Online publication date: 21-Oct-2023
  • (2023)Linked-DocRED - Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction PipelinesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591912(3064-3074)Online publication date: 19-Jul-2023
  • (2022)SpaceE: Knowledge Graph Embedding by Relational Linear Transformation in the Entity SpaceProceedings of the 33rd ACM Conference on Hypertext and Social Media10.1145/3511095.3531284(64-72)Online publication date: 28-Jun-2022
  • (2022)End-to-end Distantly Supervised Information Extraction with Retrieval AugmentationProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531876(2449-2455)Online publication date: 6-Jul-2022
  • (2022)OnML: an ontology-based approach for interpretable machine learningJournal of Combinatorial Optimization10.1007/s10878-022-00856-z44:1(770-793)Online publication date: 1-Aug-2022
  • (2022)DptOIE: a Portuguese open information extraction based on dependency analysisArtificial Intelligence Review10.1007/s10462-022-10349-456:7(7015-7046)Online publication date: 5-Dec-2022
  • (2021)Dependency Parsing-based Entity Relation Extraction over Chinese Complex TextACM Transactions on Asian and Low-Resource Language Information Processing10.1145/345027320:4(1-34)Online publication date: 9-Jun-2021
  • (2020)From F to A on the New York Regents Science Exams — An Overview of the Aristo ProjectAI Magazine10.1609/aimag.v41i4.530441:4(39-53)Online publication date: 1-Dec-2020
  • (2020)Scalable Micro-planned Generation of Discourse from Structured DataComputational Linguistics10.1162/coli_a_0036345:4(737-763)Online publication date: 1-Jan-2020
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media