Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Document Spanners: A Formal Approach to Information Extraction

Published: 06 May 2015 Publication History

Abstract

An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just spanner for short). A spanner maps an input string into a relation over the spans (intervals specified by bounding indices) of the string. The focus of this article is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables.
We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners—the closure of the first kind under standard relational operators. The core spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.

References

[1]
Alfred V. Aho. 1990. Algorithms for finding patterns in strings. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A). North Holland, 255--300.
[2]
James F. Allen. 1983. Maintaining knowledge about temporal intervals. Commun. ACM 26, 11, 832--843.
[3]
Douglas E. Appelt and Boyan Onyshkevych. 1998. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III. Association for Computational Linguistics, 23--30.
[4]
Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In Proceedings of PODS. ACM, 68--79.
[5]
Pablo Barceló, Diego Figueira, and Leonid Libkin. 2012a. Graph logics with rational relations and the generalized intersection problem. In Proceedings of LICS. IEEE, 115--124.
[6]
Pablo Barceló, Leonid Libkin, Anthony Widjaja Lin, and Peter T. Wood. 2012b. Expressive languages for path queries over graph-structured data. ACM Trans. Datab. Syst. 37, 4, 31.
[7]
Pablo Barceló, Juan L. Reutter, and Leonid Libkin. 2013. Parameterized regular expressions and their languages. Theoret. Comput. Sci. 474, 21--45.
[8]
Michael Benedikt, Leonid Libkin, Thomas Schwentick, and Luc Segoufin. 2003. Definable relations and first-order query languages over strings. J. ACM 50, 5, 694--751.
[9]
Jean Berstel. 1979. Transductions and Context-Free Languages. Teubner Studienbücher, Stuttgart.
[10]
Anthony J. Bonner and Giansalvatore Mecca. 1998. Sequences, datalog, and transducers. J. Comput. Syst. Sci. 57, 3, 234--259.
[11]
Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. 2000a. Containment of conjunctive regular path queries with inverse. In Proceedings of KR 2000. 176--185.
[12]
Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. 2000b. View-based query processing and constraint satisfaction. In Proceedings of LICS. 361--371.
[13]
Cezar Câmpeanu, Kai Salomaa, and Sheng Yu. 2003. A formal study of practical regular expressions. Int. J. Found. Comput. Sci. 14, 6, 1007--1018.
[14]
Cezar Câmpeanu and Nicolae Santean. 2009. On the intersection of regex languages with regular languages. Theoret. Comput. Sci. 410, 24--25, 2336--2344.
[15]
Benjamin Carle and Paliath Narendran. 2009. On extended regular expressions. In Proceedings of LATA 2009. Lecture Notes in Computer Science, vol. 5457. 279--289.
[16]
Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, and Shivakumar Vaithyanathan. 2010. SystemT: An algebraic approach to declarative information extraction. In Proceedings of the 48th Annual/Meeting of the Association for Computer Linguisties (ACL'10). 128--137.
[17]
Mariano P. Consens and Alberto O. Mendelzon. 1990. GraphLog: A visual formalism for real life recursion. In Proceedings of PODS. ACM, 404--416.
[18]
Isabel F. Cruz, Alberto O. Mendelzon, and Peter T. Wood. 1987. A graphical query language supporting recursion. In Proceedings of SIGMOD Conference. ACM, 323--330.
[19]
Hamish Cunningham. 2002. GATE, A General Architecture for text engineering. Comput. Human. 36, 2, 223--254.
[20]
Alin Deutsch and Val Tannen. 2001. Optimization properties for classes of conjunctive regular path queries. In Proceedings of DBPL. 21--39.
[21]
Calvin C. Elgot and J. E. Mezei. 1965. On relations defined by generalized finite automata. IBM J. Res. Devel. 9, 47--68.
[22]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2013. Spanners: A formal framework for information extraction. In Proceedings of PODS. 37--48.
[23]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2014. Cleaning inconsistencies in information extraction via prioritized repairs. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'14). ACM, 164--175.
[24]
Daniela Florescu, Alon Y. Levy, and Dan Suciu. 1998. Query containment for conjunctive queries with regular expressions. In Proceedings of PODS. 139--148.
[25]
Dayne Freitag. 1998. Toward general-purpose learning for information extraction. In Proceedings of COLING-ACL. 404--408.
[26]
Dominik D. Freydenberger. 2011. Extended regular expressions: Succinctness and decidability. In Proceedings of STACS (LIPIcs). Vol. 9, Schloss Dagstuhl -- Leibniz-Zentrum fuer Informatik, 507--518.
[27]
Jeffrey Friedl. 2006. Mastering Regular Expressions. O'Reilly Media.
[28]
Seymour Ginsburg and Xiaoyang Sean Wang. 1998. Regular sequence operations and their use in database queries. J. Comput. Syst. Sci. 56, 1, 1--26.
[29]
Gösta Grahne, Matti Nykänen, and Esko Ukkonen. 1999. Reasoning about strings in databases. J. Comput. Syst. Sci. 59, 1, 116--162.
[30]
Ralph Grishman and Beth Sundheim. 1996. Message understanding conference- 6: A brief history. In Proceedings of COLING. 466--471.
[31]
Orna Grumberg, Orna Kupferman, and Sarai Sheinvald. 2010. Variable automata over infinite alphabets. In Proceedings of LATA. 561--572.
[32]
Donald E. Knuth. 1968. Semantics of context-free languages. Math. Syst. Theory 2, 2, 127--145.
[33]
Donald E. Knuth. 1971. Correction: Semantics of context-free languages. Math. Syst. Theory 5, 1, 95--96.
[34]
Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu. 2008. SystemT: A system for declarative information extraction. SIGMOD Record 37, 4, 7--13.
[35]
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML. Morgan Kaufmann, 282--289.
[36]
T. R. Leek. 1997. Information extraction using hidden Markov models. Master's thesis University of California, San Diego.
[37]
Peter Linz. 2001. An Introduction to Formal Languages and Automata 3rd Ed. Jones and Bartlett Publishers, Inc., Sudbury, M. A.
[38]
B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. R. Reiss. 2010. Automatic rule refinement for information extraction. Proc. VLDB Endow. 3, 1--2, 588--597.
[39]
Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of ICML. Morgan Kaufmann, 591--598.
[40]
Frank Neven and Thomas Schwentick. 2002. Query automata over finite trees. Theoret. Comput. Sci. 275, 2, 633--674.
[41]
Frank Neven and Jan Van den Bussche. 2002. Expressiveness of structured document query languages based on attribute grammars. J. ACM 49, 1, 56--100.
[42]
Maurice Nivat. 1968. Transduction des langages de Chomsky. Ann. Inst. Fourier 18, 339--455.
[43]
Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and Shivakumar Vaithyanathan. 2008. An algebraic approach to rule-based information extraction. In Proceedings of ICDE. IEEE, 933--942.
[44]
Ellen Riloff. 1993. Automatically constructing a dictionary for information extraction tasks. In Proceedings of AAAI. AAAI Press/The MIT Press, 811--816.
[45]
Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy G. Lehnert. 1995. CRYSTAL: Inducing a conceptual dictionary. In Proceedings of IJCAI. Morgan Kaufmann, 1314--1321.
[46]
Slawek Staworko, Jan Chomicki, and Jerzy Marcinkowski. 2012. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64, 2--3, 209--246.
[47]
Sheng Yu. 1997. Regular Languages. In Handbook of Formal Languages, Grzegorz Rozenberg and Arto Salomaa (Eds.), vol. 1, Springer, Chapter 2.

Cited By

View all
  • (2024)Automatic Extraction and Cluster Analysis of Natural Disaster Metadata Based on the Unified Metadata FrameworkISPRS International Journal of Geo-Information10.3390/ijgi1306020113:6(201)Online publication date: 14-Jun-2024
  • (2024)SpannerLib: Embedding Declarative Information Extraction in an Imperative WorkflowProceedings of the VLDB Endowment10.14778/3685800.368585517:12(4281-4284)Online publication date: 1-Aug-2024
  • (2024)Revisiting Weighted Information Extraction: A Simpler and Faster Algorithm for Ranked EnumerationProceedings of the ACM on Management of Data10.1145/36958402:5(1-19)Online publication date: 7-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 62, Issue 2
May 2015
304 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/2772377
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 May 2015
Accepted: 01 October 2014
Revised: 01 October 2014
Received: 01 August 2013
Published in JACM Volume 62, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Information extraction
  2. document spanners
  3. finite-state automata
  4. regular expressions

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)5
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Automatic Extraction and Cluster Analysis of Natural Disaster Metadata Based on the Unified Metadata FrameworkISPRS International Journal of Geo-Information10.3390/ijgi1306020113:6(201)Online publication date: 14-Jun-2024
  • (2024)SpannerLib: Embedding Declarative Information Extraction in an Imperative WorkflowProceedings of the VLDB Endowment10.14778/3685800.368585517:12(4281-4284)Online publication date: 1-Aug-2024
  • (2024)Revisiting Weighted Information Extraction: A Simpler and Faster Algorithm for Ranked EnumerationProceedings of the ACM on Management of Data10.1145/36958402:5(1-19)Online publication date: 7-Nov-2024
  • (2024)Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FCProceedings of the ACM on Management of Data10.1145/36511432:2(1-18)Online publication date: 14-May-2024
  • (2024)Demonstrating REmatch: A Novel RegEx Engine for Finding all MatchesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654746(448-451)Online publication date: 9-Jun-2024
  • (2024)Mitigating Data Sparsity in Integrated Data through Text Conceptualization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00269(3490-3504)Online publication date: 13-May-2024
  • (2024)Languages Generated by Conjunctive Query Fragments of FC[REG]Theory of Computing Systems10.1007/s00224-024-10198-4Online publication date: 10-Oct-2024
  • (2024)The Information Extraction Framework of Document Spanners - A Very Informal SurveySOFSEM 2024: Theory and Practice of Computer Science10.1007/978-3-031-52113-3_1(3-22)Online publication date: 7-Feb-2024
  • (2023)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-PagesProceedings of the VLDB Endowment10.14778/3611479.361151116:11(3098-3110)Online publication date: 24-Aug-2023
  • (2023)REmatch: A Novel Regex Engine for Finding All MatchesProceedings of the VLDB Endowment10.14778/3611479.361148816:11(2792-2804)Online publication date: 24-Aug-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media