Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Constant-Delay Enumeration for Nondeterministic Document Spanners

Published: 04 September 2020 Publication History

Abstract

One of the classical tasks in information extraction is to extract subparts of texts through regular expressions. In the database theory literature, this approach has been generalized and formalized as document spanners. In this model, extraction is performed by evaluating a particular kind of automata, called a sequential variable-set automaton (VA). The efficiency of this task is then measured in the context of enumeration algorithms: we first run a preprocessing phase computing a compact representation of the answers, and second we produce the results one after the other with a short time between consecutive answers, called the delay of the enumeration. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., a constant delay that does not depend on the document. We present such an algorithm for a variant of VAs called extended sequential VAs and give an experimental evaluation of this algorithm.

References

[1]
A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The design and analysis of computer algorithms. Addison-Wesley, 1974.
[2]
A. Amarilli, P. Bourhis, L. Jachiet, and S. Mengel. A circuit-based approach to efficient enumeration. In ICALP, 2017.
[3]
A. Amarilli, P. Bourhis, and S. Mengel. Enumeration on trees under relabelings. In ICDT, 2018.
[4]
A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, 2019.
[5]
A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, 2019.
[6]
A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners, 2020. https://arxiv.org/abs/2003.02576.
[7]
G. Bagan. MSO queries on tree decomposable structures are computable with linear delay. In CSL, 2006.
[8]
R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2), 2015.
[9]
F. Florenzano, C. Riveros, M. Ugarte, S. Vansummeren, and D. Vrgoc. Constant delay algorithms for regular document spanners. In PODS, 2018.
[10]
D. D. Freydenberger. A logic for document spanners. In ICDT, 2017.
[11]
D. D. Freydenberger and M. Holldack. Document spanners: From expressive power to decision problems. Theory Comput. Syst., 62(4), 2018.
[12]
D. D. Freydenberger, B. Kimelfeld, and L. Peterfreund. Joining extractions of regular expressions. In PODS, 2018.
[13]
F. L. Gall. Improved output-sensitive quantum algorithms for Boolean matrix multiplication. In SODA, 2012.
[14]
F. L. Gall. Powers of tensors and fast matrix multiplication. In ISSAC, 2014.
[15]
E. Grandjean. Sorting, linear time and the satisfiability problem. Annals of Mathematics and Artificial Intelligence, 16(1), 1996.
[16]
IBM Research. SystemT, 2018. https://researcher.watson.ibm.com/ researcher/view_group.php?id=1264.
[17]
W. Kazana and L. Segoufin. Enumeration of monadic second-order queries on trees. TOCL, 14(4), 2013.
[18]
K. Losemann and W. Martens. MSO queries on trees: Enumerating answers under updates. In CSL-LICS, 2014.
[19]
F. Maturana, C. Riveros, and D. Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, 2018.
[20]
M. Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, 2018.
[21]
M. Niewerth and L. Segoufin. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In PODS, 2018.
[22]
L. Peterfreund. The Complexity of Relational Queries over Extractions from Text. PhD thesis, Technion, 2019. http: //www.cs.technion.ac.il/users/wwwb/cgi-bin/ tr-get.cgi/2019/PHD/PHD-2019--10.pdf.
[23]
L. Segoufin. A glimpse on constant delay enumeration (Invited talk). In STACS, 2014.
[24]
S. Tsukiyama, M. Ide, H. Ariyoshi, and I. Shirakawa. A new algorithm for generating all the maximal independent sets. SIAM J. Comput., 6, 09 1977.
[25]
L. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2), 1979.
[26]
K. Wasa. Enumeration of enumeration algorithms. CoRR, 2016.

Cited By

View all
  • (2024)Revisiting Weighted Information Extraction: A Simpler and Faster Algorithm for Ranked EnumerationProceedings of the ACM on Management of Data10.1145/36958402:5(1-19)Online publication date: 7-Nov-2024
  • (2024)Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FCProceedings of the ACM on Management of Data10.1145/36511432:2(1-18)Online publication date: 14-May-2024
  • (2024)The Information Extraction Framework of Document Spanners - A Very Informal SurveySOFSEM 2024: Theory and Practice of Computer Science10.1007/978-3-031-52113-3_1(3-22)Online publication date: 19-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 49, Issue 1
March 2020
72 pages
ISSN:0163-5808
DOI:10.1145/3422648
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2020
Published in SIGMOD Volume 49, Issue 1

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Revisiting Weighted Information Extraction: A Simpler and Faster Algorithm for Ranked EnumerationProceedings of the ACM on Management of Data10.1145/36958402:5(1-19)Online publication date: 7-Nov-2024
  • (2024)Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FCProceedings of the ACM on Management of Data10.1145/36511432:2(1-18)Online publication date: 14-May-2024
  • (2024)The Information Extraction Framework of Document Spanners - A Very Informal SurveySOFSEM 2024: Theory and Practice of Computer Science10.1007/978-3-031-52113-3_1(3-22)Online publication date: 19-Feb-2024
  • (2023)Enumerating grammar-based extractionsDiscrete Applied Mathematics10.1016/j.dam.2023.08.014341(372-392)Online publication date: Dec-2023
  • (2022)Document Spanners - A Brief Overview of Concepts, Results, and Recent DevelopmentsProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526069(139-150)Online publication date: 12-Jun-2022
  • (2021)Spanner Evaluation over SLP-Compressed DocumentsProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458325(153-165)Online publication date: 20-Jun-2021
  • (2020)Formal Languages in Information Extraction and Graph DatabasesBeyond the Horizon of Computability10.1007/978-3-030-51466-2_28(306-309)Online publication date: 24-Jun-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media