research-article

Database Principles and Challenges in Text Analysis

Authors:

Johannes Doleschal,

Benny Kimelfeld,

Wim MartensAuthors Info & Claims

ACM SIGMOD Record, Volume 50, Issue 2

Pages 6 - 17

https://doi.org/10.1145/3484622.3484624

Published: 31 August 2021 Publication History

Abstract

A common conceptual view of text analysis is that of a two-step process, where we first extract relations from text documents and then apply a relational query over the result. Hence, text analysis shares technical challenges with, and can draw ideas from, relational databases. A framework that formally instantiates this connection is that of the document spanners. In this article, we review recent advances in various research efforts that adapt fundamental database concepts to text analysis through the lens of document spanners. Among others, we discuss aspects of query evaluation, aggregate queries, provenance, and distributed query planning.

References

[1]

A. Amarilli, P. Bourhis, L. Jachiet, and S. Mengel. A circuit-based approach to efficient enumeration. In ICALP, pages 111:1--111:15, 2017.

[2]

A. Amarilli, P. Bourhis, and S. Mengel. Enumeration on trees under relabelings. In ICDT, pages 5:1--5:18, 2018.

[3]

A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, pages 22:1--22:19, 2019.

[4]

A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners. ACM Trans. Database Syst., 46(1):2:1--2:30, 2021.

Digital Library

[5]

T. J. Ameloot, G. Geck, B. Ketsman, F. Neven, and T. Schwentick. Parallel-correctness and transferability for conjunctive queries. Journal of the ACM, 64(5):36:1--36:38, 2017.

Digital Library

[6]

M. Arenas, L. A. Croquevielle, R. Jayaram, and C. Riveros. Efficient logspace classes for enumeration, counting, and uniform generation. In PODS, pages 59--73, 2019.

Digital Library

[7]

G. Bagan, A. Durand, and E. Grandjean. On acyclic conjunctive queries and constant delay enumeration. In CSL, pages 208--222, 2007.

Digital Library

[8]

K. Beedkar, R. Gemulla, and W. Martens. A unified framework for frequent sequence mining with subsequence constraints. ACM Trans. Database Syst., 44(3), 2019.

Digital Library

[9]

H. Björklund, W. Gelade, and W. Martens. Incremental XPath evaluation. ACM Trans. Database Syst., 35(4):29:1--29:43, 2010.

Digital Library

[10]

P. Bourhis, A. Grez, L. Jachiet, and C. Riveros. Ranked enumeration of MSO logic on words. In ICDT, pages 20:1--20:19, 2021.

[11]

N. Carmeli and M. Kröll. Enumeration complexity of conjunctive queries with functional dependencies. Theory Comput. Syst., 64(5):828--860, 2020.

[12]

J. Chen, D. Ji, C. L. Tan, and Z. Niu. Unsupervised feature selection for relation extraction. In IJCNLP, 2005.

[13]

L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128--137, 2010.

Digital Library

[14]

J. Doleschal. Optimization and Parallelization of RegEx Based Information Extraction. PhD thesis, University of Bayreuth and Hasselt University, 2021.

[15]

J. Doleschal, N. Bratman, B. Kimelfeld, and W. Martens. The Complexity of Aggregates over Extractions by Regular Expressions. In ICDT, pages 10:1--10:20, 2021.

[16]

J. Doleschal, B. Kimelfeld, W. Martens, Y. Nahshon, and F. Neven. Split-correctness in information extraction. In PODS, pages 149--163, 2019.

Digital Library

[17]

J. Doleschal, B. Kimelfeld, W. Martens, F. Neven, and M. Niewerth. Split-correctness in information extraction. CoRR, abs/1810.03367, 2021.

[18]

J. Doleschal, B. Kimelfeld, W. Martens, and L. Peterfreund. Weight annotation in information extraction. In ICDT, pages 8:1--8:18, 2020.

[19]

J. Doleschal, B. Kimelfeld, W. Martens, and L. Peterfreund. Weight annotation in information extraction. CoRR, 2020.

[20]

A. Durand. Fine-grained complexity analysis of queries: From decision to counting and enumeration. In PODS, pages 331--346. ACM, 2020.

Digital Library

[21]

R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Document spanners: A formal approach to information extraction. Journal of the ACM, 62(2):12:1--12:51, 2015.

Digital Library

[22]

R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. A relational framework for information extraction. SIGMOD Rec., 44(4):5--16, 2015.

Digital Library

[23]

R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Declarative cleaning of inconsistencies in information extraction. ACM Transactions on Database Systems, 41(1):6:1--6:44, 2016.

Digital Library

[24]

F. Florenzano, C. Riveros, M. Ugarte, S. Vansummeren, and D. Vrgoc. Constant delay algorithms for regular document spanners. In PODS, pages 165--177, 2018.

Digital Library

[25]

D. D. Freydenberger. A logic for document spanners. Theory Comput. Syst., 63(7):1679--1754, 2019.

Digital Library

[26]

D. D. Freydenberger and M. Holldack. Document spanners: From expressive power to decision problems. Theory Comput. Syst., 62(4):854--898, 2018.

Digital Library

[27]

D. D. Freydenberger and L. Peterfreund. Finite models and the theory of concatenation. CoRR, abs/1912.06110, 2019.

[28]

D. D. Freydenberger and S. M. Thompson. Dynamic complexity of document spanners. In ICDT, pages 11:1--11:21, 2020. 16 SIGMOD Record, June 2021 (Vol. 50, No. 2)

[29]

C. Giuliano, A. Lavelli, and L. Romano. Exploiting shallow linguistic information for relation extraction from biomedical literature. In EACL, 2006.

[30]

T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007.

Digital Library

[31]

M. A. Hearst. Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1):33--64, 1997.

Digital Library

[32]

M. R. Jerrum, L. G. Valiant, and V. V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169--188, 1986.

[33]

S. Kannan, Z. Sweedyk, and S. Mahaney. Counting and random generation of strings in regular languages. In SODA, pages 551--557. SIAM, 1995.

Digital Library

[34]

G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer. Neural architectures for named entity recognition. In NAACL-HLT, pages 260--270, 2016.

[35]

R. Leaman and G. Gonzalez. BANNER: an executable survey of advances in biomedical named entity recognition. In PSB, pages 652--663, 2008.

[36]

H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky. Stanford's multi-pass sieve coreference resolution system at the conll-2011 shared task. In CoNLL, pages 28--34, 2011.

[37]

D. Lembo, Y. Li, L. Popa, and F. M. Scafoglieri. Ontology mediated information extraction in financial domain with mastro system-t. In DSMM, 2020.

Digital Library

[38]

Y. Li, K. Bontcheva, and H. Cunningham. SVM based learning system for information extraction. In DSMML, pages 319--339, 2004.

[39]

K. Losemann and W. Martens. MSO queries on trees: enumerating answers under updates. In LICS, pages 67:1--67:10, 2014.

Digital Library

[40]

A. Madaan, A. Mittal, Mausam, G. Ramakrishnan, and S. Sarawagi. Numerical relation extraction with minimal supervision. In AAAI, pages 2764--2771, 2016.

Digital Library

[41]

F. Maturana, C. Riveros, and D. Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125-- 136, 2018.

Digital Library

[42]

M. Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, pages 769--778, 2018.

Digital Library

[43]

M. Niewerth and L. Segoufin. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In PODS, pages 179--191, 2018.

Digital Library

[44]

D. Olteanu and M. Schleich. Factorized databases. SIGMOD Rec., 45(2):5--16, 2016.

Digital Library

[45]

B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL, pages 271--278, 2004.

Digital Library

[46]

Y. Papakonstantinou and V. Vianu. Incremental validation of XML documents. In ICDT, pages 47--63, 2003.

Digital Library

[47]

S. Patnaik and N. Immerman. Dyn-fo: A parallel, dynamic complexity class. In PODS, pages 210--221, 1994.

[48]

L. Peterfreund. Grammars for document spanners. In ICDT, pages 7:1--7:18, 2021.

[49]

L. Peterfreund, D. D. Freydenberger, B. Kimelfeld, and M. Kröll. Complexity bounds for relational algebra over document spanners. In PODS, pages 320--334, 2019.

Digital Library

[50]

L. Peterfreund, B. ten Cate, R. Fagin, and B. Kimelfeld. Recursive Programs for Document Spanners. In ICDT, pages 13:1--13:18, 2019.

[51]

H. Poon and P. M. Domingos. Joint inference in information extraction. In AAAI, pages 913--918, 2007.

Digital Library

[52]

K. Radinsky, S. Davidovich, and S. Markovitch. Learning causality for news events prediction. In WWW, pages 909--918, 2012.

Digital Library

[53]

K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. D. Manning. A multi-pass sieve for coreference resolution. In EMNLP, pages 492--501, 2010.

Digital Library

[54]

A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269-- 282, 2017.

Digital Library

[55]

S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008.

Digital Library

[56]

F. M. Scafoglieri and D. Lembo. A formal framework for coupling document spanners with ontologies. In AIKE, pages 155--162, 2019.

[57]

M. L. Schmid and N. Schweikardt. A purely regular approach to non-regular core spanners. In ICDT, pages 4:1--4:19, 2021.

[58]

M. L. Schmid and N. Schweikardt. Spanner evaluation over SLP-compressed documents. In PODS, 2021.

[59]

W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using Datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007.

Digital Library

[60]

J. Shin, S. Wu, F. Wang, C. D. Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using DeepDive. PVLDB, 8(11):1310--1321, 2015.

Digital Library

[61]

R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning. Parsing natural scenes and natural language with recursive neural networks. In ICML, pages 129--136, 2011.

Digital Library

[62]

C. A. Sutton and A. McCallum. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):267--373, 2012.

Digital Library

[63]

D. Zeng, K. Liu, Y. Chen, and J. Zhao. Distant supervision for relation extraction via piecewise convolutional neural networks. In EMNLP, pages 1753--1762, 2015.

Cited By

Riveros CVan Sint Jan NVrgoč D(2023)REmatch: A Novel Regex Engine for Finding All MatchesProceedings of the VLDB Endowment10.14778/3611479.361148816:11(2792-2804)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.14778/3611479.3611488
Peterfreund L(2023)Enumerating grammar-based extractionsDiscrete Applied Mathematics10.1016/j.dam.2023.08.014341:C(372-392)Online publication date: 31-Dec-2023
https://dl.acm.org/doi/10.1016/j.dam.2023.08.014

Recommendations

New challenges in teaching database security
InfoSecCD '06: Proceedings of the 3rd annual conference on Information security curriculum development

Traditional Database Security has focused primarily on creating user accounts and managing user privileges to database objects. The wide spread use of databases over the web, heterogeneous client-server architectures, application servers, and networks ...
Synthesizing structured text from logical database subsets
EDBT '08: Proceedings of the 11th international conference on Extending database technology: Advances in database technology

In the classical database world, information access has been based on a paradigm that involves structured, schema-aware, queries and tabular answers. In the current environment, however, where information prevails in most activities of society, serving ...
Database Systems: A Practical Approach to Design, Implementation and Management

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 50, Issue 2

June 2021

42 pages

ISSN:0163-5808

DOI:10.1145/3484622

Editors:
Rada Chirkova
North Carolina State University
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Wim Martens
University of Bayreuth
,
Divesh Srivastava
ATT research
,
Marcelo Arenas
Research Highlights
,
Marianne Winslett
University of Illinois
,
Jun Yang
Duke University
,
Azza Abouzied
NYU
,
Lyublena Antova
Datometry
,
Aaron J. Elmore
University of Chicago
,
Kyriakos Mouratidis
Singapore Management University
,
Dan Olteanu
University of Oxford
,
Immanuel Trummer
Cornell University
,
Yannis Velegrakis
Utrecht University
,
Renata Borovica-Gajic
Surveys
,
Tamer Özsu
University of Waterloo
,
Pınar Tözün
IT University of Copenhagen

Issue’s Table of Contents

Copyright © 2021 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 August 2021

Published in SIGMOD Volume 50, Issue 2

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
125
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Riveros CVan Sint Jan NVrgoč D(2023)REmatch: A Novel Regex Engine for Finding All MatchesProceedings of the VLDB Endowment10.14778/3611479.361148816:11(2792-2804)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.14778/3611479.3611488
Peterfreund L(2023)Enumerating grammar-based extractionsDiscrete Applied Mathematics10.1016/j.dam.2023.08.014341:C(372-392)Online publication date: 31-Dec-2023
https://dl.acm.org/doi/10.1016/j.dam.2023.08.014

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents