Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Database Principles and Challenges in Text Analysis

Published: 31 August 2021 Publication History
  • Get Citation Alerts
  • Abstract

    A common conceptual view of text analysis is that of a two-step process, where we first extract relations from text documents and then apply a relational query over the result. Hence, text analysis shares technical challenges with, and can draw ideas from, relational databases. A framework that formally instantiates this connection is that of the document spanners. In this article, we review recent advances in various research efforts that adapt fundamental database concepts to text analysis through the lens of document spanners. Among others, we discuss aspects of query evaluation, aggregate queries, provenance, and distributed query planning.

    References

    [1]
    A. Amarilli, P. Bourhis, L. Jachiet, and S. Mengel. A circuit-based approach to efficient enumeration. In ICALP, pages 111:1--111:15, 2017.
    [2]
    A. Amarilli, P. Bourhis, and S. Mengel. Enumeration on trees under relabelings. In ICDT, pages 5:1--5:18, 2018.
    [3]
    A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, pages 22:1--22:19, 2019.
    [4]
    A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. Constant-delay enumeration for nondeterministic document spanners. ACM Trans. Database Syst., 46(1):2:1--2:30, 2021.
    [5]
    T. J. Ameloot, G. Geck, B. Ketsman, F. Neven, and T. Schwentick. Parallel-correctness and transferability for conjunctive queries. Journal of the ACM, 64(5):36:1--36:38, 2017.
    [6]
    M. Arenas, L. A. Croquevielle, R. Jayaram, and C. Riveros. Efficient logspace classes for enumeration, counting, and uniform generation. In PODS, pages 59--73, 2019.
    [7]
    G. Bagan, A. Durand, and E. Grandjean. On acyclic conjunctive queries and constant delay enumeration. In CSL, pages 208--222, 2007.
    [8]
    K. Beedkar, R. Gemulla, and W. Martens. A unified framework for frequent sequence mining with subsequence constraints. ACM Trans. Database Syst., 44(3), 2019.
    [9]
    H. Björklund, W. Gelade, and W. Martens. Incremental XPath evaluation. ACM Trans. Database Syst., 35(4):29:1--29:43, 2010.
    [10]
    P. Bourhis, A. Grez, L. Jachiet, and C. Riveros. Ranked enumeration of MSO logic on words. In ICDT, pages 20:1--20:19, 2021.
    [11]
    N. Carmeli and M. Kröll. Enumeration complexity of conjunctive queries with functional dependencies. Theory Comput. Syst., 64(5):828--860, 2020.
    [12]
    J. Chen, D. Ji, C. L. Tan, and Z. Niu. Unsupervised feature selection for relation extraction. In IJCNLP, 2005.
    [13]
    L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128--137, 2010.
    [14]
    J. Doleschal. Optimization and Parallelization of RegEx Based Information Extraction. PhD thesis, University of Bayreuth and Hasselt University, 2021.
    [15]
    J. Doleschal, N. Bratman, B. Kimelfeld, and W. Martens. The Complexity of Aggregates over Extractions by Regular Expressions. In ICDT, pages 10:1--10:20, 2021.
    [16]
    J. Doleschal, B. Kimelfeld, W. Martens, Y. Nahshon, and F. Neven. Split-correctness in information extraction. In PODS, pages 149--163, 2019.
    [17]
    J. Doleschal, B. Kimelfeld, W. Martens, F. Neven, and M. Niewerth. Split-correctness in information extraction. CoRR, abs/1810.03367, 2021.
    [18]
    J. Doleschal, B. Kimelfeld, W. Martens, and L. Peterfreund. Weight annotation in information extraction. In ICDT, pages 8:1--8:18, 2020.
    [19]
    J. Doleschal, B. Kimelfeld, W. Martens, and L. Peterfreund. Weight annotation in information extraction. CoRR, 2020.
    [20]
    A. Durand. Fine-grained complexity analysis of queries: From decision to counting and enumeration. In PODS, pages 331--346. ACM, 2020.
    [21]
    R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Document spanners: A formal approach to information extraction. Journal of the ACM, 62(2):12:1--12:51, 2015.
    [22]
    R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. A relational framework for information extraction. SIGMOD Rec., 44(4):5--16, 2015.
    [23]
    R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Declarative cleaning of inconsistencies in information extraction. ACM Transactions on Database Systems, 41(1):6:1--6:44, 2016.
    [24]
    F. Florenzano, C. Riveros, M. Ugarte, S. Vansummeren, and D. Vrgoc. Constant delay algorithms for regular document spanners. In PODS, pages 165--177, 2018.
    [25]
    D. D. Freydenberger. A logic for document spanners. Theory Comput. Syst., 63(7):1679--1754, 2019.
    [26]
    D. D. Freydenberger and M. Holldack. Document spanners: From expressive power to decision problems. Theory Comput. Syst., 62(4):854--898, 2018.
    [27]
    D. D. Freydenberger and L. Peterfreund. Finite models and the theory of concatenation. CoRR, abs/1912.06110, 2019.
    [28]
    D. D. Freydenberger and S. M. Thompson. Dynamic complexity of document spanners. In ICDT, pages 11:1--11:21, 2020. 16 SIGMOD Record, June 2021 (Vol. 50, No. 2)
    [29]
    C. Giuliano, A. Lavelli, and L. Romano. Exploiting shallow linguistic information for relation extraction from biomedical literature. In EACL, 2006.
    [30]
    T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007.
    [31]
    M. A. Hearst. Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1):33--64, 1997.
    [32]
    M. R. Jerrum, L. G. Valiant, and V. V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169--188, 1986.
    [33]
    S. Kannan, Z. Sweedyk, and S. Mahaney. Counting and random generation of strings in regular languages. In SODA, pages 551--557. SIAM, 1995.
    [34]
    G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer. Neural architectures for named entity recognition. In NAACL-HLT, pages 260--270, 2016.
    [35]
    R. Leaman and G. Gonzalez. BANNER: an executable survey of advances in biomedical named entity recognition. In PSB, pages 652--663, 2008.
    [36]
    H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky. Stanford's multi-pass sieve coreference resolution system at the conll-2011 shared task. In CoNLL, pages 28--34, 2011.
    [37]
    D. Lembo, Y. Li, L. Popa, and F. M. Scafoglieri. Ontology mediated information extraction in financial domain with mastro system-t. In DSMM, 2020.
    [38]
    Y. Li, K. Bontcheva, and H. Cunningham. SVM based learning system for information extraction. In DSMML, pages 319--339, 2004.
    [39]
    K. Losemann and W. Martens. MSO queries on trees: enumerating answers under updates. In LICS, pages 67:1--67:10, 2014.
    [40]
    A. Madaan, A. Mittal, Mausam, G. Ramakrishnan, and S. Sarawagi. Numerical relation extraction with minimal supervision. In AAAI, pages 2764--2771, 2016.
    [41]
    F. Maturana, C. Riveros, and D. Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125-- 136, 2018.
    [42]
    M. Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, pages 769--778, 2018.
    [43]
    M. Niewerth and L. Segoufin. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In PODS, pages 179--191, 2018.
    [44]
    D. Olteanu and M. Schleich. Factorized databases. SIGMOD Rec., 45(2):5--16, 2016.
    [45]
    B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL, pages 271--278, 2004.
    [46]
    Y. Papakonstantinou and V. Vianu. Incremental validation of XML documents. In ICDT, pages 47--63, 2003.
    [47]
    S. Patnaik and N. Immerman. Dyn-fo: A parallel, dynamic complexity class. In PODS, pages 210--221, 1994.
    [48]
    L. Peterfreund. Grammars for document spanners. In ICDT, pages 7:1--7:18, 2021.
    [49]
    L. Peterfreund, D. D. Freydenberger, B. Kimelfeld, and M. Kröll. Complexity bounds for relational algebra over document spanners. In PODS, pages 320--334, 2019.
    [50]
    L. Peterfreund, B. ten Cate, R. Fagin, and B. Kimelfeld. Recursive Programs for Document Spanners. In ICDT, pages 13:1--13:18, 2019.
    [51]
    H. Poon and P. M. Domingos. Joint inference in information extraction. In AAAI, pages 913--918, 2007.
    [52]
    K. Radinsky, S. Davidovich, and S. Markovitch. Learning causality for news events prediction. In WWW, pages 909--918, 2012.
    [53]
    K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. D. Manning. A multi-pass sieve for coreference resolution. In EMNLP, pages 492--501, 2010.
    [54]
    A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269-- 282, 2017.
    [55]
    S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008.
    [56]
    F. M. Scafoglieri and D. Lembo. A formal framework for coupling document spanners with ontologies. In AIKE, pages 155--162, 2019.
    [57]
    M. L. Schmid and N. Schweikardt. A purely regular approach to non-regular core spanners. In ICDT, pages 4:1--4:19, 2021.
    [58]
    M. L. Schmid and N. Schweikardt. Spanner evaluation over SLP-compressed documents. In PODS, 2021.
    [59]
    W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using Datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007.
    [60]
    J. Shin, S. Wu, F. Wang, C. D. Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using DeepDive. PVLDB, 8(11):1310--1321, 2015.
    [61]
    R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning. Parsing natural scenes and natural language with recursive neural networks. In ICML, pages 129--136, 2011.
    [62]
    C. A. Sutton and A. McCallum. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):267--373, 2012.
    [63]
    D. Zeng, K. Liu, Y. Chen, and J. Zhao. Distant supervision for relation extraction via piecewise convolutional neural networks. In EMNLP, pages 1753--1762, 2015.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMOD Record
    ACM SIGMOD Record  Volume 50, Issue 2
    June 2021
    42 pages
    ISSN:0163-5808
    DOI:10.1145/3484622
    Issue’s Table of Contents
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 August 2021
    Published in SIGMOD Volume 50, Issue 2

    Check for updates

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)0
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media