Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3511972acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article
Open access

StruBERT: Structure-aware BERT for Table Search and Matching

Published: 25 April 2022 Publication History
  • Get Citation Alerts
  • Abstract

    A table is composed of data values that are organized in rows and columns providing implicit structural information. A table is usually accompanied by secondary information such as the caption, page title, etc., that form the textual information. Understanding the connection between the textual and structural information is an important, yet neglected aspect in table retrieval, as previous methods treat each source of information independently. In this paper, we propose StruBERT, a structure-aware BERT model that fuses the textual and structural information of a data table to produce context-aware representations for both textual and tabular content of a data table. We introduce the concept of horizontal self-attention, which extends the idea of vertical self-attention introduced in TaBERT and allows us to treat both dimensions of a table equally. StruBERT features are integrated in a new end-to-end neural ranking model to solve three table-related downstream tasks: keyword- and content-based table retrieval, and table similarity. We evaluate our approach using three datasets, and we demonstrate substantial improvements in terms of retrieval and classification metrics over state-of-the-art methods.

    References

    [1]
    Chandra Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL: Entity Linking in Web Tables. In International Semantic Web Conference.
    [2]
    Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2013. Methods for Exploring and Mining Tables on Wikipedia. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics. ACM, 18–26.
    [3]
    Jane Bromley, James Bentz, Leon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Sackinger, and Rookpak Shah. 1993. Signature Verification using a ”Siamese” Time Delay Neural Network. International Journal of Pattern Recognition and Artificial Intelligence 7 (08 1993), 25.
    [4]
    Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data Integration for the Relational Web. Proc. VLDB Endow. 2, 1 (Aug. 2009), 1090–1101.
    [5]
    Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: Exploring the Power of Tables on the Web. Proc. VLDB Endow. 1, 1 (Aug. 2008), 538–549.
    [6]
    Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D Davison. 2018. Generating schema labels through dataset content analysis. In Companion Proceedings of the The Web Conference 2018. 1515–1522.
    [7]
    Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D Davison. 2020. Leveraging schema labels to enhance dataset search. Advances in Information Retrieval 12035 (2020), 267.
    [8]
    Zhiyu Chen, Mohamed Trabelsi, Jeff Heflin, Yinan Xu, and Brian D. Davison. 2020. Table Search Using a Deep Contextualized Language Model. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 589–598.
    [9]
    Zhiyu Chen, Mohamed Trabelsi, Jeff Heflin, Dawei Yin, and Brian D Davison. 2021. MGNETS: Multi-Graph Neural Networks for Table Search. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2945–2949.
    [10]
    Zhiyu Chen, Shuo Zhang, and Brian D Davison. 2021. WTR: A Test Collection for Web Table Retrieval. arXiv preprint arXiv:2105.02354(2021).
    [11]
    Eric Crestan and Patrick Pantel. 2010. A fine-grained taxonomy of tables on the web. In Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010,. ACM, 1405–1408.
    [12]
    Eric Crestan and Patrick Pantel. 2011. Web-scale table census and classification. In Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011. ACM, 545–554.
    [13]
    Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.
    [14]
    Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional Neural Networks for Soft-Matching N-Grams in Ad-Hoc Search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 126–134.
    [15]
    Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. 2012. Finding Related Tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 817–828.
    [16]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
    [17]
    Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Langen, Jayant Madhavan, Rebecca Shapley, and Warren Shen. 2010. Google fusion tables: data management, integration and collaboration in the cloud. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC 2010. ACM, 175–180.
    [18]
    Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM 2016. ACM, 55–64.
    [19]
    Maryam Habibi, Johannes Starlinger, and Ulf Leser. 2020. TabSim: A Siamese Neural Network for Accurate Estimation of Table Similarity. CoRR abs/2008.10856(2020). arxiv:2008.10856https://arxiv.org/abs/2008.10856
    [20]
    Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. 2042–2050.
    [21]
    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.
    [22]
    Quanzhi Li, Sameena Shah, and Rui Fang. 2016. Table classification using both structure and content information: A case study of financial documents. In 2016 IEEE International Conference on Big Data, BigData 2016, Washington DC, USA, December 5-8, 2016. IEEE Computer Society, 1778–1783.
    [23]
    Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4487–4496.
    [24]
    Ying Liu, Kun Bai, Prasenjit Mitra, and C. Lee Giles. 2007. TableRank: A Ranking Algorithm for Table Search and Retrieval. In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, July 22-26, 2007, Vancouver, British Columbia, Canada. AAAI Press, 317–322.
    [25]
    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv abs/1907.11692(2019).
    [26]
    Emir Muñoz, Aidan Hogan, and Alessandra Mileo. 2014. Using Linked Data to Mine RDF from Wikipedia’s Tables. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM ’14. 533–542.
    [27]
    Tam Nguyen, Quoc Viet Hung Nguyen, Matthias Weidlich, and Karl Aberer. 2015. Result selection and summarization for Web Table search. Proceedings - International Conference on Data Engineering 2015 (05 2015), 231–242.
    [28]
    Yifan Nie, Yanling Li, and Jian-Yun Nie. 2018. Empirical Study of Multi-level Convolution Models for IR Based on Representations and Interactions. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR. ACM, 59–66.
    [29]
    Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. ArXiv abs/1901.04085(2019).
    [30]
    Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-Stage Document Ranking with BERT. ArXiv abs/1910.14424(2019).
    [31]
    Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. 2017. DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 257–266.
    [32]
    Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 1532–1543.
    [33]
    Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the Web Using Column Keywords. Proc. VLDB Endow. 5, 10 (June 2012), 908–919.
    [34]
    Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. In Proceedings of The Third Text REtrieval Conference, TREC 1994, Vol. 500-225. 109–126.
    [35]
    Wataru Sakata, Tomohide Shibata, Ribeka Tanaka, and Sadao Kurohashi. 2019. FAQ Retrieval Using Query-Question Similarity and BERT-Based Query-Answer Relevance. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 1113–1116.
    [36]
    Roee Shraga, Haggai Roitman, Guy Feigenblat, and Mustafa Canim. 2020. Ad Hoc Table Retrieval using Intrinsic and Extrinsic Similarities. In WWW ’20: The Web Conference, 2020, Yennun Huang, Irwin King, Tie-Yan Liu, and Maarten van Steen (Eds.). ACM / IW3C2, 2479–2485.
    [37]
    Roee Shraga, Haggai Roitman, Guy Feigenblat, and Mustafa Cannim. 2020. Web Table Retrieval using Multimodal Deep Learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1399–1408.
    [38]
    Mohamed Trabelsi, Jin Cao, and Jeff Heflin. 2020. Semantic Labeling Using a Deep Contextualized Language Model. CoRR abs/2010.16037(2020).
    [39]
    Mohamed Trabelsi, Jin Cao, and Jeff Heflin. 2021. SeLaB: Semantic Labeling with BERT. In International Joint Conference on Neural Networks, IJCNN 2021, Shenzhen, China, July 18-22, 2021. IEEE, 1–8.
    [40]
    Mohamed Trabelsi, Zhiyu Chen, Brian D. Davison, and Jeff Heflin. 2020. A Hybrid Deep Model for Learning to Rank Data Tables. In 2020 IEEE International Conference on Big Data (Big Data).
    [41]
    Mohamed Trabelsi, Zhiyu Chen, Brian D. Davison, and Jeff Heflin. 2020. Relational Graph Embeddings for Table Retrieval. In IEEE International Conference on Big Data, Big Data 2020. IEEE, 3005–3014.
    [42]
    Mohamed Trabelsi, Zhiyu Chen, Brian D. Davison, and Jeff Heflin. 2021. Neural ranking models for document retrieval. Inf. Retr. J. 24, 6 (2021), 400–444.
    [43]
    Mohamed Trabelsi, Brian D. Davison, and Jeff Heflin. 2019. Improved Table Retrieval Using Multiple Context Embeddings for Attributes. In 2019 IEEE International Conference on Big Data (Big Data). 1238–1244.
    [44]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30. 5998–6008.
    [45]
    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 353–355.
    [46]
    Fei Wang, Kexuan Sun, Muhao Chen, Jay Pujara, and Pedro A. Szekely. 2021. Retrieving Complex Tables with Multi-Granular Graph Representation Learning. In The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1472–1482.
    [47]
    Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17. ACM, 55–64.
    [48]
    Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Simple Applications of BERT for Ad Hoc Document Retrieval. ArXiv abs/1903.10972(2019).
    [49]
    Yang Yi, Zhiyu Chen, Jeff Heflin, and Brian D Davison. 2018. Recognizing quantity names for tabular data. In ProfS/KG4IR/Data: Search@ SIGIR.
    [50]
    Zeynep Akkalyoncu Yilmaz, Shengjin Wang, Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Applying BERT to Document Retrieval with Birch. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. Association for Computational Linguistics, 19–24.
    [51]
    Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 8413–8426.
    [52]
    Minoru Yoshida and Kentaro Torisawa. 2001. A method to integrate tables of the World Wide Web. In In Proceedings of the International Workshop on Web Document Analysis (WDA 2001. 31–34.
    [53]
    Li Zhang, Shuo Zhang, and Krisztian Balog. 2019. Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, USA, 1029–1032.
    [54]
    Shuo Zhang and Krisztian Balog. 2017. EntiTables: Smart Assistance for Entity-Focused Tables. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval(Shinjuku, Tokyo, Japan) (SIGIR ’17). ACM, New York, NY, USA, 255–264. https://doi.org/10.1145/3077136.3080796
    [55]
    Shuo Zhang and Krisztian Balog. 2018. Ad Hoc Table Retrieval using Semantic Similarity. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018. ACM, 1553–1562.
    [56]
    Shuo Zhang and Krisztian Balog. 2019. Auto-completion for Data Cells in Relational Tables. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM ’19). ACM, New York, NY, USA, 761–770. https://doi.org/10.1145/3357384.3357932
    [57]
    Shuo Zhang and K. Balog. 2019. Recommending Related Tables. ArXiv abs/1907.03595(2019).

    Cited By

    View all
    • (2024)Rethinking Table Retrieval from Data LakesProceedings of the Seventh International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3663742.3663972(1-5)Online publication date: 14-Jun-2024
    • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
    • (2024)Enhancing Dataset Search with Compact Data SnippetsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657837(1093-1103)Online publication date: 10-Jul-2024
    • Show More Cited By

    Index Terms

    1. StruBERT: Structure-aware BERT for Table Search and Matching
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          WWW '22: Proceedings of the ACM Web Conference 2022
          April 2022
          3764 pages
          ISBN:9781450390965
          DOI:10.1145/3485447
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 25 April 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. table matching
          2. table search
          3. table similarity

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Funding Sources

          Conference

          WWW '22
          Sponsor:
          WWW '22: The ACM Web Conference 2022
          April 25 - 29, 2022
          Virtual Event, Lyon, France

          Acceptance Rates

          Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)1,584
          • Downloads (Last 6 weeks)726
          Reflects downloads up to 27 Jul 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Rethinking Table Retrieval from Data LakesProceedings of the Seventh International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3663742.3663972(1-5)Online publication date: 14-Jun-2024
          • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
          • (2024)Enhancing Dataset Search with Compact Data SnippetsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657837(1093-1103)Online publication date: 10-Jul-2024
          • (2024)Multi-Intent Attribute-Aware Text Matching in SearchingProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635813(360-368)Online publication date: 4-Mar-2024
          • (2024)Towards Cross-Table Masked Pretraining for Web Data MiningProceedings of the ACM on Web Conference 202410.1145/3589334.3645707(4449-4459)Online publication date: 13-May-2024
          • (2023)A Format-sensitive BERT-based Approach to Resume Segmentation2023 33rd Conference of Open Innovations Association (FRUCT)10.23919/FRUCT58615.2023.10143072(30-37)Online publication date: 24-May-2023
          • (2023)Simulating Users in Interactive Web Table RetrievalProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615187(3875-3879)Online publication date: 21-Oct-2023
          • (2023)MGeo: Multi-Modal Geographic Language Model Pre-TrainingProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591728(185-194)Online publication date: 19-Jul-2023
          • (2023)Enhancing Table Retrieval with Dual Graph RepresentationsMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43421-1_7(107-123)Online publication date: 18-Sep-2023

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media