Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3626772.3657837acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Enhancing Dataset Search with Compact Data Snippets

Published: 11 July 2024 Publication History

Abstract

In light of the growing availability and significance of open data, the problem of dataset search has attracted great attention in the field of information retrieval. Nevertheless, current metadata-based approaches have revealed shortcomings due to the low quality and availability of dataset metadata, while the magnitude and heterogeneity of actual data hindered the development of content-based solutions. To address these challenges, we propose to convert different formats of structured data into a unified form, from which we extract a compact data snippet that indicates the relevance of the whole data. Thanks to its compactness, we feed it into a dense reranker to improve search accuracy. We also convert it back to the original format to be presented for assisting users in relevance judgment. The effectiveness of our approach has been demonstrated by extensive experiments on two test collections for dataset search.

References

[1]
Milad Alshomary, Nick Dü sterhus, and Henning Wachsmuth. 2020. Extractive Snippet Generation for Arguments. In SIGIR 2020. ACM, 1969--1972. https://doi.org/10.1145/3397271.3401186
[2]
Yael Amsterdamer, Susan B. Davidson, Tova Milo, Kathy Razmadze, and Amit Somech. 2023. Selecting Sub-tables for Data Exploration. In ICDE 2023. IEEE, 2496--2509. https://doi.org/10.1109/ICDE55515.2023.00192
[3]
Angelos-Christos G. Anadiotis, Oana Balalau, Catarina Conceicc a o, Helena Galhardas, Mhd Yamen Haddad, Ioana Manolescu, Tayeb Merabti, and Jingmao You. 2022. Graph integration of structured, semistructured and unstructured data for data journalism. Inf. Syst., Vol. 104 (2022), 101846. https://doi.org/10.1016/J.IS.2021.101846
[4]
Elias Bassani. 2022. ranx: A Blazing-Fast Python Library for Ranking Evaluation and Comparison. In ECIR 2022 (Lecture Notes in Computer Science, Vol. 13186). Springer, 259--264. https://doi.org/10.1007/978--3-030--99739--7_30
[5]
Elias Bassani and Luca Romelli. 2022. ranx.fuse: A Python Library for Metasearch. In CIKM 2022. ACM, 4808--4812. https://doi.org/10.1145/3511808.3557207
[6]
Omar Benjelloun, Shiyu Chen, and Natasha F. Noy. 2020. Google Dataset Search by the Numbers. In ISWC 2020 (Lecture Notes in Computer Science, Vol. 12507). Springer, 667--682. https://doi.org/10.1007/978--3-030--62466--8_41
[7]
Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW 2019. ACM, 1365--1375. https://doi.org/10.1145/3308558.3313685
[8]
Sonia Castelo, Ré mi Rampin, Aé cio S. R. Santos, Aline Bessa, Fernando Chirigati, and Juliana Freire. 2021. Auctus: A Dataset Search Engine for Data Discovery and Augmentation. Proc. VLDB Endow., Vol. 14, 12 (2021), 2791--2794. https://doi.org/10.14778/3476311.3476346
[9]
Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibá n ez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. VLDB J., Vol. 29, 1 (2020), 251--272. https://doi.org/10.1007/S00778-019-00564-X
[10]
Jieying Chen, Jia-Yu Pan, Christos Faloutsos, and Spiros Papadimitriou. 2013. TSum: fast, principled table summarization. In ADKDD 2013. ACM, 2:1--2:9. https://doi.org/10.1145/2501040.2501981
[11]
Jinchi Chen, Xiaxia Wang, Gong Cheng, Evgeny Kharlamov, and Yuzhong Qu. 2019. Towards More Usable Dataset Search: From Query Characterization to Snippet Generation. In CIKM 2019. ACM, 2445--2448. https://doi.org/10.1145/3357384.3358096
[12]
Qiaosheng Chen, Zixian Huang, Zhiyang Zhang, Weiqing Luo, Tengteng Lin, Qing Shi, and Gong Cheng. 2023. Dense Re-Ranking with Weak Supervision for RDF Dataset Search. In ISWC 2023 (Lecture Notes in Computer Science, Vol. 14265). Springer, 23--40. https://doi.org/10.1007/978--3-031--47240--4_2
[13]
Wei-Fan Chen, Matthias Hagen, Benno Stein, and Martin Potthast. 2018. A User Study on Snippet Generation: Text Reuse vs. Paraphrases. In SIGIR 2018. ACM, 1033--1036. https://doi.org/10.1145/3209978.3210149
[14]
Wei-Fan Chen, Shahbaz Syed, Benno Stein, Matthias Hagen, and Martin Potthast. 2020b. Abstractive Snippet Generation. In WWW 2020. ACM / IW3C2, 1309--1319. https://doi.org/10.1145/3366423.3380206
[15]
Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison. 2020a. Leveraging Schema Labels to Enhance Dataset Search. In ECIR 2020 (Lecture Notes in Computer Science, Vol. 12035). Springer, 267--280. https://doi.org/10.1007/978--3-030--45439--5_18
[16]
Zhiyu Chen, Mohamed Trabelsi, Jeff Heflin, Yinan Xu, and Brian D. Davison. 2020c. Table Search Using a Deep Contextualized Language Model. In SIGIR 2020. ACM, 589--598. https://doi.org/10.1145/3397271.3401044
[17]
Gong Cheng, Cheng Jin, Wentao Ding, Danyun Xu, and Yuzhong Qu. 2017. Generating Illustrative Snippets for Open Data on the Web. In WSDM 2017. ACM, 151--159. https://doi.org/10.1145/3018661.3018670
[18]
Lizhou Fan, Sara Lafia, Lingyao Li, Fangyuan Yang, and Libby Hemphill. 2023. DataChat: Prototyping a Conversational Agent for Dataset Search and Visualization. CoRR, Vol. abs/2305.18358 (2023). https://doi.org/10.48550/ARXIV.2305.18358 showeprint[arXiv]2305.18358
[19]
Yu Huang, Ziyang Liu, and Yi Chen. 2008. Query biased snippet generation in XML search. In SIGMOD 2008. ACM, 315--326. https://doi.org/10.1145/1376616.1376651
[20]
Jeff Johnson, Matthijs Douze, and Hervé Jé gou. 2021. Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data, Vol. 7, 3 (2021), 535--547. https://doi.org/10.1109/TBDATA.2019.2921572
[21]
Abhijith Kashyap and Vagelis Hristidis. 2012. Comprehension-based result snippets. In CIKM 2012. ACM, 1075--1084. https://doi.org/10.1145/2396761.2398405
[22]
Makoto P. Kato, Hiroaki Ohshima, Ying-Hsang Liu, and Hsin-liang Oliver Chen. 2021. A Test Collection for Ad-hoc Dataset Retrieval. In SIGIR 2021. ACM, 2450--2456. https://doi.org/10.1145/3404835.3463261
[23]
Makoto P Kato, Hiroaki Ohshima, Ying-Hsang Liu, and Hsin-Liang Chen. 2020. Overview of the NTCIR-15 data search task. In Proceedings of the NTCIR-15 Conference.
[24]
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In SIGIR 2020. ACM, 39--48. https://doi.org/10.1145/3397271.3401075
[25]
Laura Koesten, Elena Simperl, Tom Blount, Emilia Kacprzak, and Jeni Tennison. 2020. Everything you always wanted to know about a dataset: Studies in data summarisation. Int. J. Hum. Comput. Stud., Vol. 135 (2020). https://doi.org/10.1016/J.IJHCS.2019.10.004
[26]
Laura M. Koesten, Emilia Kacprzak, Jenifer Fay Alys Tennison, and Elena Simperl. 2017. The Trials and Tribulations of Working with Structured Data: -a Study on Information Seeking Behaviour. In CHI 2017. ACM, 1277--1289. https://doi.org/10.1145/3025453.3025838
[27]
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Frassetto Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In SIGIR 2021. ACM, 2356--2362. https://doi.org/10.1145/3404835.3463238
[28]
Tengteng Lin, Qiaosheng Chen, Gong Cheng, Ahmet Soylu, Basil Ell, Ruoqi Zhao, Qing Shi, Xiaxia Wang, Yu Gu, and Evgeny Kharlamov. 2022. ACORDAR: A Test Collection for Ad Hoc Content-Based (RDF) Dataset Retrieval. In SIGIR 2022. ACM, 2981--2991. https://doi.org/10.1145/3477495.3531729
[29]
Daxin Liu, Gong Cheng, Qingxia Liu, and Yuzhong Qu. 2019. Fast and Practical Snippet Generation for RDF Datasets. ACM Trans. Web, Vol. 13, 4 (2019), 19:1--19:38. https://doi.org/10.1145/3365575
[30]
Weiqing Luo, Qiaosheng Chen, Zhiyang Zhang, Zixian Huang, and Gong Cheng. 2023. An Empirical Investigation of Implicit and Explicit Knowledge-Enhanced Methods for Ad Hoc Dataset Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 14349--14360. https://aclanthology.org/2023.findings-emnlp.957
[31]
Mari-Carmen Marcos, Ferran Gavin, and Ioannis Arapakis. 2015. Effect of Snippets on User Experience in Web Search. In HCI 2015. ACM, 47:1--47:8. https://doi.org/10.1145/2829875.2829916
[32]
David Maxwell, Leif Azzopardi, and Yashar Moshfeghi. 2017. A Study of Snippet Length and Informativeness: Behaviour, Performance and User Experience. In SIGIR 2017. ACM, 135--144. https://doi.org/10.1145/3077136.3080824
[33]
Niklas Muennighoff, Nouamane Tazi, Lo"i c Magne, and Nils Reimers. 2023. MTEB: Massive Text Embedding Benchmark. In EACL 2023. Association for Computational Linguistics, 2006--2029. https://doi.org/10.18653/V1/2023.EACL-MAIN.148
[34]
Phuc Nguyen, Kazutoshi Shinoda, Taku Sakamoto, Diana Andreea Petrescu, Hung Nghiep Tran, Atsuhiro Takasu, Akiko Aizawa, and Hideaki Takeda. 2020. Nii table linker at the ntcir-15 data search task: Re-ranking with pre-trained contextualized embeddings, data content, entity-centric, and cluster-based approaches. In Proceedings of the NTCIR-15 Conference.
[35]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In NIPS 2016), Barcelona, Spain, December 9, 2016 (CEUR Workshop Proceedings, Vol. 1773). CEUR-WS.org. https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
[36]
Rodrigo Frassetto Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. CoRR, Vol. abs/1901.04085 (2019). showeprint[arXiv]1901.04085 http://arxiv.org/abs/1901.04085
[37]
Taku Okamoto and Hisashi Miyamori. 2020. KSU Systems at the NTCIR-15 Data Search Task. In Proceedings of the NTCIR-15 Conference.
[38]
Masayo Ota, Heiko Mueller, Juliana Freire, and Divesh Srivastava. 2020. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow., Vol. 13, 7 (2020), 953--965. https://doi.org/10.14778/3384345.3384346
[39]
Norman W. Paton, Jiaoyan Chen, and Zhenyu Wu. 2023. Dataset Discovery and Exploration: A Survey. ACM Comput. Surv., Vol. 56, 4 (2023), bibinfonumpages37 pages. https://doi.org/10.1145/3626521
[40]
Emmanuel Pietriga, Hande Gö zü kan, Caroline Appert, Marie Destandau, Sejla Cebiric, Francc ois Goasdoué, and Ioana Manolescu. 2018. Browsing Linked Data Catalogs with LODAtlas. In ISWC 2018 (Lecture Notes in Computer Science, Vol. 11137). Springer, 137--153. https://doi.org/10.1007/978--3-030-00668--6_9
[41]
Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. In EMNLP 2023. Association for Computational Linguistics, 14918--14937. https://aclanthology.org/2023.emnlp-main.923
[42]
Mohamed Trabelsi, Zhiyu Chen, Shuo Zhang, Brian D. Davison, and Jeff Heflin. 2022. StruBERT: Structure-aware BERT for Table Search and Matching. In WWW 2022. ACM, 442--451. https://doi.org/10.1145/3485447.3511972
[43]
Johanna Walker, Elisavet Koutsiana, Joe Massey, Gefion Thuermer, and Elena Simperl. 2023. Prompting Datasets: Data Discovery with Conversational Agents. CoRR, Vol. abs/2312.09947 (2023). https://doi.org/10.48550/ARXIV.2312.09947 showeprint[arXiv]2312.09947
[44]
Xiaxia Wang and Gong Cheng. 2024. A Survey on Extractive Knowledge Graph Summarization: Applications, Approaches, Evaluation, and Future Directions. CoRR, Vol. abs/2402.12001 (2024). https://doi.org/10.48550/ARXIV.2402.12001
[45]
Xiaxia Wang, Gong Cheng, Tengteng Lin, Jing Xu, Jeff Z. Pan, Evgeny Kharlamov, and Yuzhong Qu. 2021. PCSG: Pattern-Coverage Snippet Generation for RDF Datasets. In ISWC 2021 (Lecture Notes in Computer Science, Vol. 12922). Springer, 3--20. https://doi.org/10.1007/978--3-030--88361--4_1
[46]
Xiaxia Wang, Gong Cheng, Jeff Z. Pan, Evgeny Kharlamov, and Yuzhong Qu. 2023. BANDAR: Benchmarking Snippet Generation Algorithms for (RDF) Dataset Search. IEEE Trans. Knowl. Data Eng., Vol. 35, 2 (2023), 1227--1241. https://doi.org/10.1109/TKDE.2021.3095309
[47]
Xiaxia Wang, Tengteng Lin, Weiqing Luo, Gong Cheng, and Yuzhong Qu. 2022. CKGSE: A Prototype Search Engine for Chinese Knowledge Graphs. Data Intell., Vol. 4, 1 (2022), 41--65. https://doi.org/10.1162/DINT_A_00118
[48]
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. 2023. C-Pack: Packaged Resources To Advance General Chinese Embedding. CoRR, Vol. abs/2309.07597 (2023). https://doi.org/10.48550/ARXIV.2309.07597 showeprint[arXiv]2309.07597
[49]
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In ICLR 2021. OpenReview.net. https://openreview.net/forum?id=zeFrfgyZln
[50]
Qin Yuan, Ye Yuan, Zhenyu Wen, He Wang, Chen Chen, and Guoren Wang. 2022. Exploring Heterogeneous Data Lake based on Unified Canonical Graphs. In SIGIR 2022. ACM, 1834--1838. https://doi.org/10.1145/3477495.3531759
[51]
Qin Yuan, Ye Yuan, Zhenyu Wen, He Wang, and Shiyuan Tang. 2023. An Effective Framework for Enhancing Query Answering in a Heterogeneous Data Lake. In SIGIR 2023. ACM, 770--780. https://doi.org/10.1145/3539618.3591637
[52]
Ke Zhang, Xiaoqing Wang, and Gong Cheng. 2023. Efficient Approximation Algorithms for the Diameter-Bounded Max-Coverage Group Steiner Tree Problem. In WWW 2023. ACM, 199--209. https://doi.org/10.1145/3543507.3583257
[53]
Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. 2022. Dense Text Retrieval based on Pretrained Language Models: A Survey. CoRR, Vol. abs/2211.14876 (2022). https://doi.org/10.48550/ARXIV.2211.14876 showeprint[arXiv]2211.14876

Index Terms

  1. Enhancing Dataset Search with Compact Data Snippets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2024
    3164 pages
    ISBN:9798400704314
    DOI:10.1145/3626772
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ad hoc dataset retrieval
    2. data snippet extraction
    3. dataset search

    Qualifiers

    • Research-article

    Funding Sources

    • NSFC

    Conference

    SIGIR 2024
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 138
      Total Downloads
    • Downloads (Last 12 months)138
    • Downloads (Last 6 weeks)41
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media