research-article

Enhancing Dataset Search with Compact Data Snippets

Authors:

Qiaosheng Chen,

Gong ChengAuthors Info & Claims

SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1093 - 1103

https://doi.org/10.1145/3626772.3657837

Published: 11 July 2024 Publication History

Abstract

In light of the growing availability and significance of open data, the problem of dataset search has attracted great attention in the field of information retrieval. Nevertheless, current metadata-based approaches have revealed shortcomings due to the low quality and availability of dataset metadata, while the magnitude and heterogeneity of actual data hindered the development of content-based solutions. To address these challenges, we propose to convert different formats of structured data into a unified form, from which we extract a compact data snippet that indicates the relevance of the whole data. Thanks to its compactness, we feed it into a dense reranker to improve search accuracy. We also convert it back to the original format to be presented for assisting users in relevance judgment. The effectiveness of our approach has been demonstrated by extensive experiments on two test collections for dataset search.

References

[1]

Milad Alshomary, Nick Dü sterhus, and Henning Wachsmuth. 2020. Extractive Snippet Generation for Arguments. In SIGIR 2020. ACM, 1969--1972. https://doi.org/10.1145/3397271.3401186

Digital Library

[2]

Yael Amsterdamer, Susan B. Davidson, Tova Milo, Kathy Razmadze, and Amit Somech. 2023. Selecting Sub-tables for Data Exploration. In ICDE 2023. IEEE, 2496--2509. https://doi.org/10.1109/ICDE55515.2023.00192

[3]

Angelos-Christos G. Anadiotis, Oana Balalau, Catarina Conceicc a o, Helena Galhardas, Mhd Yamen Haddad, Ioana Manolescu, Tayeb Merabti, and Jingmao You. 2022. Graph integration of structured, semistructured and unstructured data for data journalism. Inf. Syst., Vol. 104 (2022), 101846. https://doi.org/10.1016/J.IS.2021.101846

Digital Library

[4]

Elias Bassani. 2022. ranx: A Blazing-Fast Python Library for Ranking Evaluation and Comparison. In ECIR 2022 (Lecture Notes in Computer Science, Vol. 13186). Springer, 259--264. https://doi.org/10.1007/978--3-030--99739--7_30

Digital Library

[5]

Elias Bassani and Luca Romelli. 2022. ranx.fuse: A Python Library for Metasearch. In CIKM 2022. ACM, 4808--4812. https://doi.org/10.1145/3511808.3557207

Digital Library

[6]

Omar Benjelloun, Shiyu Chen, and Natasha F. Noy. 2020. Google Dataset Search by the Numbers. In ISWC 2020 (Lecture Notes in Computer Science, Vol. 12507). Springer, 667--682. https://doi.org/10.1007/978--3-030--62466--8_41

[7]

Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW 2019. ACM, 1365--1375. https://doi.org/10.1145/3308558.3313685

Digital Library

[8]

Sonia Castelo, Ré mi Rampin, Aé cio S. R. Santos, Aline Bessa, Fernando Chirigati, and Juliana Freire. 2021. Auctus: A Dataset Search Engine for Data Discovery and Augmentation. Proc. VLDB Endow., Vol. 14, 12 (2021), 2791--2794. https://doi.org/10.14778/3476311.3476346

Digital Library

[9]

Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibá n ez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: a survey. VLDB J., Vol. 29, 1 (2020), 251--272. https://doi.org/10.1007/S00778-019-00564-X

Digital Library

[10]

Jieying Chen, Jia-Yu Pan, Christos Faloutsos, and Spiros Papadimitriou. 2013. TSum: fast, principled table summarization. In ADKDD 2013. ACM, 2:1--2:9. https://doi.org/10.1145/2501040.2501981

Digital Library

[11]

Jinchi Chen, Xiaxia Wang, Gong Cheng, Evgeny Kharlamov, and Yuzhong Qu. 2019. Towards More Usable Dataset Search: From Query Characterization to Snippet Generation. In CIKM 2019. ACM, 2445--2448. https://doi.org/10.1145/3357384.3358096

Digital Library

[12]

Qiaosheng Chen, Zixian Huang, Zhiyang Zhang, Weiqing Luo, Tengteng Lin, Qing Shi, and Gong Cheng. 2023. Dense Re-Ranking with Weak Supervision for RDF Dataset Search. In ISWC 2023 (Lecture Notes in Computer Science, Vol. 14265). Springer, 23--40. https://doi.org/10.1007/978--3-031--47240--4_2

[13]

Wei-Fan Chen, Matthias Hagen, Benno Stein, and Martin Potthast. 2018. A User Study on Snippet Generation: Text Reuse vs. Paraphrases. In SIGIR 2018. ACM, 1033--1036. https://doi.org/10.1145/3209978.3210149

Digital Library

[14]

Wei-Fan Chen, Shahbaz Syed, Benno Stein, Matthias Hagen, and Martin Potthast. 2020b. Abstractive Snippet Generation. In WWW 2020. ACM / IW3C2, 1309--1319. https://doi.org/10.1145/3366423.3380206

Digital Library

[15]

Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison. 2020a. Leveraging Schema Labels to Enhance Dataset Search. In ECIR 2020 (Lecture Notes in Computer Science, Vol. 12035). Springer, 267--280. https://doi.org/10.1007/978--3-030--45439--5_18

[16]

Zhiyu Chen, Mohamed Trabelsi, Jeff Heflin, Yinan Xu, and Brian D. Davison. 2020c. Table Search Using a Deep Contextualized Language Model. In SIGIR 2020. ACM, 589--598. https://doi.org/10.1145/3397271.3401044

Digital Library

[17]

Gong Cheng, Cheng Jin, Wentao Ding, Danyun Xu, and Yuzhong Qu. 2017. Generating Illustrative Snippets for Open Data on the Web. In WSDM 2017. ACM, 151--159. https://doi.org/10.1145/3018661.3018670

Digital Library

[18]

Lizhou Fan, Sara Lafia, Lingyao Li, Fangyuan Yang, and Libby Hemphill. 2023. DataChat: Prototyping a Conversational Agent for Dataset Search and Visualization. CoRR, Vol. abs/2305.18358 (2023). https://doi.org/10.48550/ARXIV.2305.18358 showeprint[arXiv]2305.18358

[19]

Yu Huang, Ziyang Liu, and Yi Chen. 2008. Query biased snippet generation in XML search. In SIGMOD 2008. ACM, 315--326. https://doi.org/10.1145/1376616.1376651

Digital Library

[20]

Jeff Johnson, Matthijs Douze, and Hervé Jé gou. 2021. Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data, Vol. 7, 3 (2021), 535--547. https://doi.org/10.1109/TBDATA.2019.2921572

[21]

Abhijith Kashyap and Vagelis Hristidis. 2012. Comprehension-based result snippets. In CIKM 2012. ACM, 1075--1084. https://doi.org/10.1145/2396761.2398405

Digital Library

[22]

Makoto P. Kato, Hiroaki Ohshima, Ying-Hsang Liu, and Hsin-liang Oliver Chen. 2021. A Test Collection for Ad-hoc Dataset Retrieval. In SIGIR 2021. ACM, 2450--2456. https://doi.org/10.1145/3404835.3463261

Digital Library

[23]

Makoto P Kato, Hiroaki Ohshima, Ying-Hsang Liu, and Hsin-Liang Chen. 2020. Overview of the NTCIR-15 data search task. In Proceedings of the NTCIR-15 Conference.

[24]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In SIGIR 2020. ACM, 39--48. https://doi.org/10.1145/3397271.3401075

Digital Library

[25]

Laura Koesten, Elena Simperl, Tom Blount, Emilia Kacprzak, and Jeni Tennison. 2020. Everything you always wanted to know about a dataset: Studies in data summarisation. Int. J. Hum. Comput. Stud., Vol. 135 (2020). https://doi.org/10.1016/J.IJHCS.2019.10.004

Digital Library

[26]

Laura M. Koesten, Emilia Kacprzak, Jenifer Fay Alys Tennison, and Elena Simperl. 2017. The Trials and Tribulations of Working with Structured Data: -a Study on Information Seeking Behaviour. In CHI 2017. ACM, 1277--1289. https://doi.org/10.1145/3025453.3025838

Digital Library

[27]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Frassetto Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In SIGIR 2021. ACM, 2356--2362. https://doi.org/10.1145/3404835.3463238

Digital Library

[28]

Tengteng Lin, Qiaosheng Chen, Gong Cheng, Ahmet Soylu, Basil Ell, Ruoqi Zhao, Qing Shi, Xiaxia Wang, Yu Gu, and Evgeny Kharlamov. 2022. ACORDAR: A Test Collection for Ad Hoc Content-Based (RDF) Dataset Retrieval. In SIGIR 2022. ACM, 2981--2991. https://doi.org/10.1145/3477495.3531729

Digital Library

[29]

Daxin Liu, Gong Cheng, Qingxia Liu, and Yuzhong Qu. 2019. Fast and Practical Snippet Generation for RDF Datasets. ACM Trans. Web, Vol. 13, 4 (2019), 19:1--19:38. https://doi.org/10.1145/3365575

Digital Library

[30]

Weiqing Luo, Qiaosheng Chen, Zhiyang Zhang, Zixian Huang, and Gong Cheng. 2023. An Empirical Investigation of Implicit and Explicit Knowledge-Enhanced Methods for Ad Hoc Dataset Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 14349--14360. https://aclanthology.org/2023.findings-emnlp.957

[31]

Mari-Carmen Marcos, Ferran Gavin, and Ioannis Arapakis. 2015. Effect of Snippets on User Experience in Web Search. In HCI 2015. ACM, 47:1--47:8. https://doi.org/10.1145/2829875.2829916

Digital Library

[32]

David Maxwell, Leif Azzopardi, and Yashar Moshfeghi. 2017. A Study of Snippet Length and Informativeness: Behaviour, Performance and User Experience. In SIGIR 2017. ACM, 135--144. https://doi.org/10.1145/3077136.3080824

Digital Library

[33]

Niklas Muennighoff, Nouamane Tazi, Lo"i c Magne, and Nils Reimers. 2023. MTEB: Massive Text Embedding Benchmark. In EACL 2023. Association for Computational Linguistics, 2006--2029. https://doi.org/10.18653/V1/2023.EACL-MAIN.148

[34]

Phuc Nguyen, Kazutoshi Shinoda, Taku Sakamoto, Diana Andreea Petrescu, Hung Nghiep Tran, Atsuhiro Takasu, Akiko Aizawa, and Hideaki Takeda. 2020. Nii table linker at the ntcir-15 data search task: Re-ranking with pre-trained contextualized embeddings, data content, entity-centric, and cluster-based approaches. In Proceedings of the NTCIR-15 Conference.

[35]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In NIPS 2016), Barcelona, Spain, December 9, 2016 (CEUR Workshop Proceedings, Vol. 1773). CEUR-WS.org. https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf

[36]

Rodrigo Frassetto Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. CoRR, Vol. abs/1901.04085 (2019). showeprint[arXiv]1901.04085 http://arxiv.org/abs/1901.04085

[37]

Taku Okamoto and Hisashi Miyamori. 2020. KSU Systems at the NTCIR-15 Data Search Task. In Proceedings of the NTCIR-15 Conference.

[38]

Masayo Ota, Heiko Mueller, Juliana Freire, and Divesh Srivastava. 2020. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow., Vol. 13, 7 (2020), 953--965. https://doi.org/10.14778/3384345.3384346

Digital Library

[39]

Norman W. Paton, Jiaoyan Chen, and Zhenyu Wu. 2023. Dataset Discovery and Exploration: A Survey. ACM Comput. Surv., Vol. 56, 4 (2023), bibinfonumpages37 pages. https://doi.org/10.1145/3626521

Digital Library

[40]

Emmanuel Pietriga, Hande Gö zü kan, Caroline Appert, Marie Destandau, Sejla Cebiric, Francc ois Goasdoué, and Ioana Manolescu. 2018. Browsing Linked Data Catalogs with LODAtlas. In ISWC 2018 (Lecture Notes in Computer Science, Vol. 11137). Springer, 137--153. https://doi.org/10.1007/978--3-030-00668--6_9

[41]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. In EMNLP 2023. Association for Computational Linguistics, 14918--14937. https://aclanthology.org/2023.emnlp-main.923

[42]

Mohamed Trabelsi, Zhiyu Chen, Shuo Zhang, Brian D. Davison, and Jeff Heflin. 2022. StruBERT: Structure-aware BERT for Table Search and Matching. In WWW 2022. ACM, 442--451. https://doi.org/10.1145/3485447.3511972

Digital Library

[43]

Johanna Walker, Elisavet Koutsiana, Joe Massey, Gefion Thuermer, and Elena Simperl. 2023. Prompting Datasets: Data Discovery with Conversational Agents. CoRR, Vol. abs/2312.09947 (2023). https://doi.org/10.48550/ARXIV.2312.09947 showeprint[arXiv]2312.09947

[44]

Xiaxia Wang and Gong Cheng. 2024. A Survey on Extractive Knowledge Graph Summarization: Applications, Approaches, Evaluation, and Future Directions. CoRR, Vol. abs/2402.12001 (2024). https://doi.org/10.48550/ARXIV.2402.12001

[45]

Xiaxia Wang, Gong Cheng, Tengteng Lin, Jing Xu, Jeff Z. Pan, Evgeny Kharlamov, and Yuzhong Qu. 2021. PCSG: Pattern-Coverage Snippet Generation for RDF Datasets. In ISWC 2021 (Lecture Notes in Computer Science, Vol. 12922). Springer, 3--20. https://doi.org/10.1007/978--3-030--88361--4_1

[46]

Xiaxia Wang, Gong Cheng, Jeff Z. Pan, Evgeny Kharlamov, and Yuzhong Qu. 2023. BANDAR: Benchmarking Snippet Generation Algorithms for (RDF) Dataset Search. IEEE Trans. Knowl. Data Eng., Vol. 35, 2 (2023), 1227--1241. https://doi.org/10.1109/TKDE.2021.3095309

[47]

Xiaxia Wang, Tengteng Lin, Weiqing Luo, Gong Cheng, and Yuzhong Qu. 2022. CKGSE: A Prototype Search Engine for Chinese Knowledge Graphs. Data Intell., Vol. 4, 1 (2022), 41--65. https://doi.org/10.1162/DINT_A_00118

[48]

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. 2023. C-Pack: Packaged Resources To Advance General Chinese Embedding. CoRR, Vol. abs/2309.07597 (2023). https://doi.org/10.48550/ARXIV.2309.07597 showeprint[arXiv]2309.07597

[49]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In ICLR 2021. OpenReview.net. https://openreview.net/forum?id=zeFrfgyZln

[50]

Qin Yuan, Ye Yuan, Zhenyu Wen, He Wang, Chen Chen, and Guoren Wang. 2022. Exploring Heterogeneous Data Lake based on Unified Canonical Graphs. In SIGIR 2022. ACM, 1834--1838. https://doi.org/10.1145/3477495.3531759

Digital Library

[51]

Qin Yuan, Ye Yuan, Zhenyu Wen, He Wang, and Shiyuan Tang. 2023. An Effective Framework for Enhancing Query Answering in a Heterogeneous Data Lake. In SIGIR 2023. ACM, 770--780. https://doi.org/10.1145/3539618.3591637

Digital Library

[52]

Ke Zhang, Xiaoqing Wang, and Gong Cheng. 2023. Efficient Approximation Algorithms for the Diameter-Bounded Max-Coverage Group Steiner Tree Problem. In WWW 2023. ACM, 199--209. https://doi.org/10.1145/3543507.3583257

Digital Library

[53]

Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. 2022. Dense Text Retrieval based on Pretrained Language Models: A Survey. CoRR, Vol. abs/2211.14876 (2022). https://doi.org/10.48550/ARXIV.2211.14876 showeprint[arXiv]2211.14876

Index Terms

Enhancing Dataset Search with Compact Data Snippets
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval

Recommendations

ACORDAR 2.0: A Test Collection for Ad Hoc Dataset Retrieval with Densely Pooled Datasets and Question-Style Queries
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Dataset search, or more specifically, ad hoc dataset retrieval which is a trending specialized IR task, has received increasing attention in both academia and industry. While methods and systems continue evolving, existing test collections for this task ...
ACORDAR: A Test Collection for Ad Hoc Content-Based (RDF) Dataset Retrieval
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Ad hoc dataset retrieval is a trending topic in IR research. Methods and systems are evolving from metadata-based to content-based ones which exploit the data itself for improving retrieval accuracy but thus far lack a specialized test collection. In ...
Dataset search: a survey
Abstract
Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2024

3164 pages

ISBN:9798400704314

DOI:10.1145/3626772

General Chairs:
Grace Hui Yang
Georgetown University, USA
,
Hongning Wang
Tsinghua University, China
,
Sam Han
The Washington Post, USA
,
Program Chairs:
Claudia Hauff
Spotify, Netherlands
,
Guido Zuccon
The University of Queensland, Australia
,
Yi Zhang
University of California Santa Cruz, USA

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSFC

Conference

SIGIR 2024

Sponsor:

SIGIR

SIGIR 2024: The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 14 - 18, 2024

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
138
Total Downloads

Downloads (Last 12 months)138
Downloads (Last 6 weeks)41

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents