Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3400903.3400919acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

A Versatile Hypergraph Model for Document Collections

Published: 30 July 2020 Publication History

Abstract

Efficiently and effectively representing large collections of text is of central importance to information retrieval tasks such as summarization and search. Since models for these tasks frequently rely on an implicit graph structure of the documents or their contents, graph-based document representations are naturally appealing. For tasks that consider the joint occurrence of words or entities, however, existing document representations often fall short in capturing cooccurrences of higher order, higher multiplicity, or at varying proximity levels. Furthermore, while numerous applications benefit from structured knowledge sources, external data sources are rarely considered as integral parts of existing document models.
To address these shortcomings, we introduce heterogeneous hypergraphs as a versatile model for representing annotated document collections. We integrate external metadata, document content, entity and term annotations, and document segmentation at different granularity levels in a joint model that bridges the gap between structured and unstructured data. We discuss selection and transformation operations on the set of hyperedges, which can be chained to support a wide range of query scenarios. To ensure compatibility with established information retrieval methods, we discuss projection operations that transform hyperedges to traditional dyadic cooccurrence graph representations. Using PostgreSQL and Neo4j, we investigate the suitability of existing database systems for implementing the hypergraph document model, and explore the impact of utilizing implicit and materialized hyperedge representations on storage space requirements and query performance.

References

[1]
James Allan. 2002. Topic Detection and Tracking: Event-Based Information Organization. Vol. 12. Springer Science & Business Media.
[2]
Jing Bai, Dawei Song, Peter Bruza, Jian-Yun Nie, and Guihong Cao. 2005. Query Expansion Using Term Relationships in Language Models for Information Retrieval. In CIKM.
[3]
Abdelghani Bellaachia and Mohammed Al-Dhelaan. 2014. HG-Rank: A Hypergraph-based Keyphrase Extraction for Short Documents in Dynamic Genre. In WWW Companion.
[4]
Abdelghani Bellaachia and Mohammed Al-Dhelaan. 2014. Multi-document Hyperedge-based Ranking for Text Summarization. In CIKM.
[5]
Michael Bendersky and W. Bruce Croft. 2012. Modeling Higher-order Term Dependencies in Information Retrieval Using Query Hypergraphs. In SIGIR.
[6]
Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A Neural Probabilistic Language Model. In NIPS.
[7]
Claude Berge. 1973. Graphs and Hypergraphs. North-Holland Publishing.
[8]
Claude Berge. 1984. Hypergraphs: Combinatorics of Finite Sets. Vol. 45. Elsevier.
[9]
Roi Blanco and Christina Lioma. 2012. Graph-based Term Weighting for Information Retrieval. Inf. Retr. 15, 1 (2012), 54–92.
[10]
Monojit Choudhury, Diptesh Chatterjee, and Animesh Mukherjee. 2010. Global Topology of Word Co-occurrence Networks: Beyond the Two-regime Power-law. In COLING.
[11]
Edgar F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13, 6 (June 1970), 377–387.
[12]
Ido Dagan, Shaul Marcus, and Shaul Markovitch. 1993. Contextual Word Similarity and Estimation from Sparse Data. In ACL.
[13]
Jeffrey Dalton, Laura Dietz, and James Allan. 2014. Entity Query Feature Expansion Using Knowledge Base Links. In SIGIR.
[14]
Sourav Dutta and Gerhard Weikum. 2015. Cross-Document Co-Reference Resolution using Sample-Based Clustering with Knowledge Enrichment. TACL 3(2015), 15–28.
[15]
Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. Artif. Intell. Res. 22 (2004), 457–479.
[16]
Ernesto Estrada and Juan A. Rodríguez-Velázquez. 2006. Subgraph Centrality and Clustering in Complex Hyper-networks. Physica A 364(2006), 581 – 594.
[17]
Stefan Evert. 2005. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. Dissertation. University of Stuttgart, Germany.
[18]
John Rupert Firth. 1957. Papers in Linguistics, 1934-1951. Oxford University Press, London.
[19]
Benjamin Heintz and Abhishek Chandra. 2014. Beyond Graphs: Toward Scalable Hypergraph Analysis Systems. SIGMETRICS Perform. Eval. Rev. 41, 4 (2014), 94–97.
[20]
Benjamin Heintz, Shivangi Singh, Corey Tesdahl, and Abhishek Chandra. 2016. MESH: A Flexible Distributed Hypergraph Processing System. Technical Report. Department of Computer Science and Engineering, University of Minnesota.
[21]
Jin Huang, Rui Zhang, and Jeffrey Xu Yu. 2015. Scalable Hypergraph Learning and Processing. In ICDM.
[22]
Ramon Ferrer i Cancho and Richard V. Solé. 2001. The Small World of Human Language. Proc. Royal Soc. B 268, 1482 (2001), 2261–2265.
[23]
Adam Jatowt, Ching-man Au Yeung, and Katsumi Tanaka. 2013. Estimating Document Focus Time. In CIKM.
[24]
Komal Kapoor, Dhruv Sharma, and Jaideep Srivastava. 2013. Weighted Node Degree Centrality for Hypergraphs. In Network Science Workshop.
[25]
Wei Liang. 2017. Spectra of English Evolving Word Co-occurrence Networks. Physica A 468(2017), 802 – 808.
[26]
Wei Liang, Yuming Shi, K Tse Chi, Jing Liu, Yanli Wang, and Xunqiang Cui. 2009. Comparison of Co-occurrence Networks of the Chinese and English Languages. Physica A 388, 23 (2009), 4901–4909.
[27]
Zhiyuan Liu, Peng Li, Yabin Zheng, and Maosong Sun. 2009. Clustering to Find Exemplar Terms for Keyphrase Extraction. In EMNLP.
[28]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
[29]
Yutaka Matsuo and Mitsuru Ishizuka. 2004. Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information. Int. J. Artif. Intell. Tools 13, 1 (2004), 157–169.
[30]
George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39–41.
[31]
Yuan Ni, Qiong Kai Xu, Feng Cao, Yosi Mass, Dafna Sheinwald, Hui Jia Zhu, and Shao Sheng Cao. 2016. Semantic Documents Relatedness Using Concept Graph Representation. In WSDM.
[32]
Helen J Peat and Peter Willett. 1991. The Limitations of Term Co-occurrence Data for Query Expansion in Document Retrieval Systems. J. Assoc. Inf. Sci. Technol 42, 5 (1991), 378.
[33]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP.
[34]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In NAACL-HLT.
[35]
Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (2009), 333–389.
[36]
Anish Das Sarma, Alpa Jain, and Cong Yu. 2011. Dynamic Relationship and Event Discovery. In WSDM.
[37]
Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and Philip S. Yu. 2017. A Survey of Heterogeneous Information Network Analysis. IEEE Trans. Knowl. Data Eng. 29, 1 (2017), 17–37.
[38]
Andreas Spitz. 2019. Implicit Entity Networks: A Versatile Document Model. Ph.D. Dissertation. Heidelberg University, Germany.
[39]
Andreas Spitz, Satya Almasian, and Michael Gertz. 2017. EVELIN: Exploration of Event and Entity Links in Implicit Networks. In WWW Companion.
[40]
Andreas Spitz and Michael Gertz. 2016. Terms over LOAD: Leveraging Named Entities for Cross-Document Extraction and Summarization of Events. In SIGIR.
[41]
Andreas Spitz and Michael Gertz. 2018. Exploring Entity-centric Networks in Entangled News Streams. In WWW Companion.
[42]
Shulong Tan, Jiajun Bu, Chun Chen, Bin Xu, Can Wang, and Xiaofei He. 2011. Using Rich Social Media Information for Music Recommendation via Hypergraph Model. ACM Trans. Multimedia Comput. Commun. Appl. 7S, 1 (Nov. 2011).
[43]
Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. PTE: Predictive Text Embedding Through Large-scale Heterogeneous Text Networks. In KDD.
[44]
Cornelis Joost Van Rijsbergen. 1977. A Theoretical Basis for the Use of Co-occurrence Data in Information Retrieval. J. Doc. 33, 2 (1977), 106–119.
[45]
Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 57, 10 (2014), 78–85.
[46]
Wei Wang, Furu Wei, Wenjie Li, and Sujian Li. 2009. HyperSum: Hypergraph Based Semi-supervised Sentence Ranking for Query-oriented Summarization. In CIKM.
[47]
Michael M. Wolf, Alicia M. Klinvex, and Daniel M. Dunlavy. 2016. Advantages to Modeling Relational Data Using Hypergraphs Versus Graphs. In HPEC.
[48]
Jinxi Xu and W. Bruce Croft. 1998. Corpus-based Stemming Using Cooccurrence of Word Variants. ACM Trans. Inf. Syst. 16, 1 (Jan. 1998), 61–81.
[49]
ChengXiang Zhai and Sean Massung. 2016. Text Data Management and Analysis. Morgan & Claypool.
[50]
Yu Zhu, Ziyu Guan, Shulong Tan, Haifeng Liu, Deng Cai, and Xiaofei He. 2016. Heterogeneous Hypergraph Embedding for Document Recommendation. Neurocomputing 216(2016), 150–162.

Cited By

View all
  • (2022)A Hypergraph Approach for Estimating Growth Mechanisms of Complex NetworksIEEE Access10.1109/ACCESS.2022.314361210(35012-35025)Online publication date: 2022
  • (2022)Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few LabelsMachine Learning and Knowledge Discovery in Databases10.1007/978-3-031-26390-3_33(571-587)Online publication date: 19-Sep-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSDBM '20: Proceedings of the 32nd International Conference on Scientific and Statistical Database Management
July 2020
241 pages
ISBN:9781450388146
DOI:10.1145/3400903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 July 2020

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SSDBM 2020

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Hypergraph Approach for Estimating Growth Mechanisms of Complex NetworksIEEE Access10.1109/ACCESS.2022.314361210(35012-35025)Online publication date: 2022
  • (2022)Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few LabelsMachine Learning and Knowledge Discovery in Databases10.1007/978-3-031-26390-3_33(571-587)Online publication date: 19-Sep-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media