research-article

A Versatile Hypergraph Model for Document Collections

Authors:

Dennis Aumiller,

Bálint Soproni,

Michael GertzAuthors Info & Claims

SSDBM '20: Proceedings of the 32nd International Conference on Scientific and Statistical Database Management

Article No.: 7, Pages 1 - 12

https://doi.org/10.1145/3400903.3400919

Published: 30 July 2020 Publication History

Abstract

Efficiently and effectively representing large collections of text is of central importance to information retrieval tasks such as summarization and search. Since models for these tasks frequently rely on an implicit graph structure of the documents or their contents, graph-based document representations are naturally appealing. For tasks that consider the joint occurrence of words or entities, however, existing document representations often fall short in capturing cooccurrences of higher order, higher multiplicity, or at varying proximity levels. Furthermore, while numerous applications benefit from structured knowledge sources, external data sources are rarely considered as integral parts of existing document models.

To address these shortcomings, we introduce heterogeneous hypergraphs as a versatile model for representing annotated document collections. We integrate external metadata, document content, entity and term annotations, and document segmentation at different granularity levels in a joint model that bridges the gap between structured and unstructured data. We discuss selection and transformation operations on the set of hyperedges, which can be chained to support a wide range of query scenarios. To ensure compatibility with established information retrieval methods, we discuss projection operations that transform hyperedges to traditional dyadic cooccurrence graph representations. Using PostgreSQL and Neo4j, we investigate the suitability of existing database systems for implementing the hypergraph document model, and explore the impact of utilizing implicit and materialized hyperedge representations on storage space requirements and query performance.

References

[1]

James Allan. 2002. Topic Detection and Tracking: Event-Based Information Organization. Vol. 12. Springer Science & Business Media.

[2]

Jing Bai, Dawei Song, Peter Bruza, Jian-Yun Nie, and Guihong Cao. 2005. Query Expansion Using Term Relationships in Language Models for Information Retrieval. In CIKM.

[3]

Abdelghani Bellaachia and Mohammed Al-Dhelaan. 2014. HG-Rank: A Hypergraph-based Keyphrase Extraction for Short Documents in Dynamic Genre. In WWW Companion.

[4]

Abdelghani Bellaachia and Mohammed Al-Dhelaan. 2014. Multi-document Hyperedge-based Ranking for Text Summarization. In CIKM.

[5]

Michael Bendersky and W. Bruce Croft. 2012. Modeling Higher-order Term Dependencies in Information Retrieval Using Query Hypergraphs. In SIGIR.

[6]

Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A Neural Probabilistic Language Model. In NIPS.

[7]

Claude Berge. 1973. Graphs and Hypergraphs. North-Holland Publishing.

[8]

Claude Berge. 1984. Hypergraphs: Combinatorics of Finite Sets. Vol. 45. Elsevier.

[9]

Roi Blanco and Christina Lioma. 2012. Graph-based Term Weighting for Information Retrieval. Inf. Retr. 15, 1 (2012), 54–92.

Digital Library

[10]

Monojit Choudhury, Diptesh Chatterjee, and Animesh Mukherjee. 2010. Global Topology of Word Co-occurrence Networks: Beyond the Two-regime Power-law. In COLING.

[11]

Edgar F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13, 6 (June 1970), 377–387.

Digital Library

[12]

Ido Dagan, Shaul Marcus, and Shaul Markovitch. 1993. Contextual Word Similarity and Estimation from Sparse Data. In ACL.

[13]

Jeffrey Dalton, Laura Dietz, and James Allan. 2014. Entity Query Feature Expansion Using Knowledge Base Links. In SIGIR.

[14]

Sourav Dutta and Gerhard Weikum. 2015. Cross-Document Co-Reference Resolution using Sample-Based Clustering with Knowledge Enrichment. TACL 3(2015), 15–28.

[15]

Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. Artif. Intell. Res. 22 (2004), 457–479.

[16]

Ernesto Estrada and Juan A. Rodríguez-Velázquez. 2006. Subgraph Centrality and Clustering in Complex Hyper-networks. Physica A 364(2006), 581 – 594.

[17]

Stefan Evert. 2005. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. Dissertation. University of Stuttgart, Germany.

[18]

John Rupert Firth. 1957. Papers in Linguistics, 1934-1951. Oxford University Press, London.

[19]

Benjamin Heintz and Abhishek Chandra. 2014. Beyond Graphs: Toward Scalable Hypergraph Analysis Systems. SIGMETRICS Perform. Eval. Rev. 41, 4 (2014), 94–97.

Digital Library

[20]

Benjamin Heintz, Shivangi Singh, Corey Tesdahl, and Abhishek Chandra. 2016. MESH: A Flexible Distributed Hypergraph Processing System. Technical Report. Department of Computer Science and Engineering, University of Minnesota.

[21]

Jin Huang, Rui Zhang, and Jeffrey Xu Yu. 2015. Scalable Hypergraph Learning and Processing. In ICDM.

[22]

Ramon Ferrer i Cancho and Richard V. Solé. 2001. The Small World of Human Language. Proc. Royal Soc. B 268, 1482 (2001), 2261–2265.

[23]

Adam Jatowt, Ching-man Au Yeung, and Katsumi Tanaka. 2013. Estimating Document Focus Time. In CIKM.

[24]

Komal Kapoor, Dhruv Sharma, and Jaideep Srivastava. 2013. Weighted Node Degree Centrality for Hypergraphs. In Network Science Workshop.

[25]

Wei Liang. 2017. Spectra of English Evolving Word Co-occurrence Networks. Physica A 468(2017), 802 – 808.

[26]

Wei Liang, Yuming Shi, K Tse Chi, Jing Liu, Yanli Wang, and Xunqiang Cui. 2009. Comparison of Co-occurrence Networks of the Chinese and English Languages. Physica A 388, 23 (2009), 4901–4909.

[27]

Zhiyuan Liu, Peng Li, Yabin Zheng, and Maosong Sun. 2009. Clustering to Find Exemplar Terms for Keyphrase Extraction. In EMNLP.

[28]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich SchÃ¼tze. 2008. Introduction to Information Retrieval. Cambridge University Press.

[29]

Yutaka Matsuo and Mitsuru Ishizuka. 2004. Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information. Int. J. Artif. Intell. Tools 13, 1 (2004), 157–169.

[30]

George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39–41.

Digital Library

[31]

Yuan Ni, Qiong Kai Xu, Feng Cao, Yosi Mass, Dafna Sheinwald, Hui Jia Zhu, and Shao Sheng Cao. 2016. Semantic Documents Relatedness Using Concept Graph Representation. In WSDM.

[32]

Helen J Peat and Peter Willett. 1991. The Limitations of Term Co-occurrence Data for Query Expansion in Document Retrieval Systems. J. Assoc. Inf. Sci. Technol 42, 5 (1991), 378.

[33]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP.

[34]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In NAACL-HLT.

[35]

Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (2009), 333–389.

Digital Library

[36]

Anish Das Sarma, Alpa Jain, and Cong Yu. 2011. Dynamic Relationship and Event Discovery. In WSDM.

[37]

Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and Philip S. Yu. 2017. A Survey of Heterogeneous Information Network Analysis. IEEE Trans. Knowl. Data Eng. 29, 1 (2017), 17–37.

Digital Library

[38]

Andreas Spitz. 2019. Implicit Entity Networks: A Versatile Document Model. Ph.D. Dissertation. Heidelberg University, Germany.

[39]

Andreas Spitz, Satya Almasian, and Michael Gertz. 2017. EVELIN: Exploration of Event and Entity Links in Implicit Networks. In WWW Companion.

Digital Library

[40]

Andreas Spitz and Michael Gertz. 2016. Terms over LOAD: Leveraging Named Entities for Cross-Document Extraction and Summarization of Events. In SIGIR.

[41]

Andreas Spitz and Michael Gertz. 2018. Exploring Entity-centric Networks in Entangled News Streams. In WWW Companion.

[42]

Shulong Tan, Jiajun Bu, Chun Chen, Bin Xu, Can Wang, and Xiaofei He. 2011. Using Rich Social Media Information for Music Recommendation via Hypergraph Model. ACM Trans. Multimedia Comput. Commun. Appl. 7S, 1 (Nov. 2011).

Digital Library

[43]

Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. PTE: Predictive Text Embedding Through Large-scale Heterogeneous Text Networks. In KDD.

Digital Library

[44]

Cornelis Joost Van Rijsbergen. 1977. A Theoretical Basis for the Use of Co-occurrence Data in Information Retrieval. J. Doc. 33, 2 (1977), 106–119.

[45]

Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 57, 10 (2014), 78–85.

Digital Library

[46]

Wei Wang, Furu Wei, Wenjie Li, and Sujian Li. 2009. HyperSum: Hypergraph Based Semi-supervised Sentence Ranking for Query-oriented Summarization. In CIKM.

[47]

Michael M. Wolf, Alicia M. Klinvex, and Daniel M. Dunlavy. 2016. Advantages to Modeling Relational Data Using Hypergraphs Versus Graphs. In HPEC.

[48]

Jinxi Xu and W. Bruce Croft. 1998. Corpus-based Stemming Using Cooccurrence of Word Variants. ACM Trans. Inf. Syst. 16, 1 (Jan. 1998), 61–81.

Digital Library

[49]

ChengXiang Zhai and Sean Massung. 2016. Text Data Management and Analysis. Morgan & Claypool.

[50]

Yu Zhu, Ziyu Guan, Shulong Tan, Haifeng Liu, Deng Cai, and Xiaofei He. 2016. Heterogeneous Hypergraph Embedding for Document Recommendation. Neurocomputing 216(2016), 150–162.

Cited By

Inoue MPham TShimodaira H(2022)A Hypergraph Approach for Estimating Growth Mechanisms of Complex NetworksIEEE Access10.1109/ACCESS.2022.314361210(35012-35025)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3143612
Abdali SMukherjee SPapalexakis E(2022)Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few LabelsMachine Learning and Knowledge Discovery in Databases10.1007/978-3-031-26390-3_33(571-587)Online publication date: 19-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-26390-3_33

Recommendations

Visualization of large document collections
Mining massive document collections by the WEBSOM method
Special issue: Soft computing data mining

A viable alternative to the traditional text-mining methods is the WEBSOM, a software system based on the Self-Organizing Map (SOM) principle. Prior to the searching or browsing operations, this method orders a collection of textual items, say, ...
Document Expansion Using External Collections
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Document expansion has been shown to improve the effectiveness of information retrieval systems by augmenting documents' term probability estimates with those of similar documents, producing higher quality document representations. We propose a method ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSDBM '20: Proceedings of the 32nd International Conference on Scientific and Statistical Database Management

July 2020

241 pages

ISBN:9781450388146

DOI:10.1145/3400903

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 July 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

SSDBM 2020

SSDBM 2020: 32nd International Conference on Scientific and Statistical Database Management

July 7 - 9, 2020

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
68
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Inoue MPham TShimodaira H(2022)A Hypergraph Approach for Estimating Growth Mechanisms of Complex NetworksIEEE Access10.1109/ACCESS.2022.314361210(35012-35025)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3143612
Abdali SMukherjee SPapalexakis E(2022)Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few LabelsMachine Learning and Knowledge Discovery in Databases10.1007/978-3-031-26390-3_33(571-587)Online publication date: 19-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-26390-3_33

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents