research-article

TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters

Authors:

Hwanjo YuAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 2819 - 2829

https://doi.org/10.1145/3485447.3512002

Published: 25 April 2022 Publication History

Abstract

Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.

References

[1]

David M Blei, Thomas L Griffiths, Michael I Jordan, Joshua B Tenenbaum, 2003. Hierarchical topic models and the nested Chinese restaurant process. In NeurIPS, Vol. 16.

[2]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. JMLR 3(2003), 993–1022.

Digital Library

[3]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL 5(2017), 135–146.

[4]

Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. In SIGMOD. 93–104.

[5]

Ricardo JGB Campello, Davoud Moulavi, Arthur Zimek, and Jörg Sander. 2015. Hierarchical density estimates for data clustering, visualization, and outlier detection. TKDD 10, 1 (2015), 1–51.

Digital Library

[6]

Inderjit S Dhillon and Dharmendra S Modha. 2001. Concept decompositions for large sparse text data using clustering. Machine learning 42, 1 (2001), 143–175.

[7]

Doug Downey, Chandra Bhagavatula, and Yi Yang. 2015. Efficient methods for inferring large sparse topic hierarchies. In ACL. 774–784.

[8]

Edouard Fouché, Yu Meng, Fang Guo, Honglei Zhuang, Klemens Böhm, and Jiawei Han. 2020. Mining Text Outliers in Document Directories. In ICDM. 152–161.

[9]

Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. 2018. Hyperbolic entailment cones for learning hierarchical embeddings. In ICML. 1646–1655.

[10]

Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS 101, suppl 1 (2004), 5228–5235.

[11]

Huan Gui, Qi Zhu, Liyuan Liu, Aston Zhang, and Jiawei Han. 2018. Expert finding in heterogeneous bibliographic networks with locally-trained embeddings. arXiv preprint arXiv:1803.03370(2018).

[12]

Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. 2020. Pretrained Transformers Improve Out-of-Distribution Robustness. In ACL. 2744–2751.

[13]

Jiaxin Huang, Yu Meng, Fang Guo, Heng Ji, and Jiawei Han. 2020. Weakly-supervised aspect-based sentiment analysis via joint aspect-sentiment topic embedding. In EMNLP. 6989–6999.

[14]

Jiaxin Huang, Yiqing Xie, Yu Meng, Yunyi Zhang, and Jiawei Han. 2020. Corel: Seed-guided topical taxonomy construction by concept learning and relation transferring. In KDD. 1928–1936.

[15]

Dongha Lee, Dongmin Hyun, Jiawei Han, and Hwanjoyu Yu. 2021. Out-of-Category Document Identification Using Target-Category Names as Weak Supervision. In ICDM.

[16]

Dongha Lee, Sehun Yu, and Hwanjo Yu. 2020. Multi-Class Data Description for Out-of-distribution Detection. In KDD. 1362–1370.

[17]

Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. 2015. Mining quality phrases from massive text corpora. In SIGMOD. 1729–1744.

[18]

Xueqing Liu, Yangqiu Song, Shixia Liu, and Haixun Wang. 2012. Automatic taxonomy construction from keywords. In KDD. 1433–1441.

[19]

Andrei Manolache, Florin Brad, and Elena Burceanu. 2021. DATE: Detecting Anomalies in Text via Self-Supervision of Transformers. In NAACL-HLT.

[20]

Yuning Mao, Tong Zhao, Andrey Kan, Chenwei Zhang, Xin Luna Dong, Christos Faloutsos, and Jiawei Han. 2020. Octet: Online Catalog Taxonomy Enrichment with Self-Supervision. In KDD. 2247–2257.

[21]

Yu Meng, Jiaxin Huang, Guangyuan Wang, Zihan Wang, Chao Zhang, Yu Zhang, and Jiawei Han. 2020. Discriminative topic mining via category-name guided text embedding. In WebConf. 2121–2132.

[22]

Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao Zhang, Honglei Zhuang, Lance Kaplan, and Jiawei Han. 2019. Spherical text embedding. NeurIPS 32(2019).

[23]

Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2018. Weakly-supervised neural text classification. In CIKM. 983–992.

[24]

Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2019. Weakly-supervised hierarchical text classification. In AAAU, Vol. 33. 6826–6833.

[25]

Yu Meng, Yunyi Zhang, Jiaxin Huang, Chenyan Xiong, Heng Ji, Chao Zhang, and Jiawei Han. 2020. Text classification using label names only: A language model self-training approach. In EMNLP. 9006–9017.

[26]

Yu Meng, Yunyi Zhang, Jiaxin Huang, Yu Zhang, Chao Zhang, and Jiawei Han. 2020. Hierarchical topic mining via joint spherical tree and text embedding. In KDD. 1908–1917.

[27]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS.

[28]

David Mimno, Wei Li, and Andrew McCallum. 2007. Mixtures of hierarchical topics with pachinko allocation. In ICML. 633–640.

[29]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532–1543.

[30]

Lukas Ruff, Yury Zemlyanskiy, Robert Vandermeulen, Thomas Schnake, and Marius Kloft. 2019. Self-attentive, multi-context one-class classification for unsupervised anomaly detection on text. In ACL. 4061–4071.

[31]

Saket Sathe and Charu C Aggarwal. 2016. Subspace outlier detection in linear time with randomized hashing. In ICDM. 459–468.

[32]

Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. 2018. Automated phrase mining from massive text corpora. TKDE 30, 10 (2018), 1825–1837.

[33]

Jingbo Shang, Xinyang Zhang, Liyuan Liu, Sha Li, and Jiawei Han. 2020. Nettaxo: Automated topic taxonomy construction from text-rich network. In WebConf. 1908–1919.

[34]

Jiaming Shen, Wenda Qiu, Yu Meng, Jingbo Shang, Xiang Ren, and Jiawei Han. 2021. TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names. In NAACL. 4239–4249.

[35]

Jiaming Shen, Zhihong Shen, Chenyan Xiong, Chi Wang, Kuansan Wang, and Jiawei Han. 2020. TaxoExpan: Self-supervised taxonomy expansion with position-enhanced graph neural network. In WebConf. 486–497.

[36]

Jiaming Shen, Zeqiu Wu, Dongming Lei, Jingbo Shang, Xiang Ren, and Jiawei Han. 2017. Setexpan: Corpus-based set expansion via context feature selection and rank ensemble. In ECML-PKDD. 288–304.

[37]

Jiaming Shen, Zeqiu Wu, Dongming Lei, Chao Zhang, Xiang Ren, Michelle T Vanni, Brian M Sadler, and Jiawei Han. 2018. Hiexpan: Task-guided taxonomy construction by hierarchical tree expansion. In KDD. 2180–2189.

Digital Library

[38]

Fangbo Tao, Honglei Zhuang, Chi Wang Yu, Qi Wang, Taylor Cassidy, Lance M Kaplan, Clare R Voss, and Jiawei Han. 2016. Multi-Dimensional, Phrase-Based Summarization in Text Cubes.IEEE Data Eng. Bull. 39, 3 (2016), 74–84.

[39]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.JMLR 9, 11 (2008).

[40]

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In ICLR.

[41]

Luke Vilnis and Andrew McCallum. 2015. Word representations via gaussian embedding. In ICLR.

[42]

Chi Wang, Marina Danilevsky, Nihit Desai, Yinan Zhang, Phuong Nguyen, Thrivikrama Taula, and Jiawei Han. 2013. A phrase mining framework for recursive construction of a topical hierarchy. In KDD. 437–445.

[43]

Pengtao Xie, Diyi Yang, and Eric Xing. 2015. Incorporating word correlation knowledge into topic modeling. In NAACL. 725–734.

[44]

Yue Yu, Yinghao Li, Jiaming Shen, Hao Feng, Jimeng Sun, and Chao Zhang. 2020. STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths. In KDD. 1026–1035.

[45]

Qingkai Zeng, Jinfeng Lin, Wenhao Yu, Jane Cleland-Huang, and Meng Jiang. 2021. Enhancing Taxonomy Completion with Concept Generation via Fusing Relational Representations. In KDD. 2104–2113.

[46]

Zhiyuan Zeng, Keqing He, Yuanmeng Yan, Hong Xu, and Weiran Xu. 2021. Adversarial self-supervised learning for out-of-domain detection. In NAACL. 5631–5639.

[47]

Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian Sadler, Michelle Vanni, and Jiawei Han. 2018. Taxogen: Unsupervised topic taxonomy construction by adaptive term embedding and clustering. In KDD. 2701–2709.

Digital Library

[48]

Honglei Zhuang, Chi Wang, Fangbo Tao, Lance Kaplan, and Jiawei Han. 2017. Identifying semantically deviating outlier documents. In EMNLP. 2748–2757.

Cited By

Zhang YWan CXiao KWan QLiu DLiu X(2023)rHDP: An Aspect Sharing-Enhanced Hierarchical Topic Model for Multi-Domain CorpusACM Transactions on Information Systems10.1145/363135242:3(1-31)Online publication date: 29-Dec-2023
https://dl.acm.org/doi/10.1145/3631352
Yoon SMeng YLee DHan J(2023)SCStory: Self-supervised and Continual Online Story DiscoveryProceedings of the ACM Web Conference 202310.1145/3543507.3583507(1853-1864)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583507
Yoon SChan HHan J(2023)PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets StreamProceedings of the ACM Web Conference 202310.1145/3543507.3583371(1650-1661)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583371
Show More Cited By

Index Terms

TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Mining a set of meaningful topics organized into a hierarchy is intuitively appealing since topic correlations are ubiquitous in massive text corpora. To account for potential hierarchical topic structures, hierarchical topic models generalize flat ...
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Topic sentiment change analysis
MLDM'11: Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition

Public opinions on a topic may change over time. Topic Sentiment change analysis is a new research problem consisting of two main components: (a) mining opinions on a certain topic, and (b) detect significant changes of sentiment of the opinions on the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
536
Total Downloads

Downloads (Last 12 months)185
Downloads (Last 6 weeks)14

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YWan CXiao KWan QLiu DLiu X(2023)rHDP: An Aspect Sharing-Enhanced Hierarchical Topic Model for Multi-Domain CorpusACM Transactions on Information Systems10.1145/363135242:3(1-31)Online publication date: 29-Dec-2023
https://dl.acm.org/doi/10.1145/3631352
Yoon SMeng YLee DHan J(2023)SCStory: Self-supervised and Continual Online Story DiscoveryProceedings of the ACM Web Conference 202310.1145/3543507.3583507(1853-1864)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583507
Yoon SChan HHan J(2023)PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets StreamProceedings of the ACM Web Conference 202310.1145/3543507.3583371(1650-1661)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583371
Yoon SLee DZhang YHan JChen HDuh WHuang HKato MMothe JPoblete B(2023)Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic EmbeddingProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591782(802-811)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591782
Zhao QFan LZhang YLi JShi YRao WLiu X(2023)DualTaxoVecKnowledge-Based Systems10.1016/j.knosys.2023.110565271:COnline publication date: 8-Jul-2023
https://dl.acm.org/doi/10.1016/j.knosys.2023.110565

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents