Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3512002acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article

TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters

Published: 25 April 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.

    References

    [1]
    David M Blei, Thomas L Griffiths, Michael I Jordan, Joshua B Tenenbaum, 2003. Hierarchical topic models and the nested Chinese restaurant process. In NeurIPS, Vol. 16.
    [2]
    David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. JMLR 3(2003), 993–1022.
    [3]
    Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL 5(2017), 135–146.
    [4]
    Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. In SIGMOD. 93–104.
    [5]
    Ricardo JGB Campello, Davoud Moulavi, Arthur Zimek, and Jörg Sander. 2015. Hierarchical density estimates for data clustering, visualization, and outlier detection. TKDD 10, 1 (2015), 1–51.
    [6]
    Inderjit S Dhillon and Dharmendra S Modha. 2001. Concept decompositions for large sparse text data using clustering. Machine learning 42, 1 (2001), 143–175.
    [7]
    Doug Downey, Chandra Bhagavatula, and Yi Yang. 2015. Efficient methods for inferring large sparse topic hierarchies. In ACL. 774–784.
    [8]
    Edouard Fouché, Yu Meng, Fang Guo, Honglei Zhuang, Klemens Böhm, and Jiawei Han. 2020. Mining Text Outliers in Document Directories. In ICDM. 152–161.
    [9]
    Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. 2018. Hyperbolic entailment cones for learning hierarchical embeddings. In ICML. 1646–1655.
    [10]
    Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS 101, suppl 1 (2004), 5228–5235.
    [11]
    Huan Gui, Qi Zhu, Liyuan Liu, Aston Zhang, and Jiawei Han. 2018. Expert finding in heterogeneous bibliographic networks with locally-trained embeddings. arXiv preprint arXiv:1803.03370(2018).
    [12]
    Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. 2020. Pretrained Transformers Improve Out-of-Distribution Robustness. In ACL. 2744–2751.
    [13]
    Jiaxin Huang, Yu Meng, Fang Guo, Heng Ji, and Jiawei Han. 2020. Weakly-supervised aspect-based sentiment analysis via joint aspect-sentiment topic embedding. In EMNLP. 6989–6999.
    [14]
    Jiaxin Huang, Yiqing Xie, Yu Meng, Yunyi Zhang, and Jiawei Han. 2020. Corel: Seed-guided topical taxonomy construction by concept learning and relation transferring. In KDD. 1928–1936.
    [15]
    Dongha Lee, Dongmin Hyun, Jiawei Han, and Hwanjoyu Yu. 2021. Out-of-Category Document Identification Using Target-Category Names as Weak Supervision. In ICDM.
    [16]
    Dongha Lee, Sehun Yu, and Hwanjo Yu. 2020. Multi-Class Data Description for Out-of-distribution Detection. In KDD. 1362–1370.
    [17]
    Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. 2015. Mining quality phrases from massive text corpora. In SIGMOD. 1729–1744.
    [18]
    Xueqing Liu, Yangqiu Song, Shixia Liu, and Haixun Wang. 2012. Automatic taxonomy construction from keywords. In KDD. 1433–1441.
    [19]
    Andrei Manolache, Florin Brad, and Elena Burceanu. 2021. DATE: Detecting Anomalies in Text via Self-Supervision of Transformers. In NAACL-HLT.
    [20]
    Yuning Mao, Tong Zhao, Andrey Kan, Chenwei Zhang, Xin Luna Dong, Christos Faloutsos, and Jiawei Han. 2020. Octet: Online Catalog Taxonomy Enrichment with Self-Supervision. In KDD. 2247–2257.
    [21]
    Yu Meng, Jiaxin Huang, Guangyuan Wang, Zihan Wang, Chao Zhang, Yu Zhang, and Jiawei Han. 2020. Discriminative topic mining via category-name guided text embedding. In WebConf. 2121–2132.
    [22]
    Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao Zhang, Honglei Zhuang, Lance Kaplan, and Jiawei Han. 2019. Spherical text embedding. NeurIPS 32(2019).
    [23]
    Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2018. Weakly-supervised neural text classification. In CIKM. 983–992.
    [24]
    Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2019. Weakly-supervised hierarchical text classification. In AAAU, Vol. 33. 6826–6833.
    [25]
    Yu Meng, Yunyi Zhang, Jiaxin Huang, Chenyan Xiong, Heng Ji, Chao Zhang, and Jiawei Han. 2020. Text classification using label names only: A language model self-training approach. In EMNLP. 9006–9017.
    [26]
    Yu Meng, Yunyi Zhang, Jiaxin Huang, Yu Zhang, Chao Zhang, and Jiawei Han. 2020. Hierarchical topic mining via joint spherical tree and text embedding. In KDD. 1908–1917.
    [27]
    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS.
    [28]
    David Mimno, Wei Li, and Andrew McCallum. 2007. Mixtures of hierarchical topics with pachinko allocation. In ICML. 633–640.
    [29]
    Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532–1543.
    [30]
    Lukas Ruff, Yury Zemlyanskiy, Robert Vandermeulen, Thomas Schnake, and Marius Kloft. 2019. Self-attentive, multi-context one-class classification for unsupervised anomaly detection on text. In ACL. 4061–4071.
    [31]
    Saket Sathe and Charu C Aggarwal. 2016. Subspace outlier detection in linear time with randomized hashing. In ICDM. 459–468.
    [32]
    Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. 2018. Automated phrase mining from massive text corpora. TKDE 30, 10 (2018), 1825–1837.
    [33]
    Jingbo Shang, Xinyang Zhang, Liyuan Liu, Sha Li, and Jiawei Han. 2020. Nettaxo: Automated topic taxonomy construction from text-rich network. In WebConf. 1908–1919.
    [34]
    Jiaming Shen, Wenda Qiu, Yu Meng, Jingbo Shang, Xiang Ren, and Jiawei Han. 2021. TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names. In NAACL. 4239–4249.
    [35]
    Jiaming Shen, Zhihong Shen, Chenyan Xiong, Chi Wang, Kuansan Wang, and Jiawei Han. 2020. TaxoExpan: Self-supervised taxonomy expansion with position-enhanced graph neural network. In WebConf. 486–497.
    [36]
    Jiaming Shen, Zeqiu Wu, Dongming Lei, Jingbo Shang, Xiang Ren, and Jiawei Han. 2017. Setexpan: Corpus-based set expansion via context feature selection and rank ensemble. In ECML-PKDD. 288–304.
    [37]
    Jiaming Shen, Zeqiu Wu, Dongming Lei, Chao Zhang, Xiang Ren, Michelle T Vanni, Brian M Sadler, and Jiawei Han. 2018. Hiexpan: Task-guided taxonomy construction by hierarchical tree expansion. In KDD. 2180–2189.
    [38]
    Fangbo Tao, Honglei Zhuang, Chi Wang Yu, Qi Wang, Taylor Cassidy, Lance M Kaplan, Clare R Voss, and Jiawei Han. 2016. Multi-Dimensional, Phrase-Based Summarization in Text Cubes.IEEE Data Eng. Bull. 39, 3 (2016), 74–84.
    [39]
    Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.JMLR 9, 11 (2008).
    [40]
    Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In ICLR.
    [41]
    Luke Vilnis and Andrew McCallum. 2015. Word representations via gaussian embedding. In ICLR.
    [42]
    Chi Wang, Marina Danilevsky, Nihit Desai, Yinan Zhang, Phuong Nguyen, Thrivikrama Taula, and Jiawei Han. 2013. A phrase mining framework for recursive construction of a topical hierarchy. In KDD. 437–445.
    [43]
    Pengtao Xie, Diyi Yang, and Eric Xing. 2015. Incorporating word correlation knowledge into topic modeling. In NAACL. 725–734.
    [44]
    Yue Yu, Yinghao Li, Jiaming Shen, Hao Feng, Jimeng Sun, and Chao Zhang. 2020. STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths. In KDD. 1026–1035.
    [45]
    Qingkai Zeng, Jinfeng Lin, Wenhao Yu, Jane Cleland-Huang, and Meng Jiang. 2021. Enhancing Taxonomy Completion with Concept Generation via Fusing Relational Representations. In KDD. 2104–2113.
    [46]
    Zhiyuan Zeng, Keqing He, Yuanmeng Yan, Hong Xu, and Weiran Xu. 2021. Adversarial self-supervised learning for out-of-domain detection. In NAACL. 5631–5639.
    [47]
    Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian Sadler, Michelle Vanni, and Jiawei Han. 2018. Taxogen: Unsupervised topic taxonomy construction by adaptive term embedding and clustering. In KDD. 2701–2709.
    [48]
    Honglei Zhuang, Chi Wang, Fangbo Tao, Lance Kaplan, and Jiawei Han. 2017. Identifying semantically deviating outlier documents. In EMNLP. 2748–2757.

    Cited By

    View all
    • (2023)rHDP: An Aspect Sharing-Enhanced Hierarchical Topic Model for Multi-Domain CorpusACM Transactions on Information Systems10.1145/363135242:3(1-31)Online publication date: 29-Dec-2023
    • (2023)SCStory: Self-supervised and Continual Online Story DiscoveryProceedings of the ACM Web Conference 202310.1145/3543507.3583507(1853-1864)Online publication date: 30-Apr-2023
    • (2023)PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets StreamProceedings of the ACM Web Conference 202310.1145/3543507.3583371(1650-1661)Online publication date: 30-Apr-2023
    • Show More Cited By

    Index Terms

    1. TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          WWW '22: Proceedings of the ACM Web Conference 2022
          April 2022
          3764 pages
          ISBN:9781450390965
          DOI:10.1145/3485447
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 25 April 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Hierarchical topic discovery
          2. Novelty detection
          3. Text clustering
          4. Text embedding
          5. Topic taxonomy completion

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          WWW '22
          Sponsor:
          WWW '22: The ACM Web Conference 2022
          April 25 - 29, 2022
          Virtual Event, Lyon, France

          Acceptance Rates

          Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)185
          • Downloads (Last 6 weeks)14
          Reflects downloads up to 27 Jul 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)rHDP: An Aspect Sharing-Enhanced Hierarchical Topic Model for Multi-Domain CorpusACM Transactions on Information Systems10.1145/363135242:3(1-31)Online publication date: 29-Dec-2023
          • (2023)SCStory: Self-supervised and Continual Online Story DiscoveryProceedings of the ACM Web Conference 202310.1145/3543507.3583507(1853-1864)Online publication date: 30-Apr-2023
          • (2023)PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets StreamProceedings of the ACM Web Conference 202310.1145/3543507.3583371(1650-1661)Online publication date: 30-Apr-2023
          • (2023)Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic EmbeddingProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591782(802-811)Online publication date: 19-Jul-2023
          • (2023)DualTaxoVecKnowledge-Based Systems10.1016/j.knosys.2023.110565271:COnline publication date: 8-Jul-2023

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media