Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Seed-Guided Topic Model for Document Filtering and Classification

Published: 06 December 2018 Publication History

Abstract

One important necessity is to filter out the irrelevant information and organize the relevant information into meaningful categories. However, developing text classifiers often requires a large number of labeled documents as training examples. Manually labeling documents is costly and time-consuming. More importantly, it becomes unrealistic to know all the categories covered by the documents beforehand. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this article, we propose a seed-guided topic model for the dataless text filtering and classification (named DFC). Given a collection of unlabeled documents, and for each specified category a small set of seed words that are relevant to the semantic meaning of the category, DFC filters out the irrelevant documents and classifies the relevant documents into the corresponding categories through topic influence. DFC models two kinds of topics: category-topics and general-topics. Also, there are two kinds of category-topics: relevant-topics and irrelevant-topics. Each relevant-topic is associated with one specific category, representing its semantic meaning. The irrelevant-topics represent the semantics of the unknown categories covered by the document collection. And the general-topics capture the global semantic information. DFC assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that DFC learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then filtered, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that DFC consistently outperforms the state-of-the-art dataless text classifiers for both classification with filtering and classification without filtering. In many tasks, DFC can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that DFC is insensitive to the tuning parameters. Moreover, we conduct a thorough study about the impact of seed words for existing dataless text classification techniques. The results reveal that it is not using more seed words but the document coverage of the seed words for the corresponding category that affects the dataless classification performance.

References

[1]
David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the ICML. 25--32.
[2]
David M. Blei and Jon D. McAuliffe. 2007. Supervised topic models. In Proceedings of the NIPS. 121--128.
[3]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022.
[4]
Chris Buckley and Gerard Salton. 1995. Optimization of relevance feedback weights. In Proceedings of the SIGIR. 351--357.
[5]
Jaime G. Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the SIGIR. 335--336.
[6]
Sutanu Chakraborti, Ulises Cerviño Beresi, Nirmalie Wiratunga, Stewart Massie, Robert Lothian, and Deepak Khemani. 2008. Visualizing and evaluating complexity of textual case bases. In Proceedings of the ECCBR. 104--119.
[7]
Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Proceedings of the AAAI. 830--835.
[8]
Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. 2006. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of the NIPS. 241--248.
[9]
Xingyuan Chen, Yunqing Xia, Peng Jin, and John A. Carroll. 2015. Dataless text classification with descriptive LDA. In Proceedings of the AAAI. 2224--2231.
[10]
Zhiyuan Chen and Bing Liu. 2014. Mining topics in documents: Standing on the shoulders of big data. In Proceedings of the SIGKDD. 1116--1125.
[11]
Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malú Castellanos, and Riddhiman Ghosh. 2013. Leveraging multi-domain prior knowledge in topic models. In Proceedings of the IJCAI. 2071--2077.
[12]
Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. JASIS 41, 6 (1990), 391--407.
[13]
Doug Downey and Oren Etzioni. 2008. Look Ma, no hands: analyzing the monotonic feature abstraction for text classification. In Proceedings of the NIPS. 393--400.
[14]
Gregory Druck, Gideon S. Mann, and Andrew McCallum. 2008. Learning from labeled features using generalized expectation criteria. In Proceedings of the SIGIR. 595--602.
[15]
Mark D. Dunlop. 1997. The effect of accessing nonmatching documents on relevance feedback. ACM Trans. Inf. Syst. 15, 2 (1997), 137--153.
[16]
Karla L. Caballero Espinosa and Ram Akella. 2012. Incorporating statistical topic information in relevance feedback. In Proceedings of the SIGIR. 1093--1094.
[17]
Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the IJCAI. 1606--1611.
[18]
Alfio Gliozzo, Carlo Strapparava, and Ido Dagan. 2009. Improving text categorization bootstrapping via unsupervised learning. ACM Trans. Speech Lang. Process. 6, 1 (Oct. 2009), 1:1--1:24.
[19]
Yves Grandvalet and Yoshua Bengio. 2004. Semi-supervised learning by entropy minimization. In NIPS. 529--536.
[20]
Hu Guan, Jingyu Zhou, and Minyi Guo. 2009. A class-feature-centroid classifier for text categorization. In Proceedings of the WWW. 201--210.
[21]
Swapnil Hingmire and Sutanu Chakraborti. 2014. Topic labeled text classification: A weakly supervised approach. In Proceedings of the SIGIR. 385--394.
[22]
Swapnil Hingmire, Sandeep Chougule, Girish K. Palshikar, and Sutanu Chakraborti. 2013. Document classification by topic labeling. In Proceedings of the SIGIR. 877--880.
[23]
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the SIGIR. 50--57.
[24]
Jagadeesh Jagarlamudi, Hal Daumé III, and Raghavendra Udupa. 2012. Incorporating lexical priors into topic models. In Proceedings of the EACL. 204--213.
[25]
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the ICML. 957--966.
[26]
Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36, 2 (2017), 11:1--11:30.
[27]
Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the SIGIR. 165--174.
[28]
Chenliang Li, Jian Xing, Aixin Sun, and Zongyang Ma. 2016. Effective document labeling with very few seed words: A topic model approach. In Proceedings of the CIKM. 85--94.
[29]
Bing Liu, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. 2004. Text classification by labeling words. In Proceedings of the AAAI. 425--430.
[30]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. 2008. Introduction to Information Retrieval. Cambridge University Press.
[31]
Jun Miao, Jimmy Xiangji Huang, and Jiashu Zhao. 2016. TopPRF: A probabilistic framework for integrating topic space into pseudo relevance feedback. ACM Trans. Inf. Syst. 34, 4 (2016), 22:1--22:36.
[32]
David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the EMNLP. 262--272.
[33]
Arjun Mukherjee and Bing Liu. 2012. Aspect extraction through semi-supervised modeling. In ACL. 339--348.
[34]
Kamal Nigam, Andrew Kachites MacCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2--3 (2000), 103--134.
[35]
Hema Raghavan, Omid Madani, and Rosie Jones. 2006. Active learning with feedback on features and instances. J. Mach. Learn. Res. 7 (2006), 1655--1686.
[36]
Alan Ritter, Evan Wright, William Casey, and Tom M. Mitchell. 2015. Weakly supervised extraction of computer security events from twitter. In Proceedings of the WWW. 896--905.
[37]
Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011. Intent-aware search result diversification. In Proceedings of the SIGIR. 595--604.
[38]
Yangqiu Song and Dan Roth. 2014. On dataless hierarchical text classification. In Proceedings of the AAAI. 1579--1585.
[39]
T. P. Straatsma, H. J. C. Berendsen, and A. J. Stam. 1986. Estimation of statistical errors in molecular simulation calculations. Mol. Phys. 57, 1 (1986), 89--95.
[40]
Griffiths Thomas and Steyvers Mark. 2004. Finding scientific topics. In Proceedings of the PNAS.
[41]
Hanna M. Wallach, David M. Mimno, and Andrew McCallum. 2009. Rethinking LDA: Why priors matter. In Proceedings of the NIPS. 1973--1981.
[42]
Pengtao Xie and Eric P. Xing. 2013. Integrating document clustering and topic modeling. In Proceedings of the UAI.
[43]
Limin Yao, David M. Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the KDD. 937--946.
[44]
Zheng Ye and Jimmy Xiangji Huang. 2014. A simple term frequency transformation model for effective pseudo relevance feedback. In Proceedings of the SIGIR. 323--332.
[45]
Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. LightLDA: Big topic models on modest computer clusters. In Proceedings of the WWW. 1351--1361.
[46]
Jun Zhu, Amr Ahmed, and Eric P. Xing. 2009. MedLDA: Maximum margin supervised topic models for regression and classification. In Proceedings of the ICML. 1257--1264.
[47]
Jun Zhu, Amr Ahmed, and Eric P. Xing. 2012. MedLDA: Maximum margin supervised topic models. J. Mach. Learn. Res. 13 (2012), 2237--2278.

Cited By

View all
  • (2024)NLP-based approach for automated safety requirements information retrieval from project documentsExpert Systems with Applications10.1016/j.eswa.2023.122401239(122401)Online publication date: Apr-2024
  • (2023)Immigration Policy is Health Policy: News Media Effects on Health Disparities for Latinx Immigrant and Indigenous GroupsHealth Promotion Practice10.1177/1524839922115081624:5(818-827)Online publication date: 1-Mar-2023
  • (2023)rHDP: An Aspect Sharing-Enhanced Hierarchical Topic Model for Multi-Domain CorpusACM Transactions on Information Systems10.1145/363135242:3(1-31)Online publication date: 29-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 37, Issue 1
January 2019
435 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3289475
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2018
Accepted: 01 June 2018
Revised: 01 June 2018
Received: 01 January 2018
Published in TOIS Volume 37, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Topic model
  2. dataless classification
  3. document filtering

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Singapore Ministry of Education Academic Research Fund Tier 2
  • Natural Science Foundation of Hubei Province
  • Natural Scientific Research Program of Wuhan University
  • Academic Team Building Plan for Young Scholars from Wuhan University
  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)9
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)NLP-based approach for automated safety requirements information retrieval from project documentsExpert Systems with Applications10.1016/j.eswa.2023.122401239(122401)Online publication date: Apr-2024
  • (2023)Immigration Policy is Health Policy: News Media Effects on Health Disparities for Latinx Immigrant and Indigenous GroupsHealth Promotion Practice10.1177/1524839922115081624:5(818-827)Online publication date: 1-Mar-2023
  • (2023)rHDP: An Aspect Sharing-Enhanced Hierarchical Topic Model for Multi-Domain CorpusACM Transactions on Information Systems10.1145/363135242:3(1-31)Online publication date: 29-Dec-2023
  • (2023)Exploring Time-aware Multi-pattern Group Venue Recommendation in LBSNsACM Transactions on Information Systems10.1145/356428041:3(1-31)Online publication date: 7-Feb-2023
  • (2023)Keyword‐Assisted Topic ModelsAmerican Journal of Political Science10.1111/ajps.1277968:2(730-750)Online publication date: Apr-2023
  • (2023)Are the BERT family zero-shot learners? A study on their potential and limitationsArtificial Intelligence10.1016/j.artint.2023.103953322(103953)Online publication date: Sep-2023
  • (2022)Simplifying Text Mining Activities: Scalable and Self-Tuning Methodology for Topic Detection and CharacterizationApplied Sciences10.3390/app1210512512:10(5125)Online publication date: 19-May-2022
  • (2022)A Seed-Guided Latent Dirichlet Allocation Approach to Predict the Personality of Online Users Using the PEN ModelAlgorithms10.3390/a1503008715:3(87)Online publication date: 8-Mar-2022
  • (2022)Dealing With Hierarchical Types and Label Noise in Fine-Grained Entity TypingIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2022.315528130(1305-1318)Online publication date: 2022
  • (2022)Changes in service quality of sharing accommodation: Evidence from airbnbTechnology in Society10.1016/j.techsoc.2022.10209271(102092)Online publication date: Nov-2022
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media