Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1390334.1390441acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Topic-bridged PLSA for cross-domain text classification

Published: 20 July 2008 Publication History

Abstract

In many Web applications, such as blog classification and new-sgroup classification, labeled data are in short supply. It often happens that obtaining labeled data in a new domain is expensive and time consuming, while there may be plenty of labeled data in a related but different domain. Traditional text classification ap-proaches are not able to cope well with learning across different domains. In this paper, we propose a novel cross-domain text classification algorithm which extends the traditional probabilistic latent semantic analysis (PLSA) algorithm to integrate labeled and unlabeled data, which come from different but related domains, into a unified probabilistic model. We call this new model Topic-bridged PLSA, or TPLSA. By exploiting the common topics between two domains, we transfer knowledge across different domains through a topic-bridge to help the text classification in the target domain. A unique advantage of our method is its ability to maximally mine knowledge that can be transferred between domains, resulting in superior performance when compared to other state-of-the-art text classification approaches. Experimental eval-uation on different kinds of datasets shows that our proposed algorithm can improve the performance of cross-domain text classification significantly.

References

[1]
Basu, S., Banerjee, A., and Mooney, R. J. Semi-Supervised Clustering by Seeding. In ICML, 2002.
[2]
Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Wortman, J. Learning Bounds for Domain Adaptation. In NIPS 2007.
[3]
Cohn, D., Caruana, R., and McCallum, A. Semi-Supervised Clustering with User Feedback.Technical Report TR2003-1892, Cornell University, 2003.
[4]
Cohn, D., and Hofmann, T. The Missing Link - a Probabilistic Model of Document Content and Hypertext Connectivity. In NIPS, 2001.
[5]
Dai, W., Yang, Q., Xue, G.-R., and Yu, Y, Boosting for Transfer Learning. In ICML, 2007.
[6]
Dempster, A., Laird, N., and Rubin, D. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of Royal Statistical Society, Series B, 39(1): 1--38, 1977.
[7]
Heckman, J. J. Sample Selection Bias as a Specification Error. Econometrica 47:153--161, 1979.
[8]
Hofmann, T. Probabilistic Latent Semantic Analysis. In SIGIR, 1999.
[9]
Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., and Schölkopf, B. Correcting Sample Selection Bias by Unlabeled Data. In NIPS, 2007.
[10]
Ji, X., Xu, W., and Zhu, S. Document Clustering with Prior Knowledge. In SIGIR, 2006.
[11]
Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In ICML, 1998.
[12]
Joachims, T. Transductive Inference for Text Classification using Support Vector Machines. In ICML, 1999.
[13]
Lewis, D. D. Representation and Learning in Information Retrieval. PhD thesis, Amherst, MA, USA, 1992.
[14]
Liao, X., Xue, Y., and Carin, L. Logistic Regression with an Auxiliary Data Source. In ICML, 2005.
[15]
Kullback, S. and Leibler, R. A. On Information and Sufficiency. Annals of Mathematical Statistics, 22(1):79--86, 1951.
[16]
Ni, X., Xue, G.-R., Ling, X., Yu, Y., Yang, Q. Exploring in the Weblog Space by Detecting Informative and Affective Articles. In WWW, 2007.
[17]
Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2-3):103--134, 2000.
[18]
Porter, M. F. An Algorithm for Suffix Stripping. Program 14, 3, pp 130--137, 1980.
[19]
Robertson, S. E., Walker, S., Beaulieu, M. M., Gatford, M., and Payne, A. Okapi at TREC-4. In TREC-4, 73--96. 1996.
[20]
Shimodaira, H. Improving Predictive Inference under Cova-riate Shift by Weighting the Log-likelihood Function. Journal of Statistical Planning and Inference, 2000.
[21]
Wu, P., and Dietterich, T. G. Improving SVM Accuracy by Training on Auxiliary Data Sources. In ICML, 2004.
[22]
Yang, Y. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, Vol. 1, No. 1/2, 67--88, 1999.
[23]
Yang, Y. and Pedersen, J.P. A Comparative Study on Feature Selection in Text Categorization. In ICML, 1997.
[24]
Zadrozny, B. Learning and Evaluating Classifiers under Sample Selection Bias. In ICML, 2004.
[25]
Zhu, X. Semi-Supervised Learning Literature Survey. CS TR 1530, University of Wisconsin-Madison, 2006.

Cited By

View all
  • (2022)Surfing the Modeling of pos Taggers in Low-Resource ScenariosMathematics10.3390/math1019352610:19(3526)Online publication date: 27-Sep-2022
  • (2022)Hierarchical lifelong topic modeling using rules extracted from network communitiesPLOS ONE10.1371/journal.pone.026448117:3(e0264481)Online publication date: 3-Mar-2022
  • (2021)Lost in Transduction: Transductive Transfer Learning in Text ClassificationACM Transactions on Knowledge Discovery from Data10.1145/345314616:1(1-21)Online publication date: 20-Jul-2021
  • Show More Cited By

Index Terms

  1. Topic-bridged PLSA for cross-domain text classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
    July 2008
    934 pages
    ISBN:9781605581644
    DOI:10.1145/1390334
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 July 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-domain
    2. text classification
    3. topic-bridged PLSA

    Qualifiers

    • Research-article

    Conference

    SIGIR '08
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 16 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Surfing the Modeling of pos Taggers in Low-Resource ScenariosMathematics10.3390/math1019352610:19(3526)Online publication date: 27-Sep-2022
    • (2022)Hierarchical lifelong topic modeling using rules extracted from network communitiesPLOS ONE10.1371/journal.pone.026448117:3(e0264481)Online publication date: 3-Mar-2022
    • (2021)Lost in Transduction: Transductive Transfer Learning in Text ClassificationACM Transactions on Knowledge Discovery from Data10.1145/345314616:1(1-21)Online publication date: 20-Jul-2021
    • (2021)A Comprehensive Survey on Transfer LearningProceedings of the IEEE10.1109/JPROC.2020.3004555109:1(43-76)Online publication date: Jan-2021
    • (2021)Transfer Learning Strategies for Credit Card Fraud DetectionIEEE Access10.1109/ACCESS.2021.31044729(114754-114766)Online publication date: 2021
    • (2021)Sentiment Analysis and Opinion MiningText Data Mining10.1007/978-981-16-0100-2_8(163-199)Online publication date: 21-Jan-2021
    • (2020)A Lifelong Sentiment Classification Framework Based on a Close Domain Lifelong Topic Modeling MethodIntelligent Information and Database Systems10.1007/978-3-030-41964-6_50(575-585)Online publication date: 4-Mar-2020
    • (2019)Towards Safe Weakly Supervised LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2019.2922396(1-1)Online publication date: 2019
    • (2019)Recognition of Multiclass Epileptic EEG Signals Based on Knowledge and Label Space Inductive TransferIEEE Transactions on Neural Systems and Rehabilitation Engineering10.1109/TNSRE.2019.290470827:4(630-642)Online publication date: Apr-2019
    • (2019)Stacked Robust Adaptively Regularized Auto-Regressions for Domain AdaptationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.283708531:3(561-574)Online publication date: 1-Mar-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media