Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Large-Scale Cross-Language Web Page Classification via Dual Knowledge Transfer Using Fast Nonnegative Matrix Trifactorization

Published: 27 July 2015 Publication History

Abstract

With the rapid growth of modern technologies, Internet has reached almost every corner of the world. As a result, it becomes more and more important to manage and mine information contained in Web pages in different languages. Traditional supervised learning methods usually require a large amount of training data to obtain accurate and robust classification models. However, labeled Web pages did not increase as fast as the growth of Internet. The lack of sufficient training Web pages in many languages, especially for those in uncommonly used languages, makes it a challenge for traditional classification algorithms to achieve satisfactory performance. To address this, we observe that Web pages for a same topic from different languages usually share some common semantic patterns, though in different representation forms. In addition, we also observe that the associations between word clusters and Web page classes are another type of reliable carriers to transfer knowledge across languages. With these recognitions, in this article we propose a novel joint nonnegative matrix trifactorization (NMTF) based Dual Knowledge Transfer (DKT) approach for cross-language Web page classification. Our approach transfers knowledge from the auxiliary language, in which abundant labeled Web pages are available, to the target languages, in which we want to classify Web pages, through two different paths: word cluster approximation and the associations between word clusters and Web page classes. With the reinforcement between these two different knowledge transfer paths, our approach can achieve better classification accuracy. In order to deal with the large-scale real world data, we further develop the proposed DKT approach by constraining the factor matrices of NMTF to be cluster indicator matrices. Due to the nature of cluster indicator matrices, we can decouple the proposed optimization objective and the resulted subproblems are of much smaller sizes involving much less matrix multiplications, which make our new approach much more computationally efficient. We evaluate the proposed approach in extensive experiments using a real world cross-language Web page data set. Promising results have demonstrated the effectiveness of our approach that are consistent with our theoretical analyses.

References

[1]
Nuria Bel, Cornelis H. A. Koster, and Marta Villegas. 2003. Cross-lingual text categorization. Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science, Vol. 2769. 126--139.
[2]
John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, 120--128.
[3]
Gang Chen, Fei Wang, and Changshui Zhang. 2009. Collaborative filtering using orthogonal nonnegative matrix tri-factorization. Information Processing and Management 45, 3 (2009), 368--379.
[4]
Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu. 2007. Co-clustering based classification for out-of-domain documents. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 210--219.
[5]
Chris Ding and Xiaofeng He. 2004. K-means clustering via principal component analysis. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML). ACM, New York, NY, USA, 29.
[6]
Chris Ding, Xiaofeng He, and Horst D. Simon. 2005. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM International Conference on Data Mining. SIAM, Philadelphia, PA, 606--610.
[7]
Chris H. Q. Ding, Tao Li and Michael. I. Jordan. 2010. Convex and semi-nonnegative matrix factorizations. TPAMI 32, 1 (2010), 45--55.
[8]
Chris Ding, Tao Li, Wei Peng, and Haesun Park. 2006. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 126--135.
[9]
Quanquan Gu and Jie Zhou. 2009a. Co-clustering on manifolds. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 359--368.
[10]
Quanquan Gu and Jie Zhou. 2009b. Transductive classification via dual regularization. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I. Springer-Verlag, Berlin, Heidelberg, 439--454.
[11]
Quanquan Gu, Jie Zhou, and Chris Ding. 2010. Collaborative filtering: Weighted nonnegative matrix factorization incorporating user and item graphs. In Proceedings of the 10th SIAM International Conference on Data Mining. SIAM, Philadelphia, PA, 199--210.
[12]
Thorsten Joachims. 2008. SVMLight: Support vector machine. http://svmlight.joachims.org/.
[13]
Daniel D. Lee and Hyunjune S. Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788--791.
[14]
Daniel D. Lee and Hyunjune S. Seung. 2001. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems (NIPS). MIT Press, USA, 556--562.
[15]
Tao Li, Vikas Sindhwani, Chris Ding, and Yi Zhang. 2009. Knowledge transformation for cross-domain sentiment classification. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 716--717.
[16]
Tao Li, V. Sindhwani, C. Ding, and Y. Zhang. 2010. Bridging domains with words: Opinion analysis with matrix tri-factorizations. In SDM.
[17]
Xiao Ling, Gui-Rong Xue, Wenyuan Dai, Yun Jiang, Qiang Yang, and Yong Yu. 2008. Can Chinese web pages be classified with English data source? In Proceedings of the 17th International Conference on World Wide Web. ACM, New York, NY, USA, 969--978.
[18]
J. S. Olsson, Douglas W. Oard, and Jan Hajič. 2005. Cross-language text classification. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 645--646.
[19]
Sinno J. Pan and Qiang Yang. 2009. A survey on transfer learning. In IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) 22, 10, 1345--1359.
[20]
Peter Prettenhofer and Benno Stein. 2010. Cross-language text classification using structural correspondence learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, 1118--1127.
[21]
Gabriela Ramírez-de-la-Rosa, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda, David Pinto-Avendaño, Thamar Solorio. 2010. Using information from the target language to improve crosslingual text classification. In Proceedings of 7th International Conference on NLP, IceTAL 2010. Reykjavik, Iceland, 305--313.
[22]
Lei Shi, Rada Mihalcea, and Mingjun Tian. 2010. Cross language text classification by model translation and semi-supervised learning. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, 1057--1067.
[23]
Xiaojun Wan. 2009. Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1--Volume 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 235--243.
[24]
Fei Wang, Tao Li, and Changshui Zhang. 2008. Semi-supervised clustering via matrix factorization. In Proceedings of the SIAM International Conference on Data Mining. SIAM, Philadelphia, PA, 1--12.
[25]
Hua Wang, Feiping Nie, Heng Huang, and Fillia Makedon. 2011a. Fast nonnegative matrix tri-factorization for large-scale data co-clustering. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume 2. AAAI, USA, 1553--1558.
[26]
Hua Wang, Heng Huang, Feiping Nie, and Chris Ding. 2011b. Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 933--942.
[27]
Hua Wang, Heng Huang, and Chris Ding. 2011c. Simultaneous clustering of multi-type relational data via symmetric nonnegative matrix tri-factorization. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, USA, 279--284.
[28]
Hua Wang, Feiping Nie, Heng Huang, and Chris Ding. 2011d. Nonnegative matrix tri-factorization based high-order co-clustering and its fast implementation. In IEEE 11th International Conference on Data Mining (ICDM). IEEE, Vancouver, BC, 774--783.
[29]
Hua Wang, Feiping Nie, Heng Huang, and Chris Ding. 2011e. Dyadic transfer learning for cross-domain image classification. In IEEE International Conference on Computer Vision (ICCV). IEEE, Barcelona, Spain, 551--556.
[30]
Ke Wu and Bao-Liang Lu. 2008. A refinement framework for cross language text categorization. In Proceedings of the 4th Asia Information Retrieval Conference on Information Retrieval Technology. Lecture Notes in Computer Science, Vol. 4993. Springer-Verlag, Harbin, China, 401--411.
[31]
Hongyuan Zha, Xiaofeng He, Chris Ding, Horst Simon, and Ming Gu. 2001. Spectral relaxation for k-means clustering. In Advances in Neural Information Processing Systems (NIPS). MIT Press, USA, 1057--1064.
[32]
Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, and Zhongzhi Shi. 2010. Exploiting associations between word clusters and document classes for cross-domain text categorization. In Proceedings of the SIAM International Conference on Data Mining. SIAM, Philadelphia, PA, 13--24.

Cited By

View all
  • (2024)Learning semi-supervised enrichment of longitudinal imaging-genetic data for improved prediction of cognitive declineBMC Medical Informatics and Decision Making10.1186/s12911-024-02455-w24:S1Online publication date: 28-May-2024
  • (2020)Public service hot issue discovery with binary differential evolution algorithm based on fuzzy system theoryJournal of Intelligent & Fuzzy Systems10.3233/JIFS-179940(1-7)Online publication date: 6-Jul-2020
  • (2019)An analytical study of information extraction from unstructured and multidimensional big dataJournal of Big Data10.1186/s40537-019-0254-86:1Online publication date: 17-Oct-2019
  • Show More Cited By

Index Terms

  1. Large-Scale Cross-Language Web Page Classification via Dual Knowledge Transfer Using Fast Nonnegative Matrix Trifactorization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 10, Issue 1
    July 2015
    321 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/2808688
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 July 2015
    Accepted: 01 December 2014
    Revised: 01 May 2014
    Received: 01 July 2013
    Published in TKDD Volume 10, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cross-language classification
    2. cluster indicator matrix
    3. knowledge transfer
    4. large-scale data
    5. nonnegative matrix trifactorization

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • NSF-DBI 1356628
    • NSF-IIS 1423591
    • NSF-IIS 1117965
    • NSF-IIS 1302675
    • NSF-IIS 1344152

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Learning semi-supervised enrichment of longitudinal imaging-genetic data for improved prediction of cognitive declineBMC Medical Informatics and Decision Making10.1186/s12911-024-02455-w24:S1Online publication date: 28-May-2024
    • (2020)Public service hot issue discovery with binary differential evolution algorithm based on fuzzy system theoryJournal of Intelligent & Fuzzy Systems10.3233/JIFS-179940(1-7)Online publication date: 6-Jul-2020
    • (2019)An analytical study of information extraction from unstructured and multidimensional big dataJournal of Big Data10.1186/s40537-019-0254-86:1Online publication date: 17-Oct-2019
    • (2019)Semi-supervised Multi-view Individual and Sharable Feature Learning for Webpage ClassificationThe World Wide Web Conference10.1145/3308558.3313492(3349-3355)Online publication date: 13-May-2019
    • (2019)The optimal feasible knowledge transfer path in a knowledge creation driven teamData & Knowledge Engineering10.1016/j.datak.2019.01.002Online publication date: Jan-2019
    • (2017)Finding cut from the same clothProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298239.3298453(1467-1473)Online publication date: 4-Feb-2017
    • (2017)Semi-supervised multi-view correlation feature learning with application to webpage classificationProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298239.3298440(1374-1381)Online publication date: 4-Feb-2017
    • (2017)An optimized approach for massive web page classification using entity similarity based on semantic networkFuture Generation Computer Systems10.1016/j.future.2017.03.00376(510-518)Online publication date: Nov-2017
    • (2016)Fast robust non-negative matrix factorization for large-scale human action data clusteringProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence10.5555/3060832.3060915(2104-2110)Online publication date: 9-Jul-2016

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media