Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A nonnegative matrix factorization framework for semi-supervised document clustering with dual constraints

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we propose a new semi-supervised co-clustering algorithm Orthogonal Semi-Supervised Nonnegative Matrix Factorization (OSS-NMF) for document clustering. In this new approach, the clustering process is carried out by incorporating both prior domain knowledge of data points (documents) in the form of pair-wise constraints and category knowledge of features (words) into the NMF co-clustering framework. Under this framework, the clustering problem is formulated as the problem of finding the local minimizer of objective function, taking into account the dual prior knowledge. The update rules are derived, and an iterative algorithm is designed for the co-clustering process. Theoretically, we prove the correctness and convergence of our algorithm and demonstrate its mathematical rigorous. Our experimental evaluations show that the proposed document clustering model presents remarkable performance improvements with those constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Banerjee A, Dhillon L et al (2004) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 509–514

  2. Basu S, Banerjee A et al (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19th ICML international conference on, machine learning, pp 27–34

  3. Basu S, Bilenko M et al (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 59–68

  4. Beil F, Ester M et al (2002) Frequent term-based text clustering. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 436–442

  5. Berry MW, Browne M et al (2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52:155–173

    Article  MathSciNet  MATH  Google Scholar 

  6. Bission G, Hussain F (2008) Chi-sim: a new similarity measure for the co-clustering task. In: Proceedings of the 7th international conference on machine learning and applications, pp 211–217

  7. Chang H, Yeung DY (2006) Locally linear metric adaptation for semi-supervised clustering and image retrieval. Pattern Recognit 39(7):1253–1264

    Article  MATH  Google Scholar 

  8. Chen Y, Rege M et al (2008) Non-negative matrix factorization for semi-supervised data clustering. Knowl Inf Syst 17(3):355–379

    Article  Google Scholar 

  9. Chen Y, Wang L J et al (2009) Semi-supervised document clustering with simultaneous text representation and categorization. Mach Learn Knowl Discov Databases 5781:211–226

    Google Scholar 

  10. Chen Y, Wang L et al (2010) Non-negative matrix factorization for semi-supervised heterogeneous data co-clustering. IEEE Trans Knowl Data Eng 22(10):1459–1474

    Article  Google Scholar 

  11. Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, NewYork

    Book  MATH  Google Scholar 

  12. Davidson I, Ravi T (2005) Clustering with constraints: feasibility issues and the FK-means algorithm. In: Proceedings of the 5th SIAM international conference on data mining, pp 138–149

  13. Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1–2):143–175

    Article  MATH  Google Scholar 

  14. Dhillon IS, Mallela S et al (2003) Information-theoretic co-clustering. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, pp 89–98

  15. Ding CH, Li T et al (2008) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 99(1):195–197

    Google Scholar 

  16. Ding CH, Li T et al (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 126–135

  17. Gu Q, Zhou J (2009) Co-clustering on manifolds. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 359–367

  18. Ho ND (2008) Nonnegative matrix factorization-algorithms and applications. PhD thesis, Université catholique de Louvain, Belgium

  19. Hu G, Zhou S et al (2008) Toward effective document clustering: a constrained K-means based approach. Inf Process Manag 44(4):1397–1409

    Google Scholar 

  20. Kalogeratos A, Likas A (2012) Text document clustering using global term context vectors. Knowl Inf Syst 31(3):455–474

    Google Scholar 

  21. Kamvar SD, Klein D, Manning CD (2003) Spectral learning. In: Proceedings of the 18th international joint conference on artificial intelligence, pp 561–566

  22. Klein D, Kamvar S, Manning C (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 19th international conference on machine learning, pp 307–314

  23. Kriegel HP, Kröger P, Zimek A (2009) Clustering high dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1–58

    Article  Google Scholar 

  24. Lee D, Seung H (2001) Algorithms for non-negative matrix factorization. In: Proceedings of annual conference on neural information processing systems, pp 556–562

  25. Lee H, Yoo J et al (2010) Semi-supervised nonnegative matrix factorization. IEEE Signal Process Lett 46(2):269–294

    Google Scholar 

  26. Levin M (1998) Mathematical classification and clustering. J Glob Optimiz 12(1):105–108

    Article  Google Scholar 

  27. Li T, Ding C et al (2008) Knowledge transformation from word space to document space. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 187–194

  28. Li, T, Zhang Y et al (2009) A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In: Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP, pp 244–252

  29. Lu Z, Leen TK (2007) Penalized probabilistic clustering. Neural Comput 19(6):1528–1567

    Article  MathSciNet  MATH  Google Scholar 

  30. Mechelen IV, Bock HH, Boeck DP (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394

    Article  MathSciNet  MATH  Google Scholar 

  31. Ni X, Quan X et al (2011) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365

    Article  Google Scholar 

  32. Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5:111–126

    Article  Google Scholar 

  33. Rege M, Dong M (2006) Co-clustering documents and words using bipartite isoperimetric graph partitioning. In: Proceedings of the 6th international conference on data mining, pp 532–541

  34. Salton G, Wong A et al (1975) A vector space model for automatic indexing. Commun ACM 18(11): 613–620

    Article  MATH  Google Scholar 

  35. Shan H., Banerjee A (2008) Bayesian co-clustering. In: Proceedings of the 8th international conference on data mining, pp 530–539

  36. Song YQ, Pan S et al (2010) Constrained co-clustering for textual documents. In: Proceedings of the 24th AAAI conference on artificial intelligence, pp 581–586

  37. Thurau C, Kersting K et al (2011) Convex non-negative matrix factorization for massive datasets. Knowl Inf Syst 29(2):457–478

    Article  Google Scholar 

  38. Verbeek JJ, Nunnink JRJ et al (2006) Accelerated EM-based clustering of large data sets. Data Min Knowl Discov 13(3):291–307

    Article  MathSciNet  Google Scholar 

  39. Wagstaff K, Cardie C et al (2001) Constrained K-means clustering with background knowledge. In: Proceedings of the 18th international conference on machine learning, pp 577–584

  40. Wang F, Li T et al (2008) Semi-supervised clustering via matrix factorization. In: Proceedings of the 8th SIAM international conference on data mining, pp 1–12

  41. Wang P, Domeniconi C et al (2009) Latent dirichlet bayesian co-clustering. Mach Learn Knowl Discov Databases 5782:522–537

    Google Scholar 

  42. Xing EP, Ng AY et al (2002) Distance metric learning, with application to clustering with side-information. Adv Neural Inf Process Syst 15:502–512

    Google Scholar 

  43. Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th ACM SIGIR conference on research and development in information retrieval, pp 267–273

  44. Yan Y, Chen L X et al (2011) Semi-supervised fuzzy co-clustering algorithm for document categorization. Knowl Inf Syst (published online)

  45. Yin X, Chen S et al (2010) Semi-supervised clustering with metric learning: an adaptive kernel method. Pattern Recognit 43(4):1320–1333

    Article  MathSciNet  MATH  Google Scholar 

  46. Zhang ZY, Li T et al (2012) Non-negative tri-factor tensor decomposition with applications. Knowl Inf Syst (published online)

  47. Zhao WZ, He Q, Ma HF et al (2011) Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30(3):569–587

    Article  Google Scholar 

  48. Zhu Y, Yu J et al (2012) A novel semi-supervised learning framework with simultaneous text representing. Knowl Inf Syst (published online)

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 61163039, 61105052), National Basic Research Priorities Programme (No. 2007CB311004), Funding of enhancement of young teachers’ research of Northwest Normal University (No. NWNU-LKQN-10-1), Doctoral Start-up Funding of Xiangtan University (No. 10QDZ42).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huifang Ma.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, H., Zhao, W. & Shi, Z. A nonnegative matrix factorization framework for semi-supervised document clustering with dual constraints. Knowl Inf Syst 36, 629–651 (2013). https://doi.org/10.1007/s10115-012-0560-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0560-3

Keywords