Abstract
Clustering has become an important tool for every data scientist as it allows to perform exploratory data analysis and summarize large amounts of data. Specifically for text data, clustering faces other challenges derived from the high-dimensional space into which the data is represented. Furthermore and in spite of the fact that important contributions have already been made, scalability presents an important challenge when the whole-data-in-memory approach is no longer valid for real scenarios where data is collected in massive volumes. This chapter reviews the recent contributions on high-dimensional text data clustering with particular emphasis on scalability issues and also on the impact of the curse of dimensionality over the distance-based clustering methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abbas, M. A. and Shoukry, A. A.: Cmune: A clustering using mutual nearest neighbors algorithm, In Information Science, Signal Processing and their Applications (ISSPA), 2012 11th International Conference on, pp. 1192–1197.
Ackermann, M. R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., and Sohler, C.: Streamkm++: A clustering algorithm for data streams. J. Exp. Algorithmics, 17, 2012.
Aggarwal, C., Han, J., Wang, J., and Yu, P.: A framework for clustering evolving data streams, Proceedings of the 29th international conference on Very large databases (VLDB ’03), Morgan Kaufmann, 2003, pp. 81–92.
Aggarwal, C., Han, J., Wang, J., and Yu, P.: A framework for projected clustering of high dimensional data streams, Proceedings of the 30th international conference on Very large data bases (VLDB ’04), 2004, pp. 852–863.
Aggarwal, C. and Yu, P.: Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec., Vol. 29 (2), 2000, pp. 70–81.
Aggarwal, C. C., Hinneburg, A., and Keim, D. A.: On the surprising behavior of distance metrics in high dimensional space, Springer, 2001.
Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and Park, J. S.: Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD ’99), ACM, 1999, pp. 61–72, New York.
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD ’98, ACM, 1998, pp. 94–105, New York.
Albers, S. and Leonardi, S.: On-line algorithms, ACM Computing Surveys, Vol. 31 (3), 1999.
Assent, I.: Clustering high dimensional data, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 2 (4), 2012, pp. 340–350.
Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U.: When is “nearest neighbor” meaningful? In Database Theory—ICDT’99, Springer, pp. 217–235, 1999.
Bishop, C.: Pattern recognition and machine learning, Vol. 4., Springer New York, 2006.
Bohm, C., Railing, K., Kriegel, H., and Kroger, P.: Density connected clustering with local subspace preferences. In Data Mining, 2004. ICDM’04. 4th IEEE International Conference on, pp. 27–34.
Broder, A. Z.: On the resemblance and containment of documents, In Compression and Complexity of Sequences 1997. Proceedings, pp. 21–29.
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G.: Syntactic clustering of the web, Computer Networks and ISDN Systems, Vol. 29 (8), 1997, pp. 1157–1166.
Charikar, M. S.: Similarity estimation techniques from rounding algorithms, Proceedings of the 34th annual ACM symposium on Theory of computing, ACM, 2002, pp. 380–388.
Chien, J.-T. and Chang, Y.-L.: Bayesian sparse topic model, Journal of Signal Processing Systems, Vol. 74 (3), 2014, pp. 375–389.
Das, A., Datar, M., Garg, A., and Rajaram, S.: Google news personalization: scalable online collaborative filtering, Proceedings of the 16th international conference on World Wide Web, ACM, 2007, pp. 271–280.
Dasgupta, A., Kumar, R., and Sarlós, T.: A sparse Johnson–Lindenstrauss transform, Proceedings of the 42nd ACM symposium on Theory of computing, 2010, pp. 341–350.
Eisenstein, J., Ahmed, A., and Xing, E.: Sparse additive generative models of text, Proceedings of the 28th International Conference on Machine Learning (ICML-11), New York, ACM, 2011, pp. 1041–1048
Ertöz, L., Steinbach, M., and Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, SDM, SIAM, 2003, pp. 47–58.
Friedman, J. and Meulman, J.: Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 66 (4), 2004, pp. 815–849.
Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O’Callaghan, L.: Clustering data streams: Theory and practice, IEEE Trans. on Knowl. and Data Eng., Vol 15 (3), 2003, pp. 515–528.
Guha, S., Rastogi, R., and Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Data Engineering, 1999. Proceedings 15th International Conference on, pp. 512–521.
Haveliwala, T., Gionis, A., and Indyk, P.: Scalable techniques for clustering the web, Proceedings of the 3rd International Workshop on the Web and Databases, 2000, pp. 129–134.
Houle, M. E.: Navigating massive data sets via local clustering, Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’03), New York: ACM, 2003, pp. 547–552.
Houle, M. E.: The relevant-set correlation model for data clustering, Statistical Analysis and Data Mining, Vol. 1(3), 2008, pp. 157–176.
Houle, M. E., Kriegel, H.-P., Kröger, P., Schubert, E., and Zimek, A.: Can shared-neighbor distances defeat the curse of dimensionality? Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM’10), Berlin, Heidelberg: Springer-Verlag, 2010, pp. 482–500.
Indyk, P. and Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the 30th annual ACM symposium on Theory of computing, 1998, pp. 604–613.
Jarvis, R. and Patrick, E. A.: Clustering using a similarity measure based on shared near neighbors, Computers, IEEE Transactions on, Vol. C-22 (11), 1973, pp. 1025–1034.
Keogh, E. and Mueen, A.: Curse of Dimensionality, Springer US, Boston, MA., 2010, pp. 257–258.
Koga, H., Ishibashi, T., and Watanabe, T.: Fast hierarchical clustering algorithm using locality-sensitive hashing, In Discovery Science, 2004, pp. 114–128.
Koga, H., Ishibashi, T., and Watanabe, T.: Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing, Knowledge and Information Systems, Vol. 12 (1), 2007, pp. 25–53.
Kriegel, H., Kröger, P., and Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Trans. Knowl. Discov. Data, Vol. 3 (1), 2009, pp. 1:1–1:58.
Kriegel, H.-P. and Ntoutsi, E.: Clustering high dimensional data: Examining differences and commonalities between subspace clustering and text clustering—a position paper, SIGKDD Explor. Newsl., Vol. 15(2), 2014, pp. 1–8.
Larsson, M. O. and Ugander, J.: A concave regularization technique for sparse mixture models, In Advances in Neural Information Processing Systems, 2011, pp. 1890–1898.
Li, L., Wang, D., Li, T., Knox, D., and Padmanabhan, B.: Scene: a scalable two-stage personalized news recommendation system. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, ACM, 2011, pp. 125–134.
Li, P. and König, C.: Theory and applications of b-bit minwise hashing, Communications of the ACM, Vol. 54 (8), 2011, pp. 101–109.
Li, P., Owen, A., and Zhang, C.-H.: One permutation hashing. Advances in Neural Information Processing Systems, 2012, pp. 3113–3121.
Luo, C., Li, Y., and Chung, S. M.: Text document clustering based on neighbors, Data & Knowledge Engineering, Vol. 68 (11), 2009, pp. 1271–1288.
Manku, G. S., Jain, A., and Das Sarma, A.: Detecting near-duplicates for web crawling, Proceedings of the 16th international conference on World Wide Web, ACM, 2007, pp. 141–150.
Ntoutsi, I., Zimek, A., Palpanas, T., Kröger, P., and Kriegel, H.: Density-based projected clustering over high dimensional data streams, Proceedings of SDM ’12, SIAM, 2012, pp. 987–998.
O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., and Motwani, R.: Streaming-data algorithms for high-quality clustering. Proceedings 18th International Conference on Data Engineering (ICDE ’02), IEEE Computer Society. 2002, pp. 685–694.
Radovanović, M., Nanopoulos, A., and Ivanović, M.: Hubs in space: Popular nearest neighbors in high-dimensional data. The Journal of Machine Learning Research, Vol. 11, 2010, pp. 2487–2531.
Rovetta, S. and Masulli, F.: Shared farthest neighbor approach to clustering of high dimensionality, low cardinality data, Pattern Recognition, Vol. 39 (12), 2006, pp. 2415–2425.
Schnitzer, D., Flexer, A., Schedl, M., and Widmer, G.: Local and global scaling reduce hubs in space. The Journal of Machine Learning Research, Vol. 13 (1), 2012, pp. 2871–2902.
Schulman, L. J.: Clustering for edge-cost minimization (extended abstract), Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (STOC ’00), New York, ACM, 2000. pp. 547–555,
Shrivastava, A. and Li, P.: Fast near neighbor search in high-dimensional binary data, In Machine Learning and Knowledge Discovery in Databases, 2012, pp. 474–489.
Strehl, A., Ghosh, J., and Mooney, R. (2000). Impact of similarity measures on web-page clustering, Workshop on Artificial Intelligence for Web Search (AAAI 2000), pp. 58–64.
Tomasev, N., Radovanović, M., Mladenic, D., and Ivanović, M. (2014). The role of hubness in clustering high-dimensional data. Knowledge and Data Engineering, IEEE Transactions on, Vol. 26 (3), 2014, pp. 739–751.
Wang, J., Shen, H., Song, J., and Ji, J.: Hashing for similarity search: A survey, arXiv preprint arXiv:1408.2927, 2014.
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J.: Feature hashing for large scale multitask learning. Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 1113–1120. ACM.
Xu, R. and Wunsch, D. I.: Survey of clustering algorithms. Neural Networks, IEEE Transactions on, Vol. 16 (3), 2005, pp. 645–678.
Zamora, J., Mendoza, M., and Allende, H.: Hashing-based clustering in high dimensional data. Expert Systems with Applications, 2016.
Zhang, T., Ramakrishnan, R., and Livny, M. (1996). Birch: an efficient data clustering method for very large databases. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD ’96), pp. 103–114. ACM Press.
Acknowledgments
It is a great honor to me to become part of this homage book for a person who I sincerely admire. I met Professor Moraga in 2010 because of a seminary course that he gave at the Universidad Técnica Federico Santa María for doctoral students. After that I could appreciate his human warmth and constant eagerness to help others (me among them). I am quite sure he will not be comfortable if I write a long acknowledgement text, hence I just want to express my gratitude for his advices, discussions and great willingness. Finally, I just wanted to put a very important photo to me that was taken in my doctoral defense. Professor Moraga participated in the Commission, his observations were invaluable and made a much better work of my thesis.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Zamora, J. (2017). Recent Advances in High-Dimensional Clustering for Text Data. In: Seising, R., Allende-Cid, H. (eds) Claudio Moraga: A Passion for Multi-Valued Logic and Soft Computing. Studies in Fuzziness and Soft Computing, vol 349. Springer, Cham. https://doi.org/10.1007/978-3-319-48317-7_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-48317-7_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48316-0
Online ISBN: 978-3-319-48317-7
eBook Packages: EngineeringEngineering (R0)