Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2017701.2017709guidebooksArticle/Chapter ViewAbstractPublication PagesBookacm-pubtype
chapter

Tolerance rough set theory based data summarization for clustering large datasets

January 2011
Pages 139 - 158
Published: 01 January 2011 Publication History

Abstract

Finding clusters in large datasets is an interesting challenge in many fields of Science and Technology. Many clustering methods have been successfully developed over the years. However, most of the existing clustering methods need multiple data scans to get converged. Therefore, these methods cannot be applied for cluster analysis in large datasets. Data summarization can be used as a pre-processing step to speed up classical clustering methods for large datasets. In this paper, we propose a data summarization scheme based on tolerance rough set theory termed rough bubble. Rough bubble utilizes leaders clustering method to collect sufficient statistics of the dataset, which can be used to cluster the dataset. We show that proposed summarization scheme outperforms recently introduced data bubble as a summarization scheme when agglomerative hierarchical clustering (single-link) method is applied to it. We also introduce a technique to reduce the number of distance computations required in leaders clustering method. Experiments are conducted with synthetic and real world datasets which show effectiveness of our methods for large datasets.

References

[1]
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput. Surv. 31(3), 264-323 (1999)
[2]
MacQueen, J.B.: Some Methods for Classification and Analysis of MultiVariate Observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297 (1967)
[3]
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data:an Introduction to Cluster Analysis. John Wiley & Sons, USA (1990)
[4]
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of 2nd ACM SIGKDD, SIGKDD 1996, pp. 226-231 (1996)
[5]
Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, Inc., New York (1975)
[6]
Spath, H.: Cluster Analysis Algorithms for Data Reduction and Classification of Objects. Ellis Horwood, UK (1980)
[7]
Ho, T.B., Nguyen, N.B.: Nonhierarchical document clustering based on a tolerance rough set model. Int. J. Intell. Syst. 17, 199-212 (2002)
[8]
Tremblay, J.P., Manohar, R.: Discreate Mathematical Structures with Applications to Computer Science. Tata McGraw-Hill Publishing Company Limited, New Delhi (1997)
[9]
Sneath, A., Sokal, P.H.: Numerical Taxonomy. Freeman, London (1973)
[10]
King, B.: Step-Wise Clustering Procedures. Journal of the American Statistical Association 62(317), 86-101 (1967)
[11]
Murtagh, F.: Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly 1, 101-113 (1984)
[12]
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: Ordering points to identify the clustering structure. SIGMOD Rec. 28, 49-60 (1999)
[13]
De, S.K., Krishna, P.R.: Clustering web transactions using rough approximation. Fuzzy Sets and Systems 148(1), 131-138 (2004)
[14]
Kumar, P., Krishna, P.R., Bapi, R.S., De, S.K.: Rough clustering of sequential data. Data Knowl. Eng. 63, 183-199 (2007)
[15]
Kawasaki, S., Nguyen, N.B., Ho, T.B.: Hierarchical document clustering based on tolerance rough set model. In: Zighed, D.A., Komorowski, J., Z? ytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 458-463. Springer, Heidelberg (2000)
[16]
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference, SIGMOD 1996, pp. 103-114 (1996)
[17]
Breunig, M.M., Kriegel, H.P., Sander, J.: Fast hierarchical clustering based on compressed data and optics. In: Zighed, D.A., Komorowski, J., Z? ytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 232-242. Springer, Heidelberg (2000)
[18]
Breunig, M.M., peter Kriegel, H., Kröger, P., Sander, J.: Data bubbles: Quality preserving performance boosting for hierarchical clustering. In: Proceedings of the ACM SIGMOD Conference, SIGMOD 2001, pp. 79-90 (2001)
[19]
Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: Proceedings of the KDD, KDD 1998, pp. 9-15 (1998)
[20]
Zhou, J., Sander, J.: Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Proceedings of VLDB 2003, pp. 452-463 (2003)
[21]
Patra, B.K., Nandi, S.: A fast single link clustering method based on tolerance rough set model. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Śl ězak, D., Zhu, W. (eds.) RSFDGrC 2009. LNCS, vol. 5908, pp. 414-422. Springer, Heidelberg (2009)
[22]
Pawlak, Z.: Rough sets. Int. J. of Computer and Information Sc. 11, 341-356 (1982)
[23]
Lin, T.Y., Cercone, N. (eds.): Rough Sets and Data Mining: Analysis of Imprecise Data. Kluwer Academic Publishers, Norwell (1996)
[24]
Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245-253 (1996)
[25]
Slowinski, R., Vanderpooten, D.: A generalized definition of rough approximations based on similarity. IEEE Trans. on Knowl. and Data Eng. 12, 331-336 (2000)
[26]
Kryszkiewicz, M.: Rough set approach to incomplete information systems. Information Sciences 112(1-4), 39-49 (1998)
[27]
Ślezak, D., Wasilewski, P.: Granular sets - foundations and case study of tolerance spaces. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds.) RSFDGrC 2007. LNCS (LNAI), vol. 4482, pp. 435-442. Springer, Heidelberg (2007)
[28]
Bedi, P., Chawla, S.: Use of fuzzy rough set attribute reduction in high scent web page recommendations. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Śl ězak, D., Zhu, W. (eds.) RSFDGrC 2009. LNCS, vol. 5908, pp. 192-200. Springer, Heidelberg (2009)
[29]
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on Machine Learning, ICML 2003, pp. 147-153 (2003)
[30]
Nassar, S., Sander, J., Cheng, C.: Incremental and effective data summarization for dynamic hierarchical clustering. In: Proceedings of SIGMOD Conference, SIGMOD 2004, pp. 467-478 (2004)
[31]
Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: Clustering with DBSCAN by means of the triangle inequality. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 60-69. Springer, Heidelberg (2010)
[32]
Patra, B.K., Hubballi, N., Biswas, S., Nandi, S.: Distance based fast hierarchical clustering method for large datasets. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 50-59. Springer, Heidelberg (2010)
[33]
Rand, W.M.: Objective Criteria for Evaluation of Clustering Methods. J. of American Statistical Association 66, 846-850 (1971)
[34]
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical report, University of Minnesota (2002)

Cited By

View all

Index Terms

  1. Tolerance rough set theory based data summarization for clustering large datasets
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Guide books
        Transactions on rough sets XIV
        January 2011
        233 pages
        ISBN:9783642215629

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 01 January 2011

        Author Tags

        1. data summarization
        2. hierarchical clustering method
        3. large datasets
        4. leaders clustering
        5. rough bubble
        6. tolerance rough set theory

        Qualifiers

        • Chapter

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 30 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media