Abstract
Application domains such as life sciences, e.g. molecular biology produce a tremendous amount of data which can no longer be managed without the help of efficient and effective data mining methods. One of the primary data mining tasks is clustering. However, traditional clustering algorithms often fail to detect meaningful clusters because of the high dimensional, inherently sparse feature space of most real-world data sets. Nevertheless, the data sets often contain clusters hidden in various subspaces of the original feature space. We present a pre-processing step for traditional clustering algorithms, which detects all interesting subspaces of high-dimensional data containing clusters. For this purpose, we define a quality criterion for the interestingness of a subspace and propose an efficient algorithm called RIS (Ranking Interesting Subspaces) to examine all such subspaces. A broad evaluation based on synthetic and real-world data sets empirically shows that RIS is suitable to find all relevant subspaces in large, high dimensional, sparse data and to rank them accordingly.
The work is supported in part by the German Ministery for Education, Science, Research and Technology (BMBF) under grant no. 031U112F within the BFAM (Bioinformatics for the Functional Analysis of Mammalian Genomes) project which is part of the German Genome Analysis Network (NGFN).
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Seattle, WA (1998)
Aggarwal, C.C., Procopiuc, C.: Fast Algorithms for Projected Clustering. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Philadelphia, PA (1999)
Aggarwal, C., Yu, P.: Finding Generalized Projected Clusters in High Dimensional Space. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Dallas, TX (2000)
Hinneburg, A., Keim, D.: Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. In: Proc. 25th Int. Conf. on Very Large Databases, Edinburgh, Scotland (1999)
Cheng, C.H., Fu, A.C., Zhang, Y.: Entropy-Based Subspace Clustering for Mining Numerical Data. In: Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases, San Diego, FL (1999)
Goil, S., Nagesh, H., Choudhary, A.: MAFIA: Efficiant and Scalable Subspace Clustering for Very Large Data Sets. Tech. Report No. CPDC-TR-9906-010, Center for Parallel and Distributed Computing, Dept. of Electrical and Computer Engineering, Northwestern University (1999)
Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A Monte Carlo Algorithm for Fast Projective Clustering. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Madison, WI, pp. 418–427 (2002)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, pp. 291–316 (1996)
Hinneburg, A., Keim, D.A.: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In: Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, New York City, NY, pp. 224–228 (1998)
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: Ordering Points to Identify the Clustering Structure. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Philadelphia, PA, pp. 49–60 (1999)
Berchtold, S., Böhm, C., Keim, D.A., Kriegel, H.P.: A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space. In: Proc. ACM PODS Symp. on Principles of Database Systems, Tucson, AZ, pp. 78–86 (1997)
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Minneapolis, MN, pp. 94–105 (1994)
Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., Futcher, B.: Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization. Molecular Biolology of the Cell 9, 3273–3297 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kailing, K., Kriegel, HP., Kröger, P., Wanka, S. (2003). Ranking Interesting Subspaces for Clustering High Dimensional Data. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds) Knowledge Discovery in Databases: PKDD 2003. PKDD 2003. Lecture Notes in Computer Science(), vol 2838. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39804-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-39804-2_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20085-7
Online ISBN: 978-3-540-39804-2
eBook Packages: Springer Book Archive