research-article

EDSC: efficient density-based subspace clustering

Authors:

Emmanuel Müller,

Thomas SeidlAuthors Info & Claims

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 1093 - 1102

https://doi.org/10.1145/1458082.1458227

Published: 26 October 2008 Publication History

Abstract

Subspace clustering mines clusters hidden in subspaces of high-dimensional data sets. Density-based approaches have been shown to successfully mine clusters of arbitrary shape even in the presence of noise in full space clustering. Exhaustive search of all density-based subspace clusters, however, results in infeasible runtimes for large high-dimensional data sets. This is due to the exponential number of possible subspace projections in addition to the high computational cost of density-based clustering.

In this paper, we propose lossless efficient detection of density-based subspace clusters. In our EDSC (efficient density-based subspace clustering) algorithm we reduce the high computational cost of density-based subspace clustering by a complete multistep filter-and-refine algorithm. Our first hypercube filter step avoids exhaustive search of all regions in all subspaces by enclosing potentially density-based clusters in hypercubes. Our second filter step provides additional pruning based on a density monotonicity property. In the final refinement step, the exact unbiased density-based subspace clustering result is detected. As we prove that pruning is lossless in both filter steps, we guarantee completeness of the result.

In thorough experiments on synthetic and real world data sets, we demonstrate substantial efficiency gains. Our lossless EDSC approach outperforms existing density-based subspace clustering algorithms by orders of magnitude.

References

[1]

C. Aggarwal, J. Wolf, P. Yu, C. Procopiuc, and J. Park. Fast algorithms for projected clustering. In SIGMOD, pages 61--72, 1999.

Digital Library

[2]

C. Aggarwal and P. Yu. Finding generalized projected clusters in high dimensional spaces. In SIGMOD, pages 70--81, 2000.

Digital Library

[3]

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD, pages 94--105, 1998.

Digital Library

[4]

I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: Dimensionality unbiased subspace clustering. In ICDM, pages 409--414, 2007.

Digital Library

[5]

K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful. In IDBT, pages 217--235, 1999.

Digital Library

[6]

C.-H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In KDD, pages 84--93, 1999.

Digital Library

[7]

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In KDD, pages 226--231, 1996.

[8]

A. Hinneburg and D. Keim. An efficient approach to clustering in large multimedia databases with noise. In KDD, pages 58--65, 1998.

Digital Library

[9]

I. Joliffe. Principal Component Analysis. Springer, New York, 1986.

[10]

K. Kailing, H.-P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In SDM, pages 246--257, 2004.

[11]

K. Kailing, H.-P. Kriegel, P. Kröger, and S. Wanka. Ranking interesting subspaces for clustering high dimensional data. In PKDD, pages 241--252, 2003.

[12]

H.-P. Kriegel, P. Kröger, M. Renz, and S. Wurst. A generic framework for efficient subspace clustering of high-dimensional data. In ICDM, pages 250--257, 2005.

Digital Library

[13]

S. Lauritzen. The EM algorithm for graphical association models with missing data. Comp. Statistics & Data Analysis, 19:191--201, 1995.

Digital Library

[14]

J. MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symp. Math. stat. & prob., pages 281--297, 1967.

[15]

G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In ICDM, pages 414--425, 2006.

Digital Library

[16]

H. Nagesh, S. Goil, and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data sets. In TR 9906--010, NWU, 1999.

[17]

D. Newman, S. Hettich, C. Blake, and C. Merz. UCI repository of MLDBs, 1998.

[18]

K. Sequeira and M. Zaki. SCHISM: A new approach for interesting subspace mining. In ICDM, pages 186--193, 2004.

Digital Library

[19]

I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, USA, 2005.

Digital Library

Cited By

Houle MKiermeier MZimek A(2023)Clustering High-Dimensional DataMachine Learning for Data Science Handbook10.1007/978-3-031-24628-9_11(219-237)Online publication date: 26-Feb-2023
https://doi.org/10.1007/978-3-031-24628-9_11
Campello RKröger PSander JZimek A(2019)Density‐based clusteringWIREs Data Mining and Knowledge Discovery10.1002/widm.134310:2Online publication date: 29-Oct-2019
https://doi.org/10.1002/widm.1343
Günnemann SBoden BSeidl T(2018)Finding density-based subspace clusters in graphs with feature vectorsData Mining and Knowledge Discovery10.1007/s10618-012-0272-z25:2(243-269)Online publication date: 26-Dec-2018
https://dl.acm.org/doi/10.1007/s10618-012-0272-z
Show More Cited By

Index Terms

EDSC: efficient density-based subspace clustering
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Scalable density-based subspace clustering
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

For knowledge discovery in high dimensional databases, subspace clustering detects clusters in arbitrary subspace projections. Scalability is a crucial issue, as the number of possible projections is exponential in the number of dimensions. We propose a ...
Efficient Density-Based Subspace Clustering in High Dimensions
Revised Selected Papers of the First International Workshop on Clustering High--Dimensional Data - Volume 7627

Density-based clustering defines clusters as dense areas in feature space separated by sparsely populated areas. It is known to successfully identify clusters of arbitrary shapes even in noisy data. Today, we face increasingly high-dimensional data, ...
Efficient approaches for summarizing subspace clusters into k representatives
Recent progress in natural computation and knowledge discovery

A major challenge in subspace clustering is that subspace clustering may generate an explosive number of clusters with high computational complexity, which severely restricts the usage of subspace clustering. The problem gets even worse with the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

October 2008

1562 pages

ISBN:9781595939913

DOI:10.1145/1458082

General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM08

Sponsor:

CIKM08: Conference on Information and Knowledge Management

October 26 - 30, 2008

California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
658
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)2

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Houle MKiermeier MZimek A(2023)Clustering High-Dimensional DataMachine Learning for Data Science Handbook10.1007/978-3-031-24628-9_11(219-237)Online publication date: 26-Feb-2023
https://doi.org/10.1007/978-3-031-24628-9_11
Campello RKröger PSander JZimek A(2019)Density‐based clusteringWIREs Data Mining and Knowledge Discovery10.1002/widm.134310:2Online publication date: 29-Oct-2019
https://doi.org/10.1002/widm.1343
Günnemann SBoden BSeidl T(2018)Finding density-based subspace clusters in graphs with feature vectorsData Mining and Knowledge Discovery10.1007/s10618-012-0272-z25:2(243-269)Online publication date: 26-Dec-2018
https://dl.acm.org/doi/10.1007/s10618-012-0272-z
Gullo FDomeniconi CTagarelli A(2018)Projective clustering ensemblesData Mining and Knowledge Discovery10.1007/s10618-012-0266-x26:3(452-511)Online publication date: 26-Dec-2018
https://dl.acm.org/doi/10.1007/s10618-012-0266-x
Sim KGopalkrishnan VZimek ACong G(2018)A survey on enhanced subspace clusteringData Mining and Knowledge Discovery10.1007/s10618-012-0258-x26:2(332-397)Online publication date: 26-Dec-2018
https://dl.acm.org/doi/10.1007/s10618-012-0258-x
Zimek ASchubert EKriegel H(2018)A survey on unsupervised outlier detection in high-dimensional numerical dataStatistical Analysis and Data Mining10.1002/sam.111615:5(363-387)Online publication date: 20-Dec-2018
https://dl.acm.org/doi/10.1002/sam.11161
Zimek AAssent IVreeken J(2014)Frequent Pattern Mining Algorithms for Data ClusteringFrequent Pattern Mining10.1007/978-3-319-07821-2_16(403-423)Online publication date: 30-Aug-2014
https://doi.org/10.1007/978-3-319-07821-2_16
Assent I(2012)Efficient Density-Based Subspace Clustering in High DimensionsRevised Selected Papers of the First International Workshop on Clustering High--Dimensional Data - Volume 762710.1007/978-3-662-48577-4_3(34-49)Online publication date: 15-May-2012
https://dl.acm.org/doi/10.1007/978-3-662-48577-4_3
Günnemann SBoden BSeidl T(2011)DB-CSCProceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I10.5555/2034063.2034112(565-580)Online publication date: 5-Sep-2011
https://dl.acm.org/doi/10.5555/2034063.2034112
Puri CKumar N(2011)Projected Gustafson-Kessel clustering algorithm and its convergenceTransactions on rough sets XIV10.5555/2017701.2017710(159-182)Online publication date: 1-Jan-2011
https://dl.acm.org/doi/10.5555/2017701.2017710
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents