Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

Published: 22 July 2015 Publication History

Abstract

An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan’s classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of “outlierness” can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a “flat” (i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods.

References

[1]
N. Abe, B. Zadrozny, and J. Langford. 2006. Outlier detection by active learning. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD).Philadelphia, PA. 504--509.
[2]
E. Achtert, H.-P. Kriegel, E. Schubert, and A. Zimek. 2013. Interactive data mining with 3D-parallel-coordinate-trees. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). New York City, NY. 1009--1012.
[3]
C. C. Aggarwal and P. S. Yu. 2001. Outlier detection for high dimensional data. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Santa Barbara, CA. 37--46.
[4]
M. Agyemang, K. Barker, and R. Alhajj. 2006. A comprehensive survey of numeric and symbolic outlier mining techniques. Intelligent Data Analysis 10 (2006), 521--538.
[5]
F. Angiulli and F. Fassetti. 2009. DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data (TKDD) 3, 1 (2009), 4:1--57.
[6]
F. Angiulli and C. Pizzuti. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). Helsinki, Finland. 15--26.
[7]
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Philadelphia, PA. 49--60.
[8]
K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml.
[9]
V. Barnett. 1978. The study of outliers: Purpose and model. Applied Statistics 27, 3 (1978), 242--250.
[10]
V. Barnett and T. Lewis. 1994. Outliers in Statistical Data (3rd ed.). John Wiley & Sons.
[11]
S. Basu, I. Davidson, and K. Wagstaff (Eds.). 2008. Constraint Clustering: Advances in Algorithms, Applications and Theory. CRC Press, Boca Raton, London, New York.
[12]
S. D. Bay and M. Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. 29--38.
[13]
R. J. Beckman and R. D. Cook. 1983. Outlier..........s. Technometrics 25, 2 (1983), 119--149.
[14]
M. Bilenko, S. Basu, and R. J. Mooney. 2004. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the 21st International Conference on Machine Learning (ICML). Banff, AB, Canada.
[15]
C. M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
[16]
S. Brecheisen, H.-P. Kriegel, P. Kröger, and M. Pfeifle. 2004. Visually mining through cluster hierarchies. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM). Lake Buena Vista, FL. 400--412.
[17]
M. M. Breunig, H.-P. Kriegel, P. Kröger, and J. Sander. 2001. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Santa Barbara, CA. 79--90.
[18]
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. 2000. LOF: Identifying density-based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Dallas, TX. 93--104.
[19]
C. Böhm and C. Plant. 2008. HISSCLU: A hierarchical density-based method for semi-supervised clustering. In Proceedings of the 11th International Conference on Extending Database Technology (EDBT). Nantes, France. 440--451.
[20]
R. J. G. B. Campello, D. Moulavi, and J. Sander. 2013a. Density-based clustering based on hierarchical density estimates. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Gold Coast, Australia. 160--172.
[21]
R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander. 2013b. A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Mining and Knowledge Discovery 27, 3 (2013), 344--371.
[22]
M. H. Chehreghani, H. Abolhassani, and M. H. Chehreghani. 2008. Improving density-based methods for hierarchical clustering of web pages. Data & Knowledge Engineering 67, 1 (2008), 30--50.
[23]
D. Comaniciu and P. Meer. 1999. Distribution free decomposition of multivariate data. Pattern Analysis and Applications 2, 1 (1999), 22--30.
[24]
D. Coomans and D. L. Massart. 1981. Potential methods in pattern recognition: Part 2. CLUPOT -- An unsupervised pattern recognition technique. Analytica Chimica Acta 133, 3 (1981), 225--239.
[25]
A. Cuevas, M. Febrero, and R. Fraiman. 2000. Estimating the number of clusters. Canadian Journal of Statistics 28, 2 (2000), 367--382.
[26]
A. Cuevas, M. Febrero, and R. Fraiman. 2001. Cluster analysis: A further approach based on density estimation. Computational Statistics and Data Analysis 36 (2001), 441--459.
[27]
X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. 2014. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL. 88--99.
[28]
M. Daszykowski, B. Walczak, and D. L. Massart. 2001. Looking for natural patterns in data: Part 1. Density-based approach. Chemometrics and Intelligent Laboratory Systems 56, 2 (2001), 83--92.
[29]
I. Davidson, K. L. Wagstaff, and S. Basu. 2006. Measuring constraint-set utility for partitional clustering algorithms. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). Berlin, Germany. 115--126.
[30]
T. de Vries, S. Chawla, and M. E. Houle. 2010. Finding local anomalies in very high dimensional space. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM). Sydney, Australia. 128--137.
[31]
T. de Vries, S. Chawla, and M. E. Houle. 2012. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems (KAIS) 32, 1 (2012), 25--52.
[32]
S. Dhandapani, G. Gupta, and J. Ghosh. 2010. Design and Implementation of Scalable Hierarchical Density Based Clustering. Technical Report. Technical Report IDEAL-2010-06, Dept. Electrical and Computer Eng., Univ. of Texas at Austin.
[33]
L. Ertöz, M. Steinbach, and V. Kumar. 2002. A new shared nearest neighbor clustering algorithm and its applications. In Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining. 105--115.
[34]
L. Ertöz, M. Steinbach, and V. Kumar. 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM), San Francisco, CA.
[35]
M. Ester. 2009. Density-based Clustering. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu (Eds.). Springer, 795--799.
[36]
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD). Portland, OR. 226--231.
[37]
B. S. Everitt, S. Landau, and M. Leese. 2001. Cluster Analysis (4th ed.). Arnold.
[38]
A. Foss and O. R. Zaïane. 2002. A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM), Maebashi City. Japan. 179--186.
[39]
A. L. N. Fred and A. K. Jain. 2005. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 6 (2005), 835--850.
[40]
K. Fukunaga and L. Hostetler. 1975. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21, 1 (1975), 32--40.
[41]
J. Gao and P.-N. Tan. 2006. Converting output scores from outlier detection algorithms into probability estimates. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM). Hong Kong, China. 212--221.
[42]
J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. 2005. The amsterdam library of object images. International Journal of Computer Vision 61, 1 (2005), 103--112.
[43]
F. E. Grubbs. 1950. Sample criteria for testing outlying observations. The Annals of Mathematical Statistics 21, 1 (1950), 27--58.
[44]
G. Gupta, A. Liu, and J. Ghosh. 2006. Hierarchical density shaving: A clustering and visualization framework for large biological datasets. In IEEE ICDM Workshop on Data Mining in Bioinformatics (DMB). Hong Kong, China, 89--93.
[45]
G. Gupta, A. Liu, and J. Ghosh. 2010. Automated hierarchical density shaving: A robust automated clustering and visualization framework for large biological data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7, 2 (2010), 223--237.
[46]
A. S. Hadi, A. H. M. Rahmatullah Imon, and M. Werner. 2009. Detection of outliers. Wiley Interdisciplinary Reviews: Computational Statistics 1, 1 (2009), 57--70.
[47]
S. Hang, Z. You, and L. Y. Chun. 2009. Incorporating biological knowledge into density-based clustering analysis of gene expression data. In 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Vol. 5. Tianjin, China, 52--56.
[48]
J. A. Hanley and B. J. McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 (1982), 29--36.
[49]
J. A. Hartigan. 1975. Clustering Algorithms. John Wiley & Sons, New York, London, Sydney, Toronto.
[50]
J. A. Hartigan. 1987. Estimation of a convex density contour in two dimensions. J. Amer. Statist. Assoc. 82, 397 (1987), 267--270.
[51]
D. Hawkins. 1980. Identification of Outliers. Chapman and Hall.
[52]
M. Herbin, N. Bonnet, and P. Vautrot. 2001. Estimation of the number of clusters and influence zones. Pattern Recognition Letters 22, 14 (2001), 1557--1568.
[53]
A. Hinneburg and H. H. Gabriel. 2007. Denclue 2.0: Fast clustering based on kernel density estimation. In Proceedings of the 7th International Symposium on Intelligent Data Analysis (IDA). Ljubljana, Slovenia. 70--80.
[54]
A. Hinneburg and D. A. Keim. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th ACM International Conference on Knowledge Discovery and Data Mining (KDD). New York City, NY. 58--65.
[55]
A. Hinneburg and D. A. Keim. 2003. A general approach to clustering in large databases with noise. Knowledge and Information Systems (KAIS) 5 (2003), 387--415. Issue 4.
[56]
V. J. Hodge and J. Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22 (2004), 85--126.
[57]
D. Horta and R. J. G. B. Campello. 2012. Automatic aspect discrimination in data clustering. Pattern Recognition 45, 12 (2012), 4370--4388.
[58]
L. Hubert and P. Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193--218.
[59]
J.-N. Hwang, S.-R. Lay, and A. Lippman. 1994. Nonparametric multivariate density estimation: A comparative study. IEEE Transactions on Signal Processing 42, 10 (1994), 2795--2810.
[60]
A. K. Jain and R. C. Dubes. 1988. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs.
[61]
W. Jin, A. K. Tung, and J. Han. 2001. Mining top-n local outliers in large databases. In Proceedings of the 7th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). San Francisco, CA. 293--298.
[62]
W. Jin, A. K. H. Tung, J. Han, and W. Wang. 2006. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Singapore. 577--593.
[63]
S. C. Johnson. 1967. Hierarchical clustering schemes. Psychometrika 32 (1967), 241--254.
[64]
F. Keller, E. Müller, and K. Böhm. 2012. HiCS: High contrast subspaces for density-based outlier ranking. In Proceedings of the 28th International Conference on Data Engineering (ICDE). Washington, DC.
[65]
E. M. Knorr and R. T. Ng. 1997a. A unified approach for mining outliers. In Proceedings of the conference of the Centre for Advanced Studies on Collaborative research (CASCON). Toronto, ON, Canada. 11--23.
[66]
E. M. Knorr and R. T. Ng. 1997b. A unified notion of outliers: Properties and computation. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD). Newport Beach, CA. 219--222.
[67]
E. M. Knorr and R. T. Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB). New York City, NY. 392--403.
[68]
E. M. Knorr, R. T. Ng, and V. Tucanov. 2000. Distance-based outliers: Algorithms and applications. The VLDB Journal 8, 3--4 (2000), 237--253.
[69]
G. Kollios, D. Gunopulos, N. Koudas, and S. Berchthold. 2003. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE Transactions on Knowledge and Data Engineering 15, 5 (2003), 1170--1187.
[70]
H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek. 2011. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, 3 (2011), 231--240.
[71]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2009a. LoOP: Local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM). Hong Kong, China. 1649--1652.
[72]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2009b. Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Bangkok, Thailand. 831--838.
[73]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2011a. Interpreting and unifying outlier scores. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM) Mesa, AZ. 13--24.
[74]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2012. Outlier detection in arbitrarily oriented subspaces. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM). Brussels, Belgium. 379--388.
[75]
H.-P. Kriegel, P. Kröger, and A. Zimek. 2010. Outlier Detection Techniques. Tutorial at the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. (2010).
[76]
H.-P. Kriegel, M. Schubert, and A. Zimek. 2008. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas, NV. 444--452.
[77]
B. Larsen and C. Aone. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). San Diego, CA. 16--22.
[78]
A. Lazarevic and V. Kumar. 2005. Feature bagging for outlier detection. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Chicago, IL. 157--166.
[79]
L. Lelis and J. Sander. 2009. Semi-supervised density-based clustering. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM). Miami, FL. 842--847.
[80]
F. T. Liu, K. M. Ting, and Z.-H. Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 1 (2012), 3:1--39.
[81]
P. Liu, D. Zhou, and N. Wu. 2007. VDBSCAN: Varied density based spatial clustering of applications with noise. In International Conference on Service Systems and Service Management (ICSSSM). Chengdu, China. 1--4.
[82]
B. Micenková, R. T. Ng, X. H. Dang, and I. Assent. 2013. Explaining outliers by subspace separability. In Proceedings of the 13th IEEE International Conference on Data Mining (ICDM). Dallas, TX. 518--527.
[83]
D. W. Muller and G. Sawitzki. 1991. Excess mass estimates and tests for multimodality. J. Amer. Statist. Assoc. 86, 415 (1991), 738--746.
[84]
E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm. 2012. Outlier ranking via subspace analysis in multiple views of the data. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM). Brussels, Belgium. 529--538.
[85]
E. Müller, I. Assent, U. Steinhausen, and T. Seidl. 2008. OutRank: Ranking outliers in high dimensional data. In Proceedings of the 24th International Conference on Data Engineering (ICDE) Workshop on Ranking in Databases (DBRank). Cancun, Mexico. 600--603.
[86]
E. Müller, M. Schiffer, and T. Seidl. 2010. Adaptive outlierness for subspace outlier ranking. In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM). Toronto, ON, Canada. 1629--1632.
[87]
E. Müller, M. Schiffer, and T. Seidl. 2011. Statistical selection of relevant subspace projections for outlier ranking. In Proceedings of the 27th International Conference on Data Engineering (ICDE). Hannover, Germany. 434--445.
[88]
M. C. Naldi, R. J. G. B. Campello, E. R. Hruschka, and A. C. P. L. F. Carvalho. 2011. Efficiency issues of evolutionary k-means. Applied Soft Computing 11, 2 (2011), 1938--1952.
[89]
H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. 2010. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA). Tsukuba, Japan. 368--383.
[90]
H. V. Nguyen and V. Gopalkrishnan. 2009. Efficient pruning schemes for distance-based outlier detection. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD). Bled, Slovenia. 160--175.
[91]
H. V. Nguyen, V. Gopalkrishnan, and I. Assent. 2011. An unbiased distance-based outlier detection approach for high-dimensional data. In Proceedings of the 16th International Conference on Database Systems for Advanced Applications (DASFAA). Hong Kong, China. 138--152.
[92]
G. H. Orair, C. Teixeira, Y. Wang, W. Meira Jr., and S. Parthasarathy. 2010. Distance-based outlier detection: Consolidation and renewed bearing. Proceedings of the VLDB Endowment 3, 2 (2010), 1469--1480.
[93]
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. 2003. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE). Bangalore, India. 315--326.
[94]
D. Pascual, F. Pla, and J. S. Sánchez. 2006. Non parametric local density-based clustering for multimodal overlapping distributions. In Proceedings of the 7th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL). Burgos, Spain. 671--678.
[95]
E. S. Pearson and C. Chandra Sekar. 1936. The efficiency of statistical tools and a criterion for the rejection of outlying observations. Biometrika 28, 3/4 (1936), 308--320.
[96]
T. Pei, A. Jasra, D. J. Hand, A.-X. Zhu, and C. Zhou. 2009. DECODE: A new method for discovering clusters of different densities in spatial data. Data Mining and Knowledge Discovery 18, 3 (2009), 337--369.
[97]
T. Pei, A.-X. Zhu, C. Zhou, B. Li, and C. Qin. 2006. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science 20, 2 (2006), 153--168.
[98]
N. Pham and R. Pagh. 2012. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Beijing, China.
[99]
S. Ramaswamy, R. Rastogi, and K. Shim. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Dallas, TX. 427--438.
[100]
C. Ruiz, M. Spiliopoulou, and E. Menasalvas. 2007. C-DBSCAN: Density-based clustering with constraints. In Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, A. An, J. Stefanowski, S. Ramanna, C. Butz, W. Pedrycz, and G. Wang (Eds.). 216--223.
[101]
C. Ruiz, M. Spiliopoulou, and E. Menasalvas. 2010. Density-based semi-supervised clustering. Data Mining and Knowledge Discovery 21, 3 (2010), 345--370.
[102]
J. Sander. 2010. Density-based clustering. In Encyclopedia of Machine Learning, C. Sammut and G. I. Webb (Eds.). Springer, US, 270--273.
[103]
J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky. 2003. Automatic extraction of clusters from hierarchical clustering representations. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Seoul, Korea. 75--87.
[104]
E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. 2012. On evaluation of outlier rankings and outlier scores. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM). Anaheim, CA. 1047--1058.
[105]
E. Schubert, A. Zimek, and H.-P. Kriegel. 2014a. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM). Philadelphia, PA. 542--550.
[106]
E. Schubert, A. Zimek, and H.-P. Kriegel. 2014b. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery 28, 1 (2014), 190--237.
[107]
B. W. Silverman. 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC.
[108]
P. H. A. Sneath. 1957. The application of computers to taxonomy. Journal of General Microbiology 17 (1957), 201--226.
[109]
T. Soler and M. Chin. 1985. On transformation of covariance matrices between local Cartesian coordinate systems and commutative diagrams. In ASP-ACSM Convention. 393--406.
[110]
W. Stuetzle. 2003. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification 20, 1 (2003), 25--47.
[111]
W. Stuetzle and R. Nugent. 2010. A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics 19, 2 (2010), 397--418.
[112]
H. Sun, J. Huang, J. Han, H. Deng, P. Zhao, and B. Feng. 2010. gSkeletonClu: Density-based network clustering via structure-connected tree division or agglomeration. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM). Sydney, Australia. 481--490.
[113]
P.-N. Tan, M. Steinbach, and V. Kumar. 2006. Introduction to Data Mining. Addison Wesley.
[114]
J. Tang, Z. Chen, A. W.-C. Fu, and D. W. Cheung. 2002. Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Taipei, Taiwan. 535--548.
[115]
K. L. Wagstaff. 2002. Intelligent Clustering with Instance-Level Constraints. Ph.D. Dissertation. Department of Computer Science, Cornell University.
[116]
D. Wishart. 1969. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In Numerical Taxonomy, A. J. Cole (Ed.). 282--311.
[117]
M. A. Wong and T. Lane. 1983. A kth nearest neighbour clustering procedure. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 45, 3 (1983), 362--368.
[118]
R. Xu and D. Wunsch II. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16 (2005), 645--678.
[119]
R. Xu and D. Wunsch II. 2009. Clustering. IEEE Press.
[120]
J. Yang, N. Zhong, Y. Yao, and J. Wang. 2008. Local peculiarity factor and its application in outlier detection. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas, NV. 776--784.
[121]
K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo. 2001. Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 10 (2001), 977--987.
[122]
K. Y. Yeung, M. Medvedovic, and R. E. Bumgarner. 2003. Clustering gene-expression data with repeated measurements. Genome Biology 4, 5 (2003), R34.1--R34.17.
[123]
K. Zhang, M. Hutter, and H. Jin. 2009. A new local distance-based outlier detection approach for scattered real-world data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Bangkok, Thailand. 813--822.
[124]
T. Zhang, R. Ramakrishnan, and M. Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Montreal, QC, Canada. 103--114.
[125]
A. Zimek, R. J. G. B. Campello, and J. Sander. 2013. Ensembles for unsupervised outlier detection: Challenges and research questions. ACM SIGKDD Explorations 15, 1 (2013), 11--22.
[126]
A. Zimek, R. J. G. B. Campello, and J. Sander. 2014. Data perturbation for outlier detection ensembles. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management (SSDBM). Aalborg, Denmark. 13:1--12.
[127]
A. Zimek, M. Gaudet, R. J. G. B. Campello, and J. Sander. 2013a. Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Chicago, IL. 428--436.
[128]
A. Zimek, E. Schubert, and H.-P. Kriegel. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining 5, 5 (2012), 363--387.

Cited By

View all
  • (2024)Effective News Summarization through High Cohesion Clustering with Noise Filtering and Second-Half SummaryJournal of Digital Contents Society10.9728/dcs.2024.25.8.216525:8(2165-2174)Online publication date: 31-Aug-2024
  • (2024)Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutionsPeerJ Computer Science10.7717/peerj-cs.228610(e2286)Online publication date: 29-Aug-2024
  • (2024)Fast and exact fixed-radius neighbor search based on sortingPeerJ Computer Science10.7717/peerj-cs.192910(e1929)Online publication date: 29-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 10, Issue 1
July 2015
321 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2808688
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2015
Accepted: 01 February 2015
Revised: 01 December 2014
Received: 01 July 2013
Published in TKDD Volume 10, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Density-based clustering
  2. data visualization
  3. global/local outliers
  4. hierarchical and nonhierarchical clustering
  5. outlier detection
  6. unsupervised and semisupervised clustering

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NSERC (Canada)
  • FAPESP
  • CNPq

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)717
  • Downloads (Last 6 weeks)85
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Effective News Summarization through High Cohesion Clustering with Noise Filtering and Second-Half SummaryJournal of Digital Contents Society10.9728/dcs.2024.25.8.216525:8(2165-2174)Online publication date: 31-Aug-2024
  • (2024)Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutionsPeerJ Computer Science10.7717/peerj-cs.228610(e2286)Online publication date: 29-Aug-2024
  • (2024)Fast and exact fixed-radius neighbor search based on sortingPeerJ Computer Science10.7717/peerj-cs.192910(e1929)Online publication date: 29-Mar-2024
  • (2024)Visit Nearby Halo Substructures Using LAMOST DR9 MRS DataThe Astrophysical Journal10.3847/1538-4357/ad3240965:1(62)Online publication date: 8-Apr-2024
  • (2024)Biomass Burning in Northeast China over Two Decades: Temporal Trends and Geographic PatternsRemote Sensing10.3390/rs1611191116:11(1911)Online publication date: 26-May-2024
  • (2024)Imputing Missing Data in One-Shot Devices Using Unsupervised Learning ApproachMathematics10.3390/math1218288412:18(2884)Online publication date: 15-Sep-2024
  • (2024)Development and Application of an Advanced Automatic Identification System (AIS)-Based Ship Trajectory Extraction Framework for Maritime Traffic AnalysisJournal of Marine Science and Engineering10.3390/jmse1209167212:9(1672)Online publication date: 18-Sep-2024
  • (2024)Working in Tandem to Uncover 3D Artefact Distribution in Archaeological Excavations: Mathematical Interpretation through Positional and Relational MethodsHeritage10.3390/heritage70802117:8(4472-4499)Online publication date: 18-Aug-2024
  • (2024)Anomaly Detection in Blockchain Networks Using Unsupervised Learning: A SurveyAlgorithms10.3390/a1705020117:5(201)Online publication date: 9-May-2024
  • (2024)Key point generation as an instrument for generating core statements of a political debate on TwitterFrontiers in Artificial Intelligence10.3389/frai.2024.12009497Online publication date: 20-Mar-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media