article

Findout: finding outliers in very large datasets

Authors:

Gholamhosein Sheikholeslami,

Aidong ZhangAuthors Info & Claims

Knowledge and Information Systems, Volume 4, Issue 4

Pages 387 - 412

https://doi.org/10.1007/s101150200013

Published: 01 October 2002 Publication History

Abstract

Finding the rare instances or the outliers is important in many KDD (knowledge discovery and data-mining) applications, such as detecting credit card fraud or finding irregularities in gene expressions. Signal-processing techniques have been introduced to transform images for enhancement, filtering, restoration, analysis, and reconstruction. In this paper, we present a new method in which we apply signal-processing techniques to solve important problems in data mining. In particular, we introduce a novel deviation (or outlier) detection approach, termed FindOut, based on wavelet transform. The main idea in FindOut is to remove the clusters from the original data and then identify the outliers. Although previous research showed that such techniques may not be effective because of the nature of the clustering, FindOut can successfully identify outliers from large datasets. Experimental results on very large datasets are presented which show the efficiency and effectiveness of the proposed approach.

References

[1]

Arning A, Agrawal R, Raghavan P (1996) A linear method for deviation detection in large databases. In Proceedings of the second international conference on knowledge discovery and data mining, pp 164-169, Portland, Oregon, August, ACM Press.

[2]

Breunig MM, Kriegel HP, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In Proceedings of the ACM SIGMOD CONFERENCE on management of data, Dallas, TX, 16-18 May, pp 93-104.

[3]

Barnett V, Lewis T (1994) Outliers in statistical data (3rd edn). Wiley, New York.

[4]

Cohen A, Daubechies I, Feauveau JC (1992) Biorthogonal bases of compactly supported wavelets. Communications in Pure and Applied Mathematics 45:485-560.

[5]

Donoho DL (1992) De-noising by soft-threshold. Technical report 409, Department of Statistics, Stanford University.

[6]

Donoho DL, Johnstone IM, Kerkyacharian G, Picard D (1995) Wavelet shrinkage: symptopia? Journal of the Royal Statistical Society, Series B 57 301-369.

[7]

Greenblatt SA (1995) Wavelets in econometrics: An application to outlier testing. Computational Economic Systems: Models, Methods and Econometrics, August. http://econwpa.wustl.edu:8089/ eps/em/papers/9410/9410001.pdf

[8]

Hoaglin D, Mosteller F, Tukey J (1983) Understanding robust and exploratory data analysis. Wiley, New York.

[9]

Jain R, Kasturi R, Schunck BG (1995) Machine vision. MIT PRESS, Cambridge, MA.

[10]

Johnson R (1992) Applied multivariate Statistical analysis. Prentice-Hall, Englewood Cliffs, NJ.

[11]

Knorr E, Ng R (1997) A unified notion of outliers: properties and computation. In Proceedings of the international conference on knowledge discovery and data mining, pp 219-222, Newport Beach, CA, USA.

[12]

Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th VLDB conference, New York, August, pp 392-403.

[13]

Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In Proceedings of the 25th VLDB conference, Edinburgh, September, pp 211-222.

[14]

Knuth DE (1998) The art of computer programming (3rd edn). Addison-Wesley, Reading, MA.

[15]

Mallat S (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11:674-693.

Digital Library

[16]

Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD CONFERENCE on management of data, Dallas, TX, 16-18 May, pp 427-438.

[17]

Ruts I, Rousseeuw PJ (1996) Computing depth contours of bivariate point clouds. Computational Statistics and Data Analysis 23:153-168.

Digital Library

[18]

Sarawagi S, Agrawal R, Megiddo N (1998) Discovery-driven exploration of OLAP data cubes. In Proceedings of the sixth international conference on extending database technology (EDBT), Valencia, Spain, March.

[19]

Sheikholeslami G, Chang W, Zhang A (1998a) Semantic clustering and querying on heterogeneous features for visual data. In Proceedings of the 6th ACM international multimedia conference (ACM Multimedia '98), Bristol, UK, September, pp 3-12.

[20]

Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveChister: a multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24th VLDB conference, August, pp 428-439.

[21]

Strang G, Nguyen T (1996) Wavelets and filter banks. Wellesley-Cambridge Press, Wellesley, MA, USA.

[22]

Uytterhoeven G, Roose D, Bultheel A (1997) Wavelet transforms using lifting scheme. Technical report ITA-Wavelets Report WP 1.1, Department of Computer Science, Katholieke Universiteit Leuven, Belgium, April.

[23]

Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, Canada, pp 103-114.

Cited By

P. DSundaram S(2022)FiSH: fair spatial hot spotsData Mining and Knowledge Discovery10.1007/s10618-022-00887-437:4(1374-1403)Online publication date: 17-Nov-2022
https://dl.acm.org/doi/10.1007/s10618-022-00887-4
Li AXu WLiu ZShi Y(2021)Improved incremental local outlier detection for data streams based on the landmark window modelKnowledge and Information Systems10.1007/s10115-021-01585-163:8(2129-2155)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s10115-021-01585-1
Hu WGao JLi BWu ODu JMaybank S(2020)Anomaly Detection Using Local Kernel Density Estimation and Context-Based RegressionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.288240432:2(218-233)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1109/TKDE.2018.2882404
Show More Cited By

Index Terms

Findout: finding outliers in very large datasets
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
  2. Information systems applications
    1. Data mining

Recommendations

Robust mixture model-based clustering with genetic algorithm approach

In this paper, we address the robustness issue of maximum likelihood based methods in data clustering. Probabilistic mixture model has been a well known approach to cluster analysis. However, as they rely on maximum likelihood estimation (MLE), the ...
Adaptive outlierness for subspace outlier ranking
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Outlier mining is an important data analysis task to distinguish exceptional outliers from regular objects. However, in recent applications traditional outlier mining approaches miss outliers as they are hidden in subspace projections.

In this work, we ...
On efficiently summarizing categorical databases

Frequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining. In recent years, several studies have also extended its application to transaction or document clustering. However, most of the ...

Comments

Information & Contributors

Information

Published In

cover image Knowledge and Information Systems

Knowledge and Information Systems Volume 4, Issue 4

October 2002

133 pages

ISSN:0219-1377

Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 October 2002

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

P. DSundaram S(2022)FiSH: fair spatial hot spotsData Mining and Knowledge Discovery10.1007/s10618-022-00887-437:4(1374-1403)Online publication date: 17-Nov-2022
https://dl.acm.org/doi/10.1007/s10618-022-00887-4
Li AXu WLiu ZShi Y(2021)Improved incremental local outlier detection for data streams based on the landmark window modelKnowledge and Information Systems10.1007/s10115-021-01585-163:8(2129-2155)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s10115-021-01585-1
Hu WGao JLi BWu ODu JMaybank S(2020)Anomaly Detection Using Local Kernel Density Estimation and Context-Based RegressionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.288240432:2(218-233)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1109/TKDE.2018.2882404
Kerner HWagstaff KBue BWellington DJacob SHorton PBell JKwan CBen Amor H(2020)Comparison of novelty detection methods for multispectral images in rover-based planetary exploration missionsData Mining and Knowledge Discovery10.1007/s10618-020-00697-634:6(1642-1675)Online publication date: 16-Jun-2020
https://dl.acm.org/doi/10.1007/s10618-020-00697-6
P. DSam Abraham S(2020)Fair Outlier DetectionWeb Information Systems Engineering – WISE 202010.1007/978-3-030-62008-0_31(447-462)Online publication date: 20-Oct-2020
https://dl.acm.org/doi/10.1007/978-3-030-62008-0_31
Ren RCheng JHe XWang LZhan JGao WLuo C(2019)HybridTune: Spatio-Temporal Performance Data Correlation for Performance Diagnosis of Big Data SystemsJournal of Computer Science and Technology10.1007/s11390-019-1968-y34:6(1167-1184)Online publication date: 1-Nov-2019
https://dl.acm.org/doi/10.1007/s11390-019-1968-y
Ahmed M(2019)Data summarizationKnowledge and Information Systems10.1007/s10115-018-1183-058:2(249-273)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1007/s10115-018-1183-0
Suresh AVaratharajan R(2018)RETRACTED ARTICLEMultimedia Tools and Applications10.1007/s11042-018-5905-977:20(27075-27088)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1007/s11042-018-5905-9
Ahmed M(2018)Reservoir-based network traffic stream summarization for anomaly detectionPattern Analysis & Applications10.1007/s10044-017-0659-y21:2(579-599)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1007/s10044-017-0659-y
Huang JZhu QYang LCheng DWu Q(2017)A novel outlier cluster detection algorithm without top-n parameterKnowledge-Based Systems10.1016/j.knosys.2017.01.013121:C(32-40)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.knosys.2017.01.013
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents