research-article

EXPERIENCE: Glitches in Databases, How to Ensure Data Quality by Outlier Detection Techniques

Author:

Ciro D'UrsoAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 7, Issue 3

Article No.: 14, Pages 1 - 22

https://doi.org/10.1145/2950109

Published: 22 September 2016 Publication History

Abstract

Enterprise's archives are inevitably affected by the presence of data quality problems (also called glitches). This article proposes the application of a new method to analyze the quality of datasets stored in the tables of a database, with no knowledge of the semantics of the data and without the need to define repositories of rules. The proposed method is based on proper revisions of different approaches for outlier detection that are combined to boost overall performance and accuracy. A novel transformation algorithm is conceived that treats the items in database tables as data points in real coordinate space of n dimensions, so that fields containing dates and fields containing text are processed to calculate distances between those data points. The implementation of an iterative approach ensures that global and local outliers are discovered even if they are subject, primarily in datasets with multiple outliers or clusters of outliers, to masking and swamping effects. The application of the method to a set of archives, some of which have been studied extensively in the literature, provides very promising experimental results and outperforms the application of a single other technique. Finally, a list of future research directions is highlighted.

References

[1]

Charu C. Aggarwal. 2013. Outlier ensembles: Position paper. ACM SIGKDD Explor. Newsl. 14, 2 (2013), 49--58.

Digital Library

[2]

Charu C. Aggarwal and Saket Sathe. 2015. Theoretical foundations and algorithms for outlier ensembles. SIGKDD Explor. Newsl. 17, 1 (September 2015), 24--47.

Digital Library

[3]

Edgar Acuna and Caroline A. Rodriguez. 2004. A meta analysis study of outlier detection methods in classification. Technical Paper. Department of Mathematics, University of Puerto Rico at Mayaguez, Mayaguez, Puerto Rico.

[4]

Kevin Bache and Moshe Lichman. 2013. UCI Machine Learning Repository. Retrieved March 23, 2015, from http://archive.ics.uci.edu/ml.

[5]

Carlo Batini and Monica Scannapieco. 2006. Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin.

Digital Library

[6]

Stephen Bay and Mark Schwabacher. 2003. Distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD. ACM, New York, NY, 29--38.

Digital Library

[7]

Iran Ben-Gal. 2005. Outlier detection. In Data Mining and Knowledge Discovery Handbook, Oded Z. Maimon and Lior Rokach (Eds.). Springer, New York, NY, 131--146.

[8]

Markud M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander. 2000. LOF: Identifying density-based local outliers. SIGMOD Rec. 29, 2 (June 2000), 93--104.

Digital Library

[9]

Linda Dailey. 2006. Hackers strengthen malicious botnets by shrinking them. (April 2006). Computer; News Briefs. Retrieved March 23, 2014, from http://www.computer.org/csdl/mags/co/2006/04/r4017.pdf.

[10]

Thomas G. Dietterich. 2000. Ensemble methods in machine learning. In Multiple Classifier Systems. Springer, Berlin, 1--15.

Digital Library

[11]

William Danford and Robert Salusky. 2007. The honeynet project: How fast-flux service networks work. Retrieved November 7, 2014, from http://www.honeynet.org/node/132.

[12]

Laurie Davies and Ursula Gather. 1993. The identification of multiple outliers. J. Am. Stat. Assoc. 88, 423 (Sept. 1993), 782--801.

[13]

John Dunagan and Santosh Vempala. 2004. Optimal outlier removal in high-dimensional spaces. J. Comput. Syst. Sci. 68, 2 (March 2004), 335--373.

Digital Library

[14]

Ciro D’Urso. 2016. Glitches in databases: Ensuring data quality by a combined approach. Technical Report.

[15]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, Menlo Park, CA, 226--231.

Digital Library

[16]

Tom Fawcett. 2006. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 8 (June 2006), 861--874.

Digital Library

[17]

I. Fellegi and D. Holt. 1976. A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71, 353 (March 1976), 17--35.

[18]

GNU Octave. 2014. Retrieved March 23, 2015, from http://www.gnu.org/software/octave/.

[19]

Dieter Gollmann. 2011. Computer Security (3rd ed.). John Wiley, New York, NY.

Digital Library

[20]

Jim Gray and Andreas Reuter. 1992. Transaction Processing: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, San Francisco, CA.

Digital Library

[21]

Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 73--84.

Digital Library

[22]

Jiawei Han, Micheline Kamber, and Jian Pei. 2012. Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems (3rd ed.). Morgan Kaufmann, San Francisco, CA.

Digital Library

[23]

Douglas Hawkins. 1980. Identification of Outliers. Chapman and Hall, London, UK.

[24]

Leonard Kaufman and Peter Rousseew. 1998. Finds Groups in Data: An Introduction to Cluster Analysis. Wiley & Sons, New York, NY.

[25]

Edwin Knorr and Raymond Ng. 1997. A unified approach for mining outliers. In Proceedings of Knowledge Discovery (KDD’97). ACM, New York, NY, 219--222.

Digital Library

[26]

Edwin Knorr and Raymond Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB’98). Morgan Kaufmann, San Francisco, CA, 392--403.

Digital Library

[27]

Vipin Kumar. 2005. Cluster analysis: Basic concepts and algorithms. In Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (Eds.). Addison Wessley, Boston, MA, 487--559.

[28]

Mark Last and Abraham Kandel. 1999. Automated perceptions in data mining. In Proceedings of the 8th International Conference on Fuzzy Systems. IEEE Press, New York, NY, 190--197.

[29]

Dajiang Lei, Quinghsneg Zhu, Jun Chen, Hai Lin, and Peng Yang. 2012. Automatic PAM clustering algorithm for outlier detection. J. Software 7, 5 (May 2012), 1045--1051.

[30]

Dominik Luebbers, Udo Grimmer, and Matthias Jarke. 2003. Systematic development of data mining based data quality tools. In Proceedings of the 29th VLDB Conference. ACM, New York, NY, 548--559.

Digital Library

[31]

Prasanta Chandra Mahalanobis. 1936. On the generalised distance in statistics. Proc. Natl. Inst. Sci. India 2, 1 (April 1936), 49--55.

[32]

Edward W. Minium, Robert B. Clarke, and Theodore Coladarci. 1999. Elements of Statistical Reasoning. Wiley, New York, NY.

[33]

A. Mira, Dhruba Kumar Bhattacharyya, and Sarat Saharia. 2012. RODHA: Robust outlier detection using hybrid approach. Am. J. Intell. Syst. 2, 5 (2012), 129--140.

[34]

Paul Mockapetris. 1987. Domain names: implementation and specification. Network Working Group. RFC 1035, Retrieved March 23, 2015, from https://www.ietf.org/rfc/rfc1035.txt.

Digital Library

[35]

Raymond T. Ng and Jiawei Han. 2002. CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14, 5 (Sept.-Oct. 2002), 1003--1016.

Digital Library

[36]

Kay I. Penny and Ian T. Jolliffe. 2001. A comparison of multivariate outlier detection methods for clinical laboratory safety data. J. R. Stat. Soc. Ser. D (Statistician) 50, 3 (Sept. 2001), 295--308.

[37]

Raquel R. Pinho, et al. 2006. Efficient approximation of the mahalanobis distance for tracking with the kalman filter. In Computational Modelling of Objects Represented in Images: Fundamentals, Methods and Applications (CompIMAGE’06).

[38]

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the SIGMOD International Conference on Data Management. ACM, New York, NY, 427--438.

Digital Library

[39]

Marco Riani and Sergio S. Zani. 1997. An iterative method for the detection of multivariate outliers. Metron 55, 101--117.

[40]

Vyacheslav Rusakov and Sergey Golovanov. 2014. Attacks before system startup. Retrieved November 26, 2014, from http://securelist.com/blog/research/63725/attacks-before-system-startup/.

[41]

Jörg Sander, Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. 1998. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Discovery 2, 2 (June 1998), 169--194.

Digital Library

[42]

F. Sun, S. Omachi, N. Kato, H. Aso, S. Kono, and T. Takagi. 2000. Two-stage computational cost reduction algorithm based on Mahalanobis distance approximations. In Proceedings of the 15th International Conference on Pattern Recognition. 2, 696--699.

[43]

Peng Yang and Biao B. Huang 2008. An outlier detection algorithm based on spectral cluster. In Proceedings of the 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application. IEEE Press, New York, NY, 507--510.

Digital Library

[44]

Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1997. BIRCH: A new data clustering algorithm and its applications. Data Min. Knowl. Discovery 1, 2 (1997), 141--182.

Digital Library

Cited By

Sikder MBatarseh F(2023)Outlier detection using AI: a surveyAI Assurance10.1016/B978-0-32-391919-7.00020-2(231-291)Online publication date: 2023
https://doi.org/10.1016/B978-0-32-391919-7.00020-2
Kim JJoung JLee B(2022)A Study on the Preprocessing Method for Power System Applications Based on Polynomial and Standard PatternsEnergies10.3390/en1504144115:4(1441)Online publication date: 16-Feb-2022
https://doi.org/10.3390/en15041441
de C. Costa RMoreira J(2022)Automatic Quality Improvement of Data on the Evolution of 2D RegionsAdvanced Data Mining and Applications10.1007/978-3-030-95408-6_22(288-300)Online publication date: 2-Feb-2022
https://dl.acm.org/doi/10.1007/978-3-030-95408-6_22
Show More Cited By

Index Terms

EXPERIENCE: Glitches in Databases, How to Ensure Data Quality by Outlier Detection Techniques
1. Applied computing
  1. Enterprise computing
    1. Enterprise data management
2. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning

Recommendations

A Stahel-Donoho estimator based on huberized outlyingness

The Stahel-Donoho estimator is defined as a weighted mean and covariance, where the weight of each observation depends on a measure of its outlyingness. In high dimensions, it can easily happen that a number of outlying measurements are present in such ...
Profiting by Experience
Outliers detection in environmental monitoring databases

Environmental monitoring is nowadays an important task in many industrial operations. In order to comply with strong environmental laws, they have implemented monitoring systems based on a network of air quality and meteorological stations providing ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 7, Issue 3

Research Paper, Challenge Papers and Experience Paper

September 2016

62 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/2988525

Editor:
Louiqa Raschid
University of Maryland, College Park, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 September 2016

Accepted: 01 July 2016

Revised: 01 June 2016

Received: 01 March 2015

Published in JDIQ Volume 7, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
422
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sikder MBatarseh F(2023)Outlier detection using AI: a surveyAI Assurance10.1016/B978-0-32-391919-7.00020-2(231-291)Online publication date: 2023
https://doi.org/10.1016/B978-0-32-391919-7.00020-2
Kim JJoung JLee B(2022)A Study on the Preprocessing Method for Power System Applications Based on Polynomial and Standard PatternsEnergies10.3390/en1504144115:4(1441)Online publication date: 16-Feb-2022
https://doi.org/10.3390/en15041441
de C. Costa RMoreira J(2022)Automatic Quality Improvement of Data on the Evolution of 2D RegionsAdvanced Data Mining and Applications10.1007/978-3-030-95408-6_22(288-300)Online publication date: 2-Feb-2022
https://dl.acm.org/doi/10.1007/978-3-030-95408-6_22
Bloch LFriedrich C(2021)Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learningAlzheimer's Research & Therapy10.1186/s13195-021-00879-413:1Online publication date: 15-Sep-2021
https://doi.org/10.1186/s13195-021-00879-4
Costa RMiranda EDias PMoreira J(2021)ExperienceJournal of Data and Information Quality10.1145/342815513:1(1-13)Online publication date: 13-Jan-2021
https://dl.acm.org/doi/10.1145/3428155
Wang HBah MHammad M(2019)Progress in Outlier Detection Techniques: A SurveyIEEE Access10.1109/ACCESS.2019.29327697(107964-108000)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2932769
Wu XCai LJi R(2018)Gamma Mixture Models for Outlier Removal2018 25th IEEE International Conference on Image Processing (ICIP)10.1109/ICIP.2018.8451217(828-832)Online publication date: Oct-2018
https://doi.org/10.1109/ICIP.2018.8451217
Dong XHe HLi CLiu YXiong H(2018)Scene-Based Big Data Quality Management FrameworkData Science10.1007/978-981-13-2203-7_10(122-139)Online publication date: 9-Sep-2018
https://doi.org/10.1007/978-981-13-2203-7_10

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents