Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

EXPERIENCE: Glitches in Databases, How to Ensure Data Quality by Outlier Detection Techniques

Published: 22 September 2016 Publication History

Abstract

Enterprise's archives are inevitably affected by the presence of data quality problems (also called glitches). This article proposes the application of a new method to analyze the quality of datasets stored in the tables of a database, with no knowledge of the semantics of the data and without the need to define repositories of rules. The proposed method is based on proper revisions of different approaches for outlier detection that are combined to boost overall performance and accuracy. A novel transformation algorithm is conceived that treats the items in database tables as data points in real coordinate space of n dimensions, so that fields containing dates and fields containing text are processed to calculate distances between those data points. The implementation of an iterative approach ensures that global and local outliers are discovered even if they are subject, primarily in datasets with multiple outliers or clusters of outliers, to masking and swamping effects. The application of the method to a set of archives, some of which have been studied extensively in the literature, provides very promising experimental results and outperforms the application of a single other technique. Finally, a list of future research directions is highlighted.

References

[1]
Charu C. Aggarwal. 2013. Outlier ensembles: Position paper. ACM SIGKDD Explor. Newsl. 14, 2 (2013), 49--58.
[2]
Charu C. Aggarwal and Saket Sathe. 2015. Theoretical foundations and algorithms for outlier ensembles. SIGKDD Explor. Newsl. 17, 1 (September 2015), 24--47.
[3]
Edgar Acuna and Caroline A. Rodriguez. 2004. A meta analysis study of outlier detection methods in classification. Technical Paper. Department of Mathematics, University of Puerto Rico at Mayaguez, Mayaguez, Puerto Rico.
[4]
Kevin Bache and Moshe Lichman. 2013. UCI Machine Learning Repository. Retrieved March 23, 2015, from http://archive.ics.uci.edu/ml.
[5]
Carlo Batini and Monica Scannapieco. 2006. Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin.
[6]
Stephen Bay and Mark Schwabacher. 2003. Distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD. ACM, New York, NY, 29--38.
[7]
Iran Ben-Gal. 2005. Outlier detection. In Data Mining and Knowledge Discovery Handbook, Oded Z. Maimon and Lior Rokach (Eds.). Springer, New York, NY, 131--146.
[8]
Markud M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander. 2000. LOF: Identifying density-based local outliers. SIGMOD Rec. 29, 2 (June 2000), 93--104.
[9]
Linda Dailey. 2006. Hackers strengthen malicious botnets by shrinking them. (April 2006). Computer; News Briefs. Retrieved March 23, 2014, from http://www.computer.org/csdl/mags/co/2006/04/r4017.pdf.
[10]
Thomas G. Dietterich. 2000. Ensemble methods in machine learning. In Multiple Classifier Systems. Springer, Berlin, 1--15.
[11]
William Danford and Robert Salusky. 2007. The honeynet project: How fast-flux service networks work. Retrieved November 7, 2014, from http://www.honeynet.org/node/132.
[12]
Laurie Davies and Ursula Gather. 1993. The identification of multiple outliers. J. Am. Stat. Assoc. 88, 423 (Sept. 1993), 782--801.
[13]
John Dunagan and Santosh Vempala. 2004. Optimal outlier removal in high-dimensional spaces. J. Comput. Syst. Sci. 68, 2 (March 2004), 335--373.
[14]
Ciro D’Urso. 2016. Glitches in databases: Ensuring data quality by a combined approach. Technical Report.
[15]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, Menlo Park, CA, 226--231.
[16]
Tom Fawcett. 2006. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 8 (June 2006), 861--874.
[17]
I. Fellegi and D. Holt. 1976. A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71, 353 (March 1976), 17--35.
[18]
GNU Octave. 2014. Retrieved March 23, 2015, from http://www.gnu.org/software/octave/.
[19]
Dieter Gollmann. 2011. Computer Security (3rd ed.). John Wiley, New York, NY.
[20]
Jim Gray and Andreas Reuter. 1992. Transaction Processing: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, San Francisco, CA.
[21]
Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 73--84.
[22]
Jiawei Han, Micheline Kamber, and Jian Pei. 2012. Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems (3rd ed.). Morgan Kaufmann, San Francisco, CA.
[23]
Douglas Hawkins. 1980. Identification of Outliers. Chapman and Hall, London, UK.
[24]
Leonard Kaufman and Peter Rousseew. 1998. Finds Groups in Data: An Introduction to Cluster Analysis. Wiley & Sons, New York, NY.
[25]
Edwin Knorr and Raymond Ng. 1997. A unified approach for mining outliers. In Proceedings of Knowledge Discovery (KDD’97). ACM, New York, NY, 219--222.
[26]
Edwin Knorr and Raymond Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB’98). Morgan Kaufmann, San Francisco, CA, 392--403.
[27]
Vipin Kumar. 2005. Cluster analysis: Basic concepts and algorithms. In Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (Eds.). Addison Wessley, Boston, MA, 487--559.
[28]
Mark Last and Abraham Kandel. 1999. Automated perceptions in data mining. In Proceedings of the 8th International Conference on Fuzzy Systems. IEEE Press, New York, NY, 190--197.
[29]
Dajiang Lei, Quinghsneg Zhu, Jun Chen, Hai Lin, and Peng Yang. 2012. Automatic PAM clustering algorithm for outlier detection. J. Software 7, 5 (May 2012), 1045--1051.
[30]
Dominik Luebbers, Udo Grimmer, and Matthias Jarke. 2003. Systematic development of data mining based data quality tools. In Proceedings of the 29th VLDB Conference. ACM, New York, NY, 548--559.
[31]
Prasanta Chandra Mahalanobis. 1936. On the generalised distance in statistics. Proc. Natl. Inst. Sci. India 2, 1 (April 1936), 49--55.
[32]
Edward W. Minium, Robert B. Clarke, and Theodore Coladarci. 1999. Elements of Statistical Reasoning. Wiley, New York, NY.
[33]
A. Mira, Dhruba Kumar Bhattacharyya, and Sarat Saharia. 2012. RODHA: Robust outlier detection using hybrid approach. Am. J. Intell. Syst. 2, 5 (2012), 129--140.
[34]
Paul Mockapetris. 1987. Domain names: implementation and specification. Network Working Group. RFC 1035, Retrieved March 23, 2015, from https://www.ietf.org/rfc/rfc1035.txt.
[35]
Raymond T. Ng and Jiawei Han. 2002. CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14, 5 (Sept.-Oct. 2002), 1003--1016.
[36]
Kay I. Penny and Ian T. Jolliffe. 2001. A comparison of multivariate outlier detection methods for clinical laboratory safety data. J. R. Stat. Soc. Ser. D (Statistician) 50, 3 (Sept. 2001), 295--308.
[37]
Raquel R. Pinho, et al. 2006. Efficient approximation of the mahalanobis distance for tracking with the kalman filter. In Computational Modelling of Objects Represented in Images: Fundamentals, Methods and Applications (CompIMAGE’06).
[38]
Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the SIGMOD International Conference on Data Management. ACM, New York, NY, 427--438.
[39]
Marco Riani and Sergio S. Zani. 1997. An iterative method for the detection of multivariate outliers. Metron 55, 101--117.
[40]
Vyacheslav Rusakov and Sergey Golovanov. 2014. Attacks before system startup. Retrieved November 26, 2014, from http://securelist.com/blog/research/63725/attacks-before-system-startup/.
[41]
Jörg Sander, Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. 1998. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Discovery 2, 2 (June 1998), 169--194.
[42]
F. Sun, S. Omachi, N. Kato, H. Aso, S. Kono, and T. Takagi. 2000. Two-stage computational cost reduction algorithm based on Mahalanobis distance approximations. In Proceedings of the 15th International Conference on Pattern Recognition. 2, 696--699.
[43]
Peng Yang and Biao B. Huang 2008. An outlier detection algorithm based on spectral cluster. In Proceedings of the 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application. IEEE Press, New York, NY, 507--510.
[44]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1997. BIRCH: A new data clustering algorithm and its applications. Data Min. Knowl. Discovery 1, 2 (1997), 141--182.

Cited By

View all

Index Terms

  1. EXPERIENCE: Glitches in Databases, How to Ensure Data Quality by Outlier Detection Techniques

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Journal of Data and Information Quality
      Journal of Data and Information Quality  Volume 7, Issue 3
      Research Paper, Challenge Papers and Experience Paper
      September 2016
      62 pages
      ISSN:1936-1955
      EISSN:1936-1963
      DOI:10.1145/2988525
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 September 2016
      Accepted: 01 July 2016
      Revised: 01 June 2016
      Received: 01 March 2015
      Published in JDIQ Volume 7, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Data quality process
      2. data preparation for econometrics of public policy evaluation
      3. databases
      4. outlier identification

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Outlier detection using AI: a surveyAI Assurance10.1016/B978-0-32-391919-7.00020-2(231-291)Online publication date: 2023
      • (2022)A Study on the Preprocessing Method for Power System Applications Based on Polynomial and Standard PatternsEnergies10.3390/en1504144115:4(1441)Online publication date: 16-Feb-2022
      • (2022)Automatic Quality Improvement of Data on the Evolution of 2D RegionsAdvanced Data Mining and Applications10.1007/978-3-030-95408-6_22(288-300)Online publication date: 2-Feb-2022
      • (2021)Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learningAlzheimer's Research & Therapy10.1186/s13195-021-00879-413:1Online publication date: 15-Sep-2021
      • (2021)ExperienceJournal of Data and Information Quality10.1145/342815513:1(1-13)Online publication date: 13-Jan-2021
      • (2019)Progress in Outlier Detection Techniques: A SurveyIEEE Access10.1109/ACCESS.2019.29327697(107964-108000)Online publication date: 2019
      • (2018)Gamma Mixture Models for Outlier Removal2018 25th IEEE International Conference on Image Processing (ICIP)10.1109/ICIP.2018.8451217(828-832)Online publication date: Oct-2018
      • (2018)Scene-Based Big Data Quality Management FrameworkData Science10.1007/978-981-13-2203-7_10(122-139)Online publication date: 9-Sep-2018

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media