Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Automated error detection using association rules

Published: 01 September 2011 Publication History

Abstract

High data quality is important for every application. Inaccurate or inadequate data can lead to inappropriate assumptions, misleading results, bias and ultimately poor policy and decision making. Finding errors and cleaning data is a time consuming process. This paper presents a framework for automatically detecting unusual and erroneous data values in datasets. The main idea is to generate association rules with very high confidence and to identify the cases that are exceptions to these rules. Experimental results show that the proposed framework is able to successfully identify erroneous values in large datasets.

References

[1]
R. Agrawal, T. Imieliski and A. Swami, Mining association rules between sets of items in large databases, in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1993.
[2]
R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in Proceedings of 20th Int. Conference Very Large Data Bases VLDB, Morgan Kaufmann, 1994, 487-499.
[3]
A. Amir, R. Feldman and R. Kashi, A new and versatile method for association generation, Information Systems 22 (1997), 333-347.
[4]
V. Barnett and T. Lewis, Outliers in Statistical Data, (3rd ed.), John Wiley & Sons, New York, 1994.
[5]
C. Batini and M. Scannapieca, Data Quality Concepts, Methodologies and Techniques, Berlin; New York: Springer, 2006.
[6]
M. Biba, F. Esposito, S. Ferilli, N.D. Mauro and T.M.A. Basile, Unsupervised discretization using kernel density estimation, in Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, IJCAI, 2007, pp. 696-701.
[7]
C. Chen, W. Härdle and A. Unwin, Handbook of Data Visualization, Springer-Verlag, Berlin, 2008.
[8]
J. Dougherty, R. Kohavi and M. Sahami, Supervised and unsupervised discretization of continuous feature, in Proceedings of the 12th International Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann, 1995, pp. 194-202.
[9]
M. Friendly, Mosaic displays for multi-way contingency tables, Journal of the American Statistical Association 89 (1994), 190-200.
[10]
M. Hahsler, B. Grun and K. Hornik, arules-A computational environment for mining association rules and frequent item sets, Journal of Statistical Software 14 (2005), 1-25.
[11]
J. Han and M. Kamber, Data Mining: Concepts and Techniques, San Francisco, USA: Morgan Kaufmann, 2006.
[12]
J.A. Hartigan and B. Kleiner, Mosaics for contingency tables, in: Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface, W.F. Eddy, ed., New York: Springer, 1981, pp. 268-273.
[13]
J. Hipp, U. Güntzer and G. Nakhaeizadeh, Algorithms for association rule mining-A general survey and comparison, ACM SIGKDD Explorations 2(1) (2000), 58-64.
[14]
J. Hipp, U. Güntzer and G. Nakhaeizadeh, Mining association rules: Deriving a superior algorithm by analyzing today's approaches, in Principles of Data Mining and Knowledge Discovery, Springer-Verlag, 2000, pp. 341-376.
[15]
H. Hofmann, Exploring categorical data: Interactive mosaic plots, Metrika 51 (2000), 11-26.
[16]
L. Huan, H. Farhad, T. Chew Lim and D. Manoranjan, Discretization: An enabling technique, Data Minning Knowledge Discovery 6 (2002), 393-423.
[17]
K. Huang, Y.W. Lee and R.Y. Wang, Quality Information and Knowledge, Prentice Hall, 1999.
[18]
Y.W. Lee, D.M. Strong, K.B. Kahn and R.Y. Wang, AIMQ: A methodology for information quality assessment, Information & Management 40(2) (2002), 133-146.
[19]
J.E. Olson, Data Quality: The Accuracy Dimension, San Francisco, CA: Morgan Kaufmann Publishers, 2003.
[20]
M. Theus, Interactive Data Visualization Using Mondrian, Journal of Statistical Software 7 (2002), 1-9.
[21]
M. Theus and S. Urbanek, Interactive Graphics for Data Analysis: Principles and Examples, Chapman & Hall/CRC, 2008.
[22]
R Development Core Team, R: A language and environment for statistical computing, R foundation for statistical computing, Vienna, Austria, 2009, ISBN: 3-900051-07-0, http://www.R-project.org.
[23]
A. Unwin, M. Theus and H. Hofmann, Graphics of Large Datasets, New York: Springer, 2006.

Cited By

View all
  • (2014)Combining hybrid rule ordering strategies based on netconf and a novel satisfaction mechanism for CAR-based classifiersIntelligent Data Analysis10.5555/2729826.272983418:6S(S89-S100)Online publication date: 1-Jan-2014
  • (2013)Dynamic KProceedings, Part I, of the 18th Iberoamerican Congress on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - Volume 825810.1007/978-3-642-41822-8_18(141-148)Online publication date: 20-Nov-2013
  1. Automated error detection using association rules

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Intelligent Data Analysis
      Intelligent Data Analysis  Volume 15, Issue 5
      September 2011
      166 pages

      Publisher

      IOS Press

      Netherlands

      Publication History

      Published: 01 September 2011

      Author Tags

      1. Association Rules
      2. Data Cleaning
      3. Data Mining
      4. Data Quality
      5. Error Detection
      6. Market Basket
      7. Outlier Detection

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 25 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2014)Combining hybrid rule ordering strategies based on netconf and a novel satisfaction mechanism for CAR-based classifiersIntelligent Data Analysis10.5555/2729826.272983418:6S(S89-S100)Online publication date: 1-Jan-2014
      • (2013)Dynamic KProceedings, Part I, of the 18th Iberoamerican Congress on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - Volume 825810.1007/978-3-642-41822-8_18(141-148)Online publication date: 20-Nov-2013

      View Options

      View options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media