Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Methodologies for data quality assessment and improvement

Published: 30 July 2009 Publication History

Abstract

The literature provides a wide range of techniques to assess and improve the quality of data. Due to the diversity and complexity of these techniques, research has recently focused on defining methodologies that help the selection, customization, and application of data quality assessment and improvement techniques. The goal of this article is to provide a systematic and comparative description of such methodologies. Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, the types of information systems addressed by each methodology. The article concludes with a summary description of each methodology.

References

[1]
Abiteboul, S., Buneman, P., and Suciu, D. 2000. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann Publishers.
[2]
Aiken, P. 1996. Data Reverse Engineering. McGraw Hill.
[3]
Arenas, M., Bertossi, L., and Chomicki, J. 1999. Consistent query answers in inconsistent databases. In Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). ACM, New York, 68--79.
[4]
Atzeni, P. and Antonellis, V. D. 1993. Relational Database Theory. Benjamin/Cummings.
[5]
Atzeni, P., Merialdo, P., and Sindoni, G. 2001. Web site evaluation: Methodology and case study. In Proceedings of International Workshop on data Semantics in Web Information Systems (DASWIS).
[6]
Ballou, D. and Pazer, H. 1985. Modeling data and process quality in multi-input, multi-output information systems. Manag. Sci. 31, 2.
[7]
Ballou, D., Wang, R., Pazer, H., and Tayi, G. 1998. Modeling information manufacturing systems to determine information product quality. Manage. Sci. 44, 4.
[8]
Basile, A., Batini, C., Grega, S., Mastrella, M., and Maurino, A. 2007. Orme: A new methodology for information quality and basel II operational risk. In Proceeedings of the 12th International Conference of Information Quality, Industrial Track.
[9]
Basili, V., Caldiera, C., Rombach, H. 1994. Goal question metric paradigm.
[10]
Baskarada, S., Koronios, A., and Gao, J. 2006. Towards a capability maturity model for information quality management: a tdqm approach. In Proceedings of the 11th International Conference on Information Quality.
[11]
Batini, C., Cabitza, F., Cappiello, C., and Francalanci, C. 2008. A comprehensive data quality methodology for Web and structured data. Int. J. Innov. Comput. Appl. 1, 3, 205--218.
[12]
Batini, C. and Scannapieco, M. 2006. Data Quality: Concepts, Methodologies and Techniques. Springer Verlag.
[13]
Bertolazzi, P., Santis, L. D., and Scannapieco, M. 2003. Automatic record matching in cooperative information systems. In Proceedings of the ICDT International Workshop on Data Quality in Cooperative Information Systems (DQCIS).
[14]
Bettschen, P. 2005. Master data management (MDM) enables IQ at Tetra Pak. In Proceedings of the 10th International Conference on Information Quality.
[15]
Bilke, A., Bleiholder, J., Böhm, C., Draba, K., Naumann, F., and Weis, M. September 2005. Automatic data fusion with HumMer. In Proceedings of the VLDB Demonstration Program.
[16]
Bovee, M., Srivastava, R., and Mak, B. September 2001. A conceptual framework and belief-function approach to assessing overall information quality. In Proceedings of the 6th International Conference on Information Quality.
[17]
Buneman, P. 1997. Semi-structured data. In Proceedings of the 16th ACM Symposium on Principles of Database Systems (PODS).
[18]
Calì, A., Calvanese, D., De Giacomo, G., and Lenzerini, M. 2004. Data integration under integrity constraints. Inform. Syst. 29, 2, 147--163.
[19]
Calvanese, D., De Giacomo, D., and Lenzerini, M. 1999. Modeling and querying semi-structured data. Network. Inform. Syst. J. 2, 2, 253--273.
[20]
Cappiello, C., Francalanci, C., and Pernici, B. 2003. Preserving Web sites: A data quality approach. In Proceedings of the 7th International Conference on Information Quality (ICIQ).
[21]
Cappiello, C., Francalanci, C., Pernici, B., Plebani, P., and Scannapieco, M. 2003b. Data quality assurance in cooperative information systems: a multi-dimension certificate. In Proceedings of the ICDT International Workshop on Data Quality in Cooperative Information Systems (DQCIS).
[22]
Catarci, T., and Scannapieco, M. 2002. Data quality under the computer science perspective. Archivi Computer 2.
[23]
Chapman, A., Richards, H., and Hawken, S. 2006. Data and information quality at the Canadian institute for health information. In Proceedings of the 11th International Conference on Information Quality.
[24]
Chengalur-Smith, I. N., Ballou, D. P., and Pazer, H. L. 1999. The impact of data quality information on decision making: An exploratory analysis. IEEE Trans. Knowl. Data Eng. 11, 6, 853--864.
[25]
Corey, D., Cobler, L., Haynes, K., and Walker, R. 1996. Data quality assurance activities in the military health services system. In Proceedings of the 1st International Conference on Information Quality. 127--153.
[26]
Dasu, T. and Johnson, T. 2003. Exploratory Data Mining and Data cleaning. Probability and Statistics series, John Wiley.
[27]
Data Warehousing Institute. 2006. Data quality and the bottom line: Achieving business success through a commitment to high quality data. http://www.dw-institute.com/.
[28]
De Amicis, F., Barone, D., and Batini, C. 2006. An analytical framework to analyze dependencies among data quality dimensions. In Proceedings of the 11th International Conference on Information Quality (ICIQ). 369--383.
[29]
De Amicis, F. and Batini, C. 2004. A methodology for data quality assessment on financial data. Studies Commun. Sci. SCKM.
[30]
De Michelis, G., Dubois, E., Jarke, M., Matthes, F., Mylopoulos, J., Papazoglou, M., Pohl, K., Schmidt, J., Woo, C., and Yu, E. 1997. Cooperative Information Systems: A Manifesto. In Cooperative Information Systems: Trends & Directions, M. Papazoglou and G. Schlageter, Eds. Academic-Press.
[31]
De Santis, L., Scannapieco, M., and Catarci, T. 2003. Trusting data quality in cooperative information systems. In Proceedings of the 11th International Conference on Cooperative Information Systems (CoopIS). Catania, Italy.
[32]
Dedeke, A. 2005. Building quality into the information supply chain. Advances in Management Information Systems-Information Quality Monograph (AMIS-IQ) Monograph. R. Wang, E. Pierce, S. Madnick, and Fisher C.W., Eds.
[33]
DQI. 2004. Data quality initiative framework. Project report. www.wales.nhs.uk/sites/documents/319/DQI_Framwork_Update_Letter_160604.pdf
[34]
English, L. 1999. Improving Data Warehouse and Business Information Quality. Wiley & Sons.
[35]
English, L. 2002. Process management and information quality: how improving information production processes improved information (product) quality. In Proceedings of the 7th International Conference on Information Quality (ICIQ). 206--211.
[36]
Eppler, M. and Helfert, M. 2004. A classification and analysis of data quality costs. In Proceedings of the 9th International Conference on Information Systems (ICIQ).
[37]
Eppler, M. and Münzenmaier, P. 2002. Measuring information quality in the Web context: A survey of state-of-the-art instruments and an application methodology. In Proceedings of the 7th International Conference on Information Systems (ICIQ).
[38]
Falorsi, P., Pallara, S., Pavone, A., Alessandroni, A., Massella, E., and Scannapieco, M. 2003. Improving the quality of toponymic data in the italian public administration. In Proceedings of the ICDT Workshop on Data Quality in Cooperative Information Systems (DQCIS).
[39]
Fellegi, I. P., and Holt, D. 1976. A systematic approach to automatic edit and imputation. J. Amer. Stat. Assoc. 71, 353, 17--35.
[40]
Fisher, C. and Kingma, B. 2001. Criticality of data quality as exemplified in two disasters. Inform. Manage. 39, 109--116.
[41]
Fraternali, P., Lanzi, P., Matera, M., and Maurino, A. 2004. Model-driven Web usage analysis for the evaluation of Web application quality. J. Web Eng. 3, 2, 124--152.
[42]
Gackowski, Z. 2006. Redefining information quality: the operations management approach. In Proceedings of the 11th International Conference on Information Quality (ICIQ). 399--419.
[43]
Hammer, M. 1990. Reengineering work: Don't automate, obliterate. Harvard Bus. Rev. 104--112.
[44]
Hammer, M. and Champy, J. 2001. Reengineering the Corporation: A Manifesto for Business Revolution, Harper Collins.
[45]
Hernandez, M. and Stolfo, S. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. J. Data Min. Knowl. Dis. 1, 2.
[46]
Isakowitz, T., Bieber, M., and Vitali, F. 1998. Web information systems - introduction. Commun. ACM 41, 7, 78--80.
[47]
Isakowitz, T., Stohr, E., and Balasubramanian, P. 1995. RMM: A methodology for structured hypermedia design. Comm. ACM 58, 8.
[48]
Istat. 2004. Guidelines for the data quality improvement of localization data in public administration (in Italian). www.istat.it
[49]
Jarke, M., Lenzerini, M., Vassiliou, Y., and Vassiliadis, P., Eds. 1995. Fundamentals of Data Warehouses. Springer Verlag.
[50]
Jeusfeld, M., Quix, C., and Jarke, M. 1998. Design and analysis of quality information for data warehouses. In Proceedings of the 17th International Conference on Conceptual Modeling.
[51]
Kerr, K. and Norris, T. 2004. The development of a healthcare data quality framework and strategy. In Proceedings of the 9th International Conference on Information Quality.
[52]
Kettinger, W. and Grover, V. 1995. Special section: Toward a theory of business process change management. J. Manag. Inform. Syst. 12, 1, 9--30.
[53]
Kovac, R. and Weickert, C. 2002. Starting with quality: Using TDQM in a start-up organization. In Proceedings of the 7th International Conference on Information Quality (ICIQ). Boston, 69--78.
[54]
Lee, Y. W., Strong, D. M., Kahn, B. K., and Wang, R. Y. 2002. AIMQ: A methodology for information quality assessment. Inform. Manage. 40, 2, 133--460.
[55]
Lenzerini, M. 2002. Data integration: A theoretical perspective. In Proceedings of the 21st ACM Symposium on Principles of Database Systems (PODS).
[56]
Liu, L. and Chi, L. 2002. Evolutionary data quality. In Proceedings of the 7th International Conference on Information Quality.
[57]
Long, J. and Seko, C. April 2005. A cyclic-hierarchical method for database data-quality evaluation and improvement. In Advances in Management Information Systems-Information Quality Monograph (AMIS-IQ) Monograph, R. Wang, E. Pierce, S. Madnick, and Fisher C.W.
[58]
Loshin, D. 2004. Enterprise Knowledge Management - The Data Quality Approach. Series in Data Management Systems, Morgan Kaufmann, chapter 4.
[59]
Lyman, P. and Varian, H. R. 2003. How much information. http://www.sims.berkeley.edu/how-much-info-2003.
[60]
Mecca, G., Atzeni, P., Masci, M., Merialdo, P., and Sindoni, G. 1998. The Araneus Web-based management system. In Proceedings of the ACM SIGMOD International Conference on Management of Data, L. M. Haas and A. Tiwary, Eds. ACM Press, 544--546.
[61]
Mecca, G., Merialdo, P., Atzeni, P., and Crescenzi, V. 1999. The (short) araeneus guide to Web site development. In Proceedings of the 2nd International Workshop on the Web and Databases (WebDB) Conjunction with Sigmod.
[62]
Motro, A. and Anokhin, P. 2005. Fusionplex: Resolution of data inconsistencies in the data integration of heterogeneous information sources. Inform. Fusion, 7, 2, 176--196.
[63]
Muthu, S., Withman, L., and Cheraghi, S. H. 1999. Business process re-engineering : a consolidated methodology. In Proceedings of the 4th annual International Conference on Industrial Engineering Theory, Applications and Practice.
[64]
Nadkarni, P. 2006. Delivering data on time: The assurant health case. In Proceedings of the 11th International Conference on Information Quality.
[65]
Naumann, F. 2002. Quality-driven query answering for integrated information systems. Lecture Notes in Computer Science, vol. 2261.
[66]
Nelson, J., Poels, G., Genero, M., and Piattini, Eds. 2003. Proceedings of the 2nd International Workshop on Conceptual Modeling Quality (IWCMQ). Lecture Notes in Computer Science, vol. 2814, Springer.
[67]
Oakland, J. 1989. Total Quality Management. Springer.
[68]
Office of Management and Budget. 2006. Information quality guidelines for ensuring and maximizing the quality, objectivity, utility, and integrity of information disseminated by agencies. http://www.whitehouse.gov/omb/fedreg/reproducible.html.
[69]
Pernici, B. and Scannapieco, M. 2003. Data quality in Web information systems. J. Data Semant. 1, 48--68.
[70]
Pipino, L., Lee, Y., and Wang, R. 2002. Data quality assessment. Commun. ACM 45, 4.
[71]
Raghunathan, S. 1999. Impact of information quality and decision-maker quality on decision quality: a theoretical model and simulation analysis. Decis. Supp. Syst. 26, 275--286.
[72]
Rahm, E., Thor, A., Aumüller, D., Hong-Hai, D., Golovin, N., and Kirsten, T. June 2005. iFuice information fusion utilizing instance correspondences and peer mappings. In Proceedings of the 8th International Workshop on the Web and Databases (WebDB). located with SIGMOD.
[73]
Rao, R. 2003. From unstructured data to actionable intelligence. IT Professional 535, 6, 29--35.
[74]
Redman, T. 1996. Data Quality for the Information Age. Artech House.
[75]
Redman, T. 1998. The impact of poor data quality on the typical enteprise. Comm. ACM 41, 2, 79--82.
[76]
Scannapieco, M., A.Virgillito, Marchetti, M., Mecella, M., and Baldoni, R. 2004. The DaQuinCIS architecture: a platform for exchanging and improving data quality in Cooperative Information Systems. Inform. Syst. 29, 7, 551--582.
[77]
Scannapieco, M., Pernici, B., and Pierce, E. 2002. IP-UML: Towards a Methodology for Quality Improvement based on the IP-MAP Framework. In Proceedings of the 7th International Conference on Information Quality (ICIQ). Boston.
[78]
Scannapieco, M., Pernici, B., and Pierce, E. 2005. IP-UML: A methodology for quality improvement-based on IP-MAP and UML. In Information Quality, Advances in Management Information Systems, Information Quality Monograph (AMIS-IQ), R. Wang, E. Pierce, S. Madnik, and C. Fisher, Eds.
[79]
Sessions, V. 2007. Employing the TDQM methodology: An assessment of the SC SOR. In Proceedings of the 12th International Conference on Information Quality. 519--537.
[80]
Shankaranarayan, G., Wang, R. Y., and Ziad, M. 2000. Modeling the manufacture of an information product with IP-MAP. In Proceedings of the 6th International Conference on Information Quality (ICIQ 2000). Boston.
[81]
Shankaranarayanan, G. and Wang, R. 2007. IPMAP: Current state and perspectives. In Proceedings of the 12th International Conference on Information Quality.
[82]
Sheng, Y. 2003. Exploring the mediating and moderating effects of information quality on firm's endeavour on information systems. In Proceedings of the 8th International Conference on Information Quality 2003 (ICIQ). 344--352.
[83]
Sheng, Y. and Mykytyn, P. 2002. Information technology investment and firm performance: A perspective of data quality. In Proceedings of the 7th International Conference on Information Quality (ICIQ). DC, 132--141.
[84]
Stoica, M., Chawat, N., and Shin, N. 2003. An investigation of the methodologies of business process reengineering. In Proceedings of Information Systems Education Conference.
[85]
Su, Y. and Jin, Z. 2004. A methodology for information quality assessment in the designing and manufacturing processes of mechanical products. In Proceedings of the 9th International Conference on Information Quality (ICIQ). 447--465.
[86]
US Department of Defense. 1994. Data administration procedures. DoD rep. 8320.1-M.
[87]
Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., and Sellis, T. 2001. ARTKOS: toward the modeling, design, control and execution of ETL processes. Inform. Syst. 26, 537--561.
[88]
Vermeer, B. 2000. How important is data quality for evaluating the impact of edi on global supply chains. In Proceedings of the 33rd Haway Conference on Systems Sciences.
[89]
Wand, Y. and Wang, R. 1996. Anchoring data quality dimensions in ontological foundations. Comm. ACM 39, 11.
[90]
Wang, R. 1998. A product perspective on total data quality management. Comm. ACM 41, 2.
[91]
Wang, R. and Strong, D. 1996. Beyond accuracy: What data quality means to data consumers. J. Manage. Inform. Syst. 12, 4.
[92]
World Wide Web Consortium. www.w3.org/WAI/. Web accessibility initiative.
[93]
Zachman, J. 2006. Zachman institute for framework advancement (ZIFA). www.zifa.com.

Cited By

View all
  • (2025)Machine learning-based outlier detection for pipeline in-line inspection dataReliability Engineering & System Safety10.1016/j.ress.2024.110553254(110553)Online publication date: Feb-2025
  • (2025)Enhancing data quality in maritime transportation: A practical method for imputing missing ship static dataOcean Engineering10.1016/j.oceaneng.2024.119722315(119722)Online publication date: Jan-2025
  • (2025)ICS-LTU2022: A dataset for ICS vulnerabilitiesComputers & Security10.1016/j.cose.2024.104143148(104143)Online publication date: Jan-2025
  • Show More Cited By

Index Terms

  1. Methodologies for data quality assessment and improvement

      Recommendations

      Reviews

      Andreas E. Schwald

      Data quality comprises a wide subject area, with a variety of dissimilar issues. It is far from trivial to compile a comprehensive survey of the field. This treatise on data quality assessment and improvement presents 13 methodologies, over 50 pages, and lists 92 references up to the year 2007. It aims to provide a "systematic and comparative description along several dimensions, including phases and steps, strategies and techniques, data quality dimensions, types of data, and types of information systems." The paper may be unsatisfactory and too shallow for an advocate of a particular methodology. However, it can be quite helpful for a quick overview, especially for those who are looking for improvement, implementation advice, or new ideas. It covers a wide field and stimulates the interest of the reader?although, in most cases, a reference is needed to obtain an answer to a specific question or for an in-depth treatment of a topic. This is a noteworthy effort that sums up a great deal of information from a rather heterogeneous field. It covers several publications that might not be available in a library of modest size, thereby bringing this information to the attention of a wider reader community. However, whether quality is in the eyes of an observer or in measurable attributes of an object remains an open question. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 41, Issue 3
      July 2009
      284 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/1541880
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 July 2009
      Accepted: 01 May 2008
      Revised: 01 December 2007
      Received: 01 December 2006
      Published in CSUR Volume 41, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Data quality
      2. data quality assessment
      3. data quality improvement
      4. data quality measurement
      5. information system
      6. methodology
      7. quality dimension

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • European IST

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2,313
      • Downloads (Last 6 weeks)252
      Reflects downloads up to 25 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Machine learning-based outlier detection for pipeline in-line inspection dataReliability Engineering & System Safety10.1016/j.ress.2024.110553254(110553)Online publication date: Feb-2025
      • (2025)Enhancing data quality in maritime transportation: A practical method for imputing missing ship static dataOcean Engineering10.1016/j.oceaneng.2024.119722315(119722)Online publication date: Jan-2025
      • (2025)ICS-LTU2022: A dataset for ICS vulnerabilitiesComputers & Security10.1016/j.cose.2024.104143148(104143)Online publication date: Jan-2025
      • (2024)An Intelligent Approach to Data Quality Management AI-Powered Quality Monitoring in AnalyticsInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-22820(109-119)Online publication date: 21-Dec-2024
      • (2024)Geographical accessibility to healthcare by point-of–interest data from online maps: a comparative studyGeospatial Health10.4081/gh.2024.132219:2Online publication date: 20-Dec-2024
      • (2024)Software Engineering Approach for Designing Apparel Business Data AnalyticsData-Driven Business Intelligence Systems for Socio-Technical Organizations10.4018/979-8-3693-1210-0.ch006(128-151)Online publication date: 23-Feb-2024
      • (2024)Design Model for the Digital Shadow of a Value StreamSystems10.3390/systems1201002012:1(20)Online publication date: 9-Jan-2024
      • (2024)Flexible Techniques to Detect Typical Hidden Errors in Large Longitudinal DatasetsSymmetry10.3390/sym1605052916:5(529)Online publication date: 28-Apr-2024
      • (2024)Enhancing Understanding through Data Visualization: What Can Available Data Reveal about Access to Energy in Displacement Contexts on the African Continent?Sustainability10.3390/su1611465316:11(4653)Online publication date: 30-May-2024
      • (2024)Enhanced Out-of-Stock Detection in Retail Shelf Images Based on Deep LearningSensors10.3390/s2402069324:2(693)Online publication date: 22-Jan-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media