Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Improving data quality by source analysis

Published: 02 March 2012 Publication History
  • Get Citation Alerts
  • Abstract

    In many domains, data cleaning is hampered by our limited ability to specify a comprehensive set of integrity constraints to assist in identification of erroneous data. An alternative approach to improve data quality is to exploit different data sources that contain information about the same set of objects. Such overlapping sources highlight hot-spots of poor data quality through conflicting data values and immediately provide alternative values for conflict resolution. In order to derive a dataset of high quality, we can merge the overlapping sources based on a quality assessment of the conflicting values. The quality of the resulting dataset, however, is highly dependent on our ability to asses the quality of conflicting values effectively.
    The main objective of this article is to introduce methods that aid the developer of an integrated system over overlapping, but contradicting sources in the task of improving the quality of data. Value conflicts between contradicting sources are often systematic, caused by some characteristic of the different sources. Our goal is to identify such systematic differences and outline data patterns that occur in conjunction with them. Evaluated by an expert user, the regularities discovered provide insights into possible conflict reasons and help to assess the quality of inconsistent values. The contributions of this article are two concepts of systematic conflicts: contradiction patterns and minimal update sequences. Contradiction patterns resemble a special form of association rules that summarize characteristic data properties for conflict occurrence. We adapt existing association rule mining algorithms for mining contradiction patterns. Contradiction patterns, however, view each class of conflicts in isolation, sometimes leading to largely overlapping patterns. Sequences of set-oriented update operations that transform one data source into the other are compact descriptions for all regular differences among the sources. We consider minimal update sequences as the most likely explanation for observed differences between overlapping data sources. Furthermore, the order of operations within the sequences point out potential dependencies between systematic differences. Finding minimal update sequences, however, is beyond reach in practice. We show that the problem already is NP-complete for a restricted set of operations. In the light of this intractability result, we present heuristics that lead to convincing results for all examples we considered.

    References

    [1]
    Abiteboul, S., Cluet, S., Milo, T., Mogilevsky, P., Simon, J., and Zohar, S. 1999. Tools for data translation and integration. IEEE Data Engin. Bull. 22, 1, 3--8.
    [2]
    Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94).
    [3]
    Arenas, M., Bertossi, L., and Chomicki, J. 1999. Consistent query answers in inconsistent databases. In Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'99).
    [4]
    Baumgartner, W. A., Cohen, K. B., Fox, L. M., Acquaah-Mensah, G., and Hunter, L. 2007. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics 23, 13, i41--i48.
    [5]
    Bay, S. D. and Pazzani, M. J. 2001. Detecting group differences: Mining contrast sets. Data Min. Knowl. Discov. 5, 3, 213--246.
    [6]
    Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., and Bhat, T. N. e. a. 2000. The protein data bank. Nucl. Acids Res. 28, 1, 235--242.
    [7]
    Bhat, T. N., Bourne, P., Feng, Z., Gilliland, G., and Jain, S. e. a. 2001. The PDB data uniformity project. Nucl. Acids Res. 29, 1, 214--218.
    [8]
    Bleiholder, J. and Naumann, F. 2005. Declarative data fusion—Syntax, semantics, and implementation. In Proceedings of the 9th East European Conference on Advances in Databases and Information Systems.
    [9]
    Bleiholder, J. and Naumann, F. 2006. Conflict handling strategies in an integrated information system. In Proceedings of the IJCAI Workshop on Information on the Web (IIWeb).
    [10]
    Bleiholder, J. and Naumann, F. 2008. Data fusion. ACM Comput. Surv. 41, 1, 1--41.
    [11]
    Bohannon, P., Fan, W., Flaster, M., and Rastogi, R. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'05).
    [12]
    Boutselakis, H., Dimitropoulos, D., Fillon, J., Golovin, A., and et al., K. H. 2003. E-msd: the european bioinformatics institute macromolecular structure database. Nucl. Acids Res. 31, 1, 458--462.
    [13]
    Brenner, S. E. 1999. Errors in genome annotation. Trends Genet. 15, 4, 132--133.
    [14]
    Buneman, P., Cheney, J., Tan, W.-C., and Vansummeren, S. 2008. Curated databases. In Proceedings of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'08).
    [15]
    Buneman, P., Khanna, S., and Tan, W.-C. 2002. On propagation of deletions and annotations through views. In Proceedings of the PODS Conference. 150--158.
    [16]
    Burks, C. 1999. Molecular biology database list. Nucl. Acids Res. 27, 1, 1--9.
    [17]
    Chawathe, S. S. and Garcia-Molina, H. 1997. Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'97).
    [18]
    Chiang, F. and Miller, R. J. 2008. Discovering data quality rules. Proc. VLDB Endow. 1, 1, 1166--1177.
    [19]
    Chomicki, J. and Marcinkowski, J. 2005. Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197, 1/2, 90--121.
    [20]
    Curwen, V., Eyras, E., Andrews, T. D., Clarke, L., and Mongin, E. e. a. 2004. The ensembl automatic gene annotation system. Genome Res 14, 5, 942--950.
    [21]
    Dadam, P., Lum, V. Y., and Werner, H. D. In Proceedings of the 10th International Conference on Very Large Data Bases (VLDB'84).
    [22]
    Dennis, C. and Gallagher, R., Eds. 2002. The Human Genome. Palgrave, McMillan.
    [23]
    Dong, X. L., Berti-Equille, L., and Srivastava, D. 2009. Integrating conflicting data: The role of source dependence. Proc. VLDB Endow. 2, 1, 550--561.
    [24]
    Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Engin. 19, 1, 1--16.
    [25]
    Fagin, R., Kolaitis, P. G., Miller, R. J., and Popa, L. 2005. Data exchange: Semantics and query answering. Theor. Comput. Sci. 336, 1, 89--124.
    [26]
    Fan, W., Geerts, F., Jia, X., and Kementsietsidis, A. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Datab. Syst. 33, 2.
    [27]
    Fan, W., Geerts, F., Lakshmanan, L. V. S., and Xiong, M. 2009. Discovering conditional functional dependencies. In Proceedings of the 25th International Conference on Data Engineering (ICDE'09).
    [28]
    Fan, W., Lu, H., Madnick, S. E., and Cheung, D. 2001. Discovering and reconciling value conflicts for numerical data integration. Inf. Syst. 26, 8, 635--656.
    [29]
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., and Saita, C.-A. 2001. Declarative data cleaning: Language, model, and algorithms. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01).
    [30]
    Galland, A., Abiteboul, S., Marian, A., and Senellart, P. 2010. Corroborating information from disagreeing views. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM'10).
    [31]
    Galperin, M. Y. and Cochrane, G. R. 2009. Nucleic acids research annual database issue and the NAR online molecular biology database collection in 2009. Nucl. Acids Res. 37, suppl 1, D1--4.
    [32]
    Golab, L., Karloff, H., Korn, F., Srivastava, D., and Yu, B. 2008. On generating near-optimal tableaux for conditional functional dependencies. Proc. VLDB Endow. 1, 1, 376--390.
    [33]
    Hamming, R. 1950. Error detecting and error correcting codes. Bell Syst. Techn. J. 26, 2, 147--160.
    [34]
    Hernandez, T. and Kambhampati, S. 2004. Integration of biological sources: current systems and challenges ahead. SIGMOD Rec. 33, 3, 51--60.
    [35]
    Hrycej, T. and Hipp, J. 2004. Outlier detection by rareness assumption. In 34. Jahrestagung der Gesellschaft für Informatik e.V. (GI).
    [36]
    International Human Genome Sequencing Consortium. 2004. Finishing the euchromatic sequence of the human genome. Nature 431, 7011, 931--945.
    [37]
    Labio, W. and Garcia-Molina, H. 1996. Efficient snapshot differential algorithms for data warehousing. In Proceedings of the 22th International Conference on Very Large Data Bases (VLDB'96).
    [38]
    Land, A. H. and Doig, A. G. 1960. An automatic method of solving discrete programming problems. Econometrica 28, 3, 497--520.
    [39]
    Lenzerini, M. 2002. Data integration: A theoretical perspective. In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'02).
    [40]
    Levenshtein, V. I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 8, 707--710.
    [41]
    Liu, B., Hsu, W., and Ma, Y. 1998. Integrating classification and association rule mining. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD).
    [42]
    Louie, B., Mork, P., Martín-Sánchez, F., Halevy, A. Y., and Tarczy-Hornoch, P. 2007. Data integration and genomic medicine. J. Biomed. Inf. 40, 1, 5--16.
    [43]
    Maletic, J. I. and Marcus, A. 2000. Data cleansing: Beyond integrity analysis. In Proceedings of the 5th Conference on Information Quality (IQ'00).
    [44]
    Michael R. Garey, D. S. J. 1979. Computers and Intractability. A Guide to the Theory of NP-Completeness. W.H. Freeman and Co.
    [45]
    Müller, H., Freytag, J.-C., and Leser, U. 2006. Describing differences between databases. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06).
    [46]
    Müller, H., Leser, U., and Freytag, J.-C. 2004. Mining for patterns in contradictory data. In Proceedings of the International Workshop on Information Quality in Information Systems (IQIS'04).
    [47]
    Müller, H., Naumann, F., and Freytag, J.-C. 2003. Data quality in genome databases. In Proceedings of the 8th International Conference on Information Quality (IQ 2003).
    [48]
    Naumann, F. and Häussler, M. 2002. Declarative data merging with conflict resolution. In Proceedings of the 7th International Conference on Information Quality (IQ 2002).
    [49]
    Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT'99).
    [50]
    Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350.
    [51]
    Rahm, E. and Do, H. H. 2000. Data cleaning: Problems and current approaches. IEEE Data Engin. Bull. 23, 4, 3--13.
    [52]
    Raman, V. and Hellerstein, J. M. 2001. Potter's wheel: An interactive data cleaning system. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01).
    [53]
    Stein, L. D. 2003. Integrating biological databases. Nat. Rev. Genet. 4, 5, 337--345.
    [54]
    Tri\ssl, S., Rother, K., Müller, H., Steinke, T., and et al., I. K. 2005. Columba: An integrated database of proteins, structures, and annotations. BMC Bioinformatics 6, 81.
    [55]
    Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., and Mural, R. J. e. a. 2001. The sequence of the human genome. Science 291, 5507, 1304--1351.
    [56]
    Vossen, G. 1991. Data Models, Database Languages and Database Management Systems. Addison-Wesley Longman Publishing Co.
    [57]
    Webb, G. I., Butler, S., and Newlands, D. 2003. On detecting differences between groups. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03).
    [58]
    Wijsen, J. 2002. Condensed representation of database repairs for consistent query answering. In Proceedings of the 9th International Conference on Database Theory (ICDT'03).
    [59]
    Winkler, W. E. 1999. The state of record linkage and current research problems. Tech. rep. RR/1999/04, Statistics Research Division, U.S. Bureau of the Census.
    [60]
    Xiong, H., Tan, P.-N., and Kumar, V. 2003. Mining strong affinity association patterns in data sets with skewed support distribution. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM'03).
    [61]
    Yin, X., Han, J., and Yu, P. S. 2007. Truth discovery with multiple conflicting information providers on the web. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'07).
    [62]
    Zaki, M. J. and Hsiao, C.-J. 2002. Charm: An efficient algorithm for closed itemset mining. In Proceedings of the 2nd SIAM International Conference on Data Mining.
    [63]
    Zehetner, G. and Lehrach, H. 1994. The reference library system sharing biological material and experimental data. Nature 367, 489--491.

    Cited By

    View all
    • (2021)Methods of Information Processing and Presentation in Peer-to-Peer Online MarketplacesImpact of Disruptive Technologies on the Sharing Economy10.4018/978-1-7998-0361-4.ch003(28-49)Online publication date: 2021
    • (2021)Data quality-aware genomic data integrationComputer Methods and Programs in Biomedicine Update10.1016/j.cmpbup.2021.100009(100009)Online publication date: Apr-2021
    • (2021)Improving data quality in large-scale repositories through conflict resolutionInternational Journal on Digital Libraries10.1007/s00799-021-00311-022:4(365-383)Online publication date: 1-Dec-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 2, Issue 4
    February 2012
    62 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/2107536
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 March 2012
    Accepted: 01 December 2011
    Revised: 01 November 2011
    Received: 01 May 2011
    Published in JDIQ Volume 2, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Conflict resolution
    2. data cleaning
    3. quality assessment
    4. semantic distance measure

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Methods of Information Processing and Presentation in Peer-to-Peer Online MarketplacesImpact of Disruptive Technologies on the Sharing Economy10.4018/978-1-7998-0361-4.ch003(28-49)Online publication date: 2021
    • (2021)Data quality-aware genomic data integrationComputer Methods and Programs in Biomedicine Update10.1016/j.cmpbup.2021.100009(100009)Online publication date: Apr-2021
    • (2021)Improving data quality in large-scale repositories through conflict resolutionInternational Journal on Digital Libraries10.1007/s00799-021-00311-022:4(365-383)Online publication date: 1-Dec-2021
    • (2021)Implicit Dedupe Learning Method on Contextual Data Quality ProblemsAdvances in Data Science and Information Engineering10.1007/978-3-030-71704-9_22(343-358)Online publication date: 30-Oct-2021
    • (2018)A Platform Solution of Data-Quality Improvement for Internet-of-Vehicle Services2018 IEEE International Conference on Pervasive Computing and Communications (PerCom)10.1109/PERCOM.2018.8444581(1-7)Online publication date: Mar-2018
    • (2014)Quality of information-based source assessment and selectionNeurocomputing10.1016/j.neucom.2013.11.027133(95-102)Online publication date: Jun-2014

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media