Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3448016.3450583acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
abstract

Contextual Data Cleaning with Ontology FDs

Published: 18 June 2021 Publication History

Abstract

Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We motivate the need to include context in data cleaning in order to account for the subjective nature of data quality. We enhance dependency-based data cleaning with Ontology Functional Dependencies (OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an ontology. We study the data and ontology repair problem for a set of OFDs, and propose an algorithm that finds the best ontological interpretation of the data that minimizes the number of repairs.

References

[1]
The drug ontology. http://www.ontobee.org/ontology/DRON.
[2]
S. Baskaran, A. Keller, F. Chiang, L. Golab, and J. Szlichta. Efficient discovery of ontology functional dependencies. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1847--1856. ACM, 2017.
[3]
G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013.
[4]
P. Bohannon, M. Flaster,W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154, 2005.
[5]
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013.
[6]
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261, 2015.
[7]
Y. Chung, S. Krishnan, and T. Kraska. A data quality metric (dqm): How to estimate the number of undetected errors in data sets. Proc. VLDB Endow., 10(10):1094--1105, 2017.
[8]
G. Cong,W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, pages 315--326, 2007.
[9]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1):173--184, 2010.
[10]
N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In ICDE, pages 1275--1278, 2009.
[11]
B. T. Lowerre. The harpy speech recognition system. Ph.d. thesis, Carnegie-Mellon University, April 1976.
[12]
M. Mahdavi and Z. Abedjan. Baran: Effective error correction via a unified context representation and transfer learning. Proc. VLDB Endow., 13(11):1948--1961, 2020.
[13]
N. Prokoshyna, J. Szlichta, F. Chiang, R. Miller, and D. Srivastava. Combining quantitative and logical data cleaning. PVLDB, 9(4):300--311, 2015.
[14]
S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani, J.-A. Quiane-Ruiz, and N. Tang. Uguide â user-guided discovery of fd-detectable errors. In SIGMOD, pages 1385--1397, 2017.
[15]
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. Proc. VLDB Endow., 5(11):1483--1494, 2012.
[16]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011.

Cited By

View all
  • (2024)An Automatic Near-Duplicate Video Data Cleaning Method Based on a Consistent Feature Hash RingElectronics10.3390/electronics1308152213:8(1522)Online publication date: 17-Apr-2024

Index Terms

  1. Contextual Data Cleaning with Ontology FDs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
    June 2021
    2969 pages
    ISBN:9781450383431
    DOI:10.1145/3448016
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2021

    Check for updates

    Author Tags

    1. constraint-based cleaning
    2. data cleaning
    3. ontology functional dependencies

    Qualifiers

    • Abstract

    Funding Sources

    • Natural Sciences and Engineering Research Council

    Conference

    SIGMOD/PODS '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)23
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 17 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An Automatic Near-Duplicate Video Data Cleaning Method Based on a Consistent Feature Hash RingElectronics10.3390/electronics1308152213:8(1522)Online publication date: 17-Apr-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media