Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
interview

Cleanix: a Parallel Big Data Cleaning System

Published: 09 May 2016 Publication History
  • Get Citation Alerts
  • Abstract

    For big data, data quality problem is more serious. Big data cleaning system requires scalability and the abilityof handling mixed errors. Motivated by this, we develop Cleanix, a prototype system for cleaning relational Big Data. Cleanix takes data integrated from multiple data sources and cleans them on a shared-nothing machine cluster. The backend system is built on-top-of an extensible and flexible data-parallel substrate the Hyracks framework. Cleanix supports various data cleaning tasks such as abnormal value detection and correction, incomplete data filling, de-duplication, and conflict resolution. In this paper, we show the organization, data cleaning algorithms as well as the design of Cleanix.

    References

    [1]
    Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Springer, 2007.
    [2]
    Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. CerFix: A system for cleaning data with certain fixes. PVLDB, 4(12):1375--1378, 2011.
    [3]
    Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, pages 371--380, 2001.
    [4]
    Vinayak R. Borkar, Michael J. Carey, Raman Grover, Nicola Onose, and Rares Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011.
    [5]
    Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007.
    [6]
    Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.
    [7]
    Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.
    [8]
    Wenfei Fan and Floris Geerts. Relative information completeness. ACM Trans. Database Syst., 35(4):27, 2010.
    [9]
    Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005, pages 143--154, 2005.
    [10]
    Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. Improving data quality: Consistency and accuracy. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, pages 315--326, 2007.
    [11]
    Amélie Marian and Minji Wu. Corroborating information from web sources. IEEE Data Eng. Bull., 34(3):11--17, 2011.
    [12]
    Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009.
    [13]
    Hongzhi Wang, Mingda Li, Yingyi Bu, Jianzhong Li, Hong Gao, and Jiacheng Zhang. Cleanix: A big data cleaning parfait. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, November 3-7, 2014, pages 2024--2026, 2014.
    [14]
    Vinayak R. Borkar, Michael J. Carey, and Chen Li. Inside "Big Data management": ogres, onions, or parfaits? In EDBT, pages 3--14, 2012.
    [15]
    Esko Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191--211, 1992.
    [16]
    Lingli Li, Hongzhi Wang, Hong Gao, and Jianzhong Li. EIF: A framework of effective entity identification. In WAIM, pages 717--728, 2010.

    Cited By

    View all
    • (2024)Prediction of adsorption of metal cations by clay minerals using machine learningScience of The Total Environment10.1016/j.scitotenv.2024.171733924(171733)Online publication date: May-2024
    • (2023)DataOps-4G: On Supporting Generalists in Data Quality DiscoveryIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.315160535:5(4668-4681)Online publication date: 1-May-2023
    • (2023)A Rule Based Data Cleansing Pipeline for Automated Data Import in the Context of Social Clubs2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)10.1109/ICECCME57830.2023.10253136(1-6)Online publication date: 19-Jul-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMOD Record
    ACM SIGMOD Record  Volume 44, Issue 4
    December 2015
    59 pages
    ISSN:0163-5808
    DOI:10.1145/2935694
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 May 2016
    Published in SIGMOD Volume 44, Issue 4

    Check for updates

    Qualifiers

    • Interview

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Prediction of adsorption of metal cations by clay minerals using machine learningScience of The Total Environment10.1016/j.scitotenv.2024.171733924(171733)Online publication date: May-2024
    • (2023)DataOps-4G: On Supporting Generalists in Data Quality DiscoveryIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.315160535:5(4668-4681)Online publication date: 1-May-2023
    • (2023)A Rule Based Data Cleansing Pipeline for Automated Data Import in the Context of Social Clubs2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)10.1109/ICECCME57830.2023.10253136(1-6)Online publication date: 19-Jul-2023
    • (2023)A survey on preprocessing and classification techniques for acoustic sceneExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120520229:PAOnline publication date: 13-Jul-2023
    • (2022)Supporting Semantic Data Enrichment at ScaleTechnologies and Applications for Big Data Value10.1007/978-3-030-78307-5_2(19-39)Online publication date: 29-Apr-2022
    • (2021)Data cleansing mechanisms and approaches for big data analytics: a systematic studyJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03590-214:1(99-111)Online publication date: 17-Nov-2021
    • (2021)Uncovering travel and charging patterns of private electric vehicles with trajectory data: evidence and policy implicationsTransportation10.1007/s11116-021-10216-149:5(1409-1439)Online publication date: 29-Jul-2021
    • (2019)When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under BlockingSymmetry10.3390/sym1104057511:4(575)Online publication date: 19-Apr-2019
    • (2019)Application of Attribute Correlation in Unsupervised Data CleaningProceedings of the 5th International Conference on e-Society, e-Learning and e-Technologies10.1145/3312714.3312717(45-51)Online publication date: 10-Jan-2019
    • (2019)A Review on Data Cleansing Methods for Big DataProcedia Computer Science10.1016/j.procs.2019.11.177161(731-738)Online publication date: 2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media