Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3132847.3133187acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

CleanCloud: Cleaning Big Data on Cloud

Published: 06 November 2017 Publication History

Abstract

We describe CleanCloud, a system for cleaning big data based on Map-Reduce paradigm in cloud. Using Map-Reduce paradigm, the system detects and repairs various data quality problems in big data. We demonstrate the following features of CleanCloud: (a) the support for cleaning multiple data quality problems in big data; (b) a visual tool for watching the status of big data cleaning process and tuning the parameters for data cleaning; (c) the friendly interface for data input and setting as well as cleaned data collection for big data. CleanCloud is a promising system that provides scalable and effect data cleaning mechanism for big data in either files or databases.

References

[1]
Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB 3, 1 (2010).
[2]
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2014.
[3]
Amr Ebaid, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiane-Ruiz, Nan Tang, and Si Yin. NADEEF: A Generalized Data Cleaning System. PVLDB 6, 12 (2013), 1218--1221.
[4]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007).
[5]
Wenfei Fan. Dependencies revisited for improving data quality. In PODS 2008.
[6]
Wenfei Fan, Floris Geerts, and Jef Wijsen. Determining the Currency of Data. ACM Trans. Database Syst. 37, 4 (2012).
[7]
Ran Huo, Hongzhi Wang, Rong Zhu, Jianzhong Li, and Hong Gao. Map-Reduce Based Entity Identification in Big Data. Journal of Computer Research and Development 50, z2 (2013).
[8]
Li Jia, Hongzhi Wang, Jianzhong Li, and Hong Gao. Incremental Truth Discovery for Information from Multiple Data Sources. In WAIM Workshops 2013.
[9]
Lian Jin, Hongzhi Wang, Shenbin Huang, and Hong Gao. Missing Value Imputation in Big Data Based on Map-Reduce. Journal of Computer Research and Development 50, z1 (2013).
[10]
Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quian e-Ruiz, Nan Tang, and Si Yin. BigDansing: A System for Big Data Cleansing. In SIGMOD 2015.
[11]
Lars Kolb, Andreas Thor, and Erhard Rahm. Dedoop: Efficient Deduplication with Hadoop. PVLDB 5, 12 (2012).
[12]
Lars Kolb, Andreas Thor, and Erhard Rahm. Load Balancing for Map-Reduce-based Entity Resolution. In ICDE 2012.
[13]
Lingli Li, Hongzhi Wang, Hong Gao, and Jianzhong Li. EIF: A Framework of Effective Entity Identification. In WAIM 2010.

Cited By

View all
  • (2021)Data cleansing mechanisms and approaches for big data analytics: a systematic studyJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03590-214:1(99-111)Online publication date: 17-Nov-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data cleaning
  2. entity resolution
  3. parallel computing

Qualifiers

  • Research-article

Funding Sources

  • National Sci-Tech Support Plan
  • MOE-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology
  • the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Provience
  • National Natural Science Foundation of China

Conference

CIKM '17
Sponsor:

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 06 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Data cleansing mechanisms and approaches for big data analytics: a systematic studyJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03590-214:1(99-111)Online publication date: 17-Nov-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media