research-article

CleanCloud: Cleaning Big Data on Cloud

Authors:

Hong GaoAuthors Info & Claims

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pages 2543 - 2546

https://doi.org/10.1145/3132847.3133187

Published: 06 November 2017 Publication History

Get Access

Abstract

We describe CleanCloud, a system for cleaning big data based on Map-Reduce paradigm in cloud. Using Map-Reduce paradigm, the system detects and repairs various data quality problems in big data. We demonstrate the following features of CleanCloud: (a) the support for cleaning multiple data quality problems in big data; (b) a visual tool for watching the status of big data cleaning process and tuning the parameters for data cleaning; (c) the friendly interface for data input and setting as well as cleaned data collection for big data. CleanCloud is a promising system that provides scalable and effect data cleaning mechanism for big data in either files or databases.

References

[1]

Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB 3, 1 (2010).

Digital Library

Google Scholar

[2]

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2014.

Digital Library

Google Scholar

[3]

Amr Ebaid, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiane-Ruiz, Nan Tang, and Si Yin. NADEEF: A Generalized Data Cleaning System. PVLDB 6, 12 (2013), 1218--1221.

Digital Library

Google Scholar

[4]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007).

Digital Library

Google Scholar

[5]

Wenfei Fan. Dependencies revisited for improving data quality. In PODS 2008.

Digital Library

Google Scholar

[6]

Wenfei Fan, Floris Geerts, and Jef Wijsen. Determining the Currency of Data. ACM Trans. Database Syst. 37, 4 (2012).

Digital Library

Google Scholar

[7]

Ran Huo, Hongzhi Wang, Rong Zhu, Jianzhong Li, and Hong Gao. Map-Reduce Based Entity Identification in Big Data. Journal of Computer Research and Development 50, z2 (2013).

Google Scholar

[8]

Li Jia, Hongzhi Wang, Jianzhong Li, and Hong Gao. Incremental Truth Discovery for Information from Multiple Data Sources. In WAIM Workshops 2013.

Crossref

Google Scholar

[9]

Lian Jin, Hongzhi Wang, Shenbin Huang, and Hong Gao. Missing Value Imputation in Big Data Based on Map-Reduce. Journal of Computer Research and Development 50, z1 (2013).

Google Scholar

[10]

Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quian e-Ruiz, Nan Tang, and Si Yin. BigDansing: A System for Big Data Cleansing. In SIGMOD 2015.

Digital Library

Google Scholar

[11]

Lars Kolb, Andreas Thor, and Erhard Rahm. Dedoop: Efficient Deduplication with Hadoop. PVLDB 5, 12 (2012).

Digital Library

Google Scholar

[12]

Lars Kolb, Andreas Thor, and Erhard Rahm. Load Balancing for Map-Reduce-based Entity Resolution. In ICDE 2012.

Digital Library

Google Scholar

[13]

Lingli Li, Hongzhi Wang, Hong Gao, and Jianzhong Li. EIF: A Framework of Effective Entity Identification. In WAIM 2010.

Digital Library

Google Scholar

Cited By

View all

Hosseinzadeh MAzhir EAhmed OGhafour MAhmed SRahmani AVo B(2021)Data cleansing mechanisms and approaches for big data analytics: a systematic studyJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03590-214:1(99-111)Online publication date: 17-Nov-2021
https://doi.org/10.1007/s12652-021-03590-2

Index Terms

CleanCloud: Cleaning Big Data on Cloud
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning
      2. Entity resolution

Recommendations

A grammar-based entity representation framework for data cleaning
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Fundamental to data cleaning is the need to account for multiple data representations. We propose a formal framework that can be used to reason about and manipulate data representations. The framework is declarative and combines elements of a generative ...
ETDC: An Efficient Technique to Cleanse Data in the Data Warehouse
ICAIP '17: Proceedings of the International Conference on Advances in Image Processing

Data cleansing can be considered to be an activity that is performed on the data sets of the data warehouse. The cleansing is done in order to enhance and collectively maintain data consistency and quality. The quality of data has a strong impact on a ...
An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...

Comments

Information & Contributors

Information

Published In

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

November 2017

2604 pages

ISBN:9781450349185

DOI:10.1145/3132847

General Chairs:
Ee-Peng Lim
Singapore Management University, Singapore
,
Marianne Winslett
University of Illinois at Urbana-Champaign, USA, and Advanced Digital Sciences Center, Singapore
,
Program Chairs:
Mark Sanderson
RMIT, Australia
,
Ada Fu
Chinese University of Hong Kong, Hong Kong
,
Jimeng Sun
Georgia Tech, USA
,
Shane Culpepper
RMIT, Australia
,
Eric Lo
Chinese University of Hong Kong, Hong Kong
,
Joyce Ho
Emory University, USA
,
Debora Donato
Mix Tech, Inc., USA
,
Rakesh Agrawal
Data Insights Laboratories, USA
,
Yu Zheng
Microsoft Research Asia, China
,
Carlos Castillo
Qatar Computing Research Institute, Qatar
,
Aixin Sun
Nanyang Technological University, Singapore
,
Vincent S. Tseng
National Cheng Kung University, Taiwan
,
Chenliang Li
Wuhan University, China

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Sci-Tech Support Plan
MOE-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology
the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Provience
National Natural Science Foundation of China

Conference

CIKM '17

Sponsor:

CIKM '17: ACM Conference on Information and Knowledge Management

November 6 - 10, 2017

Singapore, Singapore

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
235
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Hosseinzadeh MAzhir EAhmed OGhafour MAhmed SRahmani AVo B(2021)Data cleansing mechanisms and approaches for big data analytics: a systematic studyJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03590-214:1(99-111)Online publication date: 17-Nov-2021
https://doi.org/10.1007/s12652-021-03590-2

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

A grammar-based entity representation framework for data cleaning

ETDC: An Efficient Technique to Cleanse Data in the Data Warehouse

An Enhanced Technique to Clean Data in the Data Warehouse

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations