Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3216122.3216154acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

Efficient Big Data Clustering

Published: 18 June 2018 Publication History

Abstract

The need to support advanced analytics on Big Data is driving data scientist' interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability of CLUBS+.

References

[1]
Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, JesúS M. PéRez, and IñIgo Perona. 2013. An Extensive Comparative Study of Cluster Validity Indices. Pattern Recognition 46, 1 (Jan. 2013), 243--256.
[2]
T. Calinski and J. Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods 3, 1 (1974), 1--27.
[3]
CLUBS. CLUBS+ website. http://yellowstone.cs.ucla.edu/clubs/. (????). Accessed: 2016-03-25.
[4]
G. M. Mazzeo, E. Masciari, and C. Zaniolo. 2017. A Fast and Accurate Algorithm for UnsupervisedClustering Around Centroids. Information Sciences 400 (2017).
[5]
S. Muthukrishnan, V. Poosala, and T. Suel. 1999. On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications. In ICDT. 236--256.
[6]
2008. Big Data. Nature (Sept. 2008).

Cited By

View all
  • (2025)Artificial Intelligence in Automotives: ANNs’ Impact on Biodiesel Engine Performance and EmissionsEnergies10.3390/en1802043818:2(438)Online publication date: 20-Jan-2025
  • (2021)K-DBSCAN: An improved DBSCAN algorithm for big dataThe Journal of Supercomputing10.1007/s11227-020-03524-377:6(6214-6235)Online publication date: 1-Jun-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IDEAS '18: Proceedings of the 22nd International Database Engineering & Applications Symposium
June 2018
328 pages
ISBN:9781450365277
DOI:10.1145/3216122
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Concordia University: Concordia University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Big Data
  2. Clustering
  3. Spark

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

IDEAS 2018

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Artificial Intelligence in Automotives: ANNs’ Impact on Biodiesel Engine Performance and EmissionsEnergies10.3390/en1802043818:2(438)Online publication date: 20-Jan-2025
  • (2021)K-DBSCAN: An improved DBSCAN algorithm for big dataThe Journal of Supercomputing10.1007/s11227-020-03524-377:6(6214-6235)Online publication date: 1-Jun-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media