A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines
Abstract
:1. Introduction
Algorithm 1: The sequential k-means algorithm. |
2. Related Work
3. Implementation
3.1. The Actor System
3.2. The Actor System on a Cluster of Nodes
- Join the cluster. This is done by using the cluster extension, which, behind the scene, handles all the aspects of joining the cluster. This of course requires using the correct seed nodes in the configuration file.
- Recognize the other parts of the actor system and the role assigned to each one of them. This is attained by subscribing to change notifications of the cluster membership. The subscribed nodes receive information about the joining nodes and the role assigned to each one.
- Exchange messages with the targeted nodes transparently without necessarily knowing their physical locations. The actor system in each node is programmed to communicate with the remote actor system using its logical address. The physical address is automatically identified at runtime.
- Joining the cluster: Each node in the cluster is identified using a <hostname:port:uid> tuple, where the uid is used to uniquely identify a specific actor system running on <hostname:port> machine. Some nodes must be set as seed nodes with known IP and port numbers. These nodes are used as contact points for new nodes to join the cluster. A node is introduced to the cluster by sending JOIN message to one of the cluster members. If the cluster consists only of seed nodes, this message is sent to one of them; if not, the message can be sent to any member of the cluster. Figure 3 provides an overview of a cluster with five nodes with two seed nodes.
- Assigning a leader: The role of the leader in a cluster is to add and remove nodes to/from the cluster. This role is given to the first node at which the cluster coverages to a globally consistent state, and can be given to different nodes between rounds of convergence.
- Membership Lifecycle: After sending a JOIN message, the state of the node is switched to Joining. Once all nodes have seen that a node is joining the cluster, the state of that node is changed to Up, and it can then participate in the cluster. If a node opts to leave the cluster, the state is changed to Leaving state. Once all nodes have seen that a node is leaving, the state is switched to existing and eventually will be marked as removed. Each node in the cluster is monitored by several other nodes. When a node becomes unreachable, for whatever reason, the rest of the nodes in the cluster receive this information via the gossip protocol. The unreachable nodes need to become reachable again or be removed from the cluster, as a cluster with an unreachable node cannot add new nodes.
- Node roles: A specific role can be assigned to any node in the cluster. The cluster can then be configured to specify the required number of nodes with a certain role. In such a case, the status of the joining nodes will only be changed to Up when the required number of nodes have already joined the cluster.
- Failure detector: The nodes in a cluster monitor each other using heartbeating. Each node keeps a record of the heartbeat of each node in the cluster. Based on that history, the decision is made about whether a node is reachable or not. However, this does not lead to rapid failure of the node. When a node determines that another node is unreachable, it disseminates this information to other nodes in the cluster. Once all nodes receive this information (i.e., convergence is reached) that node is marked as unreachable. All nodes will keep on trying to connect with the unreachable node. If it remains disconnected for a relatively long period of time, it is marked as Down and removed from the cluster.
3.3. The Preliminary Model
3.3.1. The Master Node
- The workers’ manager: This handler maintains a data structure of the type map, called backends, to keep track of all workers in the cluster. Upon receiving a WorkRegistration message from a worker, the master adds the address of that worker to the backend. On registering the minimum number of workers, the master actor sends a Start message to itself, a message that is handled by the clustering process initiator.
- The clustering process initiator: This handler is only invoked once at the beginning of the clustering process. When a Start message is received, a random set of centroids is selected, and an Iterate message is sent to the data splitter handler.
- The data splitter: This handler is executed at the beginning of each iteration. It is responsible for distributing data evenly among workers along with the current centroids, using a portion(..) message. At each iteration, the algorithm iterates over the backend map and splits the data among the registered workers.
- The centroids computer and data aggregator: This handler is responsible for collecting the results from all workers. These results are sent by workers via WorkerDone messages. The data aggregator receives these messages from all workers and consequently constructs a new set of centroids. After that, the algorithm checks whether the convergence criteria were met or not. If not, An Iterate message is sent to the splitter. These steps are repeated until the optimal solution or the maximum number of iterations is reached.
3.3.2. The Worker Node
3.3.3. Issues with the Preliminary Model
- The master is a single point of failure in the system. If the master fails, the whole clustering process fails.
- If one worker is lagging, the whole process will be waiting for that worker to finish.
- If one worker fails, the last iteration needs to be repeated and the load needs to be distributed among the intact workers.
- Messages are not guaranteed to be delivered to their destination.
- Workers can temporarily become unreachable. This leads to losing the messages sent from the worker to master or vice versa.
- There is no result checking.
- Data splitting is done manually.
3.4. A Robust, Distributed k-Means Algorithm
3.5. Data Distribution
Algorithm 2: Workers Manager Handler—Master Node |
Algorithm 3: Clustering Initiator Handler—Master Node |
3.6. Failing and Lagging Workers
Algorithm 4: Data Aggregator Handler—Master Node |
3.7. Master Node Failure
- Active replication: The same operations are performed at each replica.
- Passive replication: Only the main master performs the operations and transfers the results to the other replicas.
- Determining that the master has failed: The secondary master should be able to detect the primary master failure and react accordingly. When the primary master receives a MemberUp message, it checks the role of that member; if the role is passiveMaster, the master keeps the address of that node to provide it with the updated centroids. A WatchMe message is also sent to the passive master. Upon receiving that message, the passive master starts watching the main master (Algorithm 5-line 3) using something called Deathwatch. Deathwatch uses the Akka cluster failure detector for nodes in the cluster. It detects JVM crashes and network failures as well as the graceful termination of watched nodes, and as a result, a Terminated() message is generated and sent to the passive replica.
- Promoting a secondary master to primary: Upon receiving a Terminated() message, the passive master changes its internal state, using Become(master) to start behaving as a primary master (Algorithm 5 line 8).
- Reconfiguring the system to communicate with the new master: A message is then sent from the new master to all workers, instructing them to start sending their results to the new master (Algorithm 5-lines 5–7).
Algorithm 5: Passive Master |
3.8. Message Delivery
4. Results
4.1. Experimental Setup
4.2. Evaluation of the Preliminary Model
4.3. The Impact of Using Several Replicas on the Overall Performance
4.4. Master Failure
4.5. Undelivered Messages
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sedlmayr, M.; Würfl, T.; Maier, C.; Häberle, L.; Fasching, P.; Prokosch, H.U.; Christoph, J. Optimizing R with SparkR on a commodity cluster for biomedical research. Comput. Methods Programs Biomed. 2016, 137, 321–328. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Balis, B.; Figiela, K.; Malawski, M.; Jopek, K. Leveraging workflows and clouds for a multi-frontal solver for finite element meshes. Procedia Comput. Sci. 2015, 51, 944–953. [Google Scholar] [CrossRef] [Green Version]
- Mohebi, A.; Aghabozorgi, S.; Ying Wah, T.; Herawan, T.; Yahyapour, R. Iterative big data clustering algorithms: A review. Softw. Pract. Exp. 2016, 46, 107–129. [Google Scholar] [CrossRef]
- Al Hasib, A.; Cebrian, J.M.; Natvig, L. A vectorized k-means algorithm for compressed datasets: Design and experimental analysis. J. Supercomput. 2018, 74, 2705–2728. [Google Scholar] [CrossRef]
- Zhao, W.L.; Deng, C.H.; Ngo, C.W. K-means: A revisit. Neurocomputing 2018, 291, 195–206. [Google Scholar] [CrossRef]
- Albert, E.; Flores-Montoya, A.; Genaim, S.; Martin-Martin, E. May-happen-in-parallel analysis for actor-based concurrency. ACM Trans. Comput. Log. (TOCL) 2015, 17, 1–39. [Google Scholar] [CrossRef]
- Srirama, S.N.; Dick, F.M.S.; Adhikari, M. Akka framework based on the Actor model for executing distributed Fog Computing applications. Future Gener. Comput. Syst. 2021, 117, 439–452. [Google Scholar] [CrossRef]
- Yuan, C.; Yang, H. Research on K-value selection method of k-means clustering algorithm. J. Multidiscip. Sci. J. 2019, 2, 16. [Google Scholar] [CrossRef] [Green Version]
- Hasanzadeh-Mofrad, M.; Rezvanian, A. Learning automata clustering. J. Comput. Sci. 2018, 24, 379–388. [Google Scholar] [CrossRef]
- Arsan, T.; Hameez, M.M.N. A clustering-based approach for improving the accuracy of UWB sensor-based indoor positioning system. Mob. Inf. Syst. 2019. [Google Scholar] [CrossRef]
- El-Mandouh, A.M.; Mahmoud, H.A.; Abd-Elmegid, L.A.; Haggag, M.H. Optimized k-means clustering model based on gap statistic. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 183–188. [Google Scholar] [CrossRef]
- Liang, F.; Lu, X. Accelerating iterative big data computing through mpi. J. Comput. Sci. Technol. 2015, 30, 283–294. [Google Scholar] [CrossRef]
- Savvas, I.K.; Sofianidou, G.N. A novel near-parallel version of k-means algorithm for n-dimensional data objects using mpi. Int. J. Grid Util. Comput. 2016, 7, 80–91. [Google Scholar] [CrossRef]
- Savvas, I.K.; Sofianidou, G.N. Parallelizing k-means algorithm for 1-d data using mpi. In Proceedings of the 2014 IEEE 23rd International WETICE Conference, Parma, Italy, 23–25 June 2014; pp. 179–184. [Google Scholar]
- Nhita, F. Comparative Study between Parallel K-Means and Parallel K-Medoids with Message Passing Interface (MPI). Int. J. Inf. Commun. Technol. (IJoICT) 2016, 2, 27. [Google Scholar] [CrossRef] [Green Version]
- Zhang, J.; Wu, G.; Hu, X.; Li, S.; Hao, S. A parallel clustering algorithm with mpi-mkmeans. J. Comput. 2013, 8, 1017. [Google Scholar] [CrossRef]
- Sardar, T.H.; Ansari, Z. An analysis of MapReduce efficiency in document clustering using parallel k-means algorithm. Future Comput. Inform. J. 2018, 3, 200–209. [Google Scholar] [CrossRef]
- Losada, N.; González, P.; Martín, M.J.; Bosilca, G.; Bouteiller, A.; Teranishi, K. Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Gener. Comput. Syst. 2020, 106, 467–481. [Google Scholar] [CrossRef]
- Park, D.; Wang, J.; Kee, Y.S. In-storage computing for Hadoop MapReduce framework: Challenges and possibilities. IEEE Trans. Comput. 2016. [Google Scholar] [CrossRef]
- Bani-Salameh, H.; Al-Qawaqneh, M.; Taamneh, S. Investigating the Adoption of Big Data Management in Healthcare in Jordan. Data 2021, 6, 16. [Google Scholar] [CrossRef]
- Sreedhar, C.; Kasiviswanath, N.; Reddy, P.C. Clustering large datasets using k-means modified inter and intra clustering (KM-I2C) in Hadoop. J. Big Data 2017, 4, 1–19. [Google Scholar] [CrossRef]
- Ansari, Z.; Afzal, A.; Sardar, T.H. Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering. J. Inst. Eng. (India) Ser. B 2019, 100, 95–103. [Google Scholar] [CrossRef]
- Vats, S.; Sagar, B. Performance evaluation of k-means clustering on Hadoop infrastructure. J. Discret. Math. Sci. Cryptogr. 2019, 22, 1349–1363. [Google Scholar] [CrossRef]
- Lu, W. Improved k-means clustering algorithm for big data mining under Hadoop parallel framework. J. Grid Comput. 2019, 18, 239–250. [Google Scholar] [CrossRef]
- Gopalani, S.; Arora, R. Comparing apache spark and map reduce with performance analysis using k-means. Int. J. Comput. Appl. 2015, 113, 8–11. [Google Scholar] [CrossRef]
- Issa, J. Performance characterization and analysis for Hadoop k-means iteration. J. Cloud Comput. 2016, 5, 1–15. [Google Scholar] [CrossRef] [Green Version]
- Al-Hamil, M.; Maabreh, M.; Taamneh, S.; Pradeep, A.; Bani-Salameh, H. Apache Hadoop performance evaluation with resources monitoring tools, and parameters optimization: IOT emerging demand. J. Theor. Appl. Inf. Technol. 2021, 99, 2734–2750. [Google Scholar]
- Won, H.; Nguyen, M.C.; Gil, M.S.; Moon, Y.S.; Whang, K.Y. Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS. J. Supercomput. 2017, 73, 2657–2681. [Google Scholar] [CrossRef]
- Omrani, H.; Parmentier, B.; Helbich, M.; Pijanowski, B. The land transformation model-cluster framework: Applying k-means and the Spark computing environment for large scale land change analytics. Environ. Model. Softw. 2019, 111, 182–191. [Google Scholar] [CrossRef]
- Haut, J.M.; Paoletti, M.; Plaza, J.; Plaza, A. Cloud implementation of the k-means algorithm for hyperspectral image analysis. J. Supercomput. 2017, 73, 514–529. [Google Scholar] [CrossRef]
- Xing, Z.; Li, G. Intelligent Classification Method of Remote Sensing Image Based on Big Data in Spark Environment. Int. J. Wirel. Inf. Netw. 2019, 26, 183–192. [Google Scholar] [CrossRef]
- Shi, J.; Qiu, Y.; Minhas, U.F.; Jiao, L.; Wang, C.; Reinwald, B.; Özcan, F. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc. VLDB Endow. 2015, 8, 2110–2121. [Google Scholar] [CrossRef] [Green Version]
- Abuín, J.M.; Lopes, N.; Ferreira, L.; Pena, T.F.; Schmidt, B. Big Data in metagenomics: Apache Spark vs MPI. PLoS ONE 2020, 15, e0239741. [Google Scholar] [CrossRef]
- Losada, N.; Martín, M.J.; González, P. Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications. J. Supercomput. 2017, 73, 316–329. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhu, Z.; Cui, H.; Dong, X.; Chen, H. Small files storing and computing optimization in Hadoop parallel rendering. Concurr. Comput. Pract. Exp. 2017, 29, e3847. [Google Scholar] [CrossRef]
- Starzec, M.; Starzec, G.; Byrski, A.; Turek, W. Distributed ant colony optimization based on actor model. Parallel Comput. 2019, 90, 102573. [Google Scholar] [CrossRef]
- Bagheri, M.; Sirjani, M.; Khamespanah, E.; Khakpour, N.; Akkaya, I.; Movaghar, A.; Lee, E.A. Coordinated actor model of self-adaptive track-based traffic control systems. J. Syst. Softw. 2018, 143, 116–139. [Google Scholar] [CrossRef]
- Duan, J.; Yi, X.; Zhao, S.; Wu, C.; Cui, H.; Le, F. NFVactor: A resilient NFV system using the distributed actor model. IEEE J. Sel. Areas Commun. 2019, 37, 586–599. [Google Scholar] [CrossRef]
- Taamneh, S.; Qawasmeh, A.; Aljammal, A.H. Parallel and fault-tolerant k-means clustering based on the actor model. Multiagent Grid Syst. 2020, 16, 379–396. [Google Scholar] [CrossRef]
- Gupta, M. Akka Essentials; Packt Publishing Ltd.: Birmingham, UK, 2012. [Google Scholar]
- Friesen, J. Processing JSON with Jackson. In Java XML and JSON; Springer: Dauphin, MB, Canada, 2019; pp. 323–403. [Google Scholar]
- Hayashibara, N.; Defago, X.; Yared, R.; Katayama, T. The/spl phi/accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, Florianopolis, Brazil, 18–20 October 2004; pp. 66–78. [Google Scholar]
Data Set | No. of Rows (millions) | No. of Features | Size |
---|---|---|---|
DS1 | 2.5 | 68 | 0.5 G |
DS2 | 5.0 | 80 | 1 G |
Methods | 4 Nodes | 8 Nodes | 16 Nodes |
---|---|---|---|
No master failure | 1332 | 782 | 618 |
with master failure | 1371 | 819 | 654 |
Methods | 4 Nodes | 8 Nodes | 16 Nodes |
---|---|---|---|
No master failure | 13,233 | 7333 | 5088 |
with master failure | 13,283 | 7361 | 5134 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Taamneh, S.; Al-Hami, M.; Bani-Salameh, H.; Abdallah, A.E. A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines. Data 2021, 6, 73. https://doi.org/10.3390/data6070073
Taamneh S, Al-Hami M, Bani-Salameh H, Abdallah AE. A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines. Data. 2021; 6(7):73. https://doi.org/10.3390/data6070073
Chicago/Turabian StyleTaamneh, Salah, Mo’taz Al-Hami, Hani Bani-Salameh, and Alaa E. Abdallah. 2021. "A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines" Data 6, no. 7: 73. https://doi.org/10.3390/data6070073
APA StyleTaamneh, S., Al-Hami, M., Bani-Salameh, H., & Abdallah, A. E. (2021). A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines. Data, 6(7), 73. https://doi.org/10.3390/data6070073