Abstract
The growing trend of Big Data drives additional demand for novel solutions and specifically-designed algorithms that will perform efficient Big Data filtering and processing, recently even in a real-time fashion. Thus, the necessity to scale up Machine Learning algorithms to larger datasets and more complex methods should be addressed by distributed parallelism. This book chapter conducts a thorough literature review on distributed parallel data-intensive Machine Learning algorithms applied on Big Data so far. The selected algorithms fall into various Machine Learning categories, including (i) unsupervised learning, (ii) supervised learning, (iii) semi-supervised learning and (iv) deep learning. The most popular programming frameworks like MapReduce, PLANET, DryadLINQ, IBM Parallel Machine Learning Toolbox (PML), Compute Unified Device Architecture (CUDA) etc., well suited for parallelizing Machine Learning algorithms, will be cited throughout the review. However, this review is mainly focused on the performance and implementation traits of scalable Machine Learning algorithms, rather than on framework wide-ranging choices and their trade-offs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
IDC/EMC. (2014, April). The digital universe of opportunities: Rich data and the increasing value of the internet of things. Retrieved from https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm.
Mashey, J. R. (1999). Retrieved from http://static.usenix.org/event/usenix99/invited_talks/mashey.pdf.
Weiss, S., & Indurkhya, N. (1998). Predictive data mining: A practical guide. Morgan.
Jin, X., W. Wah, B., Cheng, X., & Wang, Y. (2015). Significance and challenges of Big Data research. Big Data Research, 59–64.
Hey, A. J., Tansley, S., & Tolle, K. M. (2009). The fourth paradigm: Data-intensive scientific discovery. WA: Microsoft Research Redmon.
Laney, D. (2001). 3-D data management: Controlling data volume, velocity and variety. META Group Research Note.
Zhou, L., Pan, S., Wang, J., & V. Vasilakos, A. (2017, May 10). Machine learning on Big Data: Opportunities and challenges. Neurocomputing, 237, 350–361. http://doi.org/10.1016/j.neucom.2017.01.026.
Alippi, C., Ntalampiras, S., & Roveri, M. (2016). Designing HMMs in the age of big data. In Advances in Big Data: Proceedings of the 2nd INNS Conference on Big Data (pp. 120–130). Springer International Publishing.
Chen, P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, Elsevier.
Chen, X.-W., & Lin, X. (2014). Big data deep learning: challenges and perspectives. In IEEE Access, 2, 514–525. https://doi.org/10.1109/access.2014.2325029.
Bekkerman, R. A. (2011). Scaling up machine learning: parallel and distributed approaches. In Proceedings of the 17th ACM SIGKDD International Conference Tutorials (pp. Article 4, 1). San Diego, California. http://dx.doi.org/10.1145/2107736.2107740.
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., … Steinberg, D. (2008, January). Knowledge and Information Systems, 14(1), 1–37.
Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means clustering based on MapReduce. In IEEE International Conference on Cloud Computing. CloudCom 2009 (Vol. 5931, pp. 674–679). Berlin, Heidelberg: Springer.
Anchalia, P. P., Koundinya, A. K., & Srinath, N. K. (2013). MapReduce design of k-means clustering algorithm. In 013 International Conference on Information Science and Applications (ICISA) (pp. 1–5). Suwon. https://doi.org/10.1109/icisa.2013.6579448.
Arthur, D., & Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1027–1035). New Orleans, Louisiana: Society for Industrial and Applied Mathematics.
Bahmani, B., Moseley, B., Vattani, A. K., & Vassilvitskii, S. (2012). Scalable k-means++. In Proceedings of the VLDB Endowment (pp. 622–633). VLDB Endowment. http://dx.doi.org/10.14778/2180912.2180915.
Xu, K., Wen, C., Yuan, Q., He, X., & Tie, J. (2014). A MapReduce based parallel SVM for email classification. Journal of Networks, 9(6), 1640–1647.
Xu, Y., Qu, W., Li, Z., Min, G., Li, K., & Liu, Z. (2014). Efficient k-means++ approximation with MapReduce. IEEE Transactions on Parallel and Distributed Systems, 3135–3144.
Tang, Z., Liu, K., Xiao, J., Yang, L., & Xiao, Z. (2017, March 23). A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce. Concurrency and Computation: Practice and Experience. http://dx.doi.org/10.1002/cpe.4109.
Tang, Z., Liu, K., Xiao, J., Yang, L., & Xiao, Z. (2017). A parallel k‐means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.4109.
Lv, Y., Ma, T., Tang, M., Cao, J., Tian, Y., Al-Dhelaan, A., & Al-Rodhaan, M. (2016, January 1). An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing, 9–22. http://doi.org/10.1016/j.neucom.2015.05.109.
Chen, M., Gao, X., & Li, H. (2010). Parallel DBSCAN with priority r-tree. In 2010 2nd IEEE International Conference on Information Management and Engineering (pp. 508–511). Chengdu. https://doi.org/10.1109/icime.2010.5477926.
Dai, B.-R., & Lin, I.-C. (2012). Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In 2012 IEEE Fifth International Conference on Cloud Computing (pp. 59–66). Honolulu. https://doi.org/10.1109/cloud.2012.42.
Patwary, M. A., Palsetia, D., Agrawal, A., Liao, W.-K., Manne, F., & Choudhary, A. (2012). A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (pp. 1–11). Salt Lake City, Utah: IEEE Computer Society Press.
Yu, Y., Zhao, J., Wang, X., Wang, Q., & Zhang, Y. (2015). Cludoop: An efficient distributed density-based clustering for Big Data using Hadoop. International Journal of Distributed Sensor Networks, 11(6).
Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., & Alok, C. (2015). A scalable hierarchical clustering algorithm using spark. In Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and Applications (pp. 418–426). Washington, DC, USA: IEEE Computer Society.
Goyal, P., Kumari, S., Sharma, S., Kuma, D. R., Kishore, V., Balasubramaniam, S., & Goyal, N. (2016). A fast, scalable SLINK algorithm for commodity cluster computing exploiting spatial locality. In 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (pp. 268–275). IEEE. https://doi.org/10.1109/hpcc-smartcity-dss.2016.0047.
Yan, W., Brahmakshatriya, U., Xue, Y., Gilder, M., & Wise, B. (2013, March). p-PIC: Parallel power iteration clustering for Big Data. Journal of Parallel and Distributed Computing, 73(3), 352–359. http://doi.org/10.1016/j.jpdc.2012.06.009.
Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 214–225.
Jayalatchumy, D., & Thambidurai, P. (2014). Implementation of P-Pic algorithm in Map Reduce to handle Big Data. IJRET: International Journal of Research in Engineering and Technology, 113–118.
Priscilla, G. A., & Chilambuchelvan, A. (2016). A fast and parallel implementation of reduction based power iterative clustering algorithm in cloud. International Journal of Computer Science and Information Security, 81–87.
Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. SIGKDD Explorations Newsletter, 90–105.
Zhu, B., Mara, A., & Mozo, M. (2015). CLUS: Parallel subspace clustering algorithm on Spark* (pp. 175–185). Poitiers, France.
Zhu, B., & Mozo, A. (2016). Spark2Fires: A new parallel approximate subspace clustering algorithm. In I. et al. (Ed.), Communications in computer and information science (pp. 147–154). Prague, Czech Republic.
Kwon, B., & Cho, H. (2010). Scalable co-clustering algorithms. In Algorithms and Architectures for Parallel Processing: 10th International Conference (pp. 32–43). Busan, Korea: Springer Berlin Heidelberg.
Papadimitriou, S., & Sun, J. (2008). DisCo: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (pp. 512–521). Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/icdm.2008.142.
Mu, Y., Liu, X., Yang, Z., & Liu, X. (2017). A parallel C4.5 decision tree algorithm based on MapReduce. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.4015.
Han, H., Liu, Y., & Sun, X. (2013). A scalable random forest algorithm based on MapReduce. In 2013 IEEE 4th International Conference on Software Engineering and Service Science (pp. 849–852). IEEE. https://doi.org/10.1109/icsess.2013.6615438.
Gupta, P., Sharma, A., & Jindal, R. (2016). Scalable machine-learning algorithms for Big Data analytics: A comprehensive review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 194–214.
del Río, S., López, V., Benítez, M., & Herrera, F. (2015). A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. International Journal of Computational Intelligence Systems, 422–427.
Fernandez, A., del Río, S., & Herrera, F. (2016). A first approach in evolutionary fuzzy systems based on the lateral tuning of the linguistic labels for Big Data classification. In 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (pp. 1437–1444). IEEE. https://doi.org/10.1109/fuzz-ieee.2016.7737858.
Liu, B., Blasch, E., Chen, Y., Shen, D., & Chen, G. (2013). Scalable sentiment classification for Big Data analysis using Naïve Bayes classifier. In 2013 IEEE International Conference on Big Data (pp. 99–104). IEEE.
Sun, Z. S., & Fox, G. (2012). Study on parallel SVM based on MapReduce. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) (p. 1).
Khairnar, J., & Kinikar, M. (2014). Sentiment analysis based mining and summarizing using SVM-MapReduce. International Journal of Computer Science & Information Technology, 5(3), 4081.
You, Z.-H., Yu, J.-Z., Zhu, L., Li, S., & Wen, Z.-K. (2014). A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing, 145, 37–43. http://doi.org/10.1016/j.neucom.2014.05.072.
Liu, Y., Xu, L., & Li, M. (2016). The parallelization of back propagation neural network in MapReduce and spark. International Journal of Parallel Programming, 1–20.
Rehab, M. A., & Boufares, F. (2015). Scalable massively parallel learning of multiple linear regression algorithm with MapReduce. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 41–47).
Hariharan, C., & Subramanian, S. (2013). Large scale multi-view learning on MapReduce. In Proceedings of 19th International Conference on Advanced Computing and Communications.
Wang, S., Zhao, Q., & Ye, F. (2013). A new back-propagation neural network algorithm for a Big Data environment based on punishing characterized active learning strategy. International Journal of Knowledge and Systems Science, 32–45.
Ranzato, M. A., Boureau, Y.-L., & LeCun, Y. (2007). Sparse feature learning for deep belief networks. In Proceedings of the 20th International Conference on Neural Information Processing Systems (pp. 1185–1192). USA: Curran Associates Inc.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Skënduli, M.P., Biba, M., Ceci, M. (2018). Implementing Scalable Machine Learning Algorithms for Mining Big Data: A State-of-the-Art Survey. In: Roy, S., Samui, P., Deo, R., Ntalampiras, S. (eds) Big Data in Engineering Applications. Studies in Big Data, vol 44. Springer, Singapore. https://doi.org/10.1007/978-981-10-8476-8_4
Download citation
DOI: https://doi.org/10.1007/978-981-10-8476-8_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8475-1
Online ISBN: 978-981-10-8476-8
eBook Packages: EngineeringEngineering (R0)