Abstract
With the growing number of available databases having a very large number of records, existing knowledge discovery tools need to be adapted to this shift and new tools need to be created. Genetic Programming (GP) has been proven as an efficient algorithm in particular for classification problems. Notwithstanding, GP is impaired with its computing cost that is more acute with large datasets. This paper, presents how an existing GP implementation (DEAP) can be adapted by distributing evaluations on a Spark cluster. Then, an additional sampling step is applied to fit tiny clusters. Experiments are accomplished on Higgs Boson classification with different settings. They show the benefits of using Spark as parallelization technology for GP.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
‘A data lake is a collection of storage instances of various data assets additional to the originating data sources.’ (Source: Gartner).
- 2.
- 3.
References
Al-Madi, N., Ludwig, S.A.: Scaling genetic programming for data classification using mapreduce methodology. In: Fifth World Congress on Nature and Biologically Inspired Computing, NaBIC 2013, 12–14 August 2013, pp. 132–139. IEEE (2013)
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nature Commun. 5 (2014)
Baldi, P., Sadowski, P., Whiteson, D.: Enhanced higgs boson to \(\tau \)+ \(\tau \)- search with deep learning. Phys. Rev. Lett. 114(11), 111–801 (2015)
Chávez, F., et al.: ECJ+HADOOP: an easy way to deploy massive runs of evolutionary algorithms. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9598, pp. 91–106. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31153-1_7
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Brewer, E.A., Chen, P. (eds.) 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 December 2004, pp. 137–150. USENIX Association (2004)
Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)
Funika, W., Koperek, P.: Scaling evolutionary programming with the use of apache spark. Comput. Sci. (AGH) 17(1), 69–82 (2016)
Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in Genetic Programming. In: Davidor, Y., Schwefel, H.-P., Männer, R. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-58484-6_275
Giráldez, R., Díaz-Díaz, N., Nepomuceno, I., Aguilar-Ruiz, J.S.: An approach to reduce the cost of evaluation in evolutionary learning. In: Cabestany, J., Prieto, A., Sandoval, F. (eds.) IWANN 2005. LNCS, vol. 3512, pp. 804–811. Springer, Heidelberg (2005). https://doi.org/10.1007/11494669_98
Higgs Dataset: http://archive.ics.uci.edu/ml/datasets/HIGGS
Hmida, H., Hamida, S.B., Borgi, A., Rukoz, M.: Scale genetic programming for large data sets: case of higgs bosons classification. Procedia Comput. Sci. 126, 302–311 (2018). The 22nd International Conference, KES-201
Karau, H., Warren, R.: High Performance Spark, 1st edn. O’Reilly, Sebastopol (2017)
Kienzler, R.: Mastering Apache Spark 2.x. Packt Publishing, Birmingham (2017)
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)
Paduraru, C., Melemciuc, M., Stefanescu, A.: A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: Companion Material Proceedings of Genetic and Evolutionary Computation Conference, 15–19 July 2017, pp. 1857–1863. ACM (2017)
Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a MapReduce approach. Math. Probl. Eng. 2015, 11 (2015)
Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)
Shashidhara, B.M., Jain, S., Rao, V.D., Patil, N., Raghavendra, G.S.: Evaluation of machine learning frameworks on bank marketing and Higgs datasets. In: 2nd International Conference on Advances in Computing and Communication Engineering, pp. 551–555 (2015)
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, 25–27 April 2012, pp. 15–28. USENIX Association (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M. (2019). Genetic Programming over Spark for Higgs Boson Classification. In: Abramowicz, W., Corchuelo, R. (eds) Business Information Systems. BIS 2019. Lecture Notes in Business Information Processing, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-030-20485-3_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-20485-3_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20484-6
Online ISBN: 978-3-030-20485-3
eBook Packages: Computer ScienceComputer Science (R0)