Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Genetic Programming over Spark for Higgs Boson Classification

  • Conference paper
  • First Online:
Business Information Systems (BIS 2019)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 353))

Included in the following conference series:

Abstract

With the growing number of available databases having a very large number of records, existing knowledge discovery tools need to be adapted to this shift and new tools need to be created. Genetic Programming (GP) has been proven as an efficient algorithm in particular for classification problems. Notwithstanding, GP is impaired with its computing cost that is more acute with large datasets. This paper, presents how an existing GP implementation (DEAP) can be adapted by distributing evaluations on a Spark cluster. Then, an additional sampling step is applied to fit tiny clusters. Experiments are accomplished on Higgs Boson classification with different settings. They show the benefits of using Spark as parallelization technology for GP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    ‘A data lake is a collection of storage instances of various data assets additional to the originating data sources.’ (Source: Gartner).

  2. 2.

    https://hadoop.apache.org.

  3. 3.

    https://spark.apache.org.

References

  1. Al-Madi, N., Ludwig, S.A.: Scaling genetic programming for data classification using mapreduce methodology. In: Fifth World Congress on Nature and Biologically Inspired Computing, NaBIC 2013, 12–14 August 2013, pp. 132–139. IEEE (2013)

    Google Scholar 

  2. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nature Commun. 5 (2014)

    Google Scholar 

  3. Baldi, P., Sadowski, P., Whiteson, D.: Enhanced higgs boson to \(\tau \)+ \(\tau \)- search with deep learning. Phys. Rev. Lett. 114(11), 111–801 (2015)

    Article  Google Scholar 

  4. Chávez, F., et al.: ECJ+HADOOP: an easy way to deploy massive runs of evolutionary algorithms. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9598, pp. 91–106. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31153-1_7

    Chapter  Google Scholar 

  5. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Brewer, E.A., Chen, P. (eds.) 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, 6–8 December 2004, pp. 137–150. USENIX Association (2004)

    Google Scholar 

  6. Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)

    MathSciNet  MATH  Google Scholar 

  7. Funika, W., Koperek, P.: Scaling evolutionary programming with the use of apache spark. Comput. Sci. (AGH) 17(1), 69–82 (2016)

    Article  Google Scholar 

  8. Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in Genetic Programming. In: Davidor, Y., Schwefel, H.-P., Männer, R. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-58484-6_275

    Chapter  Google Scholar 

  9. Giráldez, R., Díaz-Díaz, N., Nepomuceno, I., Aguilar-Ruiz, J.S.: An approach to reduce the cost of evaluation in evolutionary learning. In: Cabestany, J., Prieto, A., Sandoval, F. (eds.) IWANN 2005. LNCS, vol. 3512, pp. 804–811. Springer, Heidelberg (2005). https://doi.org/10.1007/11494669_98

    Chapter  Google Scholar 

  10. Higgs Dataset: http://archive.ics.uci.edu/ml/datasets/HIGGS

  11. Hmida, H., Hamida, S.B., Borgi, A., Rukoz, M.: Scale genetic programming for large data sets: case of higgs bosons classification. Procedia Comput. Sci. 126, 302–311 (2018). The 22nd International Conference, KES-201

    Article  Google Scholar 

  12. Karau, H., Warren, R.: High Performance Spark, 1st edn. O’Reilly, Sebastopol (2017)

    Google Scholar 

  13. Kienzler, R.: Mastering Apache Spark 2.x. Packt Publishing, Birmingham (2017)

    Google Scholar 

  14. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)

    MATH  Google Scholar 

  15. Paduraru, C., Melemciuc, M., Stefanescu, A.: A distributed implementation using apache spark of a genetic algorithm applied to test data generation. In: Companion Material Proceedings of Genetic and Evolutionary Computation Conference, 15–19 July 2017, pp. 1857–1863. ACM (2017)

    Google Scholar 

  16. Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a MapReduce approach. Math. Probl. Eng. 2015, 11 (2015)

    Article  Google Scholar 

  17. Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)

    Article  Google Scholar 

  18. Shashidhara, B.M., Jain, S., Rao, V.D., Patil, N., Raghavendra, G.S.: Evaluation of machine learning frameworks on bank marketing and Higgs datasets. In: 2nd International Conference on Advances in Computing and Communication Engineering, pp. 551–555 (2015)

    Google Scholar 

  19. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, 25–27 April 2012, pp. 15–28. USENIX Association (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hmida Hmida .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hmida, H., Ben Hamida, S., Borgi, A., Rukoz, M. (2019). Genetic Programming over Spark for Higgs Boson Classification. In: Abramowicz, W., Corchuelo, R. (eds) Business Information Systems. BIS 2019. Lecture Notes in Business Information Processing, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-030-20485-3_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20485-3_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20484-6

  • Online ISBN: 978-3-030-20485-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics