Benchmarking Spark Machine Learning Using BigBench

Singh, Sweta

doi:10.1007/978-3-319-54334-5_4

Sweta Singh¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10080))

Included in the following conference series:

Technology Conference on Performance Evaluation and Benchmarking

1213 Accesses

Abstract

Databases such as dashDB are adding High Speed Connectors for Spark to efficiently extract large volumes of data. This allows them to be combined with other unstructured data sources and perform Machine Learning (ML) on top of it. Machine Learning is a key ingredient for such use cases. In order to assess performance of the data connectors and machine language frameworks, we sought benchmarks that have the ability to scale the size of datasets to very large volumes and apply Machine Learning algorithms. After exploring several options, we found BigBench to be a good fit. In this paper, we talk about our experiences of using BigBench with special focus on its 5 Machine Learning queries and their default implementation in Spark. We discuss on how we could improve effectiveness of BigBench for benchmarking Machine Learning by avoiding bias and inclusion of real time analytics. We also think that there is scope for improving the coverage of Machine Learning by adding more use cases like Collaborative Filtering. Lastly, we share some interesting visualization of 4 ML queries using SPSS Modeler and our experiments on different Clustering and Classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Article 23 August 2022

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

Article 13 May 2020

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

References

Apache Spark. http://spark.apache.org/
dashDB. http://www.ibm.com/analytics/us/en/technology/cloud-data-services/dashdb/
dashDB Local. http://www.ibm.com/analytics/us/en/technology/cloud-data-services/dashdb-local/
UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/
IBM SPSS. http://www.ibm.com/analytics/us/en/technology/spss/spss.html
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/16.0/en/modeler_applications_guide_book.pdf
Ghazal, A., et al.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM (2013)
Google Scholar
Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., Jacobsen, H.-A.: A BigBench implementation in the hadoop ecosystem. In: Rabl, T., Jacobsen, H.-A., Raghunath, N., Poess, M., Bhandarkar, M., Baru, C. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 3–18. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10596-3_1
Google Scholar
Baru, C., et al.: Discussion of BigBench: a proposed industry standard performance benchmark for big data. In: Nambiar, R., Poess, M. (eds.) TPCTC 2014. LNCS, vol. 8904, pp. 44–63. Springer, Cham (2015). doi:10.1007/978-3-319-15350-6_4
Google Scholar
Nambiar, R., Poess, M. (eds.): TPCTC 2013. LNCS, vol. 8391. Springer, Heidelberg (2014). doi:10.1007/978-3-319-04936-6
Google Scholar
Meng, X., et al.: Mllib: Machine learning in apache spark. JMLR 17(34), 1–7 (2016)
Google Scholar
Agrawal, D., et al.: SparkBench – a spark performance testing suite. In: Nambiar, R., Poess, M. (eds.) TPCTC 2015. LNCS, vol. 9508, pp. 26–44. Springer, Heidelberg (2016). doi:10.1007/978-3-319-31409-9_3
Chapter Google Scholar
Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Adv. Artif. Intell. 2009, 19 (2009). Article ID 421425, doi:10.1155/2009/421425
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)
Article Google Scholar
Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R.: Large-scale parallel collaborative filtering for the netflix prize. In: Fleischer, R., Xu, J. (eds.) AAIM 2008. LNCS, vol. 5034, pp. 337–348. Springer, Heidelberg (2008). doi:10.1007/978-3-540-68880-8_32
Chapter Google Scholar
Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimization. In: Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing. ACM (2013)
Google Scholar
Transaction Processing Performance Council. http://www.tpc.org
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, p. 2 (2012)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, Boston, 22–25 June 2010, p. 10 (2010)
Google Scholar
Pilászy, I., Zibriczky, D., Tikk, D.: Fast als-based matrix factorization for explicit and implicit feedback datasets. In: Proceedings of the Fourth ACM Conference on Recommender Systems. ACM (2010)
Google Scholar
Feuerverger, A., He, Y., Khatri, S.: Statistical significance of the Netflix challenge. Stat. Sci. 27, 202–231 (2012)
Article MathSciNet MATH Google Scholar
Hastie, T., et al.: Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res. 16, 3367–3402 (2015)
MathSciNet MATH Google Scholar

Download references

Acknowledgement

We would like to thank Berni Schiefer, Steve Rees, Torsten Steinbach, John Poelman and Manish Anand for providing their valuable feedback.

Author information

Authors and Affiliations

IBM, Dallas, USA
Sweta Singh

Authors

Sweta Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sweta Singh .

Editor information

Editors and Affiliations

Cisco Systems, Inc., San Jose, California, USA
Raghunath Nambiar
Oracle Corporation, Redwood City, California, USA
Meikel Poess

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, S. (2017). Benchmarking Spark Machine Learning Using BigBench. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking. Traditional - Big Data - Internet of Things. TPCTC 2016. Lecture Notes in Computer Science(), vol 10080. Springer, Cham. https://doi.org/10.1007/978-3-319-54334-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-54334-5_4
Published: 18 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54333-8
Online ISBN: 978-3-319-54334-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Benchmarking Spark Machine Learning Using BigBench

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Benchmarking Spark Machine Learning Using BigBench

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation