research-article

Teaching Big Data with a Virtual Cluster

Author:

Joshua EckrothAuthors Info & Claims

SIGCSE '16: Proceedings of the 47th ACM Technical Symposium on Computing Science Education

Pages 175 - 180

https://doi.org/10.1145/2839509.2844651

Published: 17 February 2016 Publication History

Get Access

Abstract

Both industry and academia are confronting the challenge of big data, i.e., data processing that involves data so voluminous or arriving at such high velocity that no single commodity machine is capable of storing or processing them all. A common approach to handling big data is to divide and distribute the processing job to a cluster of machines. Ideally, a course that teaches students how to work with big data would provide students access to a cluster for hands-on practice. However, a cluster of physical, on-premise machines may be prohibitively expensive, particularly at smaller institutions with smaller budgets.

In this report, we summarize our experiences developing and using a virtual cluster in a big data mining and analytics course at a small private liberal arts college. A single moderately-sized server hosts a cluster of virtual machines, which run the popular Apache Hadoop system. The virtual cluster gives students hands-on experience and costs less than an equal number of physical machines. It is also easily constructed and reconfigured. We describe our implementation, analyze its performance characteristics, and compare costs with physical clusters and the Amazon Elastic MapReduce cloud service. We summarize our use of the virtual cluster in the classroom and show student feedback.

References

[1]

S. Barielle. Calculating TCO for energy. IBM Systems Magazine: Power, pages 38--40, November 2011.

Google Scholar

[2]

R. Brown and E. Shoop. Teaching undergraduates using local virtual clusters. In IEEE International Conference on Cluster Computing (CLUSTER), pages 1--8. IEEE, 2013.

Crossref

Google Scholar

[3]

R. A. Brown. Hadoop at home: Large-scale computing at a small college. In ACM SIGCSE Bulletin, volume 41, pages 106--110. ACM, 2009.

Digital Library

Google Scholar

[4]

E. Johnson, P. Garrity, T. Yates, R. Brown, et al. Performance of a virtual cluster in a general-purpose teaching laboratory. In IEEE International Conference on Cluster Computing (CLUSTER), pages 600--604. IEEE, 2011.

Digital Library

Google Scholar

[5]

L. B. Ngo, E. B. Duffy, and A. W. Apon. Teaching HDFS/MapReduce systems concepts to undergraduates. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, pages 1114--1121. IEEE, 2014.

Digital Library

Google Scholar

[6]

A. S. Rabkin, C. Reiss, R. Katz, and D. Patterson. Experiences teaching mapreduce in the cloud. In Proceedings of the 43rd ACM Technical Symposium on Computer Science Education, pages 601--606. ACM, 2012.

Digital Library

Google Scholar

[7]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1--10. IEEE, 2010.

Digital Library

Google Scholar

Cited By

View all

Wu WStephenson BStone JBattestilli LRebelsky SShoop L(2024)Learning Big Data Systems via EmulationProceedings of the 55th ACM Technical Symposium on Computer Science Education V. 110.1145/3626252.3630888(1449-1455)Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3626252.3630888
Ismail AMutalib SHaron H(2023)Data science technology course: The design, assessment and computing environment perspectivesEducation and Information Technologies10.1007/s10639-022-11558-828:8(10209-10234)Online publication date: 24-Jan-2023
https://doi.org/10.1007/s10639-022-11558-8
Cisternino ADucange PTonellotto NVallati C(2021)Leveraging Cloud Infrastructures for Teaching Advanced Computer Engineering ClassesBridges and Mediation in Higher Distance Education10.1007/978-3-030-67435-9_20(256-270)Online publication date: 29-Jan-2021
https://doi.org/10.1007/978-3-030-67435-9_20
Show More Cited By

Index Terms

Teaching Big Data with a Virtual Cluster

Recommendations

Teaching Big Data and Cloud Computing with a Physical Cluster
SIGCSE '17: Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education

Cloud Computing and Big Data continue to be disruptive forces in computing and have made inroads in the Computer Science curriculum, with courses in Cloud Computing and Big Data being routinely offered at the graduate and undergraduate level. One major ...
A course on big data analytics
Abstract
This report details a course on big data analytics designed for undergraduate junior and senior computer science students. The course is heavily focused on projects and writing code for big data processing. It is designed to help ...
Highlights
- A course designed for undergraduate junior and senior computer science students.
Big data

We use structuralism and functionalism paradigms to analyze the origins of big data applications.Current trends and sources of big data.Processing technologies, methods and analysis techniques for big data are compared in detail.We analyze major ...

Reviews

Reviewer: Mariam Kiran

Web logs and sensor-collected data are two means by which organizations collect huge amounts of data. Parallel processing tools such as Hadoop are used to tame petabytes of data, store large amounts of data in the Hadoop distributed file system, and produce quick results, making them an ideal solution. Processing models such as MapReduce are used to acquire these quick results. Starting with a description of various Apache projects, the author goes into the details of setting up local and cluster Hadoop distributions. The goal of such setups is to enable students to conduct their experiments onsite and understand more about the software that lies underneath. The paper is well written. The author does a very good job of comparing the costs of setting up the various clusters and the total time it takes to start up. The author then does a comparison with using cloud-hosted Hadoop services, such as Amazon Elastic MapReduce (EMR) services, to highlight the differences seen when facilities are on site versus in the cloud. The author gives students a collection of experiments and then records their experiences, with comments added to the paper as an evaluation of the set ups. The final conclusion is to invest more and add more servers for the students. As an academic, I see the author's motivation in setting this up locally and taming the problem by giving students the ability to work in the facility. Faculties should be convinced of investing in ideas such as this rather than moving toward commercial cloud providers, especially for enhancing student learning experiences. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

SIGCSE '16: Proceedings of the 47th ACM Technical Symposium on Computing Science Education

February 2016

768 pages

ISBN:9781450336857

DOI:10.1145/2839509

General Chairs:
Carl Alphonce
University at Buffalo
,
Jodi Tims
Baldwin Wallace University
,
Program Chairs:
Michael Caspersen
Aarhus University
,
Stephen Edwards
Virginia Tech University

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGCSE '16

Sponsor:

SIGCSE

SIGCSE '16: The 47th ACM Technical Symposium on Computing Science Education

March 2 - 5, 2016

Tennessee, Memphis, USA

Acceptance Rates

SIGCSE '16 Paper Acceptance Rate 105 of 297 submissions, 35%;

Overall Acceptance Rate 1,595 of 4,542 submissions, 35%

Upcoming Conference

SIGCSE Virtual 2024

Sponsor:
sigcse

1st ACM Virtual Global Computing Education Conference

December 5 - 8, 2024

Virtual Event , NC , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
306
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wu WStephenson BStone JBattestilli LRebelsky SShoop L(2024)Learning Big Data Systems via EmulationProceedings of the 55th ACM Technical Symposium on Computer Science Education V. 110.1145/3626252.3630888(1449-1455)Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3626252.3630888
Ismail AMutalib SHaron H(2023)Data science technology course: The design, assessment and computing environment perspectivesEducation and Information Technologies10.1007/s10639-022-11558-828:8(10209-10234)Online publication date: 24-Jan-2023
https://doi.org/10.1007/s10639-022-11558-8
Cisternino ADucange PTonellotto NVallati C(2021)Leveraging Cloud Infrastructures for Teaching Advanced Computer Engineering ClassesBridges and Mediation in Higher Distance Education10.1007/978-3-030-67435-9_20(256-270)Online publication date: 29-Jan-2021
https://doi.org/10.1007/978-3-030-67435-9_20
Yang ZGuo X(2020)Teaching Hadoop Using Role Play GamesDecision Sciences Journal of Innovative Education10.1111/dsji.1219718:1(6-21)Online publication date: 8-Feb-2020
https://doi.org/10.1111/dsji.12197
Shamsi Jul Hassan SBawany NShoaib N(2018)A Comprehensive Course on Big Data for Undergraduate Students2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00067(353-360)Online publication date: May-2018
https://doi.org/10.1109/IPDPSW.2018.00067
Zhu W(2018)Cloud-based Labs and Programming Assignments in Networking and Cybersecurity Courses2018 IEEE Frontiers in Education Conference (FIE)10.1109/FIE.2018.8659020(1-9)Online publication date: 3-Oct-2018
https://dl.acm.org/doi/10.1109/FIE.2018.8659020
Ferraro Petrillo U(2018)Using Software Visualization for Supporting the Teaching of MapReduceNetwork and System Security10.1007/978-3-030-02744-5_26(349-360)Online publication date: 18-Dec-2018
https://doi.org/10.1007/978-3-030-02744-5_26
DePratti RDancik GLucci FSampson R(2017)Development of an introductory big data programming and concepts courseJournal of Computing Sciences in Colleges10.5555/3069658.306968332:6(175-182)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.5555/3069658.3069683
Eickholt JShrestha SCaspersen MEdwards SBarnes TGarcia D(2017)Teaching Big Data and Cloud Computing with a Physical ClusterProceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education10.1145/3017680.3017705(177-181)Online publication date: 8-Mar-2017
https://dl.acm.org/doi/10.1145/3017680.3017705
Eckroth J(2017)Teaching Future Big Data Analysts: Curriculum and Experience Report2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.122(346-351)Online publication date: May-2017
https://doi.org/10.1109/IPDPSW.2017.122
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Teaching Big Data and Cloud Computing with a Physical Cluster

A course on big data analytics

Big data

Reviews

Access critical reviews of Computing literature here