Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2839509.2844651acmconferencesArticle/Chapter ViewAbstractPublication PagessigcseConference Proceedingsconference-collections
research-article

Teaching Big Data with a Virtual Cluster

Published: 17 February 2016 Publication History

Abstract

Both industry and academia are confronting the challenge of big data, i.e., data processing that involves data so voluminous or arriving at such high velocity that no single commodity machine is capable of storing or processing them all. A common approach to handling big data is to divide and distribute the processing job to a cluster of machines. Ideally, a course that teaches students how to work with big data would provide students access to a cluster for hands-on practice. However, a cluster of physical, on-premise machines may be prohibitively expensive, particularly at smaller institutions with smaller budgets.
In this report, we summarize our experiences developing and using a virtual cluster in a big data mining and analytics course at a small private liberal arts college. A single moderately-sized server hosts a cluster of virtual machines, which run the popular Apache Hadoop system. The virtual cluster gives students hands-on experience and costs less than an equal number of physical machines. It is also easily constructed and reconfigured. We describe our implementation, analyze its performance characteristics, and compare costs with physical clusters and the Amazon Elastic MapReduce cloud service. We summarize our use of the virtual cluster in the classroom and show student feedback.

References

[1]
S. Barielle. Calculating TCO for energy. IBM Systems Magazine: Power, pages 38--40, November 2011.
[2]
R. Brown and E. Shoop. Teaching undergraduates using local virtual clusters. In IEEE International Conference on Cluster Computing (CLUSTER), pages 1--8. IEEE, 2013.
[3]
R. A. Brown. Hadoop at home: Large-scale computing at a small college. In ACM SIGCSE Bulletin, volume 41, pages 106--110. ACM, 2009.
[4]
E. Johnson, P. Garrity, T. Yates, R. Brown, et al. Performance of a virtual cluster in a general-purpose teaching laboratory. In IEEE International Conference on Cluster Computing (CLUSTER), pages 600--604. IEEE, 2011.
[5]
L. B. Ngo, E. B. Duffy, and A. W. Apon. Teaching HDFS/MapReduce systems concepts to undergraduates. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, pages 1114--1121. IEEE, 2014.
[6]
A. S. Rabkin, C. Reiss, R. Katz, and D. Patterson. Experiences teaching mapreduce in the cloud. In Proceedings of the 43rd ACM Technical Symposium on Computer Science Education, pages 601--606. ACM, 2012.
[7]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1--10. IEEE, 2010.

Cited By

View all
  • (2024)Learning Big Data Systems via EmulationProceedings of the 55th ACM Technical Symposium on Computer Science Education V. 110.1145/3626252.3630888(1449-1455)Online publication date: 7-Mar-2024
  • (2023)Data science technology course: The design, assessment and computing environment perspectivesEducation and Information Technologies10.1007/s10639-022-11558-828:8(10209-10234)Online publication date: 24-Jan-2023
  • (2021)Leveraging Cloud Infrastructures for Teaching Advanced Computer Engineering ClassesBridges and Mediation in Higher Distance Education10.1007/978-3-030-67435-9_20(256-270)Online publication date: 29-Jan-2021
  • Show More Cited By

Recommendations

Reviews

Mariam Kiran

Web logs and sensor-collected data are two means by which organizations collect huge amounts of data. Parallel processing tools such as Hadoop are used to tame petabytes of data, store large amounts of data in the Hadoop distributed file system, and produce quick results, making them an ideal solution. Processing models such as MapReduce are used to acquire these quick results. Starting with a description of various Apache projects, the author goes into the details of setting up local and cluster Hadoop distributions. The goal of such setups is to enable students to conduct their experiments onsite and understand more about the software that lies underneath. The paper is well written. The author does a very good job of comparing the costs of setting up the various clusters and the total time it takes to start up. The author then does a comparison with using cloud-hosted Hadoop services, such as Amazon Elastic MapReduce (EMR) services, to highlight the differences seen when facilities are on site versus in the cloud. The author gives students a collection of experiments and then records their experiences, with comments added to the paper as an evaluation of the set ups. The final conclusion is to invest more and add more servers for the students. As an academic, I see the author's motivation in setting this up locally and taming the problem by giving students the ability to work in the facility. Faculties should be convinced of investing in ideas such as this rather than moving toward commercial cloud providers, especially for enhancing student learning experiences. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGCSE '16: Proceedings of the 47th ACM Technical Symposium on Computing Science Education
February 2016
768 pages
ISBN:9781450336857
DOI:10.1145/2839509
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data
  2. cloud computing
  3. curriculum
  4. virtual machines

Qualifiers

  • Research-article

Conference

SIGCSE '16
Sponsor:

Acceptance Rates

SIGCSE '16 Paper Acceptance Rate 105 of 297 submissions, 35%;
Overall Acceptance Rate 1,595 of 4,542 submissions, 35%

Upcoming Conference

SIGCSE Virtual 2024
1st ACM Virtual Global Computing Education Conference
December 5 - 8, 2024
Virtual Event , NC , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Learning Big Data Systems via EmulationProceedings of the 55th ACM Technical Symposium on Computer Science Education V. 110.1145/3626252.3630888(1449-1455)Online publication date: 7-Mar-2024
  • (2023)Data science technology course: The design, assessment and computing environment perspectivesEducation and Information Technologies10.1007/s10639-022-11558-828:8(10209-10234)Online publication date: 24-Jan-2023
  • (2021)Leveraging Cloud Infrastructures for Teaching Advanced Computer Engineering ClassesBridges and Mediation in Higher Distance Education10.1007/978-3-030-67435-9_20(256-270)Online publication date: 29-Jan-2021
  • (2020)Teaching Hadoop Using Role Play GamesDecision Sciences Journal of Innovative Education10.1111/dsji.1219718:1(6-21)Online publication date: 8-Feb-2020
  • (2018)A Comprehensive Course on Big Data for Undergraduate Students2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00067(353-360)Online publication date: May-2018
  • (2018)Cloud-based Labs and Programming Assignments in Networking and Cybersecurity Courses2018 IEEE Frontiers in Education Conference (FIE)10.1109/FIE.2018.8659020(1-9)Online publication date: 3-Oct-2018
  • (2018)Using Software Visualization for Supporting the Teaching of MapReduceNetwork and System Security10.1007/978-3-030-02744-5_26(349-360)Online publication date: 18-Dec-2018
  • (2017)Development of an introductory big data programming and concepts courseJournal of Computing Sciences in Colleges10.5555/3069658.306968332:6(175-182)Online publication date: 1-Jun-2017
  • (2017)Teaching Big Data and Cloud Computing with a Physical ClusterProceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education10.1145/3017680.3017705(177-181)Online publication date: 8-Mar-2017
  • (2017)Teaching Future Big Data Analysts: Curriculum and Experience Report2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.122(346-351)Online publication date: May-2017
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media