6CS030 Big Data 2019/0 Portfolio - Part 1: Worksheet Three - 5% Hand-Out: Week 9. Demo: Week 10 Workshop

Uploaded by

This document provides instructions for a Big Data worksheet assignment. Students are assigned one of three CSV datasets based on their student number. They must complete tasks using their assigned CSV file with Hadoop and Spark, including modifying and running sample Java code, running Spark queries, and identifying advantages and disadvantages of Hadoop and Spark for Big Data. Students must individually complete the tasks and upload evidence to Canvas by the deadline. During a demonstration, they may be asked to show their work for one of the tasks.

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

6CS030 Big Data 2019/0 Portfolio - Part 1: Worksheet Three - 5% Hand-Out: Week 9. Demo: Week 10 Workshop

Uploaded by

Sadikshya khanal

0% found this document useful (0 votes)

152 views2 pages

Original Title

worksheet3.pdf

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

0% found this document useful (0 votes)

152 views2 pages

6CS030 Big Data 2019/0 Portfolio - Part 1: Worksheet Three - 5% Hand-Out: Week 9. Demo: Week 10 Workshop

Uploaded by

Sadikshya khanal

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

Jump to Page

You are on page 1of 2

Search inside document

6CS030 Big Data

2019/0
Portfolio – Part 1
Worksheet Three – 5%
Hand-out: Week 9. Demo: Week 10 Workshop

This worksheet is based on the Hadoop Workbooks 1 and 2.

1. This worksheet uses three CSV exports generated from the Employment Rate &
Qualifications Profile of Adults spreadsheet seen in Worksheet One.
They have undergone some cleaning to remove non-numeric fields in any fields containing
figures. There is also no header row in the first record.
The files are available on hpd-srv.wlv.ac.uk in the /home/6cs030/Worksheet3
directory.
An updated version of the Population.java found in Hadoop Workbook 2 (Week 9) can
be found in the /home/6cs030/Worksheet3 directory. This has been amended to check
if the figures found are numbers or floats.
You need to analyse just one of the CSV datasets.
First take your student number and divide it by 3. Use the remainder value (modulus) to
pick one of the following worksheets:
Remainder Value CSV Dataset to use Java Class Name
0 Employment_Rate EmpRate
1 Degree-Level_Quals DegreeQuals
2 No_Quals NoQuals

For example, if your student number is 1712345, 1712345/3= 2 so you would use the
kermode.json dataset. See the Remainder spreadsheet if you are not sure how to do this.
2. Examine your dataset and carry out the following tasks:
Task no Task

a Java and Hadoop

Using the updated Population.java file, amend it as follows:
- To use the CSV file allocated from Part 1
- Amend the Population class name to the Java Class Name shown
above
- Amend the Mapper and Reducer class names to include your initials
- Reflect these changes in the main method
b Run the code
- Show the steps required to run the code produced in Part a. This
should include all steps (e.g., compile, creating the jar file, storing the
data into the dfs).
- The input and output directories produced should include your initials
- Show the contents of your output directory and some of the output
produced
c Apache Spark
Write a command to:
- Load the same CSV file into Apache Spark
- Show two queries that manipulate the data, one using the Data Frame
and one using a SQL Query
d Name one advantage to using Hadoop or Spark for handling Big Data and
include brief explanation of why you think this is an advantage.
e Name one disadvantage to using Hadoop or Spark for handling Big Data and
include brief explanation of why you think this is a disadvantage.

Note this is an individual assessment. Any group answers will be classed as plagiarism.
For this exercise you can either use the Mongo Shell or Python Notebook to carry out the
commands.
Upload

Upload evidence of the above tasks to Canvas by Week 10.

You must use the dataset allocated otherwise 0 marks will be allocated.

Demonstration

During the demonstration you will be asked to show what you have done for one of the above
tasks.

Saraswati Vidya Mandir Inter College: Session: 2023 - 2024 Subject: Project Report Topic
Document19 pages
Saraswati Vidya Mandir Inter College: Session: 2023 - 2024 Subject: Project Report Topic
Aman Gupta
No ratings yet
CompTIA DataSys+ DS0-001
Document13 pages
CompTIA DataSys+ DS0-001
Andy
No ratings yet
STV Lab Manual Viii - Sem
Document31 pages
STV Lab Manual Viii - Sem
ayushgoud1234
No ratings yet
EL2311 Courswork Brief With Solution
Document5 pages
EL2311 Courswork Brief With Solution
Samreen
No ratings yet
User Manual: From Statsvn
Document7 pages
User Manual: From Statsvn
jpdaigle
100% (1)
EPAM Curriculum For Batch 2023
Document9 pages
EPAM Curriculum For Batch 2023
sravan kumar
No ratings yet
Datastage - Parameters - Schema Files
Document23 pages
Datastage - Parameters - Schema Files
RM
No ratings yet
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
Document6 pages
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
priyam
100% (1)
Call SQL B1if PDF
Document13 pages
Call SQL B1if PDF
Cris Silva Navea
No ratings yet
DS Routine
Document12 pages
DS Routine
Harsh Jain
No ratings yet
Install The MDA Tables
Document8 pages
Install The MDA Tables
Kiwiny9999
No ratings yet
How To Reload Datapump Utility EXPDP-IMPDP
Document1 page
How To Reload Datapump Utility EXPDP-IMPDP
ldouri
No ratings yet
Pseudo Delta Example
Document9 pages
Pseudo Delta Example
Charanjit Singh
No ratings yet
ELEC5471M MATLABproject FT16
Document3 pages
ELEC5471M MATLABproject FT16
Alberto Rubalcava Segura
No ratings yet
Best Practices in Cognos Report Studio
Document4 pages
Best Practices in Cognos Report Studio
dhanz_99
No ratings yet
Data Strcuture Lab - 05 AIUB
Document9 pages
Data Strcuture Lab - 05 AIUB
mahin ramchagol
No ratings yet
DAA - Unit1
Document95 pages
DAA - Unit1
sb8515
No ratings yet
Lab Assignment Report: ECS 851 Data Warehousing and Data Mining
Document69 pages
Lab Assignment Report: ECS 851 Data Warehousing and Data Mining
Ayush jain
No ratings yet
DataStage Vs Informatica
Document3 pages
DataStage Vs Informatica
vkaturiLS
No ratings yet
Alfa User Guide
Document34 pages
Alfa User Guide
dibya1234
No ratings yet
Junk Dimension
Document2 pages
Junk Dimension
Katevarapu Venkateswara Rao
No ratings yet
Auto CascadingPrompts Without Page Refresh in IBM CognosReport Studio
Document11 pages
Auto CascadingPrompts Without Page Refresh in IBM CognosReport Studio
clrhoades
No ratings yet
Alfa Install Guide
Document16 pages
Alfa Install Guide
dibya1234
100% (2)
ADBMS Lab Manual New
Document24 pages
ADBMS Lab Manual New
Ashish Dugar
No ratings yet
Database Development Lifecycle
Document68 pages
Database Development Lifecycle
Noel Girma
No ratings yet
DB LOB Sizing v54
Document17 pages
DB LOB Sizing v54
Jebabalanmari Jebabalan
No ratings yet
Report SQL PDF
Document21 pages
Report SQL PDF
Rambabu Alokam
No ratings yet
Performance Tuning Techniques For Handling High Volume of Data in Informatica
Document16 pages
Performance Tuning Techniques For Handling High Volume of Data in Informatica
Syed Zubair
No ratings yet
21csc205p Dbms Unit I
Document154 pages
21csc205p Dbms Unit I
chandresh.s
No ratings yet
App SRM Unit 4 Notes
Document48 pages
App SRM Unit 4 Notes
Harshit Batra
No ratings yet
Unit 4 - 4.4
Document23 pages
Unit 4 - 4.4
King Bavisi
No ratings yet
Mapxtreme V7.2.0 Release Notes: List of Topics
Document7 pages
Mapxtreme V7.2.0 Release Notes: List of Topics
truycaptudo
No ratings yet
Pivotal Clustering Concepts R01
Document36 pages
Pivotal Clustering Concepts R01
David Hicks
0% (1)
BW Quiz From SearchSAP
Document17 pages
BW Quiz From SearchSAP
api-3824295
No ratings yet
5 2 PDF
Document98 pages
5 2 PDF
Mahesh Y 46'V
No ratings yet
HBase Interview Questions
Document12 pages
HBase Interview Questions
pooh06
No ratings yet
Infosphere Datastage Enterprise Edition Installation
Document4 pages
Infosphere Datastage Enterprise Edition Installation
srimkb
No ratings yet
Extracting Hyperion Essbase Metadata To Oracle Data Integrator
Document36 pages
Extracting Hyperion Essbase Metadata To Oracle Data Integrator
AnkurAggarwal
No ratings yet
TT OpenSourceBIDW
Document255 pages
TT OpenSourceBIDW
hu15168
No ratings yet
OBIEE Bridge Table
Document3 pages
OBIEE Bridge Table
venkatesh.golla
No ratings yet
DS Parallel Job Developers Guide
Document637 pages
DS Parallel Job Developers Guide
princeanilb
50% (2)
Dependency Graph and Bernstein Conditions
Document39 pages
Dependency Graph and Bernstein Conditions
Surbhi Jain
No ratings yet
Cognos Impromptu by Gopi
Document14 pages
Cognos Impromptu by Gopi
psaravanan1985
No ratings yet
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
Document33 pages
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
bhattsb
No ratings yet
Get Off To A Fast Start With Db2 V9 Purexml, Part 2
Document16 pages
Get Off To A Fast Start With Db2 V9 Purexml, Part 2
Ankur Verma
No ratings yet
App SRM Unit 5 Notes
Document35 pages
App SRM Unit 5 Notes
Harshit Batra
No ratings yet
Wave Maker Intro Tutorial
Document16 pages
Wave Maker Intro Tutorial
carrizof
100% (1)
Lab Manual Week 03
Document4 pages
Lab Manual Week 03
Syed Qaim raza
100% (1)
SCD-2 Implementation To Insert Update and Logically Delete Rows
Document8 pages
SCD-2 Implementation To Insert Update and Logically Delete Rows
Manshoon Habib
No ratings yet
Assignment_1-2
Document4 pages
Assignment_1-2
e.stephenson
No ratings yet
Lab-11 Random Forest
Document2 pages
Lab-11 Random Forest
KamranKhan
No ratings yet
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
Document38 pages
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
Sadikshya khanal
No ratings yet
Question: Dimension Modeling Types Along With Their Significance
Document27 pages
Question: Dimension Modeling Types Along With Their Significance
Angajala Angajala
No ratings yet
DS-COMP2117-Lab Manual
Document6 pages
DS-COMP2117-Lab Manual
XEON
No ratings yet
Coursework Tasks Specification
Document6 pages
Coursework Tasks Specification
Uzair Kabeer
No ratings yet
Sqoop Interview Questions
Document6 pages
Sqoop Interview Questions
Guruprasad Vijayakumar
No ratings yet
DP-600 Updated Dumps - Microsoft Fabric Analytics Engineer
Document17 pages
DP-600 Updated Dumps - Microsoft Fabric Analytics Engineer
timblin843
No ratings yet
21 Ibm Websphere Datastage Interview Questions A Answers
Document9 pages
21 Ibm Websphere Datastage Interview Questions A Answers
sanjaysahu160
No ratings yet
Introduction To Hadoop - Part Two: 1 Working With Found Datasets 1 2 Hadoop and Comma Separated Values (CSV) Files 1
Document18 pages
Introduction To Hadoop - Part Two: 1 Working With Found Datasets 1 2 Hadoop and Comma Separated Values (CSV) Files 1
Sadikshya khanal
No ratings yet
SAS A00-232 HadoopExam SAS BASE Certification Syllabus
Document7 pages
SAS A00-232 HadoopExam SAS BASE Certification Syllabus
vit
No ratings yet
Real Time Analytics With Apache Kafka and Spark: Rahul Jain
Document54 pages
Real Time Analytics With Apache Kafka and Spark: Rahul Jain
Sudhanshoo Saxena
No ratings yet
Spark MCQ
Document3 pages
Spark MCQ
sarthika Danthuluri
No ratings yet
SAP HANA Hadoop Integration
Document16 pages
SAP HANA Hadoop Integration
Raymundo Pires
No ratings yet
Data Factory
Document1,158 pages
Data Factory
Smriti Verma
No ratings yet
Apache Spark Features
Document2 pages
Apache Spark Features
nitinlucky
No ratings yet
Unit 2
Document56 pages
Unit 2
Ramstage Testing
No ratings yet
Amazon S3: Welcome To Talend Help Center
Document133 pages
Amazon S3: Welcome To Talend Help Center
himaja
No ratings yet
Advanced Certificate Programme DS
Document34 pages
Advanced Certificate Programme DS
Ram Sanyal
No ratings yet
Ebffiledoc 3513
Document54 pages
Ebffiledoc 3513
ethan.lightfoot816
No ratings yet
Lakehouse With Delta Lake Deep Dive
Document64 pages
Lakehouse With Delta Lake Deep Dive
A Noraznizam
100% (1)
Start by Learning A Programming Language:: Introduction To Computer Science and Programming Using Python
Document4 pages
Start by Learning A Programming Language:: Introduction To Computer Science and Programming Using Python
askingspark
No ratings yet
Nilesh Kumar: Mail: Mobile No:+91-9008630725
Document5 pages
Nilesh Kumar: Mail: Mobile No:+91-9008630725
tejashwa
No ratings yet
SR Data Engineer - Lalitya Resume
Document8 pages
SR Data Engineer - Lalitya Resume
saisrujan816
No ratings yet
Certified Hadoop and Spark Course Curriculum
Document9 pages
Certified Hadoop and Spark Course Curriculum
mano555
No ratings yet
IBM Data Virtualization Manager For z/OS Developer Guide
Document98 pages
IBM Data Virtualization Manager For z/OS Developer Guide
rajneesh osho
No ratings yet
Aravind_Kumar_Rajendran_Bigdata
Document8 pages
Aravind_Kumar_Rajendran_Bigdata
vinkaldhaka2007
No ratings yet
Big Data Quiz-Merged
Document152 pages
Big Data Quiz-Merged
Song Benard
No ratings yet
Dec 01 2020
Document298 pages
Dec 01 2020
ravikumar lanka
No ratings yet
ADB Course Catalog
Document84 pages
ADB Course Catalog
SantoshJammi
No ratings yet
Big Data Overview
Document18 pages
Big Data Overview
Prajwal khanal
No ratings yet
Open Position 17-03-2022
Document9 pages
Open Position 17-03-2022
Amar Bhapkar
No ratings yet
Certificate Course in Data Engineering - 40hrs
Document2 pages
Certificate Course in Data Engineering - 40hrs
raja00999
No ratings yet
Stream Processing and Analytics Handout
Document8 pages
Stream Processing and Analytics Handout
sdfasd
No ratings yet
Module 3
Document51 pages
Module 3
sagarhn sagarhn
No ratings yet
Big Data With Hadoop
Document26 pages
Big Data With Hadoop
sonu samge
No ratings yet
(FREE PDF Sample) BIG DATA ANALYTICS: Introduction To Hadoop, Spark, and Machine-Learning Raj Kamal Ebooks
Document49 pages
(FREE PDF Sample) BIG DATA ANALYTICS: Introduction To Hadoop, Spark, and Machine-Learning Raj Kamal Ebooks
slorklimza
100% (3)
Revision01 2
Document30 pages
Revision01 2
Sadouli Saif
No ratings yet
Ebook Accelerating Apache Spark 3
Document108 pages
Ebook Accelerating Apache Spark 3
aaa
No ratings yet
Azure Data Factory
Document5 pages
Azure Data Factory
vr.sf99
No ratings yet
An Expedition Into The World of 5G
Document15 pages
An Expedition Into The World of 5G
amal nejari
No ratings yet