Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

6CS030 Big Data 2019/0 Portfolio - Part 1: Worksheet Three - 5% Hand-Out: Week 9. Demo: Week 10 Workshop

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

6CS030 Big Data

2019/0
Portfolio – Part 1
Worksheet Three – 5%
Hand-out: Week 9. Demo: Week 10 Workshop

This worksheet is based on the Hadoop Workbooks 1 and 2.

1. This worksheet uses three CSV exports generated from the Employment Rate &
Qualifications Profile of Adults spreadsheet seen in Worksheet One.
They have undergone some cleaning to remove non-numeric fields in any fields containing
figures. There is also no header row in the first record.
The files are available on hpd-srv.wlv.ac.uk in the /home/6cs030/Worksheet3
directory.
An updated version of the Population.java found in Hadoop Workbook 2 (Week 9) can
be found in the /home/6cs030/Worksheet3 directory. This has been amended to check
if the figures found are numbers or floats.
You need to analyse just one of the CSV datasets.
First take your student number and divide it by 3. Use the remainder value (modulus) to
pick one of the following worksheets:
Remainder Value CSV Dataset to use Java Class Name
0 Employment_Rate EmpRate
1 Degree-Level_Quals DegreeQuals
2 No_Quals NoQuals

For example, if your student number is 1712345, 1712345/3= 2 so you would use the
kermode.json dataset. See the Remainder spreadsheet if you are not sure how to do this.
2. Examine your dataset and carry out the following tasks:
Task no Task

a Java and Hadoop


Using the updated Population.java file, amend it as follows:
- To use the CSV file allocated from Part 1
- Amend the Population class name to the Java Class Name shown
above
- Amend the Mapper and Reducer class names to include your initials
- Reflect these changes in the main method
b Run the code
- Show the steps required to run the code produced in Part a. This
should include all steps (e.g., compile, creating the jar file, storing the
data into the dfs).
- The input and output directories produced should include your initials
- Show the contents of your output directory and some of the output
produced
c Apache Spark
Write a command to:
- Load the same CSV file into Apache Spark
- Show two queries that manipulate the data, one using the Data Frame
and one using a SQL Query
d Name one advantage to using Hadoop or Spark for handling Big Data and
include brief explanation of why you think this is an advantage.
e Name one disadvantage to using Hadoop or Spark for handling Big Data and
include brief explanation of why you think this is a disadvantage.

Note this is an individual assessment. Any group answers will be classed as plagiarism.
For this exercise you can either use the Mongo Shell or Python Notebook to carry out the
commands.
Upload

Upload evidence of the above tasks to Canvas by Week 10.


You must use the dataset allocated otherwise 0 marks will be allocated.

Demonstration

During the demonstration you will be asked to show what you have done for one of the above
tasks.

You might also like