Welcome to Scribd!

0% found this document useful (0 votes)

60 views

Scalable Data Mining (Autumn 2021) : Assignment 1 (Full Marks: 100)

Uploaded by

This document outlines 3 questions for an assignment on Scalable Data Mining using Spark. Question 1 involves loading and analyzing movie ratings data to find the most prolific rater and most rated movies. Question 2 expands the analysis to include movie and user metadata, calculating genre popularity and most rated comedy. Question 3 loads log data to build an inverted index for efficient searching and measures its impact on a query. Students are asked to submit code, outputs and approaches for each question in a PDF.

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Scalable Data Mining (Autumn 2021) : Assignment 1 (Full Marks: 100)

Uploaded by

ARUOS Soura

0% found this document useful (0 votes)

60 views3 pages

Original Title

Assignment1 Spark

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

0% found this document useful (0 votes)

60 views3 pages

Scalable Data Mining (Autumn 2021) : Assignment 1 (Full Marks: 100)

Uploaded by

ARUOS Soura

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

Jump to Page

You are on page 1of 3

Search inside document

Scalable Data Mining (Autumn 2021)

Assignment 1 (Full Marks: 100)

Steps for Spark installation:

1. Follow the guidelines given in this link to install Spark in your system:
https://medium.com/@josemarcialportilla/installing-scala-and-spark-on-ubuntu-5665ee4b62b1

Instructions: Please submit your answers (code+output+your way to approach the

problem) to the following questions as a write-up in a PDF file via Moodle.

Question 1 (Marks = 30)

In this assignment, you have to use Spark to have a look at the Movie Lens dataset containing user
generated ratings for movies. The dataset comes in 3 files:

● ratings.dat contains the ratings in the following format: UserID::MovieID::Rating::Timestamp

● users.dat contains demographic information about the users:
UserID::Gender::Age::Occupation::Zip-code.
● movies.dat contains meta information about the movies: MovieID::Title::Genres

Please read the readme file in the zip folder for further information.

============================================================================

(10 points):
a) Download the ratings file, parse it and load it in an RDD named ratings.
b) How many lines does the ratings RDD contain?

(20 points):
c) Count how many unique movies have been rated.
d) Which user gave the most ratings? Return the userID and number of ratings.
e) Which user gave the most '5' ratings? Return the userID and number of ratings.
Question 2 (Marks = 40)
Using the same data file from Question 1, perform the following operations:

(20 points):
a) Read the movies and users files into RDDs. How many records are there in each RDD?
b) How many of the movies are a comedy?
c) Which comedy has the most ratings? Return the title and the number of rankings. Answer this
question by joining two datasets.

(20 points):
e) Compute the number of unique users that rated the movies with movie_IDs 2858, 356 and 2329
without using an inverted index. Measure the time (in seconds) it takes to make this computation.

f) Create an inverted index on ratings, field movie_ID. Print the first item.
g) Compute the number of unique users that rated the movies with movie_IDs 2858, 356 and 2329
using the above calculated index. Measure the time (in seconds) it takes to compute the same
result using the index.

Question 3 (Marks = 30)

Download the file from this link on google drive: data2_1 . Write a function to load this data in an
RDD and name it as ‘Assignment_1’. Make sure you use a case class to map the file fields.

Each line in this file contains the following fields: debug_level: String, timestamp: Date,
download_id: Integer, retrieval_stage: String, rest: String
Example: DEBUG, 2017-03-24T12:06:23+00:00, ghtorrent-49 -- ghtorrent.rb: Repo
Shikanime/print exists
Here, debug_level = DEBUG ; timestamp = 2017-03-24T12:06:23+00:00 ; download_id =
ghtorrent-49 ; retrieval_stage = ghtorrent.rb ; rest = Repo Shikanime / print exists

a. Create a function that given an RDD and a field (e.g. download_id), it computes an inverted
index on the RDD for efficiently searching the records of the RDD using values of the field as
keys.
b. Compute the number of different repositories accessed by the client ‘ghtorrent-22’ (without
using the inverted index).
c. Compute the number of different repositories accessed by the client ‘ghtorrent-22’ using the
inverted index calculated above.
Submission Instructions:

You will submit 1 file using the filename RollNo_AssignmentNo.pdf with the following details:
(1) description/logic of how you are going to use Spark to solve each problem using Scala,
(2) the code snippets for each problem and
(3) their respective outputs.

Get Access To C100DEV Exam Dumps Now: Pass MongoDB Certified Developer Associate Exam With Full Confidence - Valid IT Exam Dumps Questions
Document5 pages
Get Access To C100DEV Exam Dumps Now: Pass MongoDB Certified Developer Associate Exam With Full Confidence - Valid IT Exam Dumps Questions
Mai Phuong
No ratings yet
355 - Python Programming - R - 2021 - PILOT
Document13 pages
355 - Python Programming - R - 2021 - PILOT
avfg gfavd
No ratings yet
TP Mongo Students-2015
Document8 pages
TP Mongo Students-2015
Anwar Hamdani
0% (1)
Peoplesoft Hrms Technical MCQ Questions
Document38 pages
Peoplesoft Hrms Technical MCQ Questions
nishantky
100% (2)
02.06 Module Two Project
Document5 pages
02.06 Module Two Project
Bob Smith
No ratings yet
CMIS 550 2023W Assignment 1
Document4 pages
CMIS 550 2023W Assignment 1
Dang Dinh Dong
No ratings yet
DataGrokr Technical Assignment
Document4 pages
DataGrokr Technical Assignment
Sidkrish
No ratings yet
DP203 - 216 Questions
Document212 pages
DP203 - 216 Questions
Akash Singh
No ratings yet
Advanced Recommender Systems With Python
Document13 pages
Advanced Recommender Systems With Python
Fabian Hafner
No ratings yet
March 2021new Braindump2go DP 203 PDF Dumps and DP 203 VCE Dumps17
Document11 pages
March 2021new Braindump2go DP 203 PDF Dumps and DP 203 VCE Dumps17
Pranay Jain
No ratings yet
TL102 0 2023 Cos2611
Document10 pages
TL102 0 2023 Cos2611
bellatra069
No ratings yet
Day13-K-Means Clustering
Document10 pages
Day13-K-Means Clustering
SBS Movies
No ratings yet
Informatica Assessment - 6D Case Studyu
Document5 pages
Informatica Assessment - 6D Case Studyu
Venkata Chaitanya Gannavarapu
No ratings yet
Read Me
Document3 pages
Read Me
isrish
No ratings yet
A2NG_Services_Online_Test_NITDelhi
Document1 page
A2NG_Services_Online_Test_NITDelhi
232211030
No ratings yet
Wenglor Software Challenge 2018
Document4 pages
Wenglor Software Challenge 2018
aaaaaaa330208193
No ratings yet
Himanshu Gupta Configuration Manual
Document16 pages
Himanshu Gupta Configuration Manual
Sheethal K. S
No ratings yet
CW COMP1556 28 Ver1 1617
Document6 pages
CW COMP1556 28 Ver1 1617
dharshinishurya
No ratings yet
Neral Linux Q and A v5 0-SSG PDF
Document92 pages
Neral Linux Q and A v5 0-SSG PDF
Mdlamini1984
No ratings yet
Off 142q Vce
Document16 pages
Off 142q Vce
karunakar.mes
No ratings yet
Sad
Document12 pages
Sad
rajanrld1988
No ratings yet
Final Project
Document2 pages
Final Project
rp7028694956
No ratings yet
COL733: Fundamentals of Cloud Computing Semester II, 2021-2022
Document6 pages
COL733: Fundamentals of Cloud Computing Semester II, 2021-2022
pratik pranav
No ratings yet
Assignmt 3
Document15 pages
Assignmt 3
Tom Afa
No ratings yet
Struct Assignment in C
Document4 pages
Struct Assignment in C
cjceszkv
100% (1)
Next Pathway Hack Backpackers Problem Statement
Document11 pages
Next Pathway Hack Backpackers Problem Statement
sonali Pradhan
No ratings yet
Indian Institute of Technology Roorkee: Lab Assignment-1 (L1) Date: July 17, 2019 Duration: 1 Week
Document3 pages
Indian Institute of Technology Roorkee: Lab Assignment-1 (L1) Date: July 17, 2019 Duration: 1 Week
Leshna Balara
No ratings yet
Dbms Lab Manual
Document99 pages
Dbms Lab Manual
GanBrave Ganesh Maragani
50% (2)
5.1 Lab Assignment # 5
Document2 pages
5.1 Lab Assignment # 5
Sushain Thakur
No ratings yet
Databricks Certified Data Engineer Professional Dumps by Ball 21-03-2024 10qa Ebraindumps
Document19 pages
Databricks Certified Data Engineer Professional Dumps by Ball 21-03-2024 10qa Ebraindumps
ameliapk525
No ratings yet
Programming Assignment BST
Document2 pages
Programming Assignment BST
api-335480408
No ratings yet
Accessdata
Document7 pages
Accessdata
ling
No ratings yet
Microsoft Ensurepass DP 203 Dumps 2023 Dec 24 by Ferdinand 149q
Document51 pages
Microsoft Ensurepass DP 203 Dumps 2023 Dec 24 by Ferdinand 149q
m3aistore
No ratings yet
IMDB Movie Analysis 05 Project
Document7 pages
IMDB Movie Analysis 05 Project
Niraj Ingole
No ratings yet
P 5
Document3 pages
P 5
harsh bathla
No ratings yet
SystemC Introduction
Document21 pages
SystemC Introduction
Muhammad Ismail
No ratings yet
Urexam: $GVVGT 5gtxkeg Kijgt 3Wcnkv (
Document10 pages
Urexam: $GVVGT 5gtxkeg Kijgt 3Wcnkv (
Abhishek Satyam Jha
No ratings yet
Jangan Hapus 1
Document14 pages
Jangan Hapus 1
rodiahgam1
No ratings yet
Assignment 2
Document3 pages
Assignment 2
test user
No ratings yet
DP-100
Document85 pages
DP-100
Trương Gia Nghi
No ratings yet
M 05 L10 DFD
Document8 pages
M 05 L10 DFD
Ravichandra Takhellambam
No ratings yet
Verfied Brainbench Queston Answer Set
Document55 pages
Verfied Brainbench Queston Answer Set
Love Nijai
20% (5)
Informatica Assessment - 3A Case Study
Document5 pages
Informatica Assessment - 3A Case Study
Venkata Chaitanya Gannavarapu
No ratings yet
MOEE Mode 2015
Document22 pages
MOEE Mode 2015
henokgmariam7
No ratings yet
Assignment: Foundation of Computing
Document89 pages
Assignment: Foundation of Computing
Ankur Singh
No ratings yet
000 421
Document46 pages
000 421
Sheena Varghese
No ratings yet
Ict337 Jul 2021 Eca 1646365062707
Document5 pages
Ict337 Jul 2021 Eca 1646365062707
chim101347
No ratings yet
MarketLytics DA
Document3 pages
MarketLytics DA
sada.apnacode
No ratings yet
Practical File Veer
Document35 pages
Practical File Veer
Nssj Bsbs
No ratings yet
Dbmslab PDF
Document114 pages
Dbmslab PDF
Manoj Patil
100% (1)
Pages From Microsoft - 98-361
Document54 pages
Pages From Microsoft - 98-361
mahmoud_sker
No ratings yet
Ai 102
Document35 pages
Ai 102
Sadhik Shaik
No ratings yet
Ai-102 5
Document23 pages
Ai-102 5
solutions4works
No ratings yet
Exam SC-400: Microsoft Information Protection and Compliance Administrator Associate Exam Preparation
From Everand
Exam SC-400: Microsoft Information Protection and Compliance Administrator Associate Exam Preparation
Georgio Daccache
No ratings yet
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
From Everand
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
Georgio Daccache
No ratings yet
SC-200: Microsoft Security Operations Analyst Preparation
From Everand
SC-200: Microsoft Security Operations Analyst Preparation
Georgio Daccache
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
C++ for Game Developers: Building Scalable and Robust Gaming Applications
From Everand
C++ for Game Developers: Building Scalable and Robust Gaming Applications
Jarrel E.
No ratings yet
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
PFC Update Menu 01.01.2022
Document16 pages
PFC Update Menu 01.01.2022
ARUOS Soura
No ratings yet
CS29003 Algorithms Laboratory Assignment 1: Logarithmic Vs Linear Vs Exponential Growth of Functions
Document4 pages
CS29003 Algorithms Laboratory Assignment 1: Logarithmic Vs Linear Vs Exponential Growth of Functions
ARUOS Soura
No ratings yet
Assignment 1: Time Complexity of Algorithms
Document2 pages
Assignment 1: Time Complexity of Algorithms
ARUOS Soura
No ratings yet
Assignment 1: CS21003 Algorithms 1
Document1 page
Assignment 1: CS21003 Algorithms 1
ARUOS Soura
No ratings yet
Assignment 0
Document2 pages
Assignment 0
ARUOS Soura
No ratings yet
Anomaly Detection and Localization
Document15 pages
Anomaly Detection and Localization
ARUOS Soura
No ratings yet
Unsupervised Real-Time Anomaly Detection For Streaming Data: Neurocomputing June 2017
Document15 pages
Unsupervised Real-Time Anomaly Detection For Streaming Data: Neurocomputing June 2017
ARUOS Soura
No ratings yet
02 DataCategorization
Document41 pages
02 DataCategorization
ARUOS Soura
No ratings yet
01 Introduction
Document26 pages
01 Introduction
ARUOS Soura
No ratings yet
Computer Communication & Networking: Sudipta Mahapatra E & ECE Department IIT Kharagpur
Document28 pages
Computer Communication & Networking: Sudipta Mahapatra E & ECE Department IIT Kharagpur
ARUOS Soura
No ratings yet
Data Analytics: Department of Computer Science & Engineering
Document13 pages
Data Analytics: Department of Computer Science & Engineering
ARUOS Soura
No ratings yet
Operating Systems: K. Sreenivasa Rao Professor Dept of Cse Iit Kharagpur
Document50 pages
Operating Systems: K. Sreenivasa Rao Professor Dept of Cse Iit Kharagpur
ARUOS Soura
No ratings yet
Lect 13
Document41 pages
Lect 13
ARUOS Soura
No ratings yet
Lecture Note 03 Myths and Realities About Entrepreneurs 10.01.2020
Document50 pages
Lecture Note 03 Myths and Realities About Entrepreneurs 10.01.2020
ARUOS Soura
No ratings yet
Lecture Note 04 - Why Startups Fail 13.01.2020
Document64 pages
Lecture Note 04 - Why Startups Fail 13.01.2020
ARUOS Soura
No ratings yet
Questions TCP
Document2 pages
Questions TCP
ARUOS Soura
No ratings yet
Lecture Note 2 - Three Inspiring Stories
Document47 pages
Lecture Note 2 - Three Inspiring Stories
ARUOS Soura
No ratings yet
16EC30021 - DSP Lab Report - Exp01 - Palak
Document8 pages
16EC30021 - DSP Lab Report - Exp01 - Palak
ARUOS Soura
No ratings yet
Tutorial - Ii: Digital Signal Processing
Document13 pages
Tutorial - Ii: Digital Signal Processing
ARUOS Soura
No ratings yet
Lecture Note 1 Introduction Definition - 03.01.2020
Document59 pages
Lecture Note 1 Introduction Definition - 03.01.2020
ARUOS Soura
No ratings yet
Evolution of Microelectronics: (From Discrete Devices To Modern Integrated Circuits - A Brief Review)
Document50 pages
Evolution of Microelectronics: (From Discrete Devices To Modern Integrated Circuits - A Brief Review)
ARUOS Soura
No ratings yet
VLSI Engineering: (L-T-P: 3-0-0, CRE - 3)
Document11 pages
VLSI Engineering: (L-T-P: 3-0-0, CRE - 3)
ARUOS Soura
No ratings yet
MOSFET Fabrication
Document9 pages
MOSFET Fabrication
ARUOS Soura
No ratings yet
BTP Topics From Faculties
Document1 page
BTP Topics From Faculties
ARUOS Soura
No ratings yet
EXPERIMENT 1 - Adi
Document9 pages
EXPERIMENT 1 - Adi
ARUOS Soura
No ratings yet
Lorentz Price List 2022
Document40 pages
Lorentz Price List 2022
tesema
No ratings yet
Ammonia Solubility in Salts
Document1 page
Ammonia Solubility in Salts
ivan esteves
No ratings yet
Alphee Lavoie - Neural Networks in Financial Astrology
Document12 pages
Alphee Lavoie - Neural Networks in Financial Astrology
johnsmithxx
No ratings yet
Directional Drilling With Logging Techniques: Presented by
Document20 pages
Directional Drilling With Logging Techniques: Presented by
ermias
No ratings yet
Alloy Steel Plates Tds
Document6 pages
Alloy Steel Plates Tds
Srikanth Srikanti
No ratings yet
COM Collisions
Document11 pages
COM Collisions
Pranav Joshi
No ratings yet
INSTRUCCIONES LAMIDISC English 1001-I (Reva)
Document4 pages
INSTRUCCIONES LAMIDISC English 1001-I (Reva)
tm5u2r
No ratings yet
Brookfield Operating Manual DV-III Ultra
Document122 pages
Brookfield Operating Manual DV-III Ultra
georgiadisg
100% (1)
Me 630 Internal Combustion Engines: Project Topics
Document8 pages
Me 630 Internal Combustion Engines: Project Topics
Marcellino Mosca
100% (1)
Iso4310 1981
Document6 pages
Iso4310 1981
somsak9
100% (2)
Power Steering Gear, Disassembling and Assembling
Document13 pages
Power Steering Gear, Disassembling and Assembling
OKIDI Thomas Becket
No ratings yet
Bio Chapter 1
Document30 pages
Bio Chapter 1
Alicia Lam
No ratings yet
Gigabyte Ga-H61m-Ds2 Rev. 2.01
Document30 pages
Gigabyte Ga-H61m-Ds2 Rev. 2.01
Adriano Araújo Amaral
100% (2)
Metal Forming
Document27 pages
Metal Forming
Ali Abdullah Khan
No ratings yet
Data Augmentation: Objectives
Document10 pages
Data Augmentation: Objectives
Praveen Singh
No ratings yet
S Functions
Document534 pages
S Functions
manueldidy
No ratings yet
Forecasting Exchange Rates 1
Document32 pages
Forecasting Exchange Rates 1
Afrianto Budi Aan
100% (1)
160502193155
Document76 pages
160502193155
thaier
No ratings yet
Data Pengukuran BFPT A
Document3 pages
Data Pengukuran BFPT A
azisyuswandi
No ratings yet
RM9003B-REACTOR ZH-CN en
Document7 pages
RM9003B-REACTOR ZH-CN en
VanBelkumW
No ratings yet
Eec 161 ch04
Document82 pages
Eec 161 ch04
Hanna Shui
No ratings yet
Intermediate Python Nanodegree Program Syllabus
Document10 pages
Intermediate Python Nanodegree Program Syllabus
Cylub
No ratings yet
Extraction and Uses of Metals 1 QP PDF
Document12 pages
Extraction and Uses of Metals 1 QP PDF
Angus Aniz
No ratings yet
Heat Transfer in Packed Bed
Document3 pages
Heat Transfer in Packed Bed
Arasu Vtp
No ratings yet
Chapter 1
Document43 pages
Chapter 1
Santhosh
No ratings yet
Invoice Apple
Document1 page
Invoice Apple
twitchtoday6969
No ratings yet
Rockmass Classification
Document39 pages
Rockmass Classification
truman
No ratings yet
Mesim Per Kriperat
Document4 pages
Mesim Per Kriperat
PopTesro
No ratings yet
DuvalTriangles1 7-29mar2016
Document68 pages
DuvalTriangles1 7-29mar2016
Jorgo Bello
No ratings yet
Tips & Tricks Inorganic Chemistry FlashCards
Document25 pages
Tips & Tricks Inorganic Chemistry FlashCards
seemagoyal0206
No ratings yet