02 - BDAV - Lab Submission 2

MCAL31 Big Data Analytics and Visualization Lab Practical
Practical No 05
Aim: To perform operations using Spark:
i. Basic operations 
ii. Creating RDD using 3 methods
Starting spark-shell in cloudera
[cloudera@quickstart ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/
StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/
SLF4J: Found binding in [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.12.0.jar!/org/slf4j/impl/
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
21/11/28 22:03:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
21/11/28 22:03:14 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback
address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
21/11/28 22:03:14 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Spark context available as sc (master = local[*], app id = local-1638165801967).
21/11/28 22:03:33 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature
cannot be used because libhadoop cannot be loaded.
SQL context available as sqlContext.
Spark Create RDD from Seq or List (using Parallelize)

scala> val rdd=sc.parallelize(Seq(("Java", 20000), ("Python",100000), ("Scala",3000)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at
<console>:27
scala> rdd.foreach(println)
(Java,20000)
(Python,100000)
Name: Abdulbaqui Abdulrashid Ansari Roll No: 02
(Scala,3000)
Create an RDD from a text file(text file must be in hdfs file system)
scala> val rddnew = sc.textFile("/input/File2.txt")
rddnew: org.apache.spark.rdd.RDD[String] = /input/File2.txt MapPartitionsRDD[7] at textFile at
<console>:27
scala> rddnew.foreach(println)
hello world
abdul hello
hello world
If you want to read the entire content of a file as a single record use wholeTextFiles() method on
sparkContext.
-----------------------------------------------------
scala> val rdd2 = sc.wholeTextFiles("/input/File2.txt")
rdd2: org.apache.spark.rdd.RDD[(String, String)] = /input/File2.txt MapPartitionsRDD[9] at
wholeTextFiles at <console>:27
scala> rdd2.foreach(record=>println("FileName : "+record._1+", FileContents :"+record._2))

FileName : hdfs://quickstart.cloudera:8020/input/File2.txt,
FileContents :hello world
abdul hello
hello world
Creating from another RDD
You can use transformations like map, flatmap, filter to create a new RDD from an existing one.
scala> val rdd3 = rdd.map(row=>{(row._1,row._2+100)})
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] at map at <console>:29
scala> rdd3.foreach(println)
(Java,20100)
(Python,100100)
(Scala,3100)
Practical No 06
Aim: To perform operations using Spark:
i. Starting mongosh
C:\Users\ASUS>mongosh
Current Mongosh Log ID: 61b08351c37c3d62383b8963
Connecting to: mongodb://127.0.0.1:27017/?
directConnection=true&serverSelectionTimeoutMS=2000
Using MongoDB: 5.0.4
Using Mongosh: 1.1.2
For mongosh info see: https://docs.mongodb.com/mongodb-shell/

------
The server generated these startup warnings when booting:
2021-12-07T13:41:09.138+05:30: Access control is not enabled for the database. Read and write
access to data and configuration is unrestricted
------
Warning: Found ~/.mongorc.js, but not ~/.mongoshrc.js. ~/.mongorc.js will not be loaded.
You may want to copy or rename ~/.mongorc.js to ~/.mongoshrc.js.
ii. Create a Database and Collection
test> use mydb
switched to db mydb
mydb> show dbs
admin 41 kB
config 36.9 kB
local 73.7 kB
using createCollection method
mydb> db.createCollection("inventory")
{ ok: 1 }
using create collection with NOT NULL constraint on a column
mydb> db.createCollection( "student", {
... validator: { $jsonSchema: {
..... bsonType: "object",
..... required: [ "fname" ],
..... properties: {
....... fname: {
......... bsonType: "string",
......... description: "must be a string and is required"
......... }
....... }
..... } }
... } )
{ ok: 1 }
iii. INSERT DOCUMENT

insert single document:
mydb> db.inventory.insertOne(
... { "item" : "canvas",
..... "qty" : 100,
..... "tags" : ["cotton"],
..... "size" : { "h" : 28, "w" : 35.5, "uom" : "cm" }
..... }
... )
{
acknowledged: true,
insertedId: ObjectId("61b084034fd1c6595ac15b69")
}
insert Multiple documents at a time:
mydb> db.inventory.insertMany([
... { item: "journal", qty: 25, size: { h: 14, w: 21, uom: "cm" }, status: "A" },
... { item: "notebook", qty: 50, size: { h: 8.5, w: 11, uom: "in" }, status: "A" },
... { item: "paper", qty: 100, size: { h: 8.5, w: 11, uom: "in" }, status: "D" },
... { item: "planner", qty: 75, size: { h: 22.85, w: 30, uom: "cm" }, status: "D" },
... { item: "postcard", qty: 45, size: { h: 10, w: 15.25, uom: "cm" }, status: "A" }
... ])
{
acknowledged: true,
insertedIds: {
'0': ObjectId("61b091a04fd1c6595ac15b6a"),
'1': ObjectId("61b091a04fd1c6595ac15b6b"),
'2': ObjectId("61b091a04fd1c6595ac15b6c"),
'3': ObjectId("61b091a04fd1c6595ac15b6d"),
'4': ObjectId("61b091a04fd1c6595ac15b6e")
}
}
Show collection in database
mydb> db.getCollectionNames();
[ 'student', 'inventory' ]
Inserts a document or documents Specifying an _id Field into a collection:
mydb> db.products.insert( { _id: 10, item: "box", qty: 20 } )
DeprecationWarning: Collection.insert() is deprecated. Use insertOne, insertMany, or bulkWrite.
{ acknowledged: true, insertedIds: { '0': 10 } }
mydb> db.products.insert(
... [
... { _id: 11, item: "pencil", qty: 50, type: "no.2" },
... { item: "pen", qty: 20 },
... { item: "eraser", qty: 25 }
... ]
... )
{
acknowledged: true,
insertedIds: {
'0': 11,
'1': ObjectId("61b092ee4fd1c6595ac15b6f"),
'2': ObjectId("61b092ee4fd1c6595ac15b70")
}
}
unordered insert
mydb> db.products.insert(
... [
... { _id: 20, item: "lamp", qty: 50, type: "desk" },
... { _id: 21, item: "lamp", qty: 20, type: "floor" },
... { _id: 22, item: "bulk", qty: 100 }
... ],
... { ordered: false }
... )
{ acknowledged: true, insertedIds: { '0': 20, '1': 21, '2': 22 } }
validation violation insert
mydb> try {
... db.student.insertOne({ _id: 1, fullname: "Sohan", age: 22 })
... } catch (e) {
... print (e);
... }
MongoServerError: Document failed validation
at C:\Users\ASUS\AppData\Local\Programs\mongosh\mongosh.exe:54498:25
at C:\Users\ASUS\AppData\Local\Programs\mongosh\mongosh.exe:61790:13
at handleOperationResult (C:
\Users\ASUS\AppData\Local\Programs\mongosh\mongosh.exe:63069:5)
at MessageStream.messageHandler (C:
at MessageStream.emit (events.js:400:28)
at MessageStream.emit (domain.js:475:12)
at processIncomingData (C:
at MessageStream._write (C:
at writeOrBuffer (internal/streams/writable.js:358:12)
at MessageStream.Writable.write (internal/streams/writable.js:303:10) {
index: 0,
code: 121,
errInfo: {
failingDocumentId: 1,
details: {
operatorName: '$jsonSchema',
schemaRulesNotSatisfied: [
{
operatorName: 'required',
specifiedAs: { required: [ 'fname' ] },
missingProperties: [ 'fname' ]
}
]
}
}
}
mydb> db.student.insertOne({ _id: 1, fname: "Sohan", age: 22 })
{ acknowledged: true, insertedId: 1 }
iv. QUERY DOCUMENT
List all the documents inside a collection
--SELECT * from inventory:
mydb> db.inventory.find( {} )
[
{
_id: ObjectId("61b084034fd1c6595ac15b69"),
item: 'canvas',
qty: 100,
tags: [ 'cotton' ],
size: { h: 28, w: 35.5, uom: 'cm' }
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6a"),
item: 'journal',
qty: 25,
size: { h: 14, w: 21, uom: 'cm' },
status: 'A'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6b"),
item: 'notebook',
qty: 50,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'A'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6c"),
item: 'paper',
qty: 100,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'D'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6d"),
item: 'planner',
qty: 75,
size: { h: 22.85, w: 30, uom: 'cm' },
status: 'D'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6e"),
item: 'postcard',
qty: 45,
size: { h: 10, w: 15.25, uom: 'cm' },
status: 'A'
}
]
SELECT * FROM inventory WHERE status = "D"
mydb> db.inventory.find( { status: "D" } )
[
{
item: 'paper',
qty: 100,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'D'
},
{
item: 'planner',
qty: 75,
size: { h: 22.85, w: 30, uom: 'cm' },
status: 'D'
}
]
SELECT * FROM inventory WHERE status in ("A", "D"):
mydb> db.inventory.find( { status: { $in: [ "A", "D" ] } } )
[
{
item: 'journal',
qty: 25,
size: { h: 14, w: 21, uom: 'cm' },
status: 'A'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6b"),
item: 'notebook',
qty: 50,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'A'
},
{
item: 'paper',
qty: 100,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'D'
},
{
item: 'planner',
qty: 75,
size: { h: 22.85, w: 30, uom: 'cm' },
status: 'D'
},
{
item: 'postcard',
qty: 45,
size: { h: 10, w: 15.25, uom: 'cm' },
status: 'A'
}
]
SELECT * FROM inventory WHERE status = "A" AND qty < 30:
mydb> db.inventory.find( { status: "A", qty: { $lt: 30 } } )
[
{
item: 'journal',
qty: 25,
size: { h: 14, w: 21, uom: 'cm' },
status: 'A'
}
]
SELECT * FROM inventory WHERE status = "A" OR qty < 30:
mydb> db.inventory.find( { $or: [ { status: "A" }, { qty: { $lt: 30 } } ] } )
[
{
_id: ObjectId("61b0b17db51a84608a2d39fa"),
item: 'journal',
qty: 25,
size: { h: 14, w: 21, uom: 'cm' },
status: 'A'
},
{
_id: ObjectId("61b0b17db51a84608a2d39fb"),
item: 'notebook',
qty: 50,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'A'
},
{
_id: ObjectId("61b0b17db51a84608a2d39fe"),
item: 'postcard',
qty: 45,
size: { h: 10, w: 15.25, uom: 'cm' },
status: 'A'
}
]
SELECT * FROM inventory WHERE status = "A" AND ( qty < 30 OR item LIKE "p%"):
mydb> db.inventory.find( {
... status: "A",
... $or: [ { qty: { $lt: 30 } }, { item: /^p/ } ]
... } )
[
{
item: 'journal',
qty: 25,
size: { h: 14, w: 21, uom: 'cm' },
status: 'A'
},
{
item: 'postcard',
qty: 45,
size: { h: 10, w: 15.25, uom: 'cm' },
status: 'A'
}
]
v. DELETE DOCUMENT:
delete the first documents in the collection that matches the result
mydb> db.inventory.deleteOne( { status: "D" } )
{ acknowledged: true, deletedCount: 1 }
delete all the documents in the collection that match the result
mydb> db.inventory.find( { status: "D" } )
[
{
item: 'planner',
qty: 75,
size: { h: 22.85, w: 30, uom: 'cm' },
status: 'D'
}
]
delete all the documents in the collection
mydb> db.inventory.deleteMany({ status : "A" })
mydb> db.inventory.find( { status: "A" } )
mydb> db.inventory.deleteMany({})
mydb> db.inventory.find( {} )
mydb> show dbs
admin 41 kB
config 36.9 kB
local 73.7 kB
mydb 164 kB
Practical No 7
To perform Visualization operation using Tableau:
Starting with Tableau
Create New File
Click on Connect to Data
Select Sample -Superstore data source
Now Data is added
A] Analysis operations
Q 1: Find the customer with the highest overall profit. What is his/her profit ratio?
Ans:
Step 1: Open the superstore subset Excel data set
Step 2: Drag Orders sheet to sheet area
Step 3: Go to sheet 1 and add Customer name as rows and profit as column
Step 4: Sort the data by clicking on Profit label on bottom
Step 5:To calculate profit ratio

(Sum([Profit])/Sum([Sales]))
This formula needs to be entered as tooltip or label
Click on Analysis>Create Calculated Field and enter the formula
You can see Calculation1 in measures. Drag it to Marks area.
Final answer is: Note: Your answer might be different… As Data set values are different
Q2: Which state has the highest Sales (Sum)? What is the total Sales for that
state?
Ans: Add State/provinance in Rows
Q 3: Which customer segment has both the highest order quantity and average
discount rate?
What is the order quantity and average discount rate for that state?
Add Customer segment in columns
Check Discount and quantity. Rest fields are unchecked (According to question check the
appropriate field)
Right click on dimensions to remove it from rows and columns
Select Measure Values As Rows, Segment as Columns
You Can Change Graph by Clicking on dropdown list. Here I have selected line graph
Q 4: Which Product Category has the highest total Sales? Which Product
Category has the worst Profit? Name the Product Category and $ amount for
each.
Ans:
a. Bar Chart displaying total Sales for each Product Category (Change the graph
style)
b. Remove the previous columns and rows by right clicking on it. Add new fields.
c. Add a color scale indicating Profit

Add Sum(Sales) and Sum(Profit) by clicking on icon you can change the marking
system. (refer the arrow) by clicking on that you can change view
d. Each Product Category labeled with total Sales and Each Product Category
labeled with Profit
Q 5: Use the same visualization created for Question #4.What was the
Profit on Technology (Product Category) in Boca Raton (City) ?
Add dimension City to filter. Select the city as per question.

Click→apply→ok. This will give you profit for Boca Raton City.
Select the city to show a single bar for the city
Apply a filter for Technology. Same as above add Category dimension to filter on
technology.
B] Preparing Maps
Data Set for this Lab Sample- Superstore
Q 6: Prepare a Geographic map to show sales in each state and city.
Ans:
a) Connect to dataset
b) Join sheets
c. Create a Geographic Hierarchy

In the Data pane, right-click the geographic field ie Country or State or province,
and then select Hierarchy > Create Hierarchy.
In the Create Hierarchy dialog box that opens, give the hierarchy a name, such as
Mapping Items, and then click OK.
At the bottom of the Dimensions section, the Mapping Items hierarchy is created
with the State field.
In the Data pane, drag the city field to the hierarchy and place it below the State
field. Do same for Postal Code fields.
d. Build a basic map
1) In the Data pane, double-click State.
2) On the Marks card, click the + icon on the State field.
e. Add visual detail
Add color
Minimize to state
From Measures, drag Sales to Marks card.
Change the Sale appearance as Color
Add labels
From Measures, drag Sales to Label on the Marks card.

Each state is labeled with sum of sales.
The numbers need a little bit of formatting, however.
In the Data pane, right-click Sales and select Default Properties > Number
Format. In the Default Number Format dialog box that opens, select Number
(Custom), and then do the following:
For Decimal Places, enter 0.
For Units, select Thousands (K).
Click OK.
Q 7: Show Profit Ratio of each state as tooltip on map
Show Profit Ratio of each city as tooltip on map
Q 8: Show Profit ratio for Grip Envelop products (Add product name in filter)
Q 9: In the technology product category which unprofitable state is surrounded

byonly profitable states.
Drag the product category on the filter shelf and select technology.
Now Drag the profit measure to color mark Add profit Label
.
Q. 10: Which state has the worst Gross Profit Ratio on Envelopes in the Corporate
Customer Segment that were Shipped in 2018?
Customer Segment is Corporate (Apply Filter on customer segment)
Product Sub-category is envelope (Apply filter on Product Sub-category)
And the Shipped date year is 2013

Apply Filter on Order Date→ Select year
C] Preparing Reports
Data Set: Super store
11)Prepare a report showing product category wise sales
12)Report showing region wise product wise sales
13)Report showing state wise sales
14)What is the percent of total Sales for the ‘Home Office’ Customer
Segment in July of 2019?
15)Find the top 10 Product Names by Sales within each region. Which product is
ranked #2 in both the Central & West regions in 2018?
Drag “Product Name” dimension from data pane window to Row Shelf and then add
an
“order Date” on Filter shelf and select “Year” of Order date as 2013 (Sown in previous
assignment)).After that put region on Filter shelf and select “Central” and “West”
checkbox. Also, put a copy of region to the Column shelf as well.
Drag a Sales measure to the text label.
So for getting the Top 10 “product name” by sales, we need to add the “Product
name” on Filter shelf.
Once the Filter Pop up is open, Select “TOP” tab >By Field > Top 10 by Sum (Sales).
Right click on the aggregated Sales measure → select Quick Table Calculation →
Rank.
As the default addressing is Table across, please change it into Table

Down Right click on Sales Measures→Compute using → Table Down.

02 - BDAV - Lab Submission 2

Uploaded by

Copyright:

Available Formats

02 - BDAV - Lab Submission 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 - BDAV - Lab Submission 2

Uploaded by

Copyright:

Available Formats

MCAL31 Big Data Analytics and Visualization Lab Practical

Spark Create RDD from Seq or List (using Parallelize)

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

scala> rdd2.foreach(record=>println("FileName : "+record._1+", FileContents :"+record._2))

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

For mongosh info see: https://docs.mongodb.com/mongodb-shell/

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

iii. INSERT DOCUMENT

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

mydb> db.inventory.find( { status: "A" } )

mydb> show dbs

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Click on Connect to Data

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Select Sample -Superstore data source

Now Data is added

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Step 4: Sort the data by clicking on Profit label on bottom

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Step 5:To calculate profit ratio

This formula needs to be entered as tooltip or label

Click on Analysis>Create Calculated Field and enter the formula

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

You can see Calculation1 in measures. Drag it to Marks area.

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Ans: Add State/provinance in Rows

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Add Customer segment in columns

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Right click on dimensions to remove it from rows and columns

Select Measure Values As Rows, Segment as Columns

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

c. Add a color scale indicating Profit

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical

Name: Abdulbaqui Abdulrashid Ansari Roll No: 02

MCAL31 Big Data Analytics and Visualization Lab Practical