02 - BDAV - Lab Submission 2
02 - BDAV - Lab Submission 2
02 - BDAV - Lab Submission 2
Practical No 05
Aim: To perform operations using Spark:
i. Basic operations
ii. Creating RDD using 3 methods
Starting spark-shell in cloudera
[cloudera@quickstart ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/
StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/
StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/
StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.12.0.jar!/org/slf4j/impl/
StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
21/11/28 22:03:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
21/11/28 22:03:14 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback
address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
21/11/28 22:03:14 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Spark context available as sc (master = local[*], app id = local-1638165801967).
21/11/28 22:03:33 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature
cannot be used because libhadoop cannot be loaded.
SQL context available as sqlContext.
(Scala,3000)
Create an RDD from a text file(text file must be in hdfs file system)
scala> val rddnew = sc.textFile("/input/File2.txt")
rddnew: org.apache.spark.rdd.RDD[String] = /input/File2.txt MapPartitionsRDD[7] at textFile at
<console>:27
scala> rddnew.foreach(println)
hello world
abdul hello
hello world
If you want to read the entire content of a file as a single record use wholeTextFiles() method on
sparkContext.
-----------------------------------------------------
scala> val rdd2 = sc.wholeTextFiles("/input/File2.txt")
rdd2: org.apache.spark.rdd.RDD[(String, String)] = /input/File2.txt MapPartitionsRDD[9] at
wholeTextFiles at <console>:27
scala> rdd3.foreach(println)
(Java,20100)
(Python,100100)
(Scala,3100)
Practical No 06
Aim: To perform operations using Spark:
i. Starting mongosh
C:\Users\ASUS>mongosh
Current Mongosh Log ID: 61b08351c37c3d62383b8963
Connecting to: mongodb://127.0.0.1:27017/?
directConnection=true&serverSelectionTimeoutMS=2000
Using MongoDB: 5.0.4
Using Mongosh: 1.1.2
acknowledged: true,
insertedIds: {
'0': 11,
'1': ObjectId("61b092ee4fd1c6595ac15b6f"),
'2': ObjectId("61b092ee4fd1c6595ac15b70")
}
}
unordered insert
mydb> db.products.insert(
... [
... { _id: 20, item: "lamp", qty: 50, type: "desk" },
... { _id: 21, item: "lamp", qty: 20, type: "floor" },
... { _id: 22, item: "bulk", qty: 100 }
... ],
... { ordered: false }
... )
{ acknowledged: true, insertedIds: { '0': 20, '1': 21, '2': 22 } }
validation violation insert
mydb> try {
... db.student.insertOne({ _id: 1, fullname: "Sohan", age: 22 })
... } catch (e) {
... print (e);
... }
MongoServerError: Document failed validation
at C:\Users\ASUS\AppData\Local\Programs\mongosh\mongosh.exe:54498:25
at C:\Users\ASUS\AppData\Local\Programs\mongosh\mongosh.exe:61790:13
at handleOperationResult (C:
\Users\ASUS\AppData\Local\Programs\mongosh\mongosh.exe:63069:5)
at MessageStream.messageHandler (C:
\Users\ASUS\AppData\Local\Programs\mongosh\mongosh.exe:50299:5)
at MessageStream.emit (events.js:400:28)
at MessageStream.emit (domain.js:475:12)
at processIncomingData (C:
\Users\ASUS\AppData\Local\Programs\mongosh\mongosh.exe:49214:12)
at MessageStream._write (C:
\Users\ASUS\AppData\Local\Programs\mongosh\mongosh.exe:49110:5)
at writeOrBuffer (internal/streams/writable.js:358:12)
at MessageStream.Writable.write (internal/streams/writable.js:303:10) {
index: 0,
code: 121,
errInfo: {
failingDocumentId: 1,
details: {
operatorName: '$jsonSchema',
schemaRulesNotSatisfied: [
{
operatorName: 'required',
specifiedAs: { required: [ 'fname' ] },
missingProperties: [ 'fname' ]
}
]
}
}
}
mydb> db.student.insertOne({ _id: 1, fname: "Sohan", age: 22 })
{ acknowledged: true, insertedId: 1 }
iv. QUERY DOCUMENT
List all the documents inside a collection
--SELECT * from inventory:
mydb> db.inventory.find( {} )
[
{
_id: ObjectId("61b084034fd1c6595ac15b69"),
item: 'canvas',
qty: 100,
tags: [ 'cotton' ],
size: { h: 28, w: 35.5, uom: 'cm' }
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6a"),
item: 'journal',
qty: 25,
size: { h: 14, w: 21, uom: 'cm' },
status: 'A'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6b"),
item: 'notebook',
qty: 50,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'A'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6c"),
item: 'paper',
qty: 100,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'D'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6d"),
item: 'planner',
qty: 75,
size: { h: 22.85, w: 30, uom: 'cm' },
status: 'D'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6e"),
item: 'postcard',
qty: 45,
size: { h: 10, w: 15.25, uom: 'cm' },
status: 'A'
}
]
SELECT * FROM inventory WHERE status = "D"
mydb> db.inventory.find( { status: "D" } )
[
{
_id: ObjectId("61b091a04fd1c6595ac15b6c"),
item: 'paper',
qty: 100,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'D'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6d"),
item: 'planner',
qty: 75,
size: { h: 22.85, w: 30, uom: 'cm' },
status: 'D'
}
]
SELECT * FROM inventory WHERE status in ("A", "D"):
mydb> db.inventory.find( { status: { $in: [ "A", "D" ] } } )
[
{
_id: ObjectId("61b091a04fd1c6595ac15b6a"),
item: 'journal',
qty: 25,
size: { h: 14, w: 21, uom: 'cm' },
status: 'A'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6b"),
item: 'notebook',
qty: 50,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'A'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6c"),
item: 'paper',
qty: 100,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'D'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6d"),
item: 'planner',
qty: 75,
size: { h: 22.85, w: 30, uom: 'cm' },
status: 'D'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6e"),
item: 'postcard',
qty: 45,
size: { h: 10, w: 15.25, uom: 'cm' },
status: 'A'
}
]
SELECT * FROM inventory WHERE status = "A" AND qty < 30:
mydb> db.inventory.find( { status: "A", qty: { $lt: 30 } } )
[
{
_id: ObjectId("61b091a04fd1c6595ac15b6a"),
item: 'journal',
qty: 25,
size: { h: 14, w: 21, uom: 'cm' },
status: 'A'
}
]
SELECT * FROM inventory WHERE status = "A" OR qty < 30:
mydb> db.inventory.find( { $or: [ { status: "A" }, { qty: { $lt: 30 } } ] } )
[
{
_id: ObjectId("61b0b17db51a84608a2d39fa"),
item: 'journal',
qty: 25,
size: { h: 14, w: 21, uom: 'cm' },
status: 'A'
},
{
_id: ObjectId("61b0b17db51a84608a2d39fb"),
item: 'notebook',
qty: 50,
size: { h: 8.5, w: 11, uom: 'in' },
status: 'A'
},
{
_id: ObjectId("61b0b17db51a84608a2d39fe"),
item: 'postcard',
qty: 45,
size: { h: 10, w: 15.25, uom: 'cm' },
status: 'A'
}
]
SELECT * FROM inventory WHERE status = "A" AND ( qty < 30 OR item LIKE "p%"):
mydb> db.inventory.find( {
... status: "A",
... $or: [ { qty: { $lt: 30 } }, { item: /^p/ } ]
... } )
[
{
_id: ObjectId("61b091a04fd1c6595ac15b6a"),
item: 'journal',
qty: 25,
size: { h: 14, w: 21, uom: 'cm' },
status: 'A'
},
{
_id: ObjectId("61b091a04fd1c6595ac15b6e"),
item: 'postcard',
qty: 45,
size: { h: 10, w: 15.25, uom: 'cm' },
status: 'A'
}
]
v. DELETE DOCUMENT:
delete the first documents in the collection that matches the result
mydb> db.inventory.deleteOne( { status: "D" } )
{ acknowledged: true, deletedCount: 1 }
delete all the documents in the collection that match the result
mydb> db.inventory.find( { status: "D" } )
[
{
_id: ObjectId("61b091a04fd1c6595ac15b6d"),
item: 'planner',
qty: 75,
size: { h: 22.85, w: 30, uom: 'cm' },
status: 'D'
}
]
delete all the documents in the collection
mydb> db.inventory.deleteMany({ status : "A" })
{ acknowledged: true, deletedCount: 3 }
mydb> db.inventory.deleteMany({})
{ acknowledged: true, deletedCount: 2 }
mydb> db.inventory.find( {} )
admin 41 kB
config 36.9 kB
local 73.7 kB
mydb 164 kB
Practical No 7
To perform Visualization operation using Tableau:
Starting with Tableau
Create New File
A] Analysis operations
Q 1: Find the customer with the highest overall profit. What is his/her profit ratio?
Ans:
Step 1: Open the superstore subset Excel data set
Step 2: Drag Orders sheet to sheet area
Step 3: Go to sheet 1 and add Customer name as rows and profit as column
Final answer is: Note: Your answer might be different… As Data set values are different
Q2: Which state has the highest Sales (Sum)? What is the total Sales for that
state?
Q 3: Which customer segment has both the highest order quantity and average
discount rate?
What is the order quantity and average discount rate for that state?
Check Discount and quantity. Rest fields are unchecked (According to question check the
appropriate field)
You Can Change Graph by Clicking on dropdown list. Here I have selected line graph
Q 4: Which Product Category has the highest total Sales? Which Product
Category has the worst Profit? Name the Product Category and $ amount for
each.
Ans:
a. Bar Chart displaying total Sales for each Product Category (Change the graph
style)
b. Remove the previous columns and rows by right clicking on it. Add new fields.
d. Each Product Category labeled with total Sales and Each Product Category
labeled with Profit
Q 5: Use the same visualization created for Question #4.What was the
Profit on Technology (Product Category) in Boca Raton (City) ?
Apply a filter for Technology. Same as above add Category dimension to filter on
technology.
B] Preparing Maps
Ans:
a) Connect to dataset
b) Join sheets
In the Create Hierarchy dialog box that opens, give the hierarchy a name, such as
Mapping Items, and then click OK.
At the bottom of the Dimensions section, the Mapping Items hierarchy is created
with the State field.
In the Data pane, drag the city field to the hierarchy and place it below the State
field. Do same for Postal Code fields.
Add color
Minimize to state
From Measures, drag Sales to Marks card.
Change the Sale appearance as Color
Add labels
Q 8: Show Profit ratio for Grip Envelop products (Add product name in filter)
Drag the product category on the filter shelf and select technology.
Now Drag the profit measure to color mark Add profit Label
.
Q. 10: Which state has the worst Gross Profit Ratio on Envelopes in the Corporate
Customer Segment that were Shipped in 2018?
C] Preparing Reports
14)What is the percent of total Sales for the ‘Home Office’ Customer
Segment in July of 2019?
15)Find the top 10 Product Names by Sales within each region. Which product is
ranked #2 in both the Central & West regions in 2018?
Drag “Product Name” dimension from data pane window to Row Shelf and then add
an
“order Date” on Filter shelf and select “Year” of Order date as 2013 (Sown in previous
assignment)).After that put region on Filter shelf and select “Central” and “West”
checkbox. Also, put a copy of region to the Column shelf as well.
So for getting the Top 10 “product name” by sales, we need to add the “Product
name” on Filter shelf.
Once the Filter Pop up is open, Select “TOP” tab >By Field > Top 10 by Sum (Sales).
Right click on the aggregated Sales measure → select Quick Table Calculation →
Rank.