BIG DATA Lab Record-2024
BIG DATA Lab Record-2024
BIG DATA Lab Record-2024
CERTIFICATE
This is to certify that _____________________________________ Of B.Sc . Third
Date:
Place: Hyderabad
EXPERIMENT: 1
Install, configure and run python, numpy and pandas.
PROGRAM:
AIM: To Installing and Running Applications On python, numpy and pandas.
How to Install Anaconda on Windows?
Anaconda is an open-source software that contains Jupyter, spyder, etc that are used for large data
processing, data analytics, heavy scientific computing. Anaconda works for R and python
programming language. Spyder(sub-application of Anaconda) is used for python. Opencv for python
will work in spyder. Package versions are managed by the package management system called
conda.
To begin working with Anaconda, one must get it installed first. Follow the below instructions to
Download and install Anaconda on your system:
Download and install Anaconda:
Head over to anaconda.com and install the latest version of Anaconda. Make sure to download the
―Python 3.7 Version‖ for the appropriate architecture.
Select Installation Type: Select Just Me if you want the software to be used by a single User
import pandas as pd
dataset1 =
pd.read_csv("crime.csv") dataset1
BIG DATA ANALYTICS LAB
dataset1.head()
dataset1.tail()
dataset1.head(10)
BIG DATA ANALYTICS LAB
dataset1.tail(10)
type(dataset1)
pandas.core.frame.DataFrame
dataset1.shape
dataset1.skew()
BIG DATA ANALYTICS LAB
dataset1.var()
dataset1.kurtosis()
print(dataset1.dtypes)
NUMPY
Numpy is the core library for scientific and numerical computing in Python. It provides high
performance multi dimensional array object and tools for working with arrays.
Numpy main object is the multidimensional array, it is a table of elements (usually numbers) all of
the same type indexed by a positive integers.
BIG DATA ANALYTICS LAB
In Numpy dimensions are called as axes.
Numpy is fast, convenient and occupies less memory when compared to python list.
import numpy
arr = numpy.array([1, 2, 3, 4,
5]) print(arr)
import numpy as np
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
import numpy as
np print(np.
version )
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
type(): This built-in Python function tells us the type of the object passed to it. Like in above
code it shows that arr is numpy.ndarray type.
To create an ndarray, we can pass a list, tuple or any array-like object into the array() method, and it
will be converted into an ndarray:
Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).
BIG DATA ANALYTICS LAB
0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
print(arr)
1-D Arrays
These are the most common and basic arrays.
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
These are often used to represent matrix or 2nd order tensors.
#Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D
array. These are often used to represent a 3rd order tensor.
#Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
NumPy Array
Indexing Access Array
Elements
Array indexing is the same as accessing an array element.
You can access an array element by referring to its index number.
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second
has index 1 etc.
#Get third and fourth elements from the following array and add them.
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[2] + arr[3])
BIG DATA ANALYTICS LAB
BIG DATA ANALYTICS LAB
Access 2-D Arrays
To access elements from 2-D arrays we can use comma separated integers representing the dimension and
the index of the element.
Think of 2-D arrays like a table with rows and columns, where the dimension represents the row and the
index represents the column.
OUTPUT:
BIG DATA ANALYTICS LAB
EXPERIMENT: 2
Install, Configure and Run Hadoop and HDFS
PROGRAM:
AIM: To Installing and Running Applications On Hadoop and HDFS.
HADOOP INSTALATION IN WINDOWS
1. Prerequisites
Hardware Requirement
* RAM — Min. 8GB, if you have SSD in your system then 4GB RAM would also work.
* CPU — Min. Quad core, with at least 1.80GHz
2. JRE 1.8 — Offline installer for JRE
3. Java Development Kit — 1.8
4. A Software for Un-Zipping like 7Zip or Win Rar
* I will be using a 64-bit windows for the process, please check and download the version
supported by your system x86 or x64 for all the software.
5. Download Hadoop zip
* I am using Hadoop-2.9.2, you can use any other STABLE version for hadoop.
Once we have Downloaded all the above software, we can proceed with next steps in installing the
Hadoop.
2. Unzip and Install Hadoop
After Downloading the Hadoop, we need to Unzip the hadoop-2.9.2.tar.gz file.
Now we can organize our Hadoop installation, we can create a folder and move the final extracted
file in it. For Eg. :-
Please note while creating folders, DO NOT ADD SPACES IN BETWEEN THE FOLDER NAME.
(it can cause issues later)
I have placed my Hadoop in D: drive you can use C: or any other drive also.
3. Setting Up Environment Variables
Another important step in setting up a work environment is to set your Systems environment
variable.
To edit environment variables, go to Control Panel > System > click on the ―Advanced system
settings‖ link
Alternatively, We can Right click on This PC icon and click on Properties and click on the
―Advanced system settings‖ link
Or, easiest way is to search for Environment Variable in search bar and there you GO…
BIG DATA ANALYTICS LAB
BIG DATA ANALYTICS LAB
Setting JAVA_HOME
Open environment Variable and click on ―New‖ in ―User Variable‖
BIG DATA ANALYTICS LAB
Now as shown, add JAVA_HOME in variable name and path of Java(jdk) in Variable Value.
Click OK and we are half done with setting JAVA_HOME.
BIG DATA ANALYTICS LAB
Setting HADOOP_HOME
Open environment Variable and click on ―New‖ in ―User Variable‖
Now as shown, add HADOOP_HOME in variable name and path of Hadoop folder in Variable
Value.
Click OK and we are half done with setting HADOOP_HOME.
Note:- If you want the path to be set for all users you need to select ―New‖ from System Variables.
Setting Path Variable
Last step in setting Environment variable is setting Path in System Variable.
BIG DATA ANALYTICS LAB
Once DATA folder is created, we need to create 2 new folders namely, namenode and datanode
inside the data folder
These folders are important because files on HDFS resides inside the datanode.
Editing Configuration Files
Now we need to edit the following config files in hadoop for configuring it :-
(We can find these files in Hadoop -> etc -> hadoop)
* core-site.xml
* hdfs-site.xml
* mapred-site.xml
* yarn-site.xml
* hadoop-env.cmd
Editing core-site.xml
Right click on the file, select edit and paste the following content within <configuration>
</configuration> tags.
Note:- Below part already has the configuration tag, we need to copy only the part inside it.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Editing hdfs-site.xml
BIG DATA ANALYTICS LAB
Right click on the file, select edit and paste the following content within
<configuration></configuration>tags.
Note:- Below part already has the configuration tag, we need to copy only the part inside it.
Also replace PATH~1 and PATH~2 with the path of namenode and datanode folder that we created
recently(step 4.1).
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop\data\datanode</value>
</property>
</configuration>
Editing mapred-site.xml
Right click on the file, select edit and paste the following content within <configuration>
</configuration> tags.
Note:- Below part already has the configuration tag, we need to copy only the part inside it.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Editing yarn-site.xml
Right click on the file, select edit and paste the following content within <configuration>
</configuration> tags.
Note:- Below part already has the configuration tag, we need to copy only the part inside it.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Verifying hadoop-env.cmd
Right click on the file, select edit and check if the JAVA_HOME is set correctly or not.
We can replace the JAVA_HOME variable in the file with your actual JAVA_HOME that we
configured in the System Variable.
set JAVA_HOME=%JAVA_HOME
% OR
set JAVA_HOME="C:\Program Files\Java\jdk1.8.0_221"
Replacing bin
Last step in configuring the hadoop is to download and replace the bin folder.
BIG DATA ANALYTICS LAB
* Go to this GitHub Repo and download the bin folder as a zip.
* Extract the zip and copy all the files present under bin folder to %HADOOP_HOME%\bin
Note:- If you are using different version of Hadoop then please search for its respective bin
folder and download it.
5. Testing Setup
Congratulation..!!!!!
We are done with the setting up the Hadoop in our
System. Now we need to check if everything works
smoothly…
Formatting Namenode
Before starting hadoop we need to format the namenode for this we need to start a NEW Command
Prompt and run below command
hadoop namenode –format
Note:- This command formats all the data in namenode. So, its advisable to use only at the start and
do not use it every time while starting hadoop cluster to avoid data loss.
Launching Hadoop
Now we need to start a new Command Prompt remember to run it as administrator to avoid
permission issues and execute below commands
start-all.cmd
This will open 4 new cmd windows running 4 different Daemons of hadoop:-
* Namenode
* Datanode
* Resourcemanager
* Nodemanager
BIG DATA ANALYTICS LAB
Note:- We can verify if all the daemons are up and running using jps command in new cmd window.
6. Running Hadoop (Verifying Web UIs)
Namenode
Open localhost:50070 in a browser tab to verify namenode health.
Resourcemanger
Open localhost:8088 in a browser tab to check resourcemanager details.
Datanode
Open localhost:50075 in a browser tab to checkout datanode.
BIG DATA ANALYTICS LAB
BIG DATA ANALYTICS LAB
EXPERIMENT: 3
Visualize Data Using Basic Plotting Techniques In Python.
PROGRAM:
AIM: To create an application that takes the Visualize Data Using Basic Plotting Techniques.
import pandas as pb
import matplotlib.pyplot as plt
import seaborn as sns
crime=pb.read_csv('crime.csv')
crime
plt.plot(crime.Murder,crime.Assault);
sns.scatterplot(crime.Murder,crime.Assault,hue=crime.Murder,s=100);
plt.figure(figsize=(12,6))
plt.title('Murder Vs
Assault')
sns.scatterplot(crime.Murder,crime.Assault,hue=crime.Murder,s=100);
BIG DATA ANALYTICS LAB
plt.title('Histogram for
Robbery')
plt.hist(crime.Robbery);
plt.bar(crime_bar.index,crime_bar.Robbery);
sns.barplot('Robbery','Year',data=crime);
BIG DATA ANALYTICS LAB
import matplotlib.pyplot as
plt import pandas as pd
import numpy as np
data=pd.read_csv('crime.csv')
x=data.Population
y=data.CarTheft
plt.scatter(x,y)
plt.xlabel('Population')
plt.ylabel('CarTheft')
plt.title('Population Vs
CarTheft') plt.show();
BIG DATA ANALYTICS LAB
EXPERIMENT: 4
Implement no sql Database Operations: Crud Operations, Arrays Using MONGODB.
PROGRAM:
AIM: To Create a operations for crud and arrays without no sql datasbase.
TITLE: Basic CRUD operations in MongoDB.
CRUD operations refer to the basic Insert, Read, Update and Delete operations.
Inserting a document into a collection (Create)
➢ The command db.collection.insert()will perform an insert operation into a collection of a
document. ➢ Let us insert a document to a student collection. You must be connected to a
database for doing any insert. It is done as follows:
db.student.insert({
regNo: "3014",
name: "Test Student",
course: { courseName: "MCA", duration: "3 Years" },
address: {
city: "Bangalore",
state: "KA",
country: "India" }
})
An entry has been made into the collection called student.
Updating a document in a collection (Update) In order to update specific field values of a collection
BIG DATA ANALYTICS LAB
in MongoDB, run the below query. db.collection_name.update()
BIG DATA ANALYTICS LAB
➢ update() method specified above will take the fieldname and the new value as argument to
update a document.
➢ Let us update the attribute name of the collection student for the document with regNo
3014. db.student.update({
"regNo": "3014"
},
$set:
{
"name": "Viraj"
})
Removing an entry from the collection (Delete)
➢ Let us now look into the deleting an entry from a collection. In order to delete an entry from a
collection, run the command as shown below :
db.collection_name.remove({"fieldname":"value"})
➢ For Example : db.student.remove({"regNo":"3014"})
Note that after running the remove() method, the entry has been deleted from the student collection.
EXPERIMENT: 5
Implement Functions: Count – Sort – Limit – Skip – Aggregate Using MONGODB.
PROGRAM:
AIM: To create function operations for sort, limit, skip and aggregate.
1. COUNT
How do you get the number of Debit and Credit transactions? One way to
do it is by using count() function as below
> db.transactions.count({cr_dr : "D"});
or
2. SORT
Definition
$sort
Sorts all input documents and returns them to the pipeline in sorted order.
The
$sort
BIG DATA ANALYTICS LAB
stage has the following prototype form:
$sort
takes a document that specifies the field(s) to sort by and the respective
sort order. <sort order> can have one of the following values:
Value
Description
1
Sort ascending.
-1
Sort descending.
{ $meta: "textScore" }
Sort by the computed textScore metadata in descending order. See
Text Score Metadata Sort
for an example.
If sorting on multiple fields, sort order is evaluated from left to right. For
example, in the form above, documents are first sorted by <field1>. Then
documents with the same <field1> values are further sorted by <field2>.
Behavior
Limits
You can sort on a maximum of 32 keys.
Sort Consistency
MongoDB does not store documents in a collection in a particular order.
When sorting on a field which contains duplicate values, documents
containing those values may be returned in any order.
If consistent sort order is desired, include at least one field in your sort
that contains unique values. The easiest way to guarantee this is to
include the _id field in your sort query.
collection: db.restaurants.insertMany( [
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan"},
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" :
"Queens"},
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn"},
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan"},
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn"},
])
BIG DATA ANALYTICS LAB
db.restaurants.aggregate(
[
{ $sort : { borough : 1 } }
]
)
In this example, sort order may be inconsistent, since the borough field
contains duplicate values for both Manhattan and Brooklyn. Documents
are returned in alphabetical order by borough, but the order of those
documents with duplicate values for borough might not the be the same
across multiple executions of the same sort. For example, here are the
results from two different executions of the above command:
db.restaurants.aggregate(
[
{ $sort : { borough : 1, _id: 1 } }
]
)
BIG DATA ANALYTICS LAB
Since the _id field is always guaranteed to contain exclusively unique
values, the returned sort order will always be the same across multiple
executions of the same sort.
Examples
Ascending/Descending
Sort
For the field or fields to sort by, set the sort order to 1 or -1 to specify an
ascending or descending sort respectively, as in the following example:
db.users.aggregate(
[
{ $sort : { age : -1, posts: 1 } }
]
)
$sort
Sorts all input documents and returns them to the pipeline in sorted
order. The $sort stage has the following prototype form:
{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... } }
$sort takes a document that specifies the field(s) to sort by and the
respective sort order. <sort order> can have one of the following values:
Value Description
1 Sort ascending.
-1 Sort descending.
{ $meta: Sort by the computed textScore metadata in
"textScore" } descending order. See Text Score Metadata Sort for
an example.
If sorting on multiple fields, sort order is evaluated from left to right. For
example, in the form above, documents are first sorted by <field1>. Then
documents with the same <field1> values are further sorted by <field2>.
Behavior
Limits
You can sort on a maximum of 32 keys.
Sort Consistency
MongoDB does not store documents in a collection in a particular order.
When sorting on a field which contains duplicate values, documents
containing those values may be returned in any order.
If consistent sort order is desired, include at least one field in your sort
BIG DATA ANALYTICS LAB
that contains unique values. The easiest way to guarantee this is to
include the _id field in your sort query.
Consider the following restaurant collection:
db.restaurants.insertMany( [
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan"},
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens"},
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn"},
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan"},
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn"},
])
The following command uses the $sort stage to sort on the borough field:
db.restaurants.aggregate(
[
{ $sort : { borough : 1 } }
]
)
In this example, sort order may be inconsistent, since the borough field
contains duplicate values for both Manhattan and Brooklyn. Documents
are returned in alphabetical order by borough, but the order of those
documents with duplicate values for borough might not the be the same
across multiple executions of the same sort. For example, here are the
results from two different executions of the above command:
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn" }
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn" }
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan" }
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan" }
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens" }
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn" }
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn" }
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan" }
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan" }
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens" }
While the values for borough are still sorted in alphabetical order, the
order of the documents containing duplicate values
for borough (i.e. Manhattan and Brooklyn) is not the same.
To achieve a consistent sort, add a field which contains exclusively
unique values to the sort. The following command uses the $sort stage to
sort on both the borough field and the _id field:
db.restaurants.aggregate(
[
{ $sort : { borough : 1, _id: 1 } }
]
)
Since the _id field is always guaranteed to contain exclusively unique
values, the returned sort order will always be the same across multiple
BIG DATA ANALYTICS LAB
executions of the same sort.
Examples
Ascending/Descending
Sort
For the field or fields to sort by, set the sort order to 1 or -1 to specify an
ascending or descending sort respectively, as in the following example:
db.users.aggregate(
[
{ $sort : { age : -1, posts: 1 } }
]
)
4. SKIP
$skip
Skips over the specified number of documents that pass into the stage
and passes the remaining documents to the next stage in the pipeline.
The
$skip
stage has the following prototype form:
EXPERIMENT: 6
Implement Word Count/ Frequency Programs Using Map Reduce.
PROGRAM:
AIM: To count a given number using map reduce functions.
Hadoop Streaming API for helping us passing data between our Map and Reduce code
via STDIN (standard input) and STDOUT (standard output).
Note : Change the file has execution permission (chmod +x /home/hduser/mapper.py)
Change the file has execution permission (chmod +x /home/hduser/reducer.py
Mapper program
mapper.py
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
line = line.strip() # remove leading and trailing whitespace
words = line.split()# split the line into words
# increase
counters for word
in words:
# write the results to STDOUT (standard
output); # what we output here will be the
input for the # Reduce step, i.e. the input for
reducer.py
# tab-delimited; the trivial word count is
1 print '%s\t%s' % (word, 1)
Reducer program
"""reducer.py"""
from operator import itemgetter
import sys
current_word =
None current_count
= 0 word = None
hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py | sort -k1,1 |
/home/hduser/
reducer.py bar 1
foo 3
labs 1
quux 2
PROGRAM:
The python program reads the data from a dataset ( stored in the file data.csv- wine quality).
The data mapped is stored in shuffled.pkl using mapper.py.
The contents of shuffled.pkl are reduced using reducer.py
Mapper Program
import pandas as pd
import pickle
data = pd.read_csv('data.csv')
#Slicing Data
slice1 = data.iloc[0:399,:]
slice2 = data.iloc[400:800,:]
slice3 = data.iloc[801:1200,:]
slice4 = data.iloc[1201:,:]
def mapper(data):
mapped = []
map1 = mapper(slice1)
map2 = mapper(slice2)
map3 = mapper(slice3)
map4 = mapper(slice4)
shuffled = {
3.0: [],
4.0: [],
5.0: [],
6.0: [],
7.0: [],
8.0: [],
}
for i in [map1,map2,map3,map4]:
for j in i:
BIG DATA ANALYTICS LAB
BIG DATA ANALYTICS LAB
shuffled[j[0]].append(j[1])
file= open('shuffled.pkl','ab')
pickle.dump(shuffled,file)
file.close()
print("Data has been mapped. Now, run reducer.py to reduce the contents in
shuffled.pkl file.")
BIG DATA ANALYTICS LAB
Reducer Program
import
pickle
file= open('shuffled.pkl','rb')
shuffled = pickle.load(file)
def reduce(shuffled_dict):
reduced = {}
for i in shuffled_dict:
reduced[i] = sum(shuffled_dict[i])/len(shuffled_dict[i])
return reduced
final = reduce(shuffled)
print("Average volatile acidity in different classes of wine: ")
for i in final:
print(i,':',final[i])
BIG DATA ANALYTICS LAB
EXPERIMENT: 8
PROGRAM:
# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
PROGRAM:
If you type fluidPage() in the R console, you will see that the method returns a tag <div
class=‖container-fluid‖></div>.
Layout methods
The various layout features available in Bootstrap are implemented by R Shiny. The components are:
Panels
These are methods that group elements together into a single panel. These include:
absolutePanel()
inputPanel()
conditionalPanel()
headerPanel()
fixedPanel()
Layout functions
These organize the panels for a particular layout. These include:
fluidRow()
verticalLayout()
flowLayout()
splitLayout()
sidebarLayout()
Output methods
These methods are used for displaying R output components images, tables and plots. They are:
Server function
After you have created the appearance of the application and the ways to take input values from the
user, it is time to set up the server. The server functions help you to write the server-side code for the
Shiny app. You can create functions that map the user inputs to the corresponding outputs. This
function is called by the web browser when the application is loaded.
It takes an input and output parameter, and return values are ignored. An optional session parameter
is also taken by this method.
library(shiny)
runExample(“01_hello”)