DMDWLab Book Answers
DMDWLab Book Answers
(Science)
DSE II BCA 357- Laboratory (Data Mining) Workbook
Savitribai Phule Pune University
Answers
R Programming
Set A
1. Write a R program to add, multiply and divide two vectors of integer type.
(vector length should be minimum 4)
Solution :
> print(vector1)
[1] 10 20 30 40
> print(vector2)
[1] 20 10 40 40
Disp_table(number)
Disp_table=function(number)
{
Output
[1] "3 * 1 = 3"
[1] "3 * 2 = 6"
[1] "3 * 3 = 9"
[1] "3 * 4 = 12"
[1] "3 * 5 = 15"
[1] "3 * 6 = 16"
[1] "3 * 7 = 21"
[1] "3 * 8 = 24"
[1] "3 * 9 = 27"
[1] "3 * 10 = 30"
#display list
> print(list_data)
[[1]]
[1] "Ram Sharma"
[[2]]
[1] "Sham Varma"
[[3]]
[1] "Raj Jadhav"
[[4]]
[1] "Ved Sharma"
print(list_data)
[[1]]
[1] "Ram Sharma"
[[2]]
[1] "Sham Varma"
[[3]]
[1] "Raj Jadhav"
[[4]]
[1] "Ved Sharma"
[[5]]
[1] "Kavya Anjali"
#remove 3 employee
list_data[3] <- NULL
print(list_data)
[[1]]
[1] "Ram Sharma"
[[2]]
[1] "Sham Varma"
[[3]]
[1] "Ved Sharma"
[[4]]
[1] "Kavya Anjali"
Set B
sum=0
rev=0
while(n>0)
{
r = n%%10
sum= sum+r;
rev=rev*10+r
n=n%/%10 # %/% is used for integer division
}
print(paste("Sum of digit : ",sum))
print(paste("Reverse of number : ",rev))
}
n = as.integer(readline(prompt = "Enter a number :"))
Reverse_Sum(n)
Output
Enter a number :123
[1] "Sum of digit : 6"
[1] "Reverse of number : 321"
4. Write a R program to create a data frame using two given vectors and display the
duplicate elements
> companies <- data.frame(Shares = c("TCS", "Reliance", "HDFC Bank", "Infosys",
"Reliance"),
+ Price = c(3200, 1900, 1500, 2200, 1900))
> companies
Shares Price
1 TCS 3200
2 Reliance 1900
3 HDFC Bank 1500
4 Infosys 2200
5 Reliance 1900
> cat("After removing Duplicates ", "\n")
After removing Duplicates
> companies[duplicated(companies),]
Shares Price
5 Reliance 1900
Set C
1. Write a R program to perform the following:
a. Display all rows of the data set having weight greater than 120.
> women
height weight
b. Subset the data set by mpg column for values greater than 15.0
subset(data,data$mpg>15.0)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
b. Create a Scattered plot to show the relationship between ozone and wind
values by giving appropriate values to colour argument.
with(data,plot(Ozone ~ Wind,pch=mlev,col=mlev))
#We have seen that for the plot() function the color option is called col. For
the #shape option it is called pch which stands for print character.
c. Create a Bar plot to show the ozone level for all the days having temperature
> 70.
(Use inbuilt dataset airquality)
> data<-data[data$Temp >70,]
> data <-na.omit(data)
> barplot(height=data$Ozone,main="ozone level for all the days having temper
ature > 70", xlab="Temperature", ylab="Ozone", names.arg = data$Temp,borde
r = "dark blue", col="pink")
Data Pre-Processing
1) Write a python program to convert Categorical values in numeric format for a
given Dataset (https://codefires.com/how-convert-categorical-data-numerical-
data-python/)
import pandas as pd
info = {
'Gender' : ['Male', 'Female', 'Female', 'Male', 'Female', 'Female'],
'Position' : ['Head', 'Asst.Prof.', 'Associate Prof.', 'Asst.Prof.', 'Head', 'Asst.Prof.']
}
df = pd.DataFrame(info)
print(df)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
gender_encoded = le.fit_transform(df['Gender'])
encoded_position = le.fit_transform(df['Position'])
df['Encoded_Gender'] = gender_encoded
df['Encoded_Position'] = encoded_position
print(df)
gender_encoded = le.fit_transform(df['Gender'])
gender_encoded = gender_encoded.reshape(len(gender_encoded), -1)
one = OneHotEncoder(sparse=False)
print(one.fit_transform(gender_encoded))
2) Write a python program to rescale the data between 0 and 1. (use inbuilt
dataset) (IRIS DATA SET) https://machinelearningmastery.com/rescaling-data-
for-machine-learning-in-python-with-scikit-learn/
1) Normalisation
# Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the iris dataset
iris = load_iris()
2) Data Standardization
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data and target attributes
X = iris.data
y = iris.target
# standardize the data attributes
standardized_X = preprocessing.scale(X)
3) Write a python program to splitting the dataset into training and testing set
1) Using pandas
iris_data = load_iris()
df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
print(df)
training_data = df.sample(frac=0.8, random_state=25)
testing_data = df.drop(training_data.index)
2) Using scikit-learn
from sklearn.model_selection import train_test_split
1. Write a python program to find all null values in a given data set and remove them.
import pandas as pd
dataset = pd.read_csv('city_day.csv')
dataset
dataset.isnull()
dataset.isnull().head(10)
dataset.isnull().sum()
dataset.isnull().head().sum()
modifieddataset=dataset.fillna(" ")
modifieddataset.isnull().sum()
dataset=dataset.dropna()
2. Write a python program the Categorical values in numeric format for a given dataset.
import numpy as np
import pandas as pd
dataset = pd.read_csv('Data2.csv')
dataset
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
dataset['outlook'] = le.fit_transform(dataset.outlook)
dataset['temp'] = le.fit_transform(dataset.temp)
dataset['humidity'] = le.fit_transform(dataset.humidity)
dataset['playgolf'] = le.fit_transform(dataset.playgolf)
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,5].values
from sklearn.preprocessing import StandardScaler
st_x = StandardScaler()
x1 = st_x.fit_transform(x)
print(x1)
3. Write a python program to splitting the dataset into training and testing set.
import numpy as np
import pandas as pd
dataset = pd.read_csv("play_tennis.csv")
dataset
from sklearn import preprocessing
Numpy :
NumPy is a Python package. It stands for ‘Numerical Python’. It is a library consisting of multidimensional array objects and a
collection of routines for processing of array.
Example 1
import numpy as np
a = np.array([1,2,3])
print a
The output is as follows –
[1, 2, 3]
Example 2
Pandas features
Time series analysis
Split-apply-combine is a common strategy used during analysis to summarize data—you split data into logical subgroups, apply some
function to each subgroup, and stick the results back together again. In pandas, this is accomplished using the groupby() function and
whatever functions you want to apply to the subgroups.
Group By: split-apply-combine
Data visualization
Visualization
Pivot tables
Common features
Creating Objects
Viewing Data
Selection
Manipulating Data
Grouping Data
Merging, Joining and Concatenating
Working with Date and Time
Working With Text Data
Working with CSV and Excel files
Operations
Visualization
Applications and Projects
Miscellaneous
sklearn.preprocessing
The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors
into a representation that is more suitable for the downstream estimators.
sklearn.preprocessing.LabelEncoder
Encode target labels with value between 0 and n_classes-1.
This transformer should be used to encode target values, i.e. y, and not the input X.
Attributes
lasses_ndarray of shape (n_classes,)
Holds the label for each class.
Methods
Parameters
yarray-like of shape (n_samples,)
Target values.
Returns
selfreturns an instance of self.
Fitted label encoder.
fit_transform(y)
Fit label encoder and return encoded labels.
Parameters
yarray-like of shape (n_samples,)
Target values.
Returns
yarray-like of shape (n_samples,)
Encoded labels.
transform(y)
Transform labels to normalized encoding.
Parameters
yarray-like of shape (n_samples,)
Target values.
Returns
yarray-like of shape (n_samples,)
Labels as normalized encodings.
pandas.DataFrame.iloc
Differences between loc and iloc
The main distinction between loc and iloc is:
loc is label-based, which means that you have to specify rows and columns based on their row and
column labels.
iloc is integer position-based, so you have to specify rows and columns by their integer position
values (0-based integer position).
Syntax
loc[row_label, column_label]
iloc[row_position, column_position]
With loc, we can pass the row label 'Fri' and the column label 'Temperature'.
# To get Friday's temperature
>>> df.loc['Fri', 'Temperature']10.51
The equivalent iloc statement should take the row number 4 and the column number 1 .
# The equivalent `iloc` statement
>>> df.iloc[4, 1]10.51
We can use the syntax A:B:S to select data from label A to label B with step size S (Both A and B are
included):
# Slicing with step
df.loc['Mon':'Fri':2 , :]
In all Estimators:
model.fit() : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)).
For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).
In supervised estimators:
model.predict() : given a trained model, predict the label of a new set of data. This method accepts one argument, the new
data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.
model.predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation
has each categorical label. In this case, the label with the highest probability is returned by model.predict().
model.score() : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger
score indicating a better fit.
In unsupervised estimators:
model.transform() : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the
new representation of the data based on the unsupervised model.
model.fit_transform() : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.
Classification:
SET A)
1) write a Python program build Decision Tree Classifier using Scikit-learn
package for diabetes data set (download database from
https://www.kaggle.com/uciml/pima-indians-diabetes-database)
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
# accuracy
print("Accuracy:", metrics.accuracy_score(Y_test,y_pred))
from six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(classifier, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names =
feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
2)Write a Python program build Decision Tree Classifier for shows.csv from
pandas and predict class label for show starring a 40 years old American
comedian, with 10 years of experience, and a comedy ranking of 7? Create
a csv file as shown in
https://www.w3schools.com/python/python_ml_decision_tree.asp
import pandas
from sklearn import tree
import pydotplus
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import matplotlib.image as pltimg
df = pandas.read_csv("c:\shows.csv")
d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
features = ['Age', 'Experience', 'Rank', 'Nationality']
X = df[features]
y = df['Go']
dtree = DecisionTreeClassifier()
SET B
1) Consider following dataset
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','S
unny','Sunny','Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mi
#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print "Predicted Value:", predicted
output: Predicted Value: [1]
Here, 1 indicates that players can 'play'.
SET C
1) Write a Python program to build SVM model to Cancer dataset. The
dataset is available in the scikit-learn library. Check the accuracy of model
with precision and recall.
#Load dataset
cancer = datasets.load_breast_cancer()
# print the names of the 13 features
print("Features: ", cancer.feature_names)
Association Rules
SET A)
1)Write a Python Programme to read the dataset (“Iris.csv”). dataset download from
(https://archive.ics.uci.edu/ml/datasets/iris) and apply Apriori algorithm.
Ans{
"cells": [
{
"cell_type": "markdown",
"id": "b58228cb",
"metadata": {},
"association_rules=apriori(records,min_support=0.0045,min_confidence=0.2,min_lift=3,min
_length=2)\n",
"association_results=list(association_rules)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ab0102a",
"metadata": {},
"outputs": [],
"source": [
"print(len(association_results))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "daa923d5",
"metadata": {},
"outputs": [],
"source": [
"print(association_results[0])\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f9ceaad",
"metadata": {},
"outputs": [],
"source": [
"for item in association_results:\n",
" pair = item[0]\n",
" items = [x for x in pair]\n",
" print(\"Rule:\"+items[0]+\"->\"+items[1])\n",
" \n",
" print(\"Support:\"+str(item[1]))\n",
"association_rules=apriori(records,min_support=0.0045,min_confidence=0.2,min_lift
=3,min_length=2)\n",
"association_results=list(association_rules)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4d1fd53",
"metadata": {
"id": "e4d1fd53"
},
"outputs": [],
"source": [
"print(len(association_results))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55580074",
"metadata": {
"id": "55580074"
},
"outputs": [],
"source": [
"print(association_results[0])\n"
]
},
{
"cell_type": "code",
SET B
1)Write a Python program to read “StudentsPerformance.csv” file. Solve following:
"association_rules=apriori(records,min_support=0.0040,min_confidence=0.2,min_lift=3,min
_length=2)\n",
"association_results=list(association_rules)\n"
],
"metadata": {
"id": "7ZTOd9jQ-z5F"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(len(association_results))"
],
"metadata": {
"id": "LYXHcNQs-_Cj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(association_results[0])"
],
"metadata": {
"id": "1gmQcNZk_Ekl"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"for item in association_results:\n",
Set C
Ans
1) Consider following observations/data. And apply simple linear regression and find
out estimated coefficients b0 and b1.( use numpy package)
x= [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,11,13]
y = ([1, 3, 2, 5, 7, 8, 8, 9, 10, 12,16, 18]
Ans
import numpy as np
import matplotlib.pyplot as plt
# putting labels
plt.xlabel('x')
plt.ylabel('y')
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9,11,13])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12,16, 18])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
if __name__ == "__main__":
2) Consider following observations/data. And apply simple linear regression and find
out estimated coefficients b1 and b1 Also analyse the performance of the model
(Use sklearn package)
x = np.array([1,2,3,4,5,6,7,8])
y = np.array([7,14,15,18,19,21,26,23])
Ans
import matplotlib.pyplot as plt
from scipy import stats
x = np.array([1,2,3,4,5,6,7,8])
y = np.array([7,14,15,18,19,21,26,23])
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
1) Write a python program to implement multiple Linear Regression model for a car
dataset.
Dataset can be downloaded from:
https://www.w3schools.com/python/python_ml_multiple_regression.asp
Ans
import pandas
from sklearn import linear_model
df = pandas.read_csv("d:dmdataset\carsm.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
print(predictedCO2)
2) Write a python programme to implement multiple linear regression model for stock
market data frame as follows:
Stock_Market = {'Year':
[2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,20
16,2016,2016,2016,2016,2016,2016,2016,2016,2016],
'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'Interest_Rate':
[2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.7
5,1.75,1.75,1.75,1.75],
'Unemployment_Rate':
[5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6
.2,6.1],
'Stock_Index_Price':
[1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,
958,971,949,884,866,876,822,704,719] }
And draw a graph of stock market price verses interest rate.
Ans
mport pandas as pd
import matplotlib.pyplot as plt
Stock_Market = {'Year':
[2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,
2016,2016,2016,2016,2016,2016,2016,2016],
'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'Interest_Rate':
[2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1
.75,1.75],
'Unemployment_Rate':
[5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'Stock_Index_Price':
[1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971
,949,884,866,876,822,704,719]
}
df =
pd.DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','
Stock_Index_Price'])
1) Write a programme in python to print the number of outliers. Generate 200 samples,
from a normal distribution, cantered around the value 100, with a standard deviation
of 5.
Ans
Z-Score and How It’s Used to Determine an Outlier | by Iden W. | Clarusway | Medium
Clustering
Set A
1. Write a python program to implement k-means algorithm to build prediction model
(Use Credit Card Dataset CC GENERAL.csv Download from kaggle.com)
Ans
#dataset --> https://www.kaggle.com/mlg-ulb/creditcardfraud/version/3
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
dataset = pd.read_csv('creditcard.csv')
dataset
x = dataset.iloc[:, [3, 4]].values
print(x)
from sklearn.cluster import KMeans
wcss_list= []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elobw Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
kmeans = KMeans(n_clusters=3, init='k-means++', random_state= 42)
y_predict= kmeans.fit_predict(x)
Ansdataset = pd.read_csv('Mall_Customers.csv')
Set B
1. Write a python program to implement k-means algorithms on a synthetic dataset.
data[0].shape
data[1]
plt.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='brg')
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(data[0])
kmeans.cluster_centers_
kmeans.labels_f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))
ax1.set_title('K Means')
ax1.scatter(data[0][:,0],data[0][:,1],c=kmeans.labels_,cmap='brg')
ax2.set_title("Original")
ax2.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='brg')
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
dataset = pd.read_csv('Wholesale customers data.csv')
dataset
x = dataset.iloc[:, [3, 4]].values
print(x)
import scipy.cluster.hierarchy as shc
dendro = shc.dendrogram(shc.linkage(x, method="ward"))
mtp.title("Dendrogrma Plot")
mtp.ylabel("Euclidean Distances")
mtp.xlabel("Customers")
mtp.show()
from sklearn.cluster import AgglomerativeClustering
hc= AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
y_pred= hc.fit_predict(x)
mtp.scatter(x[y_pred == 0, 0], x[y_pred == 0, 1], s = 100, c = 'blue', label = 'Cluster 1')
mtp.scatter(x[y_pred == 1, 0], x[y_pred == 1, 1], s = 100, c = 'green', label = 'Cluster 2')
mtp.scatter(x[y_pred== 2, 0], x[y_pred == 2, 1], s = 100, c = 'red', label = 'Cluster 3')
mtp.title('Clusters of customers')
mtp.xlabel('Milk')
mtp.ylabel('Grocery')
mtp.legend()
mtp.show()