Dmdw-Lab Manual
Dmdw-Lab Manual
Dmdw-Lab Manual
LAB-MANUAL
R20
SWARNANDHRA
COLLEGE OF ENGINEERING & TECHNOLOGY
(AUTONOMOUS)
EXPERIMENT-I
Aim: Demonstrate the following data preprocessing tasks using python
libraries.
a) Loading the dataset
b) Identifying the dependent and independent variables.
c) Dealing with missing data
Solution:
a) Loading the dataset
Pandas.read_csv():
Pandas is a very popular data manipulation library, and it is very commonly used.
One of it’s very important and mature functions is read_csv() which can read any
.csv file very easily and help us manipulate it.
#import libraries
import pandas as pd
# Load the dataset locally
file_path = r"C:\Users\ayyap\OneDrive\Desktop\Demo.csv"
Demo= pd.read_csv(file_path)
# Display the first few rows of the dataset
print("Original Dataset:")
print(Demo.head())
Output:
Original Dataset:
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
import pandas as pd
file_path=r"C:\Users\ayyap\OneDrive\Desktop\Demo.csv"
dataset= pd.read_csv(file_path)
The variables here can be classified as independent and dependent variables. The
independent variables are used to determine the dependent variable. In our dataset,
the first three columns (Country, Age, Salary) are independent variables which
will be used to determine the dependent variable (Purchased), which is the fourth
column.
Independent Variables (Features):
These are the variables used as input to predict the dependent variable.
They are typically denoted as 'X'.In a dataset, each column (except the dependent
variable) could be considered an independent variable.Features can be categorical
or numerical.In your case, you previously created a matrix of independent
variables 'X' from the dataset, which includes the columns used to predict the
dependent variable.
Now, we need to differentiate the matrix of features containing the independent
variables from the dependent variable ‘purchased’.
x= dataset.iloc[:,:-1] .values
print(x)
Output:
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 61000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
In the code above, the first ‘:’ stands for the rows which we want to include, and
the next one stands for the columns we want to include. By default, if only the ‘:’
(colon) is used, it means that all the rows/columns are to be included. In case of
our dataset, we need to include all the rows (:) and all the columns but the last one
(:-1).
(ii)Creating the dependent variable vector
Dependent Variable (Target Variable):
This is the variable that you are trying to predict or explain in your model.
It is typically denoted as 'y'.In a classification problem, it could be a categorical
variable (e.g., predicting whether an email is spam or not).
In a regression problem, it could be a continuous variable (e.g., predicting house
prices).In your previous example, it seems like the dependent variable 'y' is a
categorical variable indicating some condition ('Yes' or 'No').
We’ll be following the exact same procedure to create the dependent variable
vector ‘y’. The only change here is the columns which we want in y. As in the
matrix of features, we’ll be including all the rows. But from the columns, we need
only the 4th (3rd, keeping in mind the indexes in the python). Therefore, the code
the same will look as follows:
y= dataset.iloc[:,3].values
print(y)
Output:
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
import pandas as pd
file_path = r"C:\Users\ayyap\OneDrive\Desktop\Demo.csv"
Demo= pd.read_csv(file_path)
#Find the missing values from each column
print(Demo.isnull().sum())
Output:
Country 0
Age 1
Salary 1
Purchased 0
dtype: int64
IN:
#Find the total number of missing values from the entire dataset
print(Demo.isnull().sum().sum())
OUT:
2
df = Demo.dropna(axis=0)
df.isnull().sum()
Output:
Country 0
Age 0
Salary 0
Purchased 0
dtype: int64
df = Demo.drop(['Purchased'],axis=1)
df.isnull().sum()
Output:
Country 0
Age 1
Salary 1
dtype: int64
EXPERIMENT-2
Aim: Demonstrate the following data preprocessing tasks using python
libraries.
a) Dealing with categorical data.
b) Scaling the features.
c) Splitting dataset into Training and Testing Sets
Solution:
Sample Output:
Output:
2. Standardization
Standardization is another scaling technique in which the mean will be equal
to zero and the standard deviation equal to one.
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\Ameshousing.csv")
x=df[['Gr Liv Area','Overall Qual']].values
y=df['SalePrice'].values
fig,ax=plt.subplots(ncols=2,figsize=(12,4))
ax[0].scatter(x[:,0],y)
ax[1].scatter(x[:,1],y)
plt.show()
Output:
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\Ameshousing.csv")
x=df[['Gr Liv Area','Overall Qual']].values
y=df['SalePrice'].values
fig,ax=plt.subplots(figsize=(12,4))
ax.scatter(x[:,0],y)
ax.scatter(x[:,1],y)
plt.show()
Output:
1. Standardization
The StandardScaler class is used to transform the data by standardizing it.
Let's import it and scale the data via its fit_transform() method:
Input:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
fig, ax=plt.subplots(figsize=(12,4))
scaler=StandardScaler()
x_std=scaler.fit_transform(x)
ax.scatter(x_std[:,0],y)
ax.scatter(x_std[:,1],y)
plt.show()
Output:
2.MinMaxScaler
To normalize features, we use the MinMaxScaler class. It works in much the
same way as StandardScaler, but uses a fundementally different approach to
scaling the data: They are normalized in the range of [0, 1].
Input:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
fig,ax=plt.subplots(figsize=(12,4))
scaler=MinMaxScaler()
x_minmax=scaler.fit_transform(x)
ax.scatter(x_minmax[:,0],y)
ax.scatter(x_minmax[:,1],y)
plt.show()
Output:
Example:
Download kc_house_data.csv
Input:
import pandas as pd
from sklearn.model_selection import train_test_split
df=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\kc_house_data.csv")
columns=['bedrooms','bathrooms','sqft_living','sqft_lot','floors','price']
df=df.loc[:,columns]
print(df.head())
Output:
import pandas as pd
features=['bedrooms','bathrooms','sqft_living','sqft_lot','floors','price']
X=df.loc[:,features]
y=df.loc[:,['price']]
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,train_size=.75)
print(X_train.count())
print(X_test.count())
print(y_train.count())
print(y_test.count())
EXPERIMENT-3
Aim: Demonstrate the following Similarity and Dissimilarity Measuresusing
python
a) Euclidean Distance
b) Manhattan Distance
c) Minkowski Distance
d) Cosine Similarity
e) Jaccard Similarity
f) Pearson’s Correlation
Solution:
Similarity
• The similarity measure is the measure of how much alike two data objects are.
• A similarity measure is a data mining or machine learning context is a
distance with dimensions representing features of the objects.
• If the distance is small, the features are having a high degree of
similarity. Whereas a large distance will be a low degree of similarity.
• The similarity is subjective and is highly dependent on the domain and application.
• For example, two fruits are similar because of color or size or taste. Special
care should be taken when calculating distance across dimensions/features
that are unrelated.
Generally, similarity are measured in the range 0 to 1 [0,1]. In the machine
learning world, this score in the range of [0, 1] is called the similarity score.
Two main consideration of similarity:
• Similarity = 1 if X = Y (Where X, Y are two objects)
• Similarity = 0 if X ≠ Y
Dissimilarity
A dissimilarity measure works just opposite to how the similarity measure works,
i.e., it returns 1 if dissimilar and 0 if similar
Proximity refers to either a similarity or dissimilarity
a) Euclidean Distance
Euclidian distance between two points
on any axes is the shortest distance
between them.
In other words, it is the displacement
length between two points.
Given two points, A (a, b) and B (c, d), in a 2-dimensional plane, the Euclidian
distance between A and B is given as:
To find the distance between two points in three-dimensional planes: Let A (x1, y1, z1) and B
(x2, y2, z2) be two points:
Output:
Euclidian distance between point1 and point2: 3.0
Input:
from math import *
def manhat_distance(x,y):
return sum(abs(a-b)for a,b in zip(x,y))
print("Manhattan distance between two points:",manhat_distance([10,12,15],[8,12,20]))
Output:
Manhattan distance between two points: 7
c)Minkowski Distance
The Minkowski distance is a generalized metric form of Euclidean distance
and Manhattan distance. It looks like this:
Output:
Minkowski Distance between two points: 8.373
d)Cosine Similarity
The cosine similarity metric finds the normalized
dot product of the two attributes. By determining
the cosine similarity, we would effectively try to find
the cosine of the angle between the two objects.
The cosine of 0° is 1, and it is less than 1 for any other
angle.It is thus a judgment of orientation and not
magnitude. Two vectors with the same orientation have
a cosine similarity of 1, two vectors at 90° have a similarity of 0. Whereas two
vectors diametrically opposed having a similarity of -1, independent of their
magnitude. Cosine similarity is particularly used in positive
space, where the outcome is neatly bounded in [0,1].
Input:
from math import *
def square_rooted(x):
return round(sqrt(sum([a*a for a in x])),3)
def cosine_similarity(x,y):
numerator=sum(a*b for a,b in zip(x,y))
denominator=square_rooted(x)*square_rooted(y)
return round(numerator/float(denominator),3)
print("Cosine Similarity between two points:",cosine_similarity([3,45,7,2],[2,54,13,15]))
Output:
Cosine Similarity between two points: 0.972
e)Jaccard similarity
The Jaccard similarity measures the similarity between finite sample sets and is defined as
the cardinality of the intersection of sets divided by the cardinality of the union of the sample
sets.
Suppose you want to find Jaccard similarity between two sets A and B it is the ratio of
the cardinality of A ∩ B and A 𝖴 B
Input:
from math import *
def jaccard_similarity(x,y):
intersection_cardinality=len(set.intersection(*[set(x),set(y)]))
union_cardinality=len(set.union(*[set(x),set(y)]))
return intersection_cardinality/float(union_cardinality)
print("Jaccard Similarity between two points:",jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9]))
Output:
Jaccard Similarity between two points: 0.375
f) Pearson’s Correlation
Correlation:
Variables within a dataset can be related for lots of
reasons. For example:
• One variable could cause or depend on the values of another variable.
• One variable could be lightly associated with another variable.
• Two variables could depend on a third unknown variable.
It can be useful in data analysis and modeling to better understand the relationships
between variables. The statistical relationship between two variables is referred to
as their correlation.
• Positive Correlation: both variables change in the same direction.
• Neutral Correlation: No relationship in the change of the variables.
• Negative Correlation: variables change in opposite directions.
Pearson’s Correlation:
As a result, a linear function that predicts the values of the dependent variable as
a function of the independent variable is discovered for two-dimensional sample
points with one independent variable and one dependent variable.
The below graph explains the relation between Salary and Years of Experience
Equation : y = mx + c
This is the simple linear regression equation where c is the constant and m is
the slope and describes the relationship between x (independent
variable) and y (dependent variable). The coefficient can be positive or negative
and is the degree of change in the dependent variable for every 1 unit of change
in the independent variable.
β0 (y-intercept) and β1 (slope) are the coefficients whose values represent the
accuracy of predicted values with the actual values.
Implement Simple Linear Regression in Python
In this example, we will use the salary data concerning the experience of
employees. In this dataset, we have two columns YearsExperience and Salary
Step 1: Import the required python packages
We need Pandas for data manipulation, NumPy for mathematical calculations,
and MatplotLib, and Seaborn for visualizations. Sklearn libraries are used for
machine learning operations
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression
Step 2: Load the dataset
Download the dataset from here and upload it to your notebook and read it into
the pandas dataframe.
# Get dataset
df_sal =
pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\Salary_Data.csv")
df_sal.head()
Step 3: Data analysis
Now that we have our data ready, let's analyze and understand its trend in detail.
To do that we can first describe the data below –
# Describe data
df_sal.describe()
Here, we can see Salary ranges from 37731 to 122391 and a median of65237.We
can also find how the data is distributed visually using Seaborn Histplot
# Data distribution
plt.title('Salary Distribution Plot')
sns.Histplot(df_sal['Salary'])
plt.show()
It is clearly visible now, our data varies linearly. That means, that an individual
receives more Salary as they gain Experience.
Step 4: Split the dataset into dependent/independent variables
Experience (X) is the independent variable
Salary (y) is dependent on experience
# Splitting variables
X = df_sal.iloc[:, :1] # independent
y = df_sal.iloc[:, 1:] # dependent
Step 4: Split data into Train/Test sets
Further, split your data into training (80%) and test (20%) sets
using train_test_split
# Regressor model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Step 6: Predict the result
Here comes the interesting part, when we are all set and ready to predict any
value of y (Salary) dependent on X (Experience) with the trained model
using regressor.predict
# Prediction result
y_pred_test = regressor.predict(X_test) # predicted value of y_test
y_pred_train = regressor.predict(X_train) # predicted value of y_train
Step 7: Plot the training and test results
Its time to test our predicted results by plotting graphs
Plot training set data vs predictions
First we plot the result of training sets (X_train, y_train) with X_train and
predicted value of y_train (regressor.predict(X_train))
# Prediction on training set
plt.scatter(X_train, y_train, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best',
facecolor='white')
plt.box(False)
plt.show()
Plot test set data vs predictions
Secondly, we plot the result of test sets (X_test, y_test) with X_train and
predicted value of y_train (regressor.predict(X_train))
Solution:
A decision tree is a machine learning algorithm that uses a tree-like model of
decisions and their subsequent consequences to arrive at a particular decision. It is a
Supervised Machine Learning model, where the data is continuously split according
to a certain parameter, and finally, a decision is made.
Usually, a decision tree is drawn upside down, with the root node at the top and the
leaf nodes at the bottom. A decision tree usually contains 3 types of nodes.
1. Root node: The very top node that represents the entire population or sample.
2. Decision nodes: Sub-nodes that split from the root node.
3. Leaf nodes: Nodes with no children, also known as terminal nodes.
We will be using the IRIS dataset to build a decision tree classifier. The dataset
contains information for three classes of the IRIS plant, namely IRIS Setosa, IRIS
Versicolour, and IRIS Virginica, with the following attributes: sepal length, sepal
width, petal length, and petal width.
Our aim is to predict the class of the IRIS plant based on the given attributes.
Source Code:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
data=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\iris.csv")
data.head()
Output:
First five records of ‘iris’ dataset:
Source Code:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
data=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\iris.csv")
#Extracting Attributes/Features
X=data[['SepalLength','SepalWidth','PetalLength','PetalWidth']]
#Extracting Target/Class Labels
y=data['Species']
#Import Library for splitting data
from sklearn.model_selection import train_test_split
#creating Train and test Datasets
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=50,test_size=0.25)
#Creating Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier()
clf.fit(X_train,y_train)
#predict Accuracy Score
#Predict the response for test dataset
y_pred=clf.predict(X_train)
print("Train data accuracy:",accuracy_score(y_train,y_pred)*100)
y_pred=clf.predict(X_test)
print("Test data accuracy:",accuracy_score(y_test,y_pred)*100)
Output:
Solution:
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them
share a common principle,i.e. every pair of features being classified is independent
of each other.before moving to the formula for Naive Bayes, it is important to know
about Bayes’ theorem.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the
probability of another event that has already occurred. Bayes’ theorem is stated
mathematically as the following equation:
Consider a fictional dataset that describes the weather conditions for playing a game
of golf. Given the weather conditions, each tuple classifies the conditions as
fit(“Yes”) or unfit(“No”) for playing golf.
The dataset is divided into two parts, namely, feature matrix and the response
vector.
In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’
Response vector contains the value of class variable ‘Play golf’
X={‘Outlook’,‘Temperature’,‘Humidity’and
‘Windy’} y= ‘Play golf’
Eg: Consider first row in dataset:
X = (Rainy, Hot, High, False)y = No
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
where, y is class variable and X is a dependent feature vector (of size n) where:
basically, P(y|X) here means, the probability of “Not playing golf” given that the
weather conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and
“no wind”.
Naïve Assumption:
The fundamental Naive Bayes assumption is that each feature makes an:
• independent
• equal
contribution to the outcome.
With relation to our dataset, this concept can be understood as:
df=df.drop(["Outlook","Temperature","Humidity","Windy","Play"],axis=1)
df.head(3)
Categorical Naïve Bayes:
X=df[["Outlook_le","Temperature_le","Humidity_le","Windy_le"]]
y=df["Play_le"]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42,test_size=0.4)
from sklearn.naive_bayes import CategoricalNB
cnb=CategoricalNB()
cnb.fit(X_train,y_train)
y_pred=cnb.predict(X_test)
print(y_test)
print(y_pred)
from sklearn.metrics import accuracy_score
acc=accuracy_score(y_test,y_pred)*100
print("Accuracy of Categorical Naive Bayes:”,acc)
SampleOutput:
Solution:
➢ The Apriori algorithm is a well-known Machine Learning algorithm used for
association rule learning.
➢ Association rule learning is taking a dataset and finding relationships
between items in the data. For example, if you have a dataset of grocery
store items, you could use association rule learning to find items that are
often purchased together.
➢ The Apriori algorithm is used on frequent item sets to generate association
rules and is designed to work on the databases containing transactions.
➢ The process of generating association rules is called association rule mining
or association rule learning. We can use these association rules to measure
how strongly or weakly two objects from the dataset are related.
➢ Frequent itemsets are those whose support value exceeds the user-
specified minimum support value.
The most common problems that this algorithm helps to solve are:
➢ Product recommendation
➢ Market basket recommendation
Confidence
Measures how often items in Y appear in transactions that contain X
Lift
Lift describes how much confident we are if B will be purchased too when the
customer buys A:Example:
Let’s imagine we have a history of 3000 customers’ transactions in our database, and
we have to calculate the Support, Confidence, and Lift to figure out how likely the
customers who buy Biscuits will buy Chocolate.
the confidence value shows the probability that customers buy Chocolate if they buy
Biscuits
To calculate this value, we need to divide the number of transactions that contain
Biscuits and Chocolates by the total number of transactions having Biscuits:
the Lift value shows the potential increase in the ratio of the sale of Chocolates when
you sell Biscuits. The larger the value of the lift, the better:
First, the algorithm will create a table containing each item set’s support count in
the given dataset – the Candidate set
Let’s assume that we’ve set the minimum support value to 3, meaning the algorithm
will drop all the items with a support value of less than three.
The algorithm will take out all the itemsets with a greater support count than the
minimum support (frequent itemset) in the next step:
Next, the algorithm will generate the second candidate set (C2) with the help of the
frequent itemset (L1) from the previous calculation. The candidate set 2 (C2) will
be formed by creating the pairs of itemsets of L1. After creating new subsets, the
algorithm will again find the support count from the main transaction table of
datasets by calculating how often these pairs have occurred together in the given
dataset.
After that, the algorithm will compare the C2’s support count values with the
minimum support count (3), and the itemset with less support count will be
Code:
import pandas as pd
data_df=pd.read_csv(r'C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\product.csv'
,header=None)
data_df.tail()
data_list=[]
for x in data_df.values.tolist():
data_list.append([item for item in x if str(item)!='nan'])
data_list[1]
from mlxtend.preprocessing import TransactionEncoder
te=TransactionEncoder()
te_ary=te.fit(data_list).transform(data_list)
df=pd.DataFrame(te_ary,columns=te.columns_)
df.head()
Sample Output:
Code:
from mlxtend.frequent_patterns import apriori
frequent_itemsets=apriori(df,min_support=0.01,use_colnames=True)
print(frequent_itemsets)
Sample Output:
Step-3: Add a column ‘length’ and store the length of each frequentitemset
Code:
frequent_itemsets['length']=frequent_itemsets['itemsets'].apply(lambda x:len(x))
print(frequent_itemsets.tail())
Sample Output:
Code:
frequent_itemsets[(frequent_itemsets['length']==3)&((frequent_itemsets['support']>=
0.015))]
Sample Output:
Step-5: Generate Association rules for the frequent item sets of step-4 with
confidence=50%
Code:
from mlxtend.frequent_patterns import association_rules
rules=association_rules(frequent_itemsets,metric='confidence',min_threshold=0.5)
rules
Sample Output:
From the above output, the rules generated with support>=15% and
confidence=50% are:
Solution:
• K-Means is an unsupervised machine learning algorithm that is used
for clustering problems.
• K-Means divides unlabelled data points into specific clusters/groups of
points. As a result, each data point belongs to only one cluster that has
similar properties.
K-Means Algorithm
The steps involved in K-Means are as follows:-
Step-1: Loading the libraries and dataset and display first 5 rows
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
dataset=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\Mall_Cust
omers.csv")
dataset.head()
Output:
Step-2: Select the columns ‘Annual Income’ and ‘Spending Score’ as X and use
them for determining no.of clusters using Elbow Method
X=dataset.iloc[:,[3,4]].values
from sklearn.cluster import KMeans
wcss=[]
for i in range(1,11):
kmeans=KMeans(n_clusters=i,init='k-means++',random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.xlabel("Number of Clusters")
plt.ylabel("WCSS")
plt.show()
Output:
From the above plot, it is clear that no.of clusters to be formed is 5. So choose k=5
Step-3:
Using Kmeans class of sklearn.cluster, create the clusters of X and fit the X
to predict the target values
kmeans=KMeans(n_clusters=i,init='k-means++',random_state=42)
y_kmeans=kmeans.fit_predict(X)
Step-4:
plt.scatter(X[y_kmeans==0,0],x[y_kmeans==0,1],s=60,c="red",label="Cluster1")
plt.scatter(X[y_kmeans==1,0],x[y_kmeans==1,1],s=60,c="blue",label="Cluster2")
plt.scatter(X[y_kmeans==2,0],x[y_kmeans==2,1],s=60,c="green",label="Cluster3")
plt.scatter(X[y_kmeans==3,0],x[y_kmeans==3,1],s=60,c="violet",label="Cluster4")
plt.scatter(X[y_kmeans==4,0],x[y_kmeans==4,1],s=60,c="yellow",label="Cluster5")
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=100,c='black',labe
l='centroids')
plt.xlabel("Annual Income(K$)")
plt.ylabel("Spending Score(1-100)")
plt.legend()
plt.show()
Output:
EXPERIMENT-9
Aim: Apply Hierarchical clustering algorithm on any dataset.
Solution:
Hierarchical clustering:
Hierarchical clustering groups similar objects into a dendrogram. It merges similar clusters
iteratively, starting with each data point as a separate cluster. This creates a tree-like structure
that shows the relationships between clusters and their hierarchy.
The dendrogram from hierarchical clustering reveals the hierarchy of clusters at different levels,
highlighting natural groupings in the data. It provides a visual representation of the relationships
between clusters, helping to identify patterns and outliers, making it a useful tool for exploratory
data analysis.
There are mainly two types of hierarchical clustering:
1. Agglomerative hierarchical clustering
2. Divisive Hierarchical clustering
1. Agglomerative Hierarchical Clustering
In Agglomerative Hierarchical Clustering, Each data point is considered as a single cluster
making the total number of clusters equal to the number of data points. And then we keep
grouping the data based on the similarity metrics, making clusters as we move up in the
hierarchy. This approach is also called a bottom-up approach.
2. Divisive Hierarchical Clustering
Divisive hierarchical clustering is opposite to what agglomerative HC is. Here we start with a
single cluster consisting of all the data points. With each iteration, we separate points which are
distant from others based on distance metrics until every cluster has exactly 1 data point.
Example:
Suppose we have data related to marks scored by 4 students in Math and Science and we need to
create clusters of students to draw insights.
Step-1: Construct a Distance matrix. Distance between each point can be found using various
metrics i.e. Euclidean Distance, Manhattan Distance, etc.
We’ll use Euclidean distance for this example:
Distance Calculated Between Each Data Point
We now formed a Cluster between S1 and S2 because they were closer to each other.
Step-2: We take the average of the marks obtained by S1 and S2 and the values we get will
represent the marks for this cluster.
Dataset After First Clustering
Again find the closest points and create another cluster.
Clustering S3 And S4
Step-3: Repeat the steps above and keep on clustering until we are left with just one cluster
containing all the clusters, we get a result as below
Program:
Program:
Output:
EXPERIMENT-10
Aim:
Apply DBSCAN clustering algorithm on any dataset.
Solution:
K-Means and Hierarchical Clustering both fail in creating clusters of
arbitrary shapes. They are not able to form clusters based on varying densities.
That’s why we need DBSCAN clustering.
Density-Based Clustering refers to unsupervised learning methods that
identify distinctive groups/clusters in the data, based on the idea that a cluster
in data space is a contiguous region of high point density, separated from other
such clusters by contiguous regions of low point density.
Density-based spatial clustering of applications with noise (DBSCAN)
DBSCAN is a base algorithm for density-based clustering. It can discover
clusters of different shapes and sizes from a large amount of data, which is
containing noise and outliers.
The most exciting feature of DBSCAN clustering is that it is robust to outliers.
It also does not require the number of clusters to be told in prior.
The DBSCAN algorithm uses two parameters:
Here, we have some data points represented by grey color. Let’s see how
DBSCAN clusters these data points.
The above figure shows us a cluster created by DBCAN with minPoints = 3. Here, we
draw a circle of equal radius epsilon around every data point. These two parameters help
in creating spatial clusters.
All the data points with at least 3 points in the circle including itself are considered as
Core points represented by red color.
All the data points with less than 3 but greater than 1 point in the circle
including itself are considered as Border points. They are represented by
yellow color.
Finally, data points with no point other than itself present inside the circle are considered
as Noise represented by the purple color.
Reachability in terms of density establishes a point to be reachable from
another if it lies within a particular distance (eps) from it.
Connectivity, on the other hand, involves a transitivity based chaining-
approach to determine whether points are located in a particular cluster. For
example, p and q points could be connected if p->r->s->t->q, where a->b means
b is in the neighborhood of a.
Program:
Input-1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\Mall_Custo
mers.csv")
x=dataset.loc[:,['Annual Income (k$)','Spending Score (1-100)']].values
Input-2
from sklearn.neighbors import NearestNeighbors
neighb=NearestNeighbors(n_neighbors=2)
nbrs=neighb.fit(x)
distances,indices=nbrs.kneighbors(x)
Input-3
distances=np.sort(distances,axis=0)
distances=distances[:,1]
plt.rcParams['figure.figsize']=(6,4)
plt.xlabel('Data points sorted by distance')
plt.ylabel('Epsilon')
plt.plot(distances)
plt.show()
Output:
Input-4
Output: