Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Dmdw-Lab Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

IV SEMESTER

DATA MINING USING PYTHON

LAB-MANUAL

R20

AI&ML,CSE-DS & AI-DS

SWARNANDHRA
COLLEGE OF ENGINEERING & TECHNOLOGY
(AUTONOMOUS)

SEETHARAMAPURAM, NARSAPUR-534 280, W.G.DT., A.P.


INDEX

S.NO LIST OF EXPERIMENTS


1 Demonstrate the following data preprocessing tasks using python
libraries.
a) Loading the dataset
b) Identifying the dependent and independent variables
c) Dealing with missing data
2 Demonstrate the following data preprocessing tasks using python
libraries.
a) Dealing with categorical data
b) Scaling the features
c) Splitting dataset into Training and Testing Sets
3 Demonstrate the following Similarity and Dissimilarity Measures using
python
a) Pearson’s Correlation
b) Cosine Similarity
c) Jaccard Similarity
d) Euclidean Distance
e) Manhattan Distance
4 Build a model using linear regression algorithm on any dataset.
5 Build a classification model using Decision Tree algorithm on iris dataset
6 Apply Naïve Bayes Classification algorithm on any dataset
7 Generate frequent itemsets using Apriori Algorithm in python and also
generate
association rules for any market basket data.
8 Apply K- Means clustering algorithm on any dataset.
9 Apply Hierarchical Clustering algorithm on any dataset.
10 Apply DBSCAN clustering algorithm on any dataset.
DATA MINING USING PYTHON LAB

EXPERIMENT-I
Aim: Demonstrate the following data preprocessing tasks using python
libraries.
a) Loading the dataset
b) Identifying the dependent and independent variables.
c) Dealing with missing data
Solution:
a) Loading the dataset
Pandas.read_csv():
Pandas is a very popular data manipulation library, and it is very commonly used.
One of it’s very important and mature functions is read_csv() which can read any
.csv file very easily and help us manipulate it.

#import libraries
import pandas as pd
# Load the dataset locally
file_path = r"C:\Users\ayyap\OneDrive\Desktop\Demo.csv"
Demo= pd.read_csv(file_path)
# Display the first few rows of the dataset
print("Original Dataset:")
print(Demo.head())

Output:
Original Dataset:
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes

b) Identifying the dependent and independent variables.


Load the dataset using read_csv() of pandas module:

import pandas as pd
file_path=r"C:\Users\ayyap\OneDrive\Desktop\Demo.csv"
dataset= pd.read_csv(file_path)
The variables here can be classified as independent and dependent variables. The
independent variables are used to determine the dependent variable. In our dataset,
the first three columns (Country, Age, Salary) are independent variables which
will be used to determine the dependent variable (Purchased), which is the fourth
column.
Independent Variables (Features):
These are the variables used as input to predict the dependent variable.
They are typically denoted as 'X'.In a dataset, each column (except the dependent
variable) could be considered an independent variable.Features can be categorical
or numerical.In your case, you previously created a matrix of independent
variables 'X' from the dataset, which includes the columns used to predict the
dependent variable.
Now, we need to differentiate the matrix of features containing the independent
variables from the dependent variable ‘purchased’.

(i) Creating the matrix of features


The matrix of features will contain the variables ‘Country’, ‘Age’ and ‘Salary’.
The code to declare the matrix of features will be as follows:

x= dataset.iloc[:,:-1] .values
print(x)

Output:
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 61000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

In the code above, the first ‘:’ stands for the rows which we want to include, and
the next one stands for the columns we want to include. By default, if only the ‘:’
(colon) is used, it means that all the rows/columns are to be included. In case of
our dataset, we need to include all the rows (:) and all the columns but the last one
(:-1).
(ii)Creating the dependent variable vector
Dependent Variable (Target Variable):
This is the variable that you are trying to predict or explain in your model.
It is typically denoted as 'y'.In a classification problem, it could be a categorical
variable (e.g., predicting whether an email is spam or not).
In a regression problem, it could be a continuous variable (e.g., predicting house
prices).In your previous example, it seems like the dependent variable 'y' is a
categorical variable indicating some condition ('Yes' or 'No').
We’ll be following the exact same procedure to create the dependent variable
vector ‘y’. The only change here is the columns which we want in y. As in the
matrix of features, we’ll be including all the rows. But from the columns, we need
only the 4th (3rd, keeping in mind the indexes in the python). Therefore, the code
the same will look as follows:

y= dataset.iloc[:,3].values
print(y)

Output:
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

c) Dealing with missing data


What Is a Missing Value?
Missing data is defined as the values or data that is not stored (or not present) for
some variables in the given data in the given dataset. Below is a sample of the
missing data from the Titanic dataset. You can see the columns ‘Age’ and ‘Cabin’
have some missing values.

In the dataset, the blank shows the missing values.


In Pandas, usally,missing values are represented by NaN.
It stands for Not a Number.

Types of Missing Values


1. Missing Completely At Random (MCAR)
In MCAR, the probability of data being missing is the same for all the
observations. In this case, there is no relationship between the missing data and
any other values observed or unobserved (the data which is not recorded) within
the given dataset. That is, missing values are completely independent of other data.
There is no pattern.
2. Missing At Random (MAR)
MAR data means that the reason for missing values can be explained by variables
on which you have complete information, as there is some relationship between
the missing data and other values/data. In this case, the data is not missing for all
the observations. It is missing only within sub-samples of the data, and there is
some pattern in the missing values.
3. Missing Not At Random (MNAR)
Missing values depend on the unobserved data. If there is some structure/pattern
in missing data and other observed data can not explain it, then it is considered to
be Missing Not At Random(MNAR).If the missing data does not fall under the
MCAR or MAR, it can be categorized as MNAR
Checking for Missing Values in Python:
The first step in handling missing values is to carefully look at the complete data
and find all the missing values. The following code shows the total number of
missing values in each column. It also shows the total number of missing values
in the entire data set.

import pandas as pd
file_path = r"C:\Users\ayyap\OneDrive\Desktop\Demo.csv"
Demo= pd.read_csv(file_path)
#Find the missing values from each column
print(Demo.isnull().sum())

Output:
Country 0
Age 1
Salary 1
Purchased 0
dtype: int64

IN:
#Find the total number of missing values from the entire dataset
print(Demo.isnull().sum().sum())

OUT:
2

There are 2 missing values in total. Handling Missing Values


There are 2 primary ways of handling missing values:
1. Deleting the Missing values
2. Imputing the Missing Values
1. Deleting the Missing value
Generally, this approach is not recommended. It is one of the quick and dirty
techniques one can use to deal with missing values. If the missing value is of the
type Missing Not At Random (MNAR), then it should not be deleted.If the missing
value is of type Missing At Random (MAR) or Missing Completely At Random
(MCAR) then it can be deleted (In the analysis, all cases with available data are
utilized, while missing observations are assumed to be completely random
(MCAR) and addressed through pairwise deletion.)
The disadvantage of this method is one might end up deleting some useful data
from the dataset.
There are 2 ways one can delete the missing data values:
(i) Deleting the entire row (listwise deletion):
If a row has many missing values, you can drop the entire row. If every row has
some (column)value missing, you might end up deleting the whole data. The code
to drop the entire row is as follows:

df = Demo.dropna(axis=0)
df.isnull().sum()

Output:
Country 0
Age 0
Salary 0
Purchased 0
dtype: int64

(ii) Deleting the entire column


If a certain column has many missing values, then you can choose to drop the entire
column. The code to drop the entire column is as follows:

df = Demo.drop(['Purchased'],axis=1)
df.isnull().sum()

Output:
Country 0
Age 1
Salary 1
dtype: int64

2. Imputing the Missing Value


There are many imputation methods for replacing the missing values. You can
use different python libraries such as Pandas, and Sci-kit Learn to do this.
Replacing with an arbitrary value
E.g., in the following code, we are replacing the missing values of the
‘Purchased’ column with ‘0’

#Replace the missing value with '0' using 'fiilna' method


Demo['Purchased'] = Demo['Purchased'].fillna(0)
Demo['Purchased'].isnull().sum()
Output:
0
DATA MINING USING PYTHON LAB

EXPERIMENT-2
Aim: Demonstrate the following data preprocessing tasks using python
libraries.
a) Dealing with categorical data.
b) Scaling the features.
c) Splitting dataset into Training and Testing Sets

Solution:

a) Dealing with categorical data.


● Categorical Data
○ Categorical data is a type of data that is used to group
information with similar characteristics.
○ Numerical data is a type of data that expresses
information in the form of numbers.
○ Example of categorical data: gender
● Encoding Categorical Data
○ Most machine learning algorithms cannot handle categorical
variables unless we convert them to numerical values
○ Many algorithm performances even vary based upon
how the categorical variables are encoded
● Categorical variables can be divided into two categories:
○ Nominal: no particular order
○ Ordinal: there is some order between values

Nominal data: This type of categorical data consists of the name


variable without any numerical values. For example, in any
organization, the name of the different departments like research
and development department, human resource department, accounts
and billing department etc.
Above we can see some examples of nominal data.
Ordinal data: This type of categorical data consists of a set of orders or
scales. For example, a list of patients consists of the level of sugar present
in the body of a person which can be divided into high, low and medium
classes.

● Different encoding techniques for dealing with categorical data


○ Label (or) Ordinal Encoding
○ One-hot Encoding
(i) Label encoding

In label encoding in Python, we replace the categorical value with a


numeric value between 0 and the number of classes minus 1. If the
categorical variable value contains 5 distinct classes, we use (0, 1, 2, 3,
and 4).
Ex: Let us take the dataset salary.csv and load it using read_csv () function
Input:
import pandas as pd
data=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\Salary.csv")
print(data)
Output:

Now we will encode the values of categorical attribute ‘Country’using


LabelEncoding Technique
Input:
from sklearn import preprocessing as p
df=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\Salary.csv")
le=p.LabelEncoder()
df['Country']=le.fit_transform(df['Country'])
print(df)

Sample Output:

(ii) One hot encoding


One-Hot Encoding is another popular technique for treating categorical
variables. It simply creates additional features based on the number of
unique values in the categorical feature. Every unique value in the
category will be added as a feature.
In this encoding technique, each category is represented as a one-hot vector.
Input:
from sklearn.preprocessing import OneHotEncoder
df=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\Salary.csv")
data['Country']=data['Country'].astype('category')
data['Country_new']=data['Country'].cat.codes
enc=OneHotEncoder()
enc_data=pd.DataFrame(enc.fit_transform(data[['Country']]).toarray())
New_df=data.join(enc_data)
print(New_df)

Output:

b) Scaling the features


Feature Scaling is a technique of bringing down the values of all the
independent features of our dataset on the same scale. Feature selection
helps to do calculations in algorithms very quickly. It is the important
stage of data preprocessing.
If we didn't do feature scaling then the machine learning model gives
higher weightage to higher values and lower weightage to lower values.
Also, takes a lot of time for training the machine learning model.
Many machine learning algorithms that are using Euclidean distance as a
metric to calculate the similarities will fail to give a reasonable
recognition to the smaller feature, in this case, the number of bedrooms,
which in the real case can turn out to be an actually important metric.
There are several ways to do feature scaling.
Types of Feature Scaling
1. Normalization
Normalization is a scaling technique in which the values are rescaled between
the range 0 to 1.

To normalize our data, we need to import MinMaxScalar from the Sci-


Kit learn library and apply it to our dataset. After applying the
MinMaxScalar, the minimum value will be zero and the maximum value
will be one.

2. Standardization
Standardization is another scaling technique in which the mean will be equal
to zero and the standard deviation equal to one.

To standardize our data, we need to import StandardScalar from the Sci-


Kit learn library and apply it to our dataset.We'll be working with the Ames
Housing Dataset which contains 79 features regarding houses sold in Ames
Let's import the data and take a look at some of the features we'll be using:

import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\Ameshousing.csv")
x=df[['Gr Liv Area','Overall Qual']].values
y=df['SalePrice'].values
fig,ax=plt.subplots(ncols=2,figsize=(12,4))
ax[0].scatter(x[:,0],y)
ax[1].scatter(x[:,1],y)
plt.show()
Output:

From the output, there's a clear strong positive correlation between


(a) the "Gr Liv Area" feature and the "SalePrice" feature - with only a couple
of outliers.
(b) the "Overall Qual" feature and the "SalePrice" feature.
The "Gr Liv Area" spans up to ~5000 (measured in square feet), while the
"Overall Qual"feature spans up to 10 (discrete categories of quality). If we
were to plot these two on the same axes, we wouldn't be able to tell much
about the "Overall Qual" feature:

import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\Ameshousing.csv")
x=df[['Gr Liv Area','Overall Qual']].values
y=df['SalePrice'].values
fig,ax=plt.subplots(figsize=(12,4))
ax.scatter(x[:,0],y)
ax.scatter(x[:,1],y)
plt.show()
Output:

1. Standardization
The StandardScaler class is used to transform the data by standardizing it.
Let's import it and scale the data via its fit_transform() method:
Input:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
fig, ax=plt.subplots(figsize=(12,4))
scaler=StandardScaler()
x_std=scaler.fit_transform(x)
ax.scatter(x_std[:,0],y)
ax.scatter(x_std[:,1],y)
plt.show()
Output:
2.MinMaxScaler
To normalize features, we use the MinMaxScaler class. It works in much the
same way as StandardScaler, but uses a fundementally different approach to
scaling the data: They are normalized in the range of [0, 1].
Input:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
fig,ax=plt.subplots(figsize=(12,4))
scaler=MinMaxScaler()
x_minmax=scaler.fit_transform(x)
ax.scatter(x_minmax[:,0],y)
ax.scatter(x_minmax[:,1],y)
plt.show()
Output:

c) Splitting dataset into Training and Testing Sets


What Is the Train Test Split Procedure?
Train test split is a model validation procedure that allows you to
simulate how a model would perform on new/unseen data. Here is how
the procedure works:
1. Arrange the Data
Make sure your data is arranged into a format acceptable for train test
split. In scikit-learn, this consists of separating your full data set into
“Features” and “Target.”
2. Split the Data
Split the data set into two pieces — a training set and a testing set. This
consists of random sampling without replacement about 75 percent of the
rows (you can vary this) and putting them into your training set. The
remaining 25 percent is put into your test set. Note that the colors in
“Features” and “Target” indicate where their data will go (“X_train,”
“X_test,” “y_train,” “y_test”) for a particular train test split.
3. Train the Model
Train the model on the training set. This is “X_train” and “y_train” in the image.
4. Test the Model
Test the model on the testing set (“X_test” and “y_test” in the
image) and evaluate the performance.

Example:
Download kc_house_data.csv
Input:
import pandas as pd
from sklearn.model_selection import train_test_split
df=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\kc_house_data.csv")
columns=['bedrooms','bathrooms','sqft_living','sqft_lot','floors','price']
df=df.loc[:,columns]
print(df.head())
Output:

import pandas as pd
features=['bedrooms','bathrooms','sqft_living','sqft_lot','floors','price']
X=df.loc[:,features]
y=df.loc[:,['price']]
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,train_size=.75)

print(X_train.count())

print(X_test.count())
print(y_train.count())

print(y_test.count())
EXPERIMENT-3
Aim: Demonstrate the following Similarity and Dissimilarity Measuresusing
python
a) Euclidean Distance
b) Manhattan Distance
c) Minkowski Distance
d) Cosine Similarity
e) Jaccard Similarity
f) Pearson’s Correlation
Solution:
Similarity
• The similarity measure is the measure of how much alike two data objects are.
• A similarity measure is a data mining or machine learning context is a
distance with dimensions representing features of the objects.
• If the distance is small, the features are having a high degree of
similarity. Whereas a large distance will be a low degree of similarity.
• The similarity is subjective and is highly dependent on the domain and application.
• For example, two fruits are similar because of color or size or taste. Special
care should be taken when calculating distance across dimensions/features
that are unrelated.
Generally, similarity are measured in the range 0 to 1 [0,1]. In the machine
learning world, this score in the range of [0, 1] is called the similarity score.
Two main consideration of similarity:
• Similarity = 1 if X = Y (Where X, Y are two objects)
• Similarity = 0 if X ≠ Y

Dissimilarity
A dissimilarity measure works just opposite to how the similarity measure works,
i.e., it returns 1 if dissimilar and 0 if similar
Proximity refers to either a similarity or dissimilarity
a) Euclidean Distance
Euclidian distance between two points
on any axes is the shortest distance
between them.
In other words, it is the displacement
length between two points.
Given two points, A (a, b) and B (c, d), in a 2-dimensional plane, the Euclidian
distance between A and B is given as:

To find the distance between two points in three-dimensional planes: Let A (x1, y1, z1) and B
(x2, y2, z2) be two points:

(i) General Method


Input:
from math import *
point1=[1,3,5]
point2=[2,5,3]
sqrs=(point1[0]-point2[0])**2+(point1[1]-point2[1])**2+(point1[2]-point2[2])**2
euc_dist=sqrt(sqrs)
print("Euclidian distance between point1 and point2:",euc_dist)

Output:
Euclidian distance between point1 and point2: 3.0

(ii) using linealg.norm() Method of numpy


Numpy arrays offer many advantages over Python lists, including faster computation, better
memory efficiency, and a wide range of functions for mathematical operations.
using the numpy.linalg.norm() method. The numpy.linalg.norm() function calculates the
norm (magnitude) of a vector. By subtracting p2 from p1, we get a vector representing the
difference between the two points, which is the direction and distance from p1 to p2. The
norm() function then calculates the magnitude of this vector, which is the Euclidean distance
between the two points.
Input:
import numpy as np
p1=np.array((4,7,9))
p2=np.array((10,12,14))
euc_dist=np.linalg.norm(p1-p2)
print("Euclidian distance between point1 and point2:",euc_dist)
Output:
Euclidian distance between point1 and point2: 9.27361849549570
b) Manhattan Distance
• Manhattan distance is a metric in which the distance between two points is the
sum of the absolute differences of their Cartesian coordinates.
• In a simple way of saying it is the absolute sum of the difference between the
x- coordinates and y-coordinates.
• Suppose we have a Point A and a Point B: if we want to find the Manhattan
distance between them, we just have to sum up the absolute x-axis and y-axis
variation. We find the Manhattan distance between two points by measuring
along axes at right angles.
• In a plane with p1 at (x1, y1) and p2 at (x2, y2).
Manhattan distance = |x1–x2|+|y1–y2|

Input:
from math import *
def manhat_distance(x,y):
return sum(abs(a-b)for a,b in zip(x,y))
print("Manhattan distance between two points:",manhat_distance([10,12,15],[8,12,20]))
Output:
Manhattan distance between two points: 7
c)Minkowski Distance
The Minkowski distance is a generalized metric form of Euclidean distance
and Manhattan distance. It looks like this:

When p = 2, Minkowski distance is the same as the Euclidean


distance. When p = 1, Minkowski distance is the same as the
Manhattan distance.
Input:
from math import *
from decimal import Decimal
def nth_root(value,n_root):
root_value=1/float(n_root)
return round(Decimal(value)**Decimal(root_value),3)
def minkowski_distance(x,y,p_value):
return nth_root(sum(pow(abs(a-b),p_value)for a,b in zip(x,y)),p_value)
print("Minkowski Distance between two points:",minkowski_distance([0,3,4,5],[7,6,3,-1]
,3))

Output:
Minkowski Distance between two points: 8.373

d)Cosine Similarity
The cosine similarity metric finds the normalized
dot product of the two attributes. By determining
the cosine similarity, we would effectively try to find
the cosine of the angle between the two objects.
The cosine of 0° is 1, and it is less than 1 for any other
angle.It is thus a judgment of orientation and not
magnitude. Two vectors with the same orientation have
a cosine similarity of 1, two vectors at 90° have a similarity of 0. Whereas two
vectors diametrically opposed having a similarity of -1, independent of their
magnitude. Cosine similarity is particularly used in positive
space, where the outcome is neatly bounded in [0,1].
Input:
from math import *
def square_rooted(x):
return round(sqrt(sum([a*a for a in x])),3)
def cosine_similarity(x,y):
numerator=sum(a*b for a,b in zip(x,y))
denominator=square_rooted(x)*square_rooted(y)
return round(numerator/float(denominator),3)
print("Cosine Similarity between two points:",cosine_similarity([3,45,7,2],[2,54,13,15]))
Output:
Cosine Similarity between two points: 0.972

e)Jaccard similarity

The Jaccard similarity measures the similarity between finite sample sets and is defined as
the cardinality of the intersection of sets divided by the cardinality of the union of the sample
sets.
Suppose you want to find Jaccard similarity between two sets A and B it is the ratio of
the cardinality of A ∩ B and A 𝖴 B
Input:
from math import *
def jaccard_similarity(x,y):
intersection_cardinality=len(set.intersection(*[set(x),set(y)]))
union_cardinality=len(set.union(*[set(x),set(y)]))
return intersection_cardinality/float(union_cardinality)
print("Jaccard Similarity between two points:",jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9]))
Output:
Jaccard Similarity between two points: 0.375
f) Pearson’s Correlation
Correlation:
Variables within a dataset can be related for lots of
reasons. For example:
• One variable could cause or depend on the values of another variable.
• One variable could be lightly associated with another variable.
• Two variables could depend on a third unknown variable.
It can be useful in data analysis and modeling to better understand the relationships
between variables. The statistical relationship between two variables is referred to
as their correlation.
• Positive Correlation: both variables change in the same direction.
• Neutral Correlation: No relationship in the change of the variables.
• Negative Correlation: variables change in opposite directions.

Pearson’s Correlation:

The Pearson correlation coefficient, often referred to as Pearson’s r, is a measure


of linear correlation between two variables. This means that the Pearson correlation
coefficient measures a normalized measurement of covariance (i.e., a value between
-1 and 1 that shows how much variables vary together).
The P2earson’s correlation coefficient is calculated as the covariance of the two
variables divided by the product of the standard deviation of each data sample. It is
the normalization of the covariance between the two variables to give an interpretable
score.
Pearson's correlation coefficient = covariance(X, Y) / (stdv(X) * stdv(Y))
(i) Calculating Pearsons Correlation using pandas
Input & Output:

(ii) Calculating Pearsons Correlation using numpy


(iii) Calculating Pearsons Correlation using scipy
EXPERIMENT-4
Aim: Build a model using linear regression algorithm on any dataset.
What is Simple Linear Regression?

In statistics, simple linear regression is a linear regression model with a single


explanatory variable. In simple linear regression, we predict scores on one variable
based on results on another. The criteria variable Y is the variable we are
predicting. Predictor variable X is the variable using which we are making our
predictions. The prediction approach is known as simple regression as there is only
one predictor variable,

As a result, a linear function that predicts the values of the dependent variable as
a function of the independent variable is discovered for two-dimensional sample
points with one independent variable and one dependent variable.

The below graph explains the relation between Salary and Years of Experience

Equation : y = mx + c

This is the simple linear regression equation where c is the constant and m is
the slope and describes the relationship between x (independent
variable) and y (dependent variable). The coefficient can be positive or negative
and is the degree of change in the dependent variable for every 1 unit of change
in the independent variable.
β0 (y-intercept) and β1 (slope) are the coefficients whose values represent the
accuracy of predicted values with the actual values.
Implement Simple Linear Regression in Python
In this example, we will use the salary data concerning the experience of
employees. In this dataset, we have two columns YearsExperience and Salary
Step 1: Import the required python packages
We need Pandas for data manipulation, NumPy for mathematical calculations,
and MatplotLib, and Seaborn for visualizations. Sklearn libraries are used for
machine learning operations
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression
Step 2: Load the dataset
Download the dataset from here and upload it to your notebook and read it into
the pandas dataframe.
# Get dataset
df_sal =
pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\Salary_Data.csv")
df_sal.head()
Step 3: Data analysis
Now that we have our data ready, let's analyze and understand its trend in detail.
To do that we can first describe the data below –
# Describe data
df_sal.describe()

Here, we can see Salary ranges from 37731 to 122391 and a median of65237.We
can also find how the data is distributed visually using Seaborn Histplot
# Data distribution
plt.title('Salary Distribution Plot')
sns.Histplot(df_sal['Salary'])
plt.show()

A Histplot or distribution plot shows the variation in the data distribution.


It represents the data by combining a line with a histogram.
Then we check the relationship between Salary and Experience –
# Relationship between Salary and Experience
plt.scatter(df_sal['YearsExperience'], df_sal['Salary'], color = 'lightcoral')
plt.title('Salary vs Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.box(False)
plt.show()

It is clearly visible now, our data varies linearly. That means, that an individual
receives more Salary as they gain Experience.
Step 4: Split the dataset into dependent/independent variables
Experience (X) is the independent variable
Salary (y) is dependent on experience
# Splitting variables
X = df_sal.iloc[:, :1] # independent
y = df_sal.iloc[:, 1:] # dependent
Step 4: Split data into Train/Test sets
Further, split your data into training (80%) and test (20%) sets
using train_test_split

# Splitting dataset into test/train


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state
= 0)

Step 5: Train the regression model


Pass the X_train and y_train data into the regressor model by regressor.fit to
train the model with our training data.

# Regressor model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Step 6: Predict the result
Here comes the interesting part, when we are all set and ready to predict any
value of y (Salary) dependent on X (Experience) with the trained model
using regressor.predict
# Prediction result
y_pred_test = regressor.predict(X_test) # predicted value of y_test
y_pred_train = regressor.predict(X_train) # predicted value of y_train
Step 7: Plot the training and test results
Its time to test our predicted results by plotting graphs
Plot training set data vs predictions
First we plot the result of training sets (X_train, y_train) with X_train and
predicted value of y_train (regressor.predict(X_train))
# Prediction on training set
plt.scatter(X_train, y_train, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best',
facecolor='white')
plt.box(False)
plt.show()
Plot test set data vs predictions
Secondly, we plot the result of test sets (X_test, y_test) with X_train and
predicted value of y_train (regressor.predict(X_train))

# Prediction on test set


plt.scatter(X_test, y_test, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best',
facecolor='white')
plt.box(False)
plt.show()
We can see, in both plots, the regressor line covers train and test data.Also, you
can plot results with the predicted value of y_test (regressor.predict(X_test)) but
the regression line would remain the same at it is generated from the unique
equation of linear regression with the same training data.
If you remember from the beginning of this article, we discussed the linear
equation y = mx + c, we can also get
the c (yintercept) and m (slope/coefficient) from the regressor model.
# Regressor coefficients and intercept
print(f'Coefficient: {regressor.coef_}')
print(f'Intercept: {regressor.intercept_}')
Output:
EXPERIMENT-5
Aim: Build a classification model using Decision Tree algorithm on iris dataset

Solution:
A decision tree is a machine learning algorithm that uses a tree-like model of
decisions and their subsequent consequences to arrive at a particular decision. It is a
Supervised Machine Learning model, where the data is continuously split according
to a certain parameter, and finally, a decision is made.
Usually, a decision tree is drawn upside down, with the root node at the top and the
leaf nodes at the bottom. A decision tree usually contains 3 types of nodes.
1. Root node: The very top node that represents the entire population or sample.
2. Decision nodes: Sub-nodes that split from the root node.
3. Leaf nodes: Nodes with no children, also known as terminal nodes.

How decision trees work


Decision trees work in a step-wise manner, meaning that they perform a step-by-step
process instead of following a continuous process. Decision trees follow a tree-like
structure, where the nodes of a tree are split using the features based on defined
criteria. The main criteria based on which decision trees split are:
• Gini impurity: Measures the impurity in a node.
• Entropy: Measures the randomness of the system.
You can follow the steps below to create a feasible and useful decision tree:

• Gather the data.


• Import the required Python libraries and build a data frame.
• Create the model in Python (we will use decision trees).
• Use the test dataset to make a prediction and check the accuracy score of the
model.

We will be using the IRIS dataset to build a decision tree classifier. The dataset
contains information for three classes of the IRIS plant, namely IRIS Setosa, IRIS
Versicolour, and IRIS Virginica, with the following attributes: sepal length, sepal
width, petal length, and petal width.

Our aim is to predict the class of the IRIS plant based on the given attributes.

Source Code:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
data=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\iris.csv")
data.head()

Output:
First five records of ‘iris’ dataset:
Source Code:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
data=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\iris.csv")
#Extracting Attributes/Features
X=data[['SepalLength','SepalWidth','PetalLength','PetalWidth']]
#Extracting Target/Class Labels
y=data['Species']
#Import Library for splitting data
from sklearn.model_selection import train_test_split
#creating Train and test Datasets
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=50,test_size=0.25)
#Creating Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier()
clf.fit(X_train,y_train)
#predict Accuracy Score
#Predict the response for test dataset
y_pred=clf.predict(X_train)
print("Train data accuracy:",accuracy_score(y_train,y_pred)*100)
y_pred=clf.predict(X_test)
print("Test data accuracy:",accuracy_score(y_test,y_pred)*100)

Output:

Train data accuracy: 100.0


Test data accuracy: 94.73684210526315
EXPERIMENT-6
Aim: Apply Naïve Bayes Classification algorithm on any dataset

Solution:
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them
share a common principle,i.e. every pair of features being classified is independent
of each other.before moving to the formula for Naive Bayes, it is important to know
about Bayes’ theorem.

Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the
probability of another event that has already occurred. Bayes’ theorem is stated
mathematically as the following equation:

where A and B are events and P(B) ≠ 0.

• Basically, we are trying to find probability of event A, given the event B is


true. Event B is also termed as evidence.
• P(A) is the priori of A (the prior probability, i.e. Probability of event before
evidence is seen). The evidence is an attribute value of an unknown
instance(here, it is event B).
• P(A|B) is a posteriori probability of B, i.e. probability of event after evidence
is seen.

Consider a fictional dataset that describes the weather conditions for playing a game
of golf. Given the weather conditions, each tuple classifies the conditions as
fit(“Yes”) or unfit(“No”) for playing golf.
The dataset is divided into two parts, namely, feature matrix and the response
vector.
In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’
Response vector contains the value of class variable ‘Play golf’
X={‘Outlook’,‘Temperature’,‘Humidity’and
‘Windy’} y= ‘Play golf’
Eg: Consider first row in dataset:
X = (Rainy, Hot, High, False)y = No

Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

where, y is class variable and X is a dependent feature vector (of size n) where:

basically, P(y|X) here means, the probability of “Not playing golf” given that the
weather conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and
“no wind”.
Naïve Assumption:
The fundamental Naive Bayes assumption is that each feature makes an:
• independent
• equal
contribution to the outcome.
With relation to our dataset, this concept can be understood as:

• We assume that no pair of features are dependent. For example, the


temperature being ‘Hot’ has nothing to do with the humidity or the outlook
being ‘Rainy’ has no effect on the winds. Hence, the features are assumed
to be independent.
• Secondly, each feature is given the same weight(or importance). For
example, knowing only temperature and humidity alone can’t predict the
outcome accurately. None of the attributes is irrelevant and assumed to be
contributing equally to the outcome.
Source Code:
import pandas as pd
df=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\golf_df.csv")
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
Outlook_le=le.fit_transform(df.Outlook)
Temperature_le=le.fit_transform(df.Temperature)
Humidity_le=le.fit_transform(df.Humidity)
Windy_le=le.fit_transform(df.Windy)
Play_le=le.fit_transform(df.Play)
df["Outlook_le"]=Outlook_le
df["Temperature_le"]=Temperature_le
df["Humidity_le"]=Humidity_le
df["Windy_le"]=Windy_le
df["Play_le"]=Play_le
df.head(3)

df=df.drop(["Outlook","Temperature","Humidity","Windy","Play"],axis=1)
df.head(3)
Categorical Naïve Bayes:

X=df[["Outlook_le","Temperature_le","Humidity_le","Windy_le"]]
y=df["Play_le"]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42,test_size=0.4)
from sklearn.naive_bayes import CategoricalNB
cnb=CategoricalNB()
cnb.fit(X_train,y_train)
y_pred=cnb.predict(X_test)
print(y_test)
print(y_pred)
from sklearn.metrics import accuracy_score
acc=accuracy_score(y_test,y_pred)*100
print("Accuracy of Categorical Naive Bayes:”,acc)

SampleOutput:

Gaussian Naïve Bayes:

from sklearn.naive_bayes import GaussianNB


gnb=GaussianNB()
gnb.fit(X_train,y_train)
y_pred=gnb.predict(X_test)
print(y_test)
print(y_pred)
from sklearn.metrics import accuracy_score
acc=accuracy_score(y_test,y_pred)*100
print("Accuracy of Gaussian Naive bayes :",acc)
Sample Output:
EXPERIMENT-7
Aim: Generate frequent item sets using Apriori Algorithm in python and also
generate association rules for any market basket data.

Solution:
➢ The Apriori algorithm is a well-known Machine Learning algorithm used for
association rule learning.
➢ Association rule learning is taking a dataset and finding relationships
between items in the data. For example, if you have a dataset of grocery
store items, you could use association rule learning to find items that are
often purchased together.
➢ The Apriori algorithm is used on frequent item sets to generate association
rules and is designed to work on the databases containing transactions.
➢ The process of generating association rules is called association rule mining
or association rule learning. We can use these association rules to measure
how strongly or weakly two objects from the dataset are related.
➢ Frequent itemsets are those whose support value exceeds the user-
specified minimum support value.
The most common problems that this algorithm helps to solve are:
➢ Product recommendation
➢ Market basket recommendation

There are three major parts of the Apriori algorithm.


➢ Support
➢ Confidence
➢ Lift
Support
Support of item I is the ratio of the number of transactions in which item I appears
to the total number of transactions.

Confidence
Measures how often items in Y appear in transactions that contain X

Lift
Lift describes how much confident we are if B will be purchased too when the
customer buys A:Example:
Let’s imagine we have a history of 3000 customers’ transactions in our database, and
we have to calculate the Support, Confidence, and Lift to figure out how likely the
customers who buy Biscuits will buy Chocolate.

Here are some numbers from our dataset:

➢ 3000 customers’ transactions


➢ 400 out of 3000 transactions contain Biscuit purchases
➢ 600 out of 3000 transactions contain Chocolate purchases
➢ 200 out of 3000 transactions described purchases when customers bought
Biscuits and Chocolates together

the support value for biscuits will be:

the confidence value shows the probability that customers buy Chocolate if they buy
Biscuits

To calculate this value, we need to divide the number of transactions that contain
Biscuits and Chocolates by the total number of transactions having Biscuits:

the Lift value shows the potential increase in the ratio of the sale of Chocolates when
you sell Biscuits. The larger the value of the lift, the better:

Apriori Algorithm steps


1. Start with itemsets containing just a single item (Individual items)
2. Determine the support for itemsets
3. Keep the itemsets that meet the minimum support threshold and remove
itemsets that do not support minimum support
4. Using the itemsets kept from Step 1, generate all the possible itemset
combinations.
5. Repeat steps 1 and 2 until there are no more new item sets.
Example:

Let’s take a look at these steps while using a sample dataset:

First, the algorithm will create a table containing each item set’s support count in
the given dataset – the Candidate set

Let’s assume that we’ve set the minimum support value to 3, meaning the algorithm
will drop all the items with a support value of less than three.
The algorithm will take out all the itemsets with a greater support count than the
minimum support (frequent itemset) in the next step:

Next, the algorithm will generate the second candidate set (C2) with the help of the
frequent itemset (L1) from the previous calculation. The candidate set 2 (C2) will
be formed by creating the pairs of itemsets of L1. After creating new subsets, the
algorithm will again find the support count from the main transaction table of
datasets by calculating how often these pairs have occurred together in the given
dataset.
After that, the algorithm will compare the C2’s support count values with the
minimum support count (3), and the itemset with less support count will be

eliminated from table C2.

Program to Generate frequent item sets using Apriori Algorithm in pythonand


also generate association rules for any market basket data

Download the dataset Market_Basket_Optimization.csv

Step-1: Load the data set and perform preprocessing using


Transaction Encoder class

Code:
import pandas as pd
data_df=pd.read_csv(r'C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\product.csv'
,header=None)
data_df.tail()
data_list=[]
for x in data_df.values.tolist():
data_list.append([item for item in x if str(item)!='nan'])
data_list[1]
from mlxtend.preprocessing import TransactionEncoder
te=TransactionEncoder()
te_ary=te.fit(data_list).transform(data_list)
df=pd.DataFrame(te_ary,columns=te.columns_)
df.head()
Sample Output:

Step-2: Using Apriori algorithm, generate frequent itemsets with


min_support=0.01 (1%)

Code:
from mlxtend.frequent_patterns import apriori
frequent_itemsets=apriori(df,min_support=0.01,use_colnames=True)
print(frequent_itemsets)

Sample Output:
Step-3: Add a column ‘length’ and store the length of each frequentitemset

Code:
frequent_itemsets['length']=frequent_itemsets['itemsets'].apply(lambda x:len(x))
print(frequent_itemsets.tail())

Sample Output:

Step-4: Find the 3-itemsets(length=3) from frequent itemsets with


min_support>=0.015 (15%)

Code:
frequent_itemsets[(frequent_itemsets['length']==3)&((frequent_itemsets['support']>=
0.015))]

Sample Output:

Step-5: Generate Association rules for the frequent item sets of step-4 with
confidence=50%

Code:
from mlxtend.frequent_patterns import association_rules
rules=association_rules(frequent_itemsets,metric='confidence',min_threshold=0.5)
rules
Sample Output:

From the above output, the rules generated with support>=15% and
confidence=50% are:

{ground beef,eggs} ->{mineral water}

{ground beef,milk} ->{mineral water}


EXPERIMENT-8
Aim: Apply K- Means clustering algorithm on any dataset.

Solution:
• K-Means is an unsupervised machine learning algorithm that is used
for clustering problems.
• K-Means divides unlabelled data points into specific clusters/groups of
points. As a result, each data point belongs to only one cluster that has
similar properties.

K-Means Algorithm
The steps involved in K-Means are as follows:-

1. Initialize ‘K’ i.e number of clusters to be created.


2. Randomly assign K centroid points.
3. Assign each data point to its nearest centroid to create K clusters.
4. Re-calculate the centroids using the newly created clusters.
5. Repeat steps 3 and 4 until the centroid gets fixed.
Download the dataset ‘Mall_Customers.csv’
Code:

Step-1: Loading the libraries and dataset and display first 5 rows
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
dataset=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\Mall_Cust
omers.csv")
dataset.head()
Output:

Step-2: Select the columns ‘Annual Income’ and ‘Spending Score’ as X and use
them for determining no.of clusters using Elbow Method

The Elbow Method


The elbow method is used in cluster analysis to help determine the optimal number
of clusters in a dataset.
It works by:
1. defining a range of K values to run K-Means clustering on
2. evaluating the Sum of Squares Errors (SSE) for the model using each of the
defined numbers of clusters.
The optimal K value is usually found at the “elbow” where the curve starts to
become more constant.
WCSS (Within-Cluster Sum of Square) i.e. the sum of the square distance
between points in a cluster and the cluster centroid.
Inertia is the sum of squared distance of samples to their closest cluster center

X=dataset.iloc[:,[3,4]].values
from sklearn.cluster import KMeans
wcss=[]
for i in range(1,11):
kmeans=KMeans(n_clusters=i,init='k-means++',random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.xlabel("Number of Clusters")
plt.ylabel("WCSS")
plt.show()
Output:

From the above plot, it is clear that no.of clusters to be formed is 5. So choose k=5

Step-3:

Using Kmeans class of sklearn.cluster, create the clusters of X and fit the X
to predict the target values
kmeans=KMeans(n_clusters=i,init='k-means++',random_state=42)
y_kmeans=kmeans.fit_predict(X)

Step-4:

Plot the 5 clusters and centroids using scatter plot.

plt.scatter(X[y_kmeans==0,0],x[y_kmeans==0,1],s=60,c="red",label="Cluster1")
plt.scatter(X[y_kmeans==1,0],x[y_kmeans==1,1],s=60,c="blue",label="Cluster2")
plt.scatter(X[y_kmeans==2,0],x[y_kmeans==2,1],s=60,c="green",label="Cluster3")
plt.scatter(X[y_kmeans==3,0],x[y_kmeans==3,1],s=60,c="violet",label="Cluster4")
plt.scatter(X[y_kmeans==4,0],x[y_kmeans==4,1],s=60,c="yellow",label="Cluster5")
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=100,c='black',labe
l='centroids')
plt.xlabel("Annual Income(K$)")
plt.ylabel("Spending Score(1-100)")
plt.legend()
plt.show()
Output:
EXPERIMENT-9
Aim: Apply Hierarchical clustering algorithm on any dataset.

Solution:
Hierarchical clustering:
Hierarchical clustering groups similar objects into a dendrogram. It merges similar clusters
iteratively, starting with each data point as a separate cluster. This creates a tree-like structure
that shows the relationships between clusters and their hierarchy.
The dendrogram from hierarchical clustering reveals the hierarchy of clusters at different levels,
highlighting natural groupings in the data. It provides a visual representation of the relationships
between clusters, helping to identify patterns and outliers, making it a useful tool for exploratory
data analysis.
There are mainly two types of hierarchical clustering:
1. Agglomerative hierarchical clustering
2. Divisive Hierarchical clustering
1. Agglomerative Hierarchical Clustering
In Agglomerative Hierarchical Clustering, Each data point is considered as a single cluster
making the total number of clusters equal to the number of data points. And then we keep
grouping the data based on the similarity metrics, making clusters as we move up in the
hierarchy. This approach is also called a bottom-up approach.
2. Divisive Hierarchical Clustering
Divisive hierarchical clustering is opposite to what agglomerative HC is. Here we start with a
single cluster consisting of all the data points. With each iteration, we separate points which are
distant from others based on distance metrics until every cluster has exactly 1 data point.
Example:
Suppose we have data related to marks scored by 4 students in Math and Science and we need to
create clusters of students to draw insights.

Step-1: Construct a Distance matrix. Distance between each point can be found using various
metrics i.e. Euclidean Distance, Manhattan Distance, etc.
We’ll use Euclidean distance for this example:
Distance Calculated Between Each Data Point

We now formed a Cluster between S1 and S2 because they were closer to each other.

Step-2: We take the average of the marks obtained by S1 and S2 and the values we get will
represent the marks for this cluster.
Dataset After First Clustering
Again find the closest points and create another cluster.
Clustering S3 And S4

Step-3: Repeat the steps above and keep on clustering until we are left with just one cluster
containing all the clusters, we get a result as below

Dendrogram Of Our Example

Program:
Program:

import scipy.cluster.hierarchy as sch


import matplotlib.pyplot as plt
plt.figure(figsize=(10,7))
plt.title("Dendrogram")#title of the dendrogram
plt.xlabel("Customers")#label of the x-axis
plt.ylabel("Euclidean distance")#label of the y-axis
#Selecting Annual Income and Spending Scores by index
selected_data = dataset.iloc[:, 3:5]
#Finding the optimal number of clusters using dendrogram
dendrogram = sch.dendrogram(sch.linkage(selected_data, method='ward'))
plt.axhline(y=150, color='r', linestyle='--')
plt.show()#show the dendrogram

Output:
EXPERIMENT-10
Aim:
Apply DBSCAN clustering algorithm on any dataset.

Solution:
K-Means and Hierarchical Clustering both fail in creating clusters of
arbitrary shapes. They are not able to form clusters based on varying densities.
That’s why we need DBSCAN clustering.
Density-Based Clustering refers to unsupervised learning methods that
identify distinctive groups/clusters in the data, based on the idea that a cluster
in data space is a contiguous region of high point density, separated from other
such clusters by contiguous regions of low point density.
Density-based spatial clustering of applications with noise (DBSCAN)
DBSCAN is a base algorithm for density-based clustering. It can discover
clusters of different shapes and sizes from a large amount of data, which is
containing noise and outliers.
The most exciting feature of DBSCAN clustering is that it is robust to outliers.
It also does not require the number of clusters to be told in prior.
The DBSCAN algorithm uses two parameters:

minPts: The minimum number of points (a threshold) clustered


together for a region to be considered dense.
Epsilon (ε): is the radius of the circle to be created around each data
point to check the density

Let’s understand it with the help of an example.

Here, we have some data points represented by grey color. Let’s see how
DBSCAN clusters these data points.

DBSCAN creates a circle of epsilon radius around every data point


and classifies them into Core point, Border point, and Noise.
A data point is a Core point if the circle around it contains at least
‘minPoints’ number of points.
If the number of points is less than minPoints, then it is classified as Border
Point.
If there are no other data points around any data point within
epsilon radius, then it treated as Noise.

The above figure shows us a cluster created by DBCAN with minPoints = 3. Here, we
draw a circle of equal radius epsilon around every data point. These two parameters help
in creating spatial clusters.
All the data points with at least 3 points in the circle including itself are considered as
Core points represented by red color.
All the data points with less than 3 but greater than 1 point in the circle
including itself are considered as Border points. They are represented by
yellow color.
Finally, data points with no point other than itself present inside the circle are considered
as Noise represented by the purple color.
Reachability in terms of density establishes a point to be reachable from
another if it lies within a particular distance (eps) from it.
Connectivity, on the other hand, involves a transitivity based chaining-
approach to determine whether points are located in a particular cluster. For
example, p and q points could be connected if p->r->s->t->q, where a->b means
b is in the neighborhood of a.
Program:
Input-1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset=pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\Mall_Custo
mers.csv")
x=dataset.loc[:,['Annual Income (k$)','Spending Score (1-100)']].values

Input-2
from sklearn.neighbors import NearestNeighbors
neighb=NearestNeighbors(n_neighbors=2)
nbrs=neighb.fit(x)
distances,indices=nbrs.kneighbors(x)

Input-3
distances=np.sort(distances,axis=0)
distances=distances[:,1]
plt.rcParams['figure.figsize']=(6,4)
plt.xlabel('Data points sorted by distance')
plt.ylabel('Epsilon')
plt.plot(distances)
plt.show()

Output:
Input-4

from sklearn.cluster import DBSCAN


dbscan=DBSCAN(eps=8,min_samples=4).fit(x)
labels=dbscan.labels_
plt.scatter(x[:,0],x[:,1],c=labels,cmap="plasma")
plt.ylabel("Spending Score")
plt.show()

Output:

You might also like