0% found this document useful (0 votes)

9 views

data preprocessing

The document outlines a comprehensive guide for data preprocessing and implementing various machine learning algorithms including K-NN, Decision Trees, Naive Bayes, Random Forest, and Linear Regression. It details steps such as importing libraries, handling missing values, encoding categorical data, splitting datasets into training and testing sets, and visualizing results. Each algorithm is illustrated with code snippets for training, predicting, and evaluating performance using confusion matrices and visualizations.

Uploaded by

Bharath Shivashankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

data preprocessing

Uploaded by

Bharath Shivashankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Data Preprocessisng:-

1. Get Data Set

2. Import important libraries:-

import numpy as np :- for number calculations and array manupulation

import matplotlib.pyplot as plt:- for pictorial representation of results

import pandas as pd:- read and manupulate the data , for series operations

3. Import dataset:- (sir’s example)

data.csv/xls

dataset=pd.read_csv(‘Data.csv’)

> create matrix of all independent variables(sir’s example)

x = datset.iloc[:, :-1].values

> create matrix of dependent variables(sir’s example)
y = datset.iloc[:, 3].values

4. Handaling missing values

taking care of missing data from :-
> from sklearn.preprocessing import Imputer (sklearn is a ML lib for multiple
jobs,
Imputer use to find the missing values
rememberer caps I)

> imputer = Imputer(missing_values =’NaN’,strategy = ’mean’, axis=0)
imputer = imputer.fit(x[:,1:3])

>x[:,1:3] = imputer.transform(x[:,1:3])

5. Categorical Data:-
Encoding Categorical Data:
#Encoding the independent variable:-
> from sklearn.preprocessing import LabelEncoder,OneHotEncoder
(LabelEncoder will give numbers to entities of same
category)

> labelencoder_x = LabelEncoder()
x[:,0] = labelencoder_x.fit_transform(x[:,0])(here it will enocde the first
column values as 0,1,2...)

> onehotencoder = OneHotEncoder(categorical_features=[0])

> x=onehotencoder.fit_transform(x).toarray() (to encode x in terms of o’s and
1’s and other values in
exponential form)
x

>labelencoder_y=LabelEncoder() (encoding y)
y = labelencoder_y.fit_transform(y)
6. Spliting Training and Test Data:-

> from sklearn.cross_validation import train_test_split
note:- (cross validation is library for spliting the whole data set in training and testing
data..... inside which we call the train and test
class for spliting the data)

>x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state = 0)
(splits the whole data set for 80% data for tarining, 20% data for testing,,
random state maintains the consistency in the train and test data,if not then
every time it takes duffrent set if values)

try to keep the train test in between the range of 20-30% as <20 results in
overfitting and more than 30 leads to error

7. Future scaling:- (is used to scale large values in small space.. like putting the two
numbers and the square of their diuffrence in the same graph)
(Note:- all values will be scaled between -1 to +1)

> from sklearn.preprocessing import Standard Scaler
sc_x = Standard Scaler
x_train = sc_x.fit_transform(x_train) #fit.transform-- use only for training data
x_test = sc_x.transform(x_test)
## x_train=always a dependent variable
## standard scaler is a class that scales all the values based on volume of ,model...
## fit()- generate learning model parameters from training data (only makes
machine to learn) going to make the object ready
##transform()-- applied upon model to generate transform data set..

Mnote:_ fit_transform() can only be applied on standard scaler functions

**k-nn algorithm:-
from sklearn.neighbors import KneighborsClassifier
classifier = KneighborsClassifier(n_neighbors=5, metric=’minkowski’, p=2)
classifier.fit(X_train,Y_train)
-------till here machine if fit with trianing data and machines learns with training data----

## in sklearn neighbors is a library in which we have kneighbors classifiers

## kneighbors takes some values=== n-neighbors are number of neighbours... a prime
number
metrics == defines the type of method being used
p=2 means using euclidean distance
------ for testing and predictiong------
y_pred = classifier.predict(X_test) ## predicts only on x_test values given before
y_pred

**making the confusion matrix----

from sklearn.metrics import confusion_matrix ## confusion_matrix is a fnc

cm = confusion_matrix(y_test,y_pred)
cm
gives out a confusion matrix with [TP,FP,FN,TN] format/....

*STEP 8:- Visualizing the Training and Test data set results

from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.uniquely(y_set)):

plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap((‘red’,’green’))(i),label=j)
plt.title(‘K-NN(Training set)’)
plt.xlabel(‘Age’)
plt.yalbel(‘Estimnated sAlary’)
plt.legend()
plt.show()

**
*STEP 9:- Visualizing the Training and Test data set results

from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.uniquely(y_set)):

**decision treee:--

dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values

from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(criterion = 'entropy',random_state=0)
classifier.fit(x_train,y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,y_pred)
cm

training plot:-
from matplotlib.colors import ListedColormap
x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Decison Tree(Training set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

Test plot :--

from matplotlib.colors import ListedColormap

x_set,y_set=x_test,y_test

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Decison Tree(Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
***Naive Bayes
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values

from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(x_train,y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,y_pred)
cm

training plot:-
from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Training set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
Test plot :--
from matplotlib.colors import ListedColormap

x_set,y_set=x_test,y_test

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

**Random forest
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values

from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators = 10,criterion =
'entropy',random_state=0)
classifier.fit(x_train,y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,y_pred)
cm
from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

from matplotlib.colors import ListedColormap

x_set,y_set=x_test,y_test

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

Linear Regression:-
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset=pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,1].values

from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(x_train,y_train)

y_pred = regressor.predict(x_test)

plt.scatter(x_train,y_train, color='red')
plt.plot(x_train,regressor.predict(x_train),color = 'blue')
plt.title('sal vs exp (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

BR100 Purchasing Application Setup 1.0
No ratings yet
BR100 Purchasing Application Setup 1.0
24 pages
Daf Xf95 Series Workshop Manual
100% (67)
Daf Xf95 Series Workshop Manual
20 pages
Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
R290-R190 Repair Guide
100% (1)
R290-R190 Repair Guide
13 pages
Tous Les Algo de ML
No ratings yet
Tous Les Algo de ML
7 pages
ML MANUAL WITH OUTPUTS (2)
No ratings yet
ML MANUAL WITH OUTPUTS (2)
30 pages
Linearregression SVM
No ratings yet
Linearregression SVM
3 pages
16BCB0126 VL2018195002535 Pe003
No ratings yet
16BCB0126 VL2018195002535 Pe003
40 pages
Unit2 ML Programs
No ratings yet
Unit2 ML Programs
7 pages
LAB-4 Report
No ratings yet
LAB-4 Report
21 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
100% (1)
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
1 page
SVM K NN MLP With Sklearn Jupyter NoteBo
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
22 pages
Document 4
No ratings yet
Document 4
3 pages
ANN_EXPERIENTIAL_LEARNING
No ratings yet
ANN_EXPERIENTIAL_LEARNING
43 pages
20-SE-66 ML Assign 2
No ratings yet
20-SE-66 ML Assign 2
4 pages
ML Codes
No ratings yet
ML Codes
9 pages
ML Cheatsheet
No ratings yet
ML Cheatsheet
4 pages
C2W3_Lab_01_Model_Evaluation_and_Selection
No ratings yet
C2W3_Lab_01_Model_Evaluation_and_Selection
21 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Scikit-Learn Cheat Sheet Python For Data Science: Preprocessing The Data Evaluate Your Model's Performance
100% (1)
Scikit-Learn Cheat Sheet Python For Data Science: Preprocessing The Data Evaluate Your Model's Performance
1 page
ML Algorithms
100% (1)
ML Algorithms
1 page
Machine Learnin
100% (2)
Machine Learnin
23 pages
Scikit-Learn Cheat Sheet
No ratings yet
Scikit-Learn Cheat Sheet
1 page
Scikit-Learn Cheat Sheet
No ratings yet
Scikit-Learn Cheat Sheet
1 page
Classification Review
No ratings yet
Classification Review
8 pages
Machine Learning Model Building
No ratings yet
Machine Learning Model Building
6 pages
ADS_phase 3
No ratings yet
ADS_phase 3
34 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
ML_Lab_01999676272
No ratings yet
ML_Lab_01999676272
12 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
Case Study - Classifier
No ratings yet
Case Study - Classifier
5 pages
Aiml 5-8
No ratings yet
Aiml 5-8
19 pages
1. Linear Regression (Code)
No ratings yet
1. Linear Regression (Code)
9 pages
Project-4 (KNN CLASSIFICATION) (2) PRANAB
No ratings yet
Project-4 (KNN CLASSIFICATION) (2) PRANAB
2 pages
Cheat Sheet: Python For Data Science
100% (1)
Cheat Sheet: Python For Data Science
1 page
Exp 6
No ratings yet
Exp 6
6 pages
DM ML Practical
No ratings yet
DM ML Practical
13 pages
AI ML - Cycle 2 Programs (1)
No ratings yet
AI ML - Cycle 2 Programs (1)
15 pages
Mlda - Lab
No ratings yet
Mlda - Lab
35 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
5 pages
Codes for Project
No ratings yet
Codes for Project
8 pages
ML pdf
No ratings yet
ML pdf
30 pages
Scikit Learn What Were Covering
No ratings yet
Scikit Learn What Were Covering
15 pages
Data Modeling - Cheatsheet
No ratings yet
Data Modeling - Cheatsheet
9 pages
MlLabManualdocx 2024 09 04 22 02 58
No ratings yet
MlLabManualdocx 2024 09 04 22 02 58
19 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
lab manual
No ratings yet
lab manual
9 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Random Forest
No ratings yet
Random Forest
5 pages
Scikit Learn Cheat Sheet Python
No ratings yet
Scikit Learn Cheat Sheet Python
1 page
Advance AI and ML LAB
No ratings yet
Advance AI and ML LAB
16 pages
ML LAB 146
No ratings yet
ML LAB 146
50 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
ML
No ratings yet
ML
8 pages
Data Mining Practicals
No ratings yet
Data Mining Practicals
22 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
3/5 (4)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Curtain Wall Design
No ratings yet
Curtain Wall Design
8 pages
191022i049905986 PDF
No ratings yet
191022i049905986 PDF
1 page
Clad Metals
No ratings yet
Clad Metals
16 pages
1336562021-O&k RH90
No ratings yet
1336562021-O&k RH90
2 pages
S P Ms Notes
No ratings yet
S P Ms Notes
56 pages
Kenwood Krf-V5450d v6400d S
No ratings yet
Kenwood Krf-V5450d v6400d S
47 pages
Wipro SVEC B.Tech 2024 Notification
No ratings yet
Wipro SVEC B.Tech 2024 Notification
26 pages
Engine Lubrication System
100% (1)
Engine Lubrication System
8 pages
EEE INSTR F214-Mid Term Paper
No ratings yet
EEE INSTR F214-Mid Term Paper
2 pages
Urriculum Itae: Professional Summary
No ratings yet
Urriculum Itae: Professional Summary
2 pages
2016 Autumn Instrumental Analysis
No ratings yet
2016 Autumn Instrumental Analysis
33 pages
Data Compression Techniques
No ratings yet
Data Compression Techniques
21 pages
Installed Files
100% (2)
Installed Files
61 pages
Assignment 2 HCI
No ratings yet
Assignment 2 HCI
3 pages
XPMU013SC
No ratings yet
XPMU013SC
3 pages
POoverview
No ratings yet
POoverview
2 pages
PickUp NBB4 12GM50 E2 V1
No ratings yet
PickUp NBB4 12GM50 E2 V1
2 pages
Project by Brincy
No ratings yet
Project by Brincy
64 pages
WIPO Handbook Chap 3
No ratings yet
WIPO Handbook Chap 3
44 pages
Router Usuario y Contraseña
No ratings yet
Router Usuario y Contraseña
8 pages
Aid120091-02 - SC - TC - PS50 - Ffe - 504 - A PDF
No ratings yet
Aid120091-02 - SC - TC - PS50 - Ffe - 504 - A PDF
228 pages
Module 5
No ratings yet
Module 5
78 pages
Solar Power Manager Series (EN)
No ratings yet
Solar Power Manager Series (EN)
1 page
Urban Design Tools
75% (8)
Urban Design Tools
24 pages
0740 800 179 - S - CaddyTig 2200i ACDC
No ratings yet
0740 800 179 - S - CaddyTig 2200i ACDC
64 pages
PDVSA Am 211 PRT
No ratings yet
PDVSA Am 211 PRT
10 pages
TG 53
No ratings yet
TG 53
56 pages