Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

data preprocessing

The document outlines a comprehensive guide for data preprocessing and implementing various machine learning algorithms including K-NN, Decision Trees, Naive Bayes, Random Forest, and Linear Regression. It details steps such as importing libraries, handling missing values, encoding categorical data, splitting datasets into training and testing sets, and visualizing results. Each algorithm is illustrated with code snippets for training, predicting, and evaluating performance using confusion matrices and visualizations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

data preprocessing

The document outlines a comprehensive guide for data preprocessing and implementing various machine learning algorithms including K-NN, Decision Trees, Naive Bayes, Random Forest, and Linear Regression. It details steps such as importing libraries, handling missing values, encoding categorical data, splitting datasets into training and testing sets, and visualizing results. Each algorithm is illustrated with code snippets for training, predicting, and evaluating performance using confusion matrices and visualizations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Preprocessisng:-

1. Get Data Set

2. Import important libraries:-


​ import numpy as np :- for number calculations and array manupulation

​ import matplotlib.pyplot as plt:- for pictorial representation of results

​ import pandas as pd:- read and manupulate the data , for series operations

3. Import dataset:- (sir’s example)


​ data.csv/xls

​ dataset=pd.read_csv(‘Data.csv’)

​ > create matrix of all independent variables(sir’s example)


​ ​ x = datset.iloc[:, :-1].values

​ > create matrix of dependent variables(sir’s example)
​ ​ y = datset.iloc[:, 3].values

4. Handaling missing values


​ taking care of missing data from :-
​ > from sklearn.preprocessing import Imputer (sklearn is a ML lib for multiple
jobs,
​ ​ ​ ​ ​ ​ ​ Imputer use to find the missing values
​ ​ ​ ​ ​ ​ ​ ​ ​ rememberer caps I)

​ > imputer = Imputer(missing_values =’NaN’,strategy = ’mean’, axis=0)
​ ​ imputer = imputer.fit(x[:,1:3])

>x[:,1:3] = imputer.transform(x[:,1:3])

5. Categorical Data:-
​ Encoding Categorical Data:
​ #Encoding the independent variable:-
​ ​ > from sklearn.preprocessing import LabelEncoder,OneHotEncoder
​ ​ ​ ​ ​ (LabelEncoder will give numbers to entities of same
category)
​ ​
​ ​ > labelencoder_x = LabelEncoder()
​ ​ x[:,0] = labelencoder_x.fit_transform(x[:,0])(here it will enocde the first
column ​ ​ ​ ​ ​ ​ ​ ​ ​ values as 0,1,2...)
​ ​
​ ​ > onehotencoder = OneHotEncoder(categorical_features=[0])

​ ​
​ ​ > x=onehotencoder.fit_transform(x).toarray() (to encode x in terms of o’s and
1’s and ​ ​ ​ ​ ​ ​ ​ ​ ​ other values in
exponential form)
​ ​ x
​ ​
​ ​ >labelencoder_y=LabelEncoder() (encoding y)
​ ​ y = labelencoder_y.fit_transform(y)
6. Spliting Training and Test Data:-

​ > from sklearn.cross_validation import train_test_split
note:- (cross validation is library for spliting the whole data set in training and testing
data..... inside which we call the train and test
class for spliting the data)​

​ >x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state = 0)
​ ​ (splits the whole data set for 80% data for tarining, 20% data for testing,,
​ ​ random state maintains the consistency in the train and test data,if not then
every ​​ time it takes duffrent set if values)

​ ​ try to keep the train test in between the range of 20-30% as <20 results in
overfitting and more than 30 leads to error

7. Future scaling:- (is used to scale large values in small space.. like putting the two
numbers and the square of their diuffrence in the same graph)
​ (Note:- all values will be scaled between -1 to +1)

​ > from sklearn.preprocessing import Standard Scaler
​ sc_x = Standard Scaler​
​ x_train = sc_x.fit_transform(x_train) #fit.transform-- use only for training data
​ x_test = sc_x.transform(x_test)
​ ## x_train=always a dependent variable
​ ## standard scaler is a class that scales all the values based on volume of ,model...
​ ## fit()- generate learning model parameters from training data (only makes
machine to ​ ​ ​ ​ learn) going to make the object ready
​ ##transform()-- applied upon model to generate transform data set..

​ Mnote:_ fit_transform() can only be applied on standard scaler functions

**k-nn algorithm:-
from sklearn.neighbors import KneighborsClassifier
classifier = KneighborsClassifier(n_neighbors=5, metric=’minkowski’, p=2)
classifier.fit(X_train,Y_train)
-------till here machine if fit with trianing data and machines learns with training data----

## in sklearn neighbors is a library in which we have kneighbors classifiers


## kneighbors takes some values=== n-neighbors are number of neighbours... a prime
number
​ ​ ​ ​ ​ metrics == defines the type of method being used
​ ​ ​ ​ ​ p=2 means using euclidean distance
------ for testing and predictiong------
y_pred = classifier.predict(X_test) ## predicts only on x_test values given before
y_pred

**making the confusion matrix----

from sklearn.metrics import confusion_matrix​ ​ ## confusion_matrix is a fnc


cm = confusion_matrix(y_test,y_pred)
cm
gives out a confusion matrix with [TP,FP,FN,TN] format/....

*STEP 8:- Visualizing the Training and Test data set results

from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.uniquely(y_set)):



plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap((‘red’,’green’))(i),label=j)
plt.title(‘K-NN(Training set)’)
plt.xlabel(‘Age’)
plt.yalbel(‘Estimnated sAlary’)
plt.legend()
plt.show()

**
*STEP 9:- Visualizing the Training and Test data set results

from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.uniquely(y_set)):



plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap((‘red’,’green’))(i),label=j)
plt.title(‘K-NN(Training set)’)
plt.xlabel(‘Age’)
plt.yalbel(‘Estimnated sAlary’)
plt.legend()
plt.show()

**decision treee:--

dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values

from sklearn.cross_validation import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.preprocessing import StandardScaler


sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.tree import DecisionTreeClassifier


classifier = DecisionTreeClassifier(criterion = 'entropy',random_state=0)
classifier.fit(x_train,y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_test,y_pred)
cm

training plot:-
from matplotlib.colors import ListedColormap
x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):



plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Decison Tree(Training set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

Test plot :--


from matplotlib.colors import ListedColormap

x_set,y_set=x_test,y_test

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):



plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Decison Tree(Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
***Naive Bayes
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values

from sklearn.cross_validation import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.preprocessing import StandardScaler


sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.naive_bayes import GaussianNB


classifier = GaussianNB()
classifier.fit(x_train,y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_test,y_pred)
cm

training plot:-
from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):



plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Training set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
Test plot :--
from matplotlib.colors import ListedColormap

x_set,y_set=x_test,y_test

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):



plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

**Random forest
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values

from sklearn.cross_validation import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.preprocessing import StandardScaler


sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.ensemble import RandomForestClassifier


classifier = RandomForestClassifier(n_estimators = 10,criterion =
'entropy',random_state=0)
classifier.fit(x_train,y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_test,y_pred)
cm
from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):



plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Training set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

from matplotlib.colors import ListedColormap

x_set,y_set=x_test,y_test

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):



plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

Linear Regression:-
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset=pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,1].values

from sklearn.cross_validation import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.linear_model import LinearRegression


regressor = LinearRegression()
regressor.fit(x_train,y_train)

y_pred = regressor.predict(x_test)

plt.scatter(x_train,y_train, color='red')
plt.plot(x_train,regressor.predict(x_train),color = 'blue')
plt.title('sal vs exp (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

You might also like