data preprocessing
data preprocessing
import pandas as pd:- read and manupulate the data , for series operations
5. Categorical Data:-
Encoding Categorical Data:
#Encoding the independent variable:-
> from sklearn.preprocessing import LabelEncoder,OneHotEncoder
(LabelEncoder will give numbers to entities of same
category)
> labelencoder_x = LabelEncoder()
x[:,0] = labelencoder_x.fit_transform(x[:,0])(here it will enocde the first
column values as 0,1,2...)
> onehotencoder = OneHotEncoder(categorical_features=[0])
> x=onehotencoder.fit_transform(x).toarray() (to encode x in terms of o’s and
1’s and other values in
exponential form)
x
>labelencoder_y=LabelEncoder() (encoding y)
y = labelencoder_y.fit_transform(y)
6. Spliting Training and Test Data:-
> from sklearn.cross_validation import train_test_split
note:- (cross validation is library for spliting the whole data set in training and testing
data..... inside which we call the train and test
class for spliting the data)
>x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state = 0)
(splits the whole data set for 80% data for tarining, 20% data for testing,,
random state maintains the consistency in the train and test data,if not then
every time it takes duffrent set if values)
try to keep the train test in between the range of 20-30% as <20 results in
overfitting and more than 30 leads to error
7. Future scaling:- (is used to scale large values in small space.. like putting the two
numbers and the square of their diuffrence in the same graph)
(Note:- all values will be scaled between -1 to +1)
> from sklearn.preprocessing import Standard Scaler
sc_x = Standard Scaler
x_train = sc_x.fit_transform(x_train) #fit.transform-- use only for training data
x_test = sc_x.transform(x_test)
## x_train=always a dependent variable
## standard scaler is a class that scales all the values based on volume of ,model...
## fit()- generate learning model parameters from training data (only makes
machine to learn) going to make the object ready
##transform()-- applied upon model to generate transform data set..
**k-nn algorithm:-
from sklearn.neighbors import KneighborsClassifier
classifier = KneighborsClassifier(n_neighbors=5, metric=’minkowski’, p=2)
classifier.fit(X_train,Y_train)
-------till here machine if fit with trianing data and machines learns with training data----
*STEP 8:- Visualizing the Training and Test data set results
x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
**
*STEP 9:- Visualizing the Training and Test data set results
x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
**decision treee:--
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values
y_pred = classifier.predict(x_test)
training plot:-
from matplotlib.colors import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
x_set,y_set=x_test,y_test
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
y_pred = classifier.predict(x_test)
training plot:-
from matplotlib.colors import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
x_set,y_set=x_test,y_test
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
**Random forest
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values
y_pred = classifier.predict(x_test)
x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
x_set,y_set=x_test,y_test
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
Linear Regression:-
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset=pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,1].values
y_pred = regressor.predict(x_test)
plt.scatter(x_train,y_train, color='red')
plt.plot(x_train,regressor.predict(x_train),color = 'blue')
plt.title('sal vs exp (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()