Handling The Dataset Using R - Word
Handling The Dataset Using R - Word
To check the data is complete, before building the model , and before
forecasting, and before publishing we have to do a 4 step key measures
in all supervised Machine Learnings are:
Data Pre-Processing - (includes cleaning the data)
Data Manipulation – (summarize, finding the significant
variables etc….)
Data Exploratory Analysis - (relation b/w DV and IDV)
Data Visualization - (Building Visualizations, plots etc…)
ColSums(is.na(dataset_name))
To find and treat the Missing values in the dataset:
Identify whether the value is number or character. If it is a number we
have to use MEAN or MEDIAN.
If there is Outlier, we use MEDIAN otherwise use MEAN.
If it is Character, we use MODE.
If the data is missing more than 25% we will be removing the entire
variable. Or we use MEAN and MEDIAN.
Feature Scaling: When there is outlier we use feature scaling. If one variable is
influencing another variable we use feature scaling.
We use NORMALIZATION
Actual Value-min/max-min
STANDARIZATION or z-test
Actual Value-mean/standard Deviation
Normalization Standardization
Gives only positive (+) Gives both positive (+), and
values. negative (-) values.
This is used for “Row- This is used for “column
wise” or “observation”. wise” or “variable wise”.
It is used when there is It is used when there is an
“NO OUTLIER”. “OUTLIER”
Encoding Concept:
Encoding concept is converting the character to number or vise -
versa.
caTools
To fix the Random Number while doing the splitting of the training and test dataset.
How to write it:
install. packages("caTools")
library(caTools)
set.seed(456)
split <- sample.split(dataset_name$DV, SplitRatio = 0.75 )
split
table(split)
note: “on what basis we are deciding to split the data into 75 or 80% is,
normally with “industrial standards” we take 75 to 80% data to training dataset to
building the model and with 25 or 20% test data we use or park it for prediction of
the model we are going to build. This is because to escape prom the over/under
fitting the data”.
REGRESSION:
if it is Regression model we use lm (Linear Model) as model building
Algorithm.
We use OLS (Ordinary Least Square) methods for linear models, to get the
“best fit line”.
How to write it:
Packages to be installed:
o R-part
o Rattle
SUMMARY
Summary would find stats values.
Summary would do missing value.
Summary is like boxplot to find the outlier.
Summary would do visualizations.
Summary would do numeric and character.
Hence summary is very important in the whole project, so it would tell us which is
significant and which is non-significant variable in the project.
DECISION TREE:
Dec_tree_cbind <- cbind(test$dv, prediction_threshold)
Dec_tree_cbind
RANDOM FOREST:
rand_pred_bin <- ifelse(rand_pred>=0.5,1,0)
rand_pred_bin
SVM:
0(No) TN FP
(True (False
Positive)
Negative)
1(Yes) FN TP
(False (True
Negative) Positive)
Packages to be installed:
Install.packages(“rpart”)
Library(rpart)
Install.packages(“ROCR”)
Library(ROCR)
How to write it:
ROCprediction <- prediction(test$Dv,pred_thres)
ROCprediction
ROCperformance <- performance(ROCprediction, 'tpr','fpr')
ROCperformance
plot(ROCperformance, col="color",print.cutoffs.at=seq(0.1,by=.1))
abline(a=0,b=1)
With these measures we can do any dataset of supervised Machine Learning easily by using R.
BASICS OF PYTHON:
Types of Variables
o NUMBER – integer, float, and complex number
o STRING - character
o BOOLIAN / LOGICAL VALUES – yes/no, True/False
LOOPS IN PYTHON:
we use LOOPS in python where the manual work couldn’t help.
o ZIP LOOP
o WHILE LOOP
o FOR LOOP.
ZIP LOOP:
WHILE LOOP:
The while loop in Python is used to iterate over a block of code
as long as the test expression (condition) is true.
We generally use this loop when we don't know the number of
times to iterate beforehand.
FOR LOOP:
For loop is to iterate over a sequence of elements using the different variations
of the loop.
The for loop in Python is used to iterate over a sequence (list, tuple, string) or
other iterable objects. Iterating over a sequence is called traversal.
GATES IN PYTHON:
Logic gates are used to create a circuit that performs calculations,
data storage or shows off object-oriented programming especially the power of
inheritance.
A Logic gate is an elementary building block of any digital circuits. It takes one
or two inputs and produces output based on those inputs. Outputs may be high
(1) or low (0).
There are seven basic logic gates defined, in PYTHON these
are:
AND GATE
OR GATE
NOT GATE
NAND GATE
NOR GATE
XOR GATE
XNOR GATE
AND GATE:
The AND gate is a basic digital logic gate that implements
logical conjunction.
it behaves according to the truth table to the right.
A HIGH output (1) results only if all the inputs to the AND gate are HIGH (1).
The AND gate gives an output of 1 if both the two inputs are 1, it gives 0
otherwise.
In simple terms:
And gate provides an output of 0 if either of the inputs are 0. This operation is
considered as multiplication in binary numbers.
We can see in the truth table that whenever either of the two inputs is 0, the
output is 0 too.
Output:
Output of 0 AND 0 is 0
Output of 0 AND 1 is 0
Output of 1 AND 0 is 0
Output of 1 AND 1 is 1
OR GATE:
The OR gate is a digital logic gate that implements logical disjunction – it
behaves according to the truth table to the right.
A HIGH output (1) results if one or both the inputs to the gate are HIGH
(1). If neither input is high, a LOW output (0) results.
The OR gate gives an output of 1 if either of the two inputs are 1, it
gives 0 otherwise.
In simple terms:
OR gate provides the output as 1 if either of the inputs is 1. It is similar to
an “addition” operation, with respect to binary numbers.
How to write OR Gate in python:
output:
Output of 0 OR 0 is 0
Output of 0 OR 1 is 1
Output of 1 OR 0 is 1
Output of 1 OR 1 is 1
NAND Gate:
The NAND gate (negated AND) gives an output of 0 if both inputs are 1, it
gives 1 otherwise.
in simple terms:
output:
Output of 0 NAND 0 is 1
Output of 0 NAND 1 is 1
Output of 1 NAND 0 is 1
Output of 1 NAND 1 is 0
NOR Gate:
The NOR gate (negated OR) gives an output of 1 if both inputs are 0, it gives 1
otherwise.
In simple terms:
OUTPUT:
Output of 0 NOR 0 is 1
Output of 0 NOR 1 is 0
Output of 1 NOR 0 is 0
Output of 1 NOR 1 is 0
NOT Gate:
It acts as an inverter. It takes only one input. If the input is given as 1, it will
invert the result as 0 and vice-versa.
In simple terms:
This gate provides the negation of the input given. This gate supports only a
single input.
Output:
Output of NOT 0 is 1
Output of NOT 1 is 0
Note: The NOT function provides correct results for bit values 0 and 1.
There are two special types of logic gates, XOR and XNOR, that focus on the number
of inputs of 0 or 1, rather than individual values.
XOR Gate:
The XOR gate gives an output of 1 if either both inputs are different, it gives 0 if
they are same.
In simple terms:
An acronym for Exclusive-OR, XOR gate provides an output of 1 when the
number of 1s in the input is odd.
OUTPUT:
Output of 0 XOR 0 is 0
Output of 0 XOR 1 is 1
Output of 1 XOR 0 is 1
Output of 1 XOR 1 is 0
XNOR GATE:
The XNOR gate (negated XOR) gives an output of 1 both inputs are same and
0 if both are different.
It is formed as a result of the combination of XOR and NOT gates.
Opposite to XOR, it provides an output of 1, when the number of 1s in
the input is even.
How to write XNOR Gate in Python:
The XNOR () function can be implemented by using the XOR () and NOT () functions in
Python.
OUTPUT:
Output of 0 XNOR 0 is 1
Output of 0 XNOR 1 is 0
Output of 1 XNOR 0 is 0
Output of 1 XNOR 1 is 1
LIST:
List starts with index / square brackets “[-]”
it’s mutable.
Python offers a range of compound data types often referred to as
sequences.
List is one of the most frequently used and very versatile data types
used in Python.
DICTIONARY:
Python dictionary is an unordered collection of items. Each item of a dictionary
has a key/value pair.
Dictionaries are optimized to retrieve values when the key is known.
Dictionary starts with curly braces {--}
It is mutable(changeable)
An item has a KEY and a corresponding VALUE that is expressed as a pair
(key: value).
# Output: Jack
print(my_dict['name'])
# Output: 26
print(my_dict.get('age'))
# KeyError
print(my_dict['address'])
Output
Jack
26
None
Traceback (most recent call last):
File "<string>", line 15, in <module>
print(my_dict['address'])
KeyError: 'address'
# from sequence having each item as a pair
my_dict = dict([(1,'apple'), (2,'ball')])
SET:
Set is mutable.
Set starts with curly brackets {-}.
Set gives the unique values. We can add or remove the items in it.
Set arranges in ascending order.
Output
{1, 2, 3}
{1.0, (1, 2, 3), 'Hello'}
TUPLE:
Tuples are used to store multiple items in a single variable.
Tuple is one of 4 built-in data types in Python used to store collections of data,
the other 3 are List, Set, and Dictionary, all with different qualities and usage.
A tuple is a collection which is ordered and unchangeable.
Tuple is exactly same as list the only difference is it is immutable.
Tuple is widely used in the python.
NUMPY:
NUMPY is mainly for ARRAYS and for mathematical
calculations / operations.
it will convert all the data into numbers.
It is like psych package in R which gives us stats values.
PANDAS:
PANDAS is for Data Manipulation.
It will give Series values.
It will give Index values.
It is used to imports, and exports the data.
It is called “Grammar of Manipulations”.
It sometimes used for visualization which is influenced by deep learning.
MATPLOTLIB:
Mainly used for visualization.
SEABORN:
Mainly used for visualization.
It is used for Statistics.
It is used for EDA.
It is the advanced version of Matplotlib.
SKLEARN:
it is mainly for Machine Learning along with data pre-processing.
It is the advanced version of “PSYCHIC” package.
TENSERFLOW2:
Mainly used for Deep Learning.
KERAS:
Mainly used for Deep Learning.
Advanced version of API ("Application Programming Interface") which
means many lines of code is replaced by single line code with the help of
keras.
now how to give the path: go to the folder/file where the dataset is there,
and give right click – properties – security – select the whole path –
copy the path- paste it in the (“\\”) give alt + enter.
Checking the working directory is working fine or not:
os.getcwd() alt + enter
importing 5 important package before starting of any dataset:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Importing the dataset to working place (i.e in jupyter notebook):
dataset = pd.read_csv(“file name.csv”)
Checking the head/ info of dataset:
dataset.head()
dataset.info()
Data – Pre-processing:
Checking the dataset if there are any missing data available:
Pd.DataFrame(dataset).isnull().any/.sum()
Checking the dataset length/columns/data types:
print(len(dataset))
print(len(dataset.columns))
print(dataset.dtypes)
Checking/ dividing dependent and independent variables/correlations:
how to write it:
independent_vars = dataset.iloc[:,starting_num:ending_num].columns.tolist()
independent_vars
dependent_var:
DV = dataset.iloc[:,numofdv].name
DV
Checking the correlation of DV with other variables to see how many variables
are strongly correlated with DV
how to write it:
to see the correlation package to be installed:
from scipy.stats import pearsonr
correlations = { }
for i in features:
data = dataset[[i, DV]]
x1 = data[i].values
x2 = data[DV].values
key = i + "Vs" + DV
correlations[key] = pearsonr(x1,x2)[0]
correlations alt + enter
data_correlations = pd.DataFrame(correlations, index = ['Value']).T
data_correlations.loc[data_correlations['Value'].abs().sort_values(ascending=False)
.index] alt + enter
To check the heat-map:
plt.figure(figsize=(30,8))
sns.heatmap(dataset.corr(), cmap='rainbow', annot=True)
plt.show()
Pd.DataFrame(dataset).isnull().any/.sum()
Let’s see the example that we have seen in the class from titanic dataset:
To check whether missing value is skewed or not
X.hist('age')
Treating the missing value with median in dataset:
X['age'] = X['age'].fillna(X.age.median()) # because the hist is skewed
print (X.age.isnull().sum())
Treating the missing value with mode in dataset:
print(X.Embarked.mode()[0])
X['Embarked'] = X['Embarked'].fillna(X.Embarked.mode()[0])
print (X.Embarked.isnull().sum())
Feature Scaling: When there is outlier we use feature scaling. If one variable is
influencing another variable we use feature scaling.
We use NORMALIZATION
Actual Value-min/max-min
STANDARIZATION or z-test
Actual Value-mean/standard Deviation
Normalization Standardization
Gives only positive (+) Gives both positive (+), and
values. negative (-) values.
This is used for “Row-wise” This is used for “column
or “observation”. wise” or “variable wise”.
It is used when there is “NO It is used when there is an
OUTLIER”. “OUTLIER”
Formula = actual value-min Formula = actual value-mean
/ max-min / standard deviation.
Package to be installed:
from sklearn.preprocessing import StandardScaler
In Python how we write it:
scaler = StandardScaler()
display (X[:5])
X.IDV_name = scaler.fit_transform(X[[“IDV”]])
display (X[:5])
Encoding Concept:
Encoding concept is converting the character to number or vise -versa.
In python we have 3(THREE) different types of encoding concepts are available:
labelencoder = LabelEncoder()
x[:,3] = labelencoder.fit_transform(x[:,3])
Packages to be installed:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
or incase if one-hot encoder is not working we have to use column transformer:
ct = ColumnTransformer([('one_hot_encoder',OneHotEncoder(categories='auto',),
[3])], remainder='passthrough')
# in default categories = auto
onehot_x= np.array(ct.fit_transform(x), dtype=np.str)
Dummy Variables:
Let’s see the example that we have seen in the class from titanic dataset:
X = X.join(pd.get_dummies(df.Embarked, prefix ='Embarked'))
display (X[:5])
classification Models:
Logistic Regression:
Package to be installed:
from sklearn.linear_model import LogisticRegression
How to write it:
logmodel = LogisticRegression()
logmodel.fit(x_train, y_train)
y_pred = logmodel.predict(x_test)
y_pred
y_test
Predicted values
0(No) 1(Yes)
Actual values
0(No) TN FP
(True (False
Positive)
Negative)
1(Yes) FN TP
(False (True
Negative) Positive)
F1 square
precision
Recall
Support
F1 Square:
packages to be installed:
from sklearn.metrics import classification_report
How to write it:
print(classification_report(y_test,y_pred))
decision tree:
packages to be installed:
from sklearn.tree import DecisionTreeRegressor/classifier
How to write it:
from sklearn.metrics import explained_variance_score
from time import time
# how to write Model Building
start = time()
decision = DecisionTreeRegressor()
decision.fit(x_train, y_train)
decc = decision.score(x_test,y_test)
# Prediction
decpredict = decision.predict(x_test)
# explained_variance_score - comparing pred vs actual
# confusion_matrix - comparing actual vs pred
# Score / Accuracy
exp_dec = explained_variance_score(decpredict, y_test)
end = time()
train_time_dec = end-start
exp_dec
plot:
plt.figure(figsize=(17,8))
plt.plot(y_test, label = "Test")
plt.plot(name_of_model_predict, label = "predict")
plt.show()
Note:
If my plot is not a linear-line; non-linear then we build decision tree with
regression problems.
it’s better to avoid using decision tree regression models as it’s a non-
linear.
note: here in support vector machine there are 4 kernels are there:
linear
sigmoid
polynomial
RBF (Radial basis function )
SVM is mainly used in AI (artificial intelligence)
By default, system chooses RBF kernel if we want we can change it in the place
of kernel. Kernel = approach.
Facts of KNN:
Knn is mostly used in clinical sectors.
In KNN Dependent variable must be a Factor/ character.
Knn also called as LAZY Algorithm of all classification models, because
it takes the nearest value rather than accurate value; works with range.
KNN works efficient with small sized datasets.
In KNN the key rule is K-value should be Odd number.
The disadvantage of KNN is that if k value gets increased the answer get
changed.
Package to be installed:
from sklearn.naive_bayes import GaussianNB
ENSEMBLE TECHNIQUES:
We use ensemble techniques to improve accuracy and the
performance of our predicted machine learning models.
These are very important, fast, accurate, and very efficient in nature.
Hence these techniques will do a lot of help for our M.L models to
give the results in their best way possible.
However, “Random Forest” is accepted by the universal industry of
data science as ensemble technique in M.L.
AdaBoostRegressor / classifier:
Package to be installed:
from sklearn.ensemble import AdaBoostRegressor/classifier
how to write it:
start = time()
ada = AdaBoostRegressor / classifier(n_estimators=50, learning_rate=0.2,
loss='exponential').fit(x_train, y_train)
adab = ada.score(x_test, y_test)
# Loss function - MAE, MAPE, MSE, RME
end = time()
train_test_ada = end - start
predict_ada = ada.predict(x_test)
exp_ada = explained_variance_score(predict_ada, y_test)
exp_ada
XGBoost Method:
Xg_boost works fast, accurate, and works efficient with all the algorithms.
Package to be installed:
!pip install xgboost
from xgboost import XGBClassifier/Regressor
how to write it:
classifier_xgb = XGBClassifier()
classifier_xgb.fit(x_train, y_train)
K-Fold method:
k-fold method is one of the key methods in machine learning
concepts it is very accurate and very frequently used concept across
the M.L models.
Package to be installed:
from sklearn.model_selection import cross_val_score
How to write it:
accuracy = cross_val_score(estimator = est, X = x_train, y=y_train, cv = 10)
accuracy
Note: here in this example we have taken gradient boosting regressor to see the
accuracy, but we can use any model to get accuracy with the k-fold method.
K-FOLD with Logistic Regression:
from sklearn.model_selection import cross_val_score
accuracy = cross_val_score(estimator=nb_mdl,
X=x_train, y=y_train, cv=15)
Accuracy
NOTE:
please change the ensemble technique names as to classifier or to regressor
when you are dealing with the ensemble techniques. Please find that above are
the example of the code and you need to do changes according to your needs.
Model comparision:
Model Comparision on the basis of Model's Accuracy Score and Explained
Variance score of different models.
model_validation = pd.DataFrame({
'Model':['Decision Tree','Random Forest','Gradiant Boosting','AdaBoost',
'Support Vector Machine','Linear Regression'],
'Score': [decc,random,gradient,adab,svr1,regressor1],
'Variance Score': [exp_dec,exp_rand,exp_est,exp_ada,exp_svr,exp_linear]
})
model_validation.sort_values(by='Score', ascending=False)
Model analysis:
Analysing training time for each model has taken:
Model = ['Decision Tree','Random Forest','Gradiant Boosting','AdaBoost',
'Support Vector Machine','Linear Regression']
Train_time = [
train_time_dec,
train_test_rand,
train_test_est,
train_test_ada,
train_time_svr,
train_time_linear
]
index = np.arange(len(Model))
plt.bar(index, Train_time)
plt.xlabel("Machine Learning Models", fontsize =20)
plt.ylabel("Training Time", fontsize = 25)
plt.xticks(index, Model, fontsize=10)
plt.title("Comparision of Training Time taken of all the ML Models")
plt.show()
THANK YOU
&
ALL THE BEST