Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
11 views

Handling The Dataset Using R - Word

The document discusses handling datasets in R for supervised machine learning. It covers importing datasets, exploring the data structure, preprocessing including handling missing values and outliers, splitting data into training and test sets, building regression, classification, decision tree, random forest and SVM models, predicting on test data, and finding significant variables. The key steps discussed are data preprocessing, manipulation, exploratory analysis, and visualization before building models.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Handling The Dataset Using R - Word

The document discusses handling datasets in R for supervised machine learning. It covers importing datasets, exploring the data structure, preprocessing including handling missing values and outliers, splitting data into training and test sets, building regression, classification, decision tree, random forest and SVM models, predicting on test data, and finding significant variables. The key steps discussed are data preprocessing, manipulation, exploratory analysis, and visualization before building models.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Handling the dataset using R

Supervised Machine Learning:


Things to remember when using R:
 To get the new workspace in R : go to File select the New File –
select R Script or we can use ctrl+shift+N.

 For execution of the code use RUN or we can use CTRL+ENTER.

 In R we have 4 windows called:


 Source code window - (we can write and execute the
code)
 Console window- (we can see the execution results)
 Environment window- (we can see the what are the things
came in the window when we write the code)
 And Results window- (we can get help and packages get
installed and we can see the working directory)
 R is a case- sensitive language.

 R works with all sorts of files and their extensions.

To check the data is complete, before building the model , and before
forecasting, and before publishing we have to do a 4 step key measures
in all supervised Machine Learnings are:
 Data Pre-Processing - (includes cleaning the data)
 Data Manipulation – (summarize, finding the significant
variables etc….)
 Data Exploratory Analysis - (relation b/w DV and IDV)
 Data Visualization - (Building Visualizations, plots etc…)

So let’s get started with the handling dataset in R:


Importing the dataset:
How to write it:
Dataset_name <- read.csv(choose.files())
When we click this we’ll be getting a pop-up dialogue box saying to choose
the file which we are working for.
Choose the file you want to work with.

Looking for the structure of the data using R codes like:

 dim(dataset_name): which tells us the dimensions of the


dataset.
 View(dataset_name): we can see how dataset looks like.
 head(dataset_name): we can see the top 6 of the dataset.
 Tail (dataset_name): we can see the Bottom 6 of the dataset.
 str(dataset_name): we can see the structure of the dataset.

Data Pre-Processing or Cleaning the Dataset.

There are 4 Major Steps included in Data-Preprocessing:


 Finding the Missing Values
 Handling the Outlier
 Feature Scaling
 The Encoding Concept.
Detailed study of pre- Processing required, as it is so so much
important to any dataset that we are dealing with: to get the good
accuracy and good results.
Finding the Missing values if there any missing values in dataset:

ColSums(is.na(dataset_name))
To find and treat the Missing values in the dataset:
 Identify whether the value is number or character. If it is a number we
have to use MEAN or MEDIAN.
 If there is Outlier, we use MEDIAN otherwise use MEAN.
 If it is Character, we use MODE.
 If the data is missing more than 25% we will be removing the entire
variable. Or we use MEAN and MEDIAN.

Outlier: how to find it:


We use boxplot to find the Outlier:
boxplot(dataset_name$variable_name)

 The components of the box-plot are:


 IQR (inter-quartile range) =Q3-Q1
 RANGE = Max-Min
 POSITIVE OUTLIER= Q3+1.5*IQR
 NEGETIVE OUTLIER= Q1-1.5*IQR
 Q3= 3*n+1/4
 Q1= n+1/4
 Q2= Median of the datase
 To handle the Outlier, we use TRANSFORMATION, Concepts like
SQAURE ROOT, CUBE ROOT and LOG concepts.

Feature Scaling: When there is outlier we use feature scaling. If one variable is
influencing another variable we use feature scaling.

We use NORMALIZATION
Actual Value-min/max-min
STANDARIZATION or z-test
Actual Value-mean/standard Deviation

Normalization Standardization
 Gives only positive (+)  Gives both positive (+), and
values. negative (-) values.
 This is used for “Row-  This is used for “column
wise” or “observation”. wise” or “variable wise”.
 It is used when there is  It is used when there is an
“NO OUTLIER”. “OUTLIER”

Encoding Concept:
Encoding concept is converting the character to number or vise -
versa.

Splitting the data into training and test data:

 After the Pre-Processing is completed, we have to split the data into


TRAINING and TEST data.
Fixing the Random number:
 Packages to be installed:

 caTools
To fix the Random Number while doing the splitting of the training and test dataset.
How to write it:
install. packages("caTools")
library(caTools)
set.seed(456)
split <- sample.split(dataset_name$DV, SplitRatio = 0.75 )
split
table(split)

note: “on what basis we are deciding to split the data into 75 or 80% is,
normally with “industrial standards” we take 75 to 80% data to training dataset to
building the model and with 25 or 20% test data we use or park it for prediction of
the model we are going to build. This is because to escape prom the over/under
fitting the data”.

Checking the data has splitted correctly:


How to write it:
training
training <- subset(dataset_name, split==TRUE)
nrow(training)
test
test <- subset(train_data, split==FALSE)
nrow(test)

if the data has splitted correctly we can go for model building:

REGRESSION:
 if it is Regression model we use lm (Linear Model) as model building
Algorithm.
 We use OLS (Ordinary Least Square) methods for linear models, to get the
“best fit line”.
How to write it:

Reg<- lm(dv~., data = training)


Reg
CLASSIFICATION:

 if it is Classification model we use glm (Generalized Linear Model) as


model building Algorithm.
How to write it:
classify <- glm(dv~., data = training, family="binamiol")
names(dataset_name)
DECISION TREE:
 in Classification model we use Rpart (rpart) as model building
Algorithm for decision tree.
 Algorithm is “CART” = Classification And Regression Tree.
 Cart is a predictive model, which explains how outcome variable’s
values can be predicted based on the other values.

 Packages to be installed:
o R-part
o Rattle

Parts of Decision Tree:


 ROOT NODE
 DECISION NODE
 TERMINAL/ LEAF NODE
ROOT NODE:
is the important node to the whole project.it is the top node of the tree.
DECISION NODE/BRANCH NODE:
It is the branch of tree. The second subset of the tree.
TERMINAL /LEAF NODE:
Final node of the tree, and the final subset of the tree after the pruning of the
tree.
On which basis we are deciding the root node:

Components of Decision Tree:


 Gini Index = p2+q2
 Entropy = -p*Log2(p)-q*Log2(q)
 Information Gain = 1-entropy
 Chi-square test = summation of I = 1n(actual value-
expected value)2/expected value
In simple terms:
Gini index:
Higher the value, higher possibility to be root node.
Entropy:
Lesser the value, higher possibility to be root node.
Information gain:
This is nothing but entropy only, however it will give 100% information of the tree,
so 1-Entropy. Higher the value, higher possibility to be root node.
Chi-Square test:
Higher the value, higher possibility to be root node.
Packages to be installed:
How to write it:
Install.packages(“rpart”)
Library(rpart)
For getting the visualization:
Install.packages(“rattle”)
Library(“rattle”)
How to write to building the model:
dec_tree <- rpart(dv~., data=training)
dec_tree
fancyRpartplot(dec_tree)- to get fancy visualization.
RANDOM FOREST:
Packages to be installed:
Install.packages(“randomforest”)
Library(randomforest)
How to write it:
Rand_for <- randomforest(dv~., data = training, ntree=500)
NOTE: In Random Forest package and algorithm is also Random Forest only.

SUPPORT VECTOR MACHINE:


Packages to be installed:
Install.packages(“e1071”)
Library(e1071)
How to write it:
There are 4 different types of kernels(Approach) are there:
Svm_linear <-svm(dv~., data = training, kernel=”linear”)
Svm_sigmoid <-svm(dv~., data = training, kernel=”sigmoid”)
Svm_polynomial <-svm(dv~., data = training, kernel=”polynomial”)
Svm_rbf <-svm(dv~., data = training, kernel=”rbf”)

Finding the significant variables:


There is very easy and very very powerful syntax called:

SUMMARY
 Summary would find stats values.
 Summary would do missing value.
 Summary is like boxplot to find the outlier.
 Summary would do visualizations.
 Summary would do numeric and character.
Hence summary is very important in the whole project, so it would tell us which is
significant and which is non-significant variable in the project.

How to write it:


Summary (model name)

Predicting the model with test data in Supervised Machine Learning:


REGRESSION:
pred <- predict(reg, newdata = test)
pred
CLASSIFICATION:(LOGISTIC)
Pred_log <- predict(log, newdata = test, type="response")
Pred_log
DECISION_TREE:
dec_pred <- predict(dec_tree, newdata = test, type='class')
dec_pred
RANDOM FOREST:
Rand_pred <- predict(rand_for, newdata= test)
Rand_pred
SVM (Support Vector Machine):
svm_lin_pre <- predict(svm_linear, newdata = test, type='class')
svm_sig_pre <- predict(svm_sigmoid, newdata = test, type='class')
svm_poly_pre <- predict(svm_polynomial, newdata = test, type='class')
svm_rbf_pre <- predict(svm_rbf, newdata = test, type='class')

Predicting/comparing the actual values and dependent values:


 In Regression we get the accuracy after the prediction hence there won’t
be any further steps to get the accuracy. So we combine the predicted and
test values.
 But in classification problem we don’t get any accuracy immediately after
predicting the model. So we are fixing a threshold value to get the
accuracy.
REGRESSION:
Pred_regressor_cbind <- cbind(test$dv,pred_reg)
CLASSIFICATION:(LOGIT)
Prediction_threshold <- ifelse(prediction<=0.5,0,1)
Then combining the prediction and threshold values:
Prediction_cbind <- cbind(test$dv, prediction_threshold)

DECISION TREE:
Dec_tree_cbind <- cbind(test$dv, prediction_threshold)
Dec_tree_cbind
RANDOM FOREST:
rand_pred_bin <- ifelse(rand_pred>=0.5,1,0)
rand_pred_bin
SVM:

svm_lin_pre_bin <- ifelse(svm_lin_pre>=0.5,1,0)


svm_sigmoid_pre_bin <- ifelse(svm_sigmoid_pre>=0.5,1,0)
svm_poly_pre_bin <- ifelse(svm_polynomial_pre>=0.5,1,0)
svm_rbf_pre_bin <- ifelse(svm_rbf_pre>=0.5,1,0)

BUILDING CONFUSION MATRIX:


 One of the key things in the whole project.
 Very significant role in the Classification models.
 After predicting the model with test data we need the accuracy which we
wanted in classification models, hence confusion matrix comes to the
rescue.
 Confusion matrix will tell us how accurate we are with our predicted
model.
 With the help of confusion matrix only, we can able to find accuracy,
sensitivity, specificity, recall, precision, etc…..
 The name suggests that it’s confusion Matrix but in fact it gives us the
clarity in our prediction.
 With the help of C.M we can go choose methods to reduce the error of our
Prediction.
 Without confusion matrix we can’t achieve the error % in Classification
models.
Let’s see how it looks like:
Predicted values
0(No) 1(Yes)
Actual values

0(No) TN FP
(True (False
Positive)
Negative)
1(Yes) FN TP
(False (True
Negative) Positive)

 We use Confusion Matrix to find the Error of our prediction.


Measures of Confusion Matrix:

ACCURACY: TRUE POSITIVE + TRUE NEGATIVE


True Positive + True Negative / TP+FP+TN+FN

SENSITIVITY: TRUE POSITIVE


True positive / TP+FN

SPECIFICITY: TRUE NEGATIVE


True Negative / TN+FP
RECALL: is also called as “Sensitivity”.
True positive / TP+FN
PRECISION: is also called as “Positive Predicted Value”
True positive / TP+FP
Negative Predicted Value:
True Negative / TN+FN

Note: Less than 0.5 considered as 0: Negative, False


More than 0.5 considered as 1: Positive, True.
Confusion Matrix will tell us the Maximum error of the prediction.

ROC and AUC:

ROC = Receiver operative Characteristic


AUC = Area Under Curve
ROC and AUC comes under the MLE (Maximum Likelihood Estimation). We use AUC and ROC
when the threshold values get changed. So we tend to have lesser error than the prior one with
the help of MLE (Maximum Likelihood Estimation). So we use these techniques to improve our
Accuracy.

Packages to be installed:
Install.packages(“rpart”)
Library(rpart)
Install.packages(“ROCR”)
Library(ROCR)
How to write it:
ROCprediction <- prediction(test$Dv,pred_thres)
ROCprediction
ROCperformance <- performance(ROCprediction, 'tpr','fpr')
ROCperformance
plot(ROCperformance, col="color",print.cutoffs.at=seq(0.1,by=.1))
abline(a=0,b=1)

TPR: True Positive Rate


FPR: False Positive Rate
After this ROC and AUC prediction we get AIC ,NULL DEVIANCE, MULTIPLE DEVIANCE ERROR,
F-MEASUREMENT,ETC.. values hence get the accuracy in comprehending way to all.
AIC: it’s like R2 in classification
NULL Deviance: is the default error given by the system without IDV.
MULTIPLE DEVIANCE ERROR: is the with IDV.
F-MEASUREMENT: is 2*Recall*precision / Recall + Precision.

With these measures we can do any dataset of supervised Machine Learning easily by using R.

ALL THE BEST


&
THANK YOU
Handling the dataset using Python
Supervised Machine Learning:
Things to remember when using Python:
Python- to get python we have to install – Anaconda – Jupyter note book –
Launch jupyter notebook.
 Python is a case- sensitive language.
 Python is collection of Data Types.
 Shutdown Kernal when you log-off (or) in homepage select – go to Running –
shutdown the kernals.
 CTRL+ENTER for only execution.
 SHIFT+ENTER for execution of the code and selects the below line.
 ALT+ENTER for execution of the code and creates a new line after the
execution.
 When there is execution is taking place we get [*] this symbol.
 To get new notebook we have 2 ways:
 File – new notebook – select python3.
 Homepage – select – new – choose python3.
 Rename the notebook immediately after log-in to the jupyter notebook in order to
lose the data info, and the code.
 After completing the code writing to save the document go to file – select
download as – notebook.jpny- and delete the file which is auto-saved in the drive
we are working with in homepage select.
 Axis= 0,1 where 0 = ROW, 1 = column.
 None= All.
 NAN = Missing values
 Dropna = deletes the missing value.
 Range = basic python expression and in numpy we have to write as Arrange.
 [[-]] – which means specifying the index value.
 Elif = combination of else and if functions.
 Insert = appending/ assigning place value then adding inserting the value where
you like to insert in the dataset.
 Append = adding the value at the end.
 Copy = duplicate/replication with the new one.
 Clear = clears all the values in the object.
 Delete = deletes entire object.
 Remove = removes particular value in the dataset.
 Pop-Index Place = gives index values.
 Random number:
o Rand – only binary / probability / positive values
o Randn – negative (-), positive(+), float, decimal
o Randint – fixing the sample values like set.seed in R.
 To install any package, we can use 2(two) ways
o !PIP INSTALL PACKAGE NAME
o Go to ANACONDA Prompt and say- PIP INSTALL PACKAGE NAME .
 Python is OOPS program, where as OOPS is Object oriented programming
language.

BASICS OF PYTHON:
 Types of Variables
o NUMBER – integer, float, and complex number
o STRING - character
o BOOLIAN / LOGICAL VALUES – yes/no, True/False

LOOPS IN PYTHON:
 we use LOOPS in python where the manual work couldn’t help.
o ZIP LOOP
o WHILE LOOP
o FOR LOOP.

 AUTOMATION LOOP FUNCTIONS IN PYTHON / USER DEFINED


FUNCTIONS:
o Elif – it’s combination of “else and if”.
o DEF – we can create our customized function.
o LAMBDA – simpler/short ver.of def function.it is inbuilt.
o MAP – all values we can assign from the def function.
o FILTER – it gives filtered values
o BREAK
o PASS
o CONTINUE
LOOPS IN PYTHON:
 We use Loops mainly to avoid the manual activity ; and to make things better
and accurate.
 With less error and more efficiency and saving the time is at most important
object of these loops.
 Let’s see some important loops in python:

 ZIP LOOP:

 For zip loop we have to install package called- “ITERTOOLS”


 Python’s zip ()function is defined as zip(iterables).
 The function takes in iterables as arguments and returns an iterator.
 This iterator generates a series of tuples containing elements from each
iterable.
 zip() can accept any type of iterable, such as files, lists,
tuples, dictionaries, sets, and so on.
 It works only with for loop.

 WHILE LOOP:
 The while loop in Python is used to iterate over a block of code
as long as the test expression (condition) is true.
 We generally use this loop when we don't know the number of
times to iterate beforehand.
 FOR LOOP:

 For loop is to iterate over a sequence of elements using the different variations
of the loop.
 The for loop in Python is used to iterate over a sequence (list, tuple, string) or
other iterable objects. Iterating over a sequence is called traversal.

GATES IN PYTHON:
 Logic gates are used to create a circuit that performs calculations,
data storage or shows off object-oriented programming especially the power of
inheritance.
 A Logic gate is an elementary building block of any digital circuits. It takes one
or two inputs and produces output based on those inputs. Outputs may be high
(1) or low (0).
 There are seven basic logic gates defined, in PYTHON these
are:
 AND GATE
 OR GATE
 NOT GATE
 NAND GATE
 NOR GATE
 XOR GATE
 XNOR GATE

AND GATE:
 The AND gate is a basic digital logic gate that implements
logical conjunction.
 it behaves according to the truth table to the right.
 A HIGH output (1) results only if all the inputs to the AND gate are HIGH (1).
 The AND gate gives an output of 1 if both the two inputs are 1, it gives 0
otherwise.
In simple terms:

 And gate provides an output of 0 if either of the inputs are 0. This operation is
considered as multiplication in binary numbers.
 We can see in the truth table that whenever either of the two inputs is 0, the
output is 0 too.

How to write AND Gate in python:

def AND(A, B):


return A & B

print("Output of 0 AND 0 is", AND(0, 0))


print("Output of 0 AND 1 is", AND(0, 1))
print("Output of 1 AND 0 is", AND(1, 0))
print("Output of 1 AND 1 is", AND(1, 1))

Output:

Output of 0 AND 0 is 0
Output of 0 AND 1 is 0
Output of 1 AND 0 is 0
Output of 1 AND 1 is 1

OR GATE:
 The OR gate is a digital logic gate that implements logical disjunction – it
behaves according to the truth table to the right.
 A HIGH output (1) results if one or both the inputs to the gate are HIGH
(1). If neither input is high, a LOW output (0) results.
 The OR gate gives an output of 1 if either of the two inputs are 1, it
gives 0 otherwise.
In simple terms:
OR gate provides the output as 1 if either of the inputs is 1. It is similar to
an “addition” operation, with respect to binary numbers.
How to write OR Gate in python:

def OR(A, B):


return A | B

print("Output of 0 OR 0 is", OR(0, 0))


print("Output of 0 OR 1 is", OR(0, 1))
print("Output of 1 OR 0 is", OR(1, 0))
print("Output of 1 OR 1 is", OR(1, 1))

output:

Output of 0 OR 0 is 0
Output of 0 OR 1 is 1
Output of 1 OR 0 is 1
Output of 1 OR 1 is 1

Universal Logic Gates in Python:


 There are two universal logic gates, NAND and NOR.
 They are named universal because any Boolean circuit can be
implemented using only these gates.

NAND Gate:
 The NAND gate (negated AND) gives an output of 0 if both inputs are 1, it
gives 1 otherwise.

in simple terms:

 The “NAND” gate is a combination of AND gate followed by NOT gate.


Opposite to AND gate, it provides an output of 0 only when both the bits are set,
otherwise 1.
 In Python, NAND function can be implemented using the
AND() and OR()functions created before.

How to write OR Gate in python:


NAND Gate

# Function to simulate AND Gate


def AND(A, B):
return A & B;

# Function to simulate NOT Gate


def NOT(A):
return ~A+2

# Function to simulate NAND Gate


def NAND(A, B):
return NOT(AND(A, B))

print("Output of 0 NAND 0 is", NAND(0, 0))


print("Output of 0 NAND 1 is", NAND(0, 1))
print("Output of 1 NAND 0 is", NAND(1, 0))
print("Output of 1 NAND 1 is", NAND(1, 1))

output:
Output of 0 NAND 0 is 1
Output of 0 NAND 1 is 1
Output of 1 NAND 0 is 1
Output of 1 NAND 1 is 0

NOR Gate:
 The NOR gate (negated OR) gives an output of 1 if both inputs are 0, it gives 1
otherwise.

In simple terms:

 The NOR gate is a result of cascading of OR gate followed by NOT gate.


 Contrary to OR gate, it provides an output of 1, when all the inputs are 0.
 Similar to NAND () function, NOR () can be implemented using already created
functions.

How to write NOR gate in python:

# Function to calculate OR Gate


def OR(A, B):
return A | B;

# Function to simulate NOT Gate


def NOT(A):
return ~A+2

# Function to simulate NOR Gate


def NOR(A, B):
return NOT(OR(A, B))

print("Output of 0 NOR 0 is", NOR(0, 0))


print("Output of 0 NOR 1 is", NOR(0, 1))
print("Output of 1 NOR 0 is", NOR(1, 0))
print("Output of 1 NOR 1 is", NOR(1, 1))

OUTPUT:
Output of 0 NOR 0 is 1
Output of 0 NOR 1 is 0
Output of 1 NOR 0 is 0
Output of 1 NOR 1 is 0

NOT Gate:
 It acts as an inverter. It takes only one input. If the input is given as 1, it will
invert the result as 0 and vice-versa.
In simple terms:
 This gate provides the negation of the input given. This gate supports only a
single input.

How to write NOT gate in Python:


# Function to simulate NOT Gate
def NOT(A):
return ~A+2

print("Output of NOT 0 is", NOT(0))


print("Output of NOT 1 is", NOT(1))

Output:

Output of NOT 0 is 1
Output of NOT 1 is 0
Note: The NOT function provides correct results for bit values 0 and 1.

Exclusive Logic Gates in Python:

There are two special types of logic gates, XOR and XNOR, that focus on the number
of inputs of 0 or 1, rather than individual values.
XOR Gate:
 The XOR gate gives an output of 1 if either both inputs are different, it gives 0 if
they are same.
In simple terms:
 An acronym for Exclusive-OR, XOR gate provides an output of 1 when the
number of 1s in the input is odd.

How to write XOR gate in python:


Function to simulate XOR Gate:
def XOR(A, B):
return A ^ B

print("Output of 0 XOR 0 is", XOR(0, 0))


print("Output of 0 XOR 1 is", XOR(0, 1))
print("Output of 1 XOR 0 is", XOR(1, 0))
print("Output of 1 XOR 1 is", XOR(1, 1))

OUTPUT:

Output of 0 XOR 0 is 0
Output of 0 XOR 1 is 1
Output of 1 XOR 0 is 1
Output of 1 XOR 1 is 0

XNOR GATE:

 The XNOR gate (negated XOR) gives an output of 1 both inputs are same and
0 if both are different.
 It is formed as a result of the combination of XOR and NOT gates.
 Opposite to XOR, it provides an output of 1, when the number of 1s in
the input is even.
How to write XNOR Gate in Python:

The XNOR () function can be implemented by using the XOR () and NOT () functions in
Python.

# Function to simulate XOR Gate


def XOR(A, B):
return A ^ B

# Function to simulate NOT Gate


def NOT(A):
return ~A+2

# Function to simulate XNOR Gate


def XNOR(A, B):
return NOT(XOR(A, B))

print("Output of 0 XNOR 0 is", XNOR(0, 0))


print("Output of 0 XNOR 1 is", XNOR(0, 1))
print("Output of 1 XNOR 0 is", XNOR(1, 0))
print("Output of 1 XNOR 1 is", XNOR(1, 1))

OUTPUT:
Output of 0 XNOR 0 is 1
Output of 0 XNOR 1 is 0
Output of 1 XNOR 0 is 0
Output of 1 XNOR 1 is 1

DATA STRUCTURES OF PYTHON:


 A data structure is a way of organizing and storing data such that we can
access and modify it efficiently.
 In python we have TWO types of data structures / types are there:
 Mutable(Changeable)
 Immutable (Unchangeable)
 In mutable data structure / types we have:
 LIST
 DICTIONARY
 SET
 In immutable data structure / types we have:
 NUMBER
 STRING
 TUPLE

Let’s see how does these data types works:

LIST:
 List starts with index / square brackets “[-]”
 it’s mutable.
 Python offers a range of compound data types often referred to as
sequences.
 List is one of the most frequently used and very versatile data types
used in Python.

 a list is created by placing all the items


(elements) inside square brackets [] , separated
by commas.
 It can have any number of items and they may
be of different types (integer, float, string etc.).
 A list can also have another list as an item. This
is called a nested list.

Examples of creating a list and a nested list:


List: my_list = [“hello”,20,30,40, 5.9,8.7,9.9]

Nested list: my_nested_list = [“important”, [5,6,7],


[4.6,7.9,], 10]

Python List Methods:


append() - Add an element to the end of the list

extend() - Add all elements of a list to the another list

insert() - Insert an item at the defined index

remove() - Removes an item from the list

pop() - Removes and returns an element at the given index

clear() - Removes all items from the list

index() - Returns the index of the first matched item

count() - Returns the count of the number of items passed as an argument

sort() - Sort items in a list in ascending order

reverse() - Reverse the order of items in the list.

copy() - Returns a shallow copy of the list.

DICTIONARY:
 Python dictionary is an unordered collection of items. Each item of a dictionary
has a key/value pair.
 Dictionaries are optimized to retrieve values when the key is known.
 Dictionary starts with curly braces {--}
 It is mutable(changeable)
 An item has a KEY and a corresponding VALUE that is expressed as a pair
 (key: value).

Example of creating a dictionary:

# get vs [] for retrieving elements


my_dict = {'name': 'Jack', 'age': 26}

# Output: Jack
print(my_dict['name'])

# Output: 26
print(my_dict.get('age'))

# Trying to access keys which doesn't exist throws error


# Output None
print(my_dict.get('address'))

# KeyError
print(my_dict['address'])

Output

Jack
26
None
Traceback (most recent call last):
File "<string>", line 15, in <module>
print(my_dict['address'])
KeyError: 'address'
# from sequence having each item as a pair
my_dict = dict([(1,'apple'), (2,'ball')])

SET:
 Set is mutable.
 Set starts with curly brackets {-}.
 Set gives the unique values. We can add or remove the items in it.
 Set arranges in ascending order.

# Different types of sets in Python


# set of integers
my_set = {1, 2, 3}
print(my_set)

# set of mixed datatypes


my_set = {1.0, "Hello", (1, 2, 3)}
print(my_set)

Output

{1, 2, 3}
{1.0, (1, 2, 3), 'Hello'}

Immutable (unchangeable) data types:

TUPLE:
 Tuples are used to store multiple items in a single variable.
 Tuple is one of 4 built-in data types in Python used to store collections of data,
the other 3 are List, Set, and Dictionary, all with different qualities and usage.
 A tuple is a collection which is ordered and unchangeable.
 Tuple is exactly same as list the only difference is it is immutable.
 Tuple is widely used in the python.

NUMBER & STRING:


 Python supports integers, floating-point numbers and complex numbers.
 They are defined as int , float , and complex classes in Python.
 Integers and floating points are separated by the presence or absence of a
decimal point.
 A string is a sequence of characters.

 a string is a sequence of Unicode characters. Unicode was introduced to include


every character in all languages and bring uniformity in encoding.

PACKAGES IN PYTHON AND THEIR IMPORTANCE:


In python there are many packages. These very significant packages
and their importance is high when we write a code using them.
So let’s see what are they:

NUMPY:
 NUMPY is mainly for ARRAYS and for mathematical
calculations / operations.
 it will convert all the data into numbers.
 It is like psych package in R which gives us stats values.

PANDAS:
 PANDAS is for Data Manipulation.
 It will give Series values.
 It will give Index values.
 It is used to imports, and exports the data.
 It is called “Grammar of Manipulations”.
 It sometimes used for visualization which is influenced by deep learning.

MATPLOTLIB:
 Mainly used for visualization.

SEABORN:
 Mainly used for visualization.
 It is used for Statistics.
 It is used for EDA.
 It is the advanced version of Matplotlib.

SKLEARN:
 it is mainly for Machine Learning along with data pre-processing.
 It is the advanced version of “PSYCHIC” package.

TENSERFLOW2:
 Mainly used for Deep Learning.

KERAS:
 Mainly used for Deep Learning.
 Advanced version of API ("Application Programming Interface") which
means many lines of code is replaced by single line code with the help of
keras.

NLTK (NATURAL LANGUAGE TOOL KIT):


 Mainly used in NLP (natural language process).

Writing the program in Python:


Setting up the working directory:
import os
os.chdir("give\\the\\path\\like\\this")

 now how to give the path: go to the folder/file where the dataset is there,
and give right click – properties – security – select the whole path –
copy the path- paste it in the (“\\”) give alt + enter.
Checking the working directory is working fine or not:
os.getcwd() alt + enter
importing 5 important package before starting of any dataset:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Importing the dataset to working place (i.e in jupyter notebook):
dataset = pd.read_csv(“file name.csv”)
Checking the head/ info of dataset:
dataset.head()
dataset.info()
Data – Pre-processing:
Checking the dataset if there are any missing data available:
Pd.DataFrame(dataset).isnull().any/.sum()
Checking the dataset length/columns/data types:
print(len(dataset))
print(len(dataset.columns))
print(dataset.dtypes)
Checking/ dividing dependent and independent variables/correlations:
how to write it:
independent_vars = dataset.iloc[:,starting_num:ending_num].columns.tolist()
independent_vars
dependent_var:
DV = dataset.iloc[:,numofdv].name
DV
Checking the correlation of DV with other variables to see how many variables
are strongly correlated with DV
how to write it:
to see the correlation package to be installed:
from scipy.stats import pearsonr
correlations = { }
for i in features:
data = dataset[[i, DV]]
x1 = data[i].values
x2 = data[DV].values
key = i + "Vs" + DV
correlations[key] = pearsonr(x1,x2)[0]
correlations alt + enter
data_correlations = pd.DataFrame(correlations, index = ['Value']).T
data_correlations.loc[data_correlations['Value'].abs().sort_values(ascending=False)
.index] alt + enter
To check the heat-map:
plt.figure(figsize=(30,8))
sns.heatmap(dataset.corr(), cmap='rainbow', annot=True)
plt.show()

EDA (Exploratory Data Analysis) or Data Visualization:


Packages to be installed:
from scipy.stats import stats
from scipy.stats import norm, skew
How to write it:
To see the visualization of idv’s with dv:
sns.lmplot(x=”IDV1”, y=DV, data=dataset)
To see the boxplot:
plt.figure(figsize = (16,8))
sns.boxplot(x=”IDV2”, y=”DV”, data = dataset)
plt.show()
To see the barplot:
plt.figure(figsize = (16,8))
sns.barplot(x=”IDV2”, y=”DV”, data = dataset)
plt.show()

To check the pair-plot:


sns.pairplot(dataset, hue=”IDV2”) alt + enter
sns.pairplot(dataset, hue=”IDV2”, diag_kind='hist') alt + enter
to see the normal distribution of dataset:
sns.distplot(dataset['Profit'], fit=norm);
# fitted with some parameter by using mu and sigma
(mu, sigma) = norm.fit(dataset['Profit'])
plt.legend(['Normal Dist. ($\mu=$ {:.2f} and $\sigma=${:.2f})'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title("Profit Distribution")

There are 4 Major Steps included in Data-Preprocessing:


 Finding the Missing Values
 Handling the Outlier
 Feature Scaling
 The Encoding Concept.
Detailed study of pre- Processing required, as it is so so much
important to any dataset that we are dealing with: to get the good
accuracy and good results.
Finding the Missing values if there any missing values in dataset:

Pd.DataFrame(dataset).isnull().any/.sum()

To find and treat the Missing values in the dataset:


 Identify whether the value is number or character. If it is a number,
we have to use MEAN or MEDIAN.
 If there is Outlier, we use MEDIAN otherwise use MEAN.
 If it is Character, we use MODE.
 If the data is missing more than 25% we will be removing the entire
variable. Or if it is less than 25% we use MEAN and MEDIAN.

Let’s see the example that we have seen in the class from titanic dataset:
To check whether missing value is skewed or not
X.hist('age')
Treating the missing value with median in dataset:
X['age'] = X['age'].fillna(X.age.median()) # because the hist is skewed
print (X.age.isnull().sum())
Treating the missing value with mode in dataset:
print(X.Embarked.mode()[0])
X['Embarked'] = X['Embarked'].fillna(X.Embarked.mode()[0])
print (X.Embarked.isnull().sum())

Outlier: how to find it:


We use boxplot to find the Outlier:
plt.figure(figsize = (16,8))
sns.boxplot(x='State', y='Profit', data = dataset)
plt.show()

 The components of the box-plot are:


 IQR (inter-quartile range) =Q3-Q1
 RANGE = Max-Min
 POSITIVE OUTLIER= Q3+1.5*IQR
 NEGETIVE OUTLIER= Q1-1.5*IQR
 Q3= 3*n+1/4
 Q1= n+1/4
 Q2= Median of the datase
 To handle the Outlier, we use TRANSFORMATION, Concepts like
SQAURE ROOT, CUBE ROOT and LOG concepts.

Feature Scaling: When there is outlier we use feature scaling. If one variable is
influencing another variable we use feature scaling.

We use NORMALIZATION
Actual Value-min/max-min
STANDARIZATION or z-test
Actual Value-mean/standard Deviation
Normalization Standardization
 Gives only positive (+)  Gives both positive (+), and
values. negative (-) values.
 This is used for “Row-wise”  This is used for “column
or “observation”. wise” or “variable wise”.
 It is used when there is “NO  It is used when there is an
OUTLIER”. “OUTLIER”
 Formula = actual value-min  Formula = actual value-mean
/ max-min / standard deviation.

Package to be installed:
from sklearn.preprocessing import StandardScaler
In Python how we write it:
scaler = StandardScaler()
display (X[:5])
X.IDV_name = scaler.fit_transform(X[[“IDV”]])
display (X[:5])

Encoding Concept:
Encoding concept is converting the character to number or vise -versa.
In python we have 3(THREE) different types of encoding concepts are available:

 LABEL ENCODER – when there is change of character to number or vise-


versa.
 ONE HOT ENCODER – all the numbers convert into columns in binary
format.
 DUMMY VARIABLE- is vast used in python and very popular, it’s formula
is “n-1”, so first value gets removed. And there won’t be multicolinearity.
How to write it:
Packages to be installed:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder = LabelEncoder()
x[:,3] = labelencoder.fit_transform(x[:,3])

Example of Label encoder:


State label encoder
TN 0
ND 1
MP 2
GUJ 3

Packages to be installed:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
or incase if one-hot encoder is not working we have to use column transformer:
ct = ColumnTransformer([('one_hot_encoder',OneHotEncoder(categories='auto',),
[3])], remainder='passthrough')
# in default categories = auto
onehot_x= np.array(ct.fit_transform(x), dtype=np.str)

Example of One Hot Encoder:


0 1 2 3 4
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0

Dummy Variables:
Let’s see the example that we have seen in the class from titanic dataset:
X = X.join(pd.get_dummies(df.Embarked, prefix ='Embarked'))
display (X[:5])

X = X.drop(['Embarked_C', 'Embarked'], axis=1)


display (X[:5])
se ag sibs parc Embarke pclass_ pclass_ Embarked_ Embarked_ Embarked_
x e p h d 2 3 C Q S
0 0 22.0 1 0 S 0 1 0 0 1
1 1 38.0 1 0 C 0 0 1 0 0
2 1 26.0 0 0 S 0 1 0 0 1
3 1 35.0 1 0 S 0 0 0 0 1
4 0 35.0 0 0 S 0 1 0 0 1

se ag sibs parc pclass_ pclass_ Embarked_ Embarked_


x e p h 2 3 Q S
0 0 22.0 1 0 0 1 0 1
1 1 38.0 1 0 0 0 0 0
2 1 26.0 0 0 0 1 0 1
3 1 35.0 1 0 0 0 0 1
4 0 35.0 0 0 0 1 0 1

Splitting the data into training and test:


We are splitting the data into 70% to 80% for training and 30% to 20% for test as
per the industry rules as that is accepted universal across the industry.
Let’s see how we write it:
Packages to be installed:
From sklearn import train_test_split
Syntax:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state =
42)
To check whether the data has splitted correctly or not:
print (X_train.shape)
print (X_test.shape)
print (y_train.shape)
print (y_test.shape)

Model Building for dataset:


Regression models:
Regression:
Packages to be installed:
from sklearn.linear_model import LinearRegression
How to write it:
start = time()
regressor = LinearRegression()
regressor.fit(x_train, y_train)
end = time()
train_time_linear = end-start
regressor1 = regressor.score(x_test, y_test)
prediction_linear = regressor.predict(x_test)
exp_linear = explained_variance_score(prediction_linear, y_test)

classification Models:
Logistic Regression:
Package to be installed:
from sklearn.linear_model import LogisticRegression
How to write it:
logmodel = LogisticRegression()
logmodel.fit(x_train, y_train)
y_pred = logmodel.predict(x_test)
y_pred
y_test

BUILDING CONFUSION MATRIX:


 One of the key things in the whole project.
 Very significant role in the Classification models.
 After predicting the model with test data we need the accuracy which we
wanted in classification models, hence confusion matrix comes to the
rescue.
 Confusion matrix will tell us how accurate we are with our predicted
model.
 With the help of confusion matrix only, we can able to find accuracy,
sensitivity, specificity, recall, precision, etc…..
 The name suggests that it’s confusion Matrix but in fact it gives us the
clarity in our prediction.
 With the help of C.M we can go choose methods to reduce the error of our
Prediction.
 Without confusion matrix we can’t achieve the error % in Classification
models.
Let’s see how it looks like:

Predicted values
0(No) 1(Yes)
Actual values

0(No) TN FP
(True (False
Positive)
Negative)
1(Yes) FN TP
(False (True
Negative) Positive)

 We use Confusion Matrix to find the Error of our prediction.


Measures of Confusion Matrix:

ACCURACY: TRUE POSITIVE + TRUE NEGATIVE


True Positive + True Negative / TP+FP+TN+FN

SENSITIVITY: TRUE POSITIVE


True positive / TP+FN

SPECIFICITY: TRUE NEGATIVE


True Negative / TN+FP

RECALL: is also called as “Sensitivity”.


True positive / TP+FN
PRECISION: is also called as “Positive Predicted Value”
True positive / TP+FP
Negative Predicted Value:
True Negative / TN+FN
Note: Less than 0.5 considered as 0: Negative, False
More than 0.5 considered as 1: Positive, True.
Confusion Matrix will tell us the Maximum error of the prediction.
Package to be installed:
from sklearn.metrics import confusion_matrix
How to write it:
confusion_matrix(y_test,y_pred)

to get the classification report:


in this classification report we get elements like

 F1 square
 precision
 Recall
 Support
F1 Square:

 The F1 score can be termed as a weighted average of the precision and


recall.
 an F1 score range reaches its best value at 1 and worst score at 0.
 The relative contribution of precision and recall to the F1 score are equal.
 The formula for the F1 score is: F1 = 2 * (precision * recall) / (precision +
recall)

packages to be installed:
from sklearn.metrics import classification_report
How to write it:
print(classification_report(y_test,y_pred))

ROC AND AUC CURVE:


The ROC curve is a simple plot that shows the tradeoff between the true positive
rate and the false positive rate of a classifier for various choices of the probability
threshold.
y_pred_prob = model1.predict(X_test)
package to be installed:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

# Generate ROC curve values: fpr, tpr, thresholds


fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
Plot for ROC curve:
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
AUCurve(area under curve)
roc_auc_score(y_test,predicted_df['Predicted_Class'])

Metrics and their report / values after the model evaluation:


These parameters are very key things in understanding
how well the model is like:
 Model- which model we are working with
 R-Squared value – (calculated for both significant and non-
significant values)
 ROC Score – gives trade-off b/w tpr and fpr
 Precision Score – one of the metrics in C.M
 Recall Score- metrics of the C.M
 Accuracy Score- defines accuracy of the model
 Kappa Score- one of the key thing in whole classification concepts
and it is very important to see the model’s accuracy and
performance.
cols = ['Model','R-Squared Value','ROC Score', 'Precision Score', 'Recall
Score','Accuracy Score','Kappa Score']
models_report = pd.DataFrame(columns = cols)
from sklearn import metrics
tmp1 = pd.Series({'Model': " Logistic Regression Base Model",
'R-Squared Value': model1.prsquared,
'ROC Score' : metrics.roc_auc_score(y_test,
predicted_df['Predicted_Class']),
'Precision Score': metrics.precision_score(y_test,
predicted_df['Predicted_Class']),
'Recall Score': metrics.recall_score(y_test, predicted_df['Predicted_Class']),
'Accuracy Score': metrics.accuracy_score(y_test,
predicted_df['Predicted_Class']),
'Kappa Score':metrics.cohen_kappa_score(y_test,
predicted_df['Predicted_Class'])})

model1_report = models_report.append(tmp1, ignore_index = True)


model1_report

decision tree:
packages to be installed:
from sklearn.tree import DecisionTreeRegressor/classifier
How to write it:
from sklearn.metrics import explained_variance_score
from time import time
# how to write Model Building
start = time()
decision = DecisionTreeRegressor()
decision.fit(x_train, y_train)
decc = decision.score(x_test,y_test)
# Prediction
decpredict = decision.predict(x_test)
# explained_variance_score - comparing pred vs actual
# confusion_matrix - comparing actual vs pred
# Score / Accuracy
exp_dec = explained_variance_score(decpredict, y_test)
end = time()
train_time_dec = end-start
exp_dec

To see the visualization for all algorithms:


Scatter_plot:
plt.figure(figsize=(20,7))
plt.scatter(y_test,name_of__model_predict, c = 'red')
plt.xlabel("Y Test")
plt.ylabel("Predicted Y")
plt.show()

plot:
plt.figure(figsize=(17,8))
plt.plot(y_test, label = "Test")
plt.plot(name_of_model_predict, label = "predict")
plt.show()

Note:
 If my plot is not a linear-line; non-linear then we build decision tree with
regression problems.
 it’s better to avoid using decision tree regression models as it’s a non-
linear.

Random Forest Regressor / classifier:


Package to be installed:
from sklearn.ensemble import RandomForestRegressor/classifier
How to write it:
start = time()
rand_regr = RandomForestRegressor(n_estimators = 400, random_state=0)
rand_regr.fit(x_train, y_train)
random = rand_regr.score(x_test, y_test)
end = time()
train_test_rand = end - start
predict_rand = rand_regr.predict(x_test)
exp_rand = explained_variance_score(predict_rand , y_test)
exp_rand

Support Vector Machine


Regressor/classifier:
from sklearn.svm import SVR
How to write it:
start = time()
svr = SVR(kernel='linear')
svr.fit(x_train, y_train)
end = time()
train_time_svr = end-start
svr1 = svr.score(x_test, y_test)
prediction_svr = svr.predict(x_test)
exp_svr = explained_variance_score(prediction_svr, y_test)
exp_svr

note: here in support vector machine there are 4 kernels are there:
 linear
 sigmoid
 polynomial
 RBF (Radial basis function )
 SVM is mainly used in AI (artificial intelligence)
By default, system chooses RBF kernel if we want we can change it in the place
of kernel. Kernel = approach.

KNN (K-Nearest neighbors):


Package to be installed:
from sklearn.neighbors import KNeighborsClassifier/regressor
How to write it:
model = KNeighborsClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

Facts of KNN:
 Knn is mostly used in clinical sectors.
 In KNN Dependent variable must be a Factor/ character.
 Knn also called as LAZY Algorithm of all classification models, because
it takes the nearest value rather than accurate value; works with range.
 KNN works efficient with small sized datasets.
 In KNN the key rule is K-value should be Odd number.
 The disadvantage of KNN is that if k value gets increased the answer get
changed.

KNN (k-nearest neighbor) K-Means


 It’s a supervised machine
 It’s a unsupervised machine
learning.
learning.
 Range takes how many k-
 How many clusters (k)
values to find the value.
Gaussian Naive Bayes Theorem:
 Naïve Bayes theorem is used when adhoc work is there.
 Effective than KNN
 Takes less time to give accurate results.

Package to be installed:
from sklearn.naive_bayes import GaussianNB

how to write it:


nb_mdl = GaussianNB()
nb_mdl.fit(x_train, y_train)
y_pred_nb = nb_mdl.predict(x_test)

ENSEMBLE TECHNIQUES:
 We use ensemble techniques to improve accuracy and the
performance of our predicted machine learning models.
 These are very important, fast, accurate, and very efficient in nature.
 Hence these techniques will do a lot of help for our M.L models to
give the results in their best way possible.
 However, “Random Forest” is accepted by the universal industry of
data science as ensemble technique in M.L.

Let’s see some of the important ensemble techniques used in M.L:

Use in Regression and classification models:


GradientBoostingRegressor/Classifier
Package to be installed:
from sklearn.ensemble import GradientBoostingRegressor / classifier
how to write it:
start = time()
est = GradientBoostingRegressor(n_estimators = 400, max_depth=5, loss='ls',
min_samples_split=2, learning_rate=0.1).fit(x_train, y_train)
gradient = est.score(x_test, y_test)
# Loss function - MAE, MAPE, MSE, RME
end = time()
train_test_est = end - start
predict_est = est.predict(x_test)
exp_est = explained_variance_score(predict_est, y_test)
exp_est

AdaBoostRegressor / classifier:
Package to be installed:
from sklearn.ensemble import AdaBoostRegressor/classifier
how to write it:
start = time()
ada = AdaBoostRegressor / classifier(n_estimators=50, learning_rate=0.2,
loss='exponential').fit(x_train, y_train)
adab = ada.score(x_test, y_test)
# Loss function - MAE, MAPE, MSE, RME
end = time()
train_test_ada = end - start
predict_ada = ada.predict(x_test)
exp_ada = explained_variance_score(predict_ada, y_test)
exp_ada

XGBoost Method:
Xg_boost works fast, accurate, and works efficient with all the algorithms.

Package to be installed:
!pip install xgboost
from xgboost import XGBClassifier/Regressor
how to write it:
classifier_xgb = XGBClassifier()
classifier_xgb.fit(x_train, y_train)

K-Fold method:
k-fold method is one of the key methods in machine learning
concepts it is very accurate and very frequently used concept across
the M.L models.

Package to be installed:
from sklearn.model_selection import cross_val_score
How to write it:
accuracy = cross_val_score(estimator = est, X = x_train, y=y_train, cv = 10)
accuracy

Note: here in this example we have taken gradient boosting regressor to see the
accuracy, but we can use any model to get accuracy with the k-fold method.
K-FOLD with Logistic Regression:
from sklearn.model_selection import cross_val_score

accuracy = cross_val_score(estimator=logmodel, X=x_train, y=y_train, cv=15)


accuracy

K-FOLD with naiveBayesTheorem:


from sklearn.model_selection import cross_val_score

accuracy = cross_val_score(estimator=nb_mdl,
X=x_train, y=y_train, cv=15)
Accuracy

# K-FOLD with knn


from sklearn.model_selection import cross_val_score
accuracy = cross_val_score(estimator=model,
X=x_train, y=y_train, cv=15)
Accuracy

# K-FOLD with xgboost


from sklearn.model_selection import cross_val_score
accuracy = cross_val_score(estimator=classifier_xgb,
X=x_train, y=y_train, cv=15)
Accuracy

NOTE:
please change the ensemble technique names as to classifier or to regressor
when you are dealing with the ensemble techniques. Please find that above are
the example of the code and you need to do changes according to your needs.
Model comparision:
Model Comparision on the basis of Model's Accuracy Score and Explained
Variance score of different models.
model_validation = pd.DataFrame({
'Model':['Decision Tree','Random Forest','Gradiant Boosting','AdaBoost',
'Support Vector Machine','Linear Regression'],
'Score': [decc,random,gradient,adab,svr1,regressor1],
'Variance Score': [exp_dec,exp_rand,exp_est,exp_ada,exp_svr,exp_linear]
})
model_validation.sort_values(by='Score', ascending=False)

Model analysis:
Analysing training time for each model has taken:
Model = ['Decision Tree','Random Forest','Gradiant Boosting','AdaBoost',
'Support Vector Machine','Linear Regression']
Train_time = [
train_time_dec,
train_test_rand,
train_test_est,
train_test_ada,
train_time_svr,
train_time_linear
]
index = np.arange(len(Model))
plt.bar(index, Train_time)
plt.xlabel("Machine Learning Models", fontsize =20)
plt.ylabel("Training Time", fontsize = 25)
plt.xticks(index, Model, fontsize=10)
plt.title("Comparision of Training Time taken of all the ML Models")
plt.show()

THANK YOU
&
ALL THE BEST

You might also like