Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
148 views

Inductive Learning and Machine Learning

Machine learning enables machines to learn from data and improve their performance without being explicitly programmed. It involves building models from sample data called training data to make predictions or decisions on new data. Some key types of machine learning include supervised learning where labeled data is used to train models to predict output values, unsupervised learning to find hidden patterns in unlabeled data, and reinforcement learning where agents learn through trial-and-error interactions with a dynamic environment. Machine learning has various applications and brings benefits like solving complex problems, extracting useful insights from vast amounts of data, and automating decision making.

Uploaded by

j
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
148 views

Inductive Learning and Machine Learning

Machine learning enables machines to learn from data and improve their performance without being explicitly programmed. It involves building models from sample data called training data to make predictions or decisions on new data. Some key types of machine learning include supervised learning where labeled data is used to train models to predict output values, unsupervised learning to find hidden patterns in unlabeled data, and reinforcement learning where agents learn through trial-and-error interactions with a dynamic environment. Machine learning has various applications and brings benefits like solving complex problems, extracting useful insights from vast amounts of data, and automating decision making.

Uploaded by

j
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 321

• What is Machine Learning

• In the real world, we are surrounded by humans who can learn everything from
their experiences with their learning capability, and we have computers or
machines which work on our instructions. But can a machine also learn from
experiences or past data like a human does? So here comes the role of Machine
Learning.
• Machine Learning is said as a subset of artificial intelligence that is mainly
concerned with the development of algorithms which allow a computer to learn
from the data and past experiences on their own. The term machine learning was
first introduced by Arthur Samuel in 1959. We can define it in a summarized way
as:
• Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly
programmed.
• With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions
or decisions without being explicitly programmed. Machine learning brings
computer science and statistics together for creating predictive models. Machine
learning constructs or uses the algorithms that learn from historical data. The more
we will provide the information, the higher will be the performance.
• A machine has the ability to learn if it can improve its performance by
gaining more data.
• How does Machine Learning work
• A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The
accuracy of predicted output depends upon the amount of data, as the huge amount
of data helps to build a better model which predicts the output more accurately.
• Suppose we have a complex problem, where we need to perform some
predictions, so instead of writing a code for it, we just need to feed the data to
generic algorithms, and with the help of these algorithms, machine builds the logic
as per the data and predict the output. Machine learning has changed our way of
thinking about the problem. The below block diagram explains the working of
Machine Learning algorithm:
Features of Machine Learning:
•Machine learning uses data to detect various patterns in a given dataset.
•It can learn from past data and improve automatically.
•It is a data-driven technology.
•Machine learning is much similar to data mining as it also deals with the huge amount of the
data.
• Need for Machine Learning
• The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems and here comes
the machine learning to make things easy for us.
• We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of machine
learning, we can save both time and money.
• The importance of machine learning can be easily understood by its uses cases,
Currently, machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc. Various top companies such as
Netflix and Amazon have build machine learning models that are using a vast amount of
data to analyze the user interest and recommend product accordingly.
• Following are some key points which show the importance of Machine
Learning:
• Rapid increment in the production of data
• Solving complex problems, which are difficult for a human
• Decision making in various sector including finance
• Finding hidden patterns and extracting useful information from data.
• Classification of Machine Learning
• At a broad level, machine learning can be classified into three types:
• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Supervised Learning
• Supervised learning is a type of machine learning method in which we provide
sample labeled data to the machine learning system in order to train it, and on
that basis, it predicts the output.
• The system creates a model using labeled data to understand the datasets and
learn about each data, once the training and processing are done then we test
the model by providing a sample data to check whether it is predicting the exact
output or not.
• The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a
student learns things in the supervision of the teacher. The example of
supervised learning is spam filtering.
Supervised learning can be grouped further in two categories of algorithms:
•Classification
•Regression
• Classification Algorithm in Machine Learning
• the Supervised Machine Learning algorithm can be broadly classified into Regression
and Classification Algorithms. In Regression algorithms, we have predicted the output
for continuous values, but to predict the categorical values, we need Classification
algorithms.
• What is the Classification Algorithm?
• The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data. In Classification, a
program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not
Spam, cat or dog, etc. Classes can be called as targets/labels or categories.
• Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with
the corresponding output.
• In classification algorithm, a discrete output function(y) is mapped to input variable(x).
• y=f(x), where y = categorical output
• The best example of an ML classification algorithm is Email Spam Detector.
• The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.
• Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar
to each other and dissimilar to other classes.
• The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
• Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
• Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
• Learners in Classification Problems:
• In the classification problems, there are two types of learners:
• Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of the
most related data stored in the training dataset. It takes less time in training but more
time for predictions.
Example: K-NN algorithm, Case-based reasoning
• Eager Learners: Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve
Bayes, ANN.
• Types of ML Classification Algorithms:
• Classification Algorithms can be further divided into the Mainly two category:
• Linear Models
• Logistic Regression
• Support Vector Machines
• Non-linear Models
• K-Nearest Neighbours
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
• Evaluating a Classification model:
• Once our model is completed, it is necessary to evaluate its performance; either
it is a Classification or Regression model. So for evaluating a Classification model,
we have the following ways:
• Log Loss or Cross-Entropy Loss:
• It is used for evaluating the performance of a classifier, whose output is a
probability value between the 0 and 1.
• For a good binary Classification model, the value of log loss should be near to 0.
• The value of log loss increases if the predicted value deviates from the actual
value.
• The lower log loss represents the higher accuracy of the model.
• For Binary classification, cross-entropy can be calculated as:
• Cross-Entropy Loss Function
• When working on a Machine Learning or a Deep Learning Problem, loss/cost functions are
used to optimize the model during training. The objective is almost always to minimize the
loss function. The lower the loss the better the model. Cross-Entropy loss is a most important
cost function. It is used to optimize classification models. The understanding of Cross-Entropy
is pegged on understanding of Softmax activation function.
• Consider a 4-class classification task where an image is classified as either a dog,
cat, horse or cheetah.
• In the above Figure, Softmax converts logits into probabilities. The purpose of the
Cross-Entropy is to take the output probabilities (P) and measure the distance
from the truth values (as shown in Figure below).

For the example above the desired output is [1,0,0,0] for the class dog but the model
outputs [0.775, 0.116, 0.039, 0.070] .
The objective is to make the model output be as close as possible to the desired output (truth
values). During model training, the model weights are iteratively adjusted accordingly with the
aim of minimizing the Cross-Entropy loss. The process of adjusting the weights is what
defines model training and as the model keeps training and the loss is getting minimized, we
say that the model is learning.

The concept of cross-entropy traces back into the field of Information Theory where Claude
Shannon introduced the concept of entropy in 1948. Before diving into Cross-Entropy cost
function, let us introduce entropy .
Entropy
Entropy of a random variable X is the level of uncertainty inherent in the variables possible
outcome. For p(x) — probability distribution and a random variable X, entropy is defined as
follows
Reason for negative sign: log(p(x))<0 for all p(x) in (0,1) . p(x) is a probability distribution
and therefore the values must range between 0 and 1.
The greater the value of entropy,H(x) , the greater the uncertainty for probability distribution
and the smaller the value the less the uncertainty.
• Example
• Consider the following 3 “containers” with shapes: triangles and circles
• Container 1: The probability of picking a triangle is 26/30 and the probability of picking a
circle is 4/30. For this reason, the probability of picking one shape and/or not picking another is
more certain.
• Container 2: Probability of picking the a triangular shape is 14/30 and 16/30 otherwise. There
is almost 50–50 chance of picking any particular shape. Less certainty of picking a given shape
than in 1.
• Container 3: A shape picked from container 3 is highly likely to be a circle. Probability of
picking a circle is 29/30 and the probability of picking a triangle is 1/30. It is highly certain
than the shape picked will be circle.
• Let us calculate the entropy so that we ascertain our assertions about the certainty of picking a
given shape.
• As expected the entropy for the first and third container is smaller than the second one. This is
because probability of picking a given shape is more certain in container 1 and 3 than in 2.
Cross-Entropy Loss Function
• Also called logarithmic loss, log loss or logistic loss. Each predicted class probability is
compared to the actual class desired output 0 or 1 and a score/loss is calculated that penalizes
the probability based on how far it is from the actual expected value. The penalty is
logarithmic in nature yielding a large score for large differences close to 1 and small score for
small differences tending to 0.
• Cross-entropy loss is used when adjusting model weights during training. The aim is to
minimize the loss, i.e, the smaller the loss the better the model. A perfect model has a cross-
entropy loss of 0.
Cross-entropy is defined as

Binary Cross-Entropy Loss


For binary classification, we have binary cross-entropy defined as
• Binary cross-entropy is often calculated as the average cross-entropy across all
data examples

• Example
• Consider the classification problem with the following Softmax probabilities (S)
and the labels (T). The objective is to calculate for cross-entropy loss given these
information.

The categorical cross-entropy is computed as follows


• Softmax is continuously differentiable function. This makes it possible to calculate the
derivative of the loss function with respect to every weight in the neural network. This
property allows the model to adjust the weights accordingly to minimize the loss function
(model output close to the true values).
• Assume that after some iterations of model training the model outputs the
following vector of logits
0.095 is less than previous loss, that is, 0.3677 implying that the model is learning. The
process of optimization (adjusting weights so that the output is close to true values)
continues until training is over.
Keras provides the following cross-entropy loss functions: binary, categorical, sparse
categorical cross-entropy loss functions.

Categorical Cross-Entropy and Sparse Categorical Cross-Entropy


Both categorical cross entropy and sparse categorical cross-entropy have the same loss
function as defined in Equation 2. The only difference between the two is on how truth labels
are defined.
•Categorical cross-entropy is used when true labels are one-hot encoded, for example, we
have the following true values for 3-class classification problem [1,0,0], [0,1,0] and [0,0,1].
•In sparse categorical cross-entropy , truth labels are integer encoded, for
example, [1], [2] and [3] for 3-class problem.
I hope this article helped you understand cross-entropy loss function more clearly.
• Confusion Matrix:
• The confusion matrix provides us a matrix/table as output and describes the performance of
the model.
• It is also known as the error matrix.
• The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:
Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative


• AUC-ROC curve:
• ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
• It is a graph that shows the performance of the classification model at different thresholds.
• To visualize the performance of the multi-class classification model, we use the AUC-ROC
Curve.
• The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.
• Use cases of Classification Algorithms
• Classification algorithms can be used in different places. Below are some popular use cases
of Classification Algorithms:
• Email Spam Detection
• Speech Recognition
• Identifications of Cancer tumor cells.
• Drugs Classification
• Biometric Identification, etc.
• Supervised learning
• Supervised learning, as the name indicates, has the presence of a supervisor as a
teacher. Basically supervised learning is when we teach or train the machine using
data that is well labeled. Which means some data is already tagged with the
correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training
data(set of training examples) and produces a correct outcome from labeled data.
• For instance, suppose you are given a basket filled with different kinds of fruits.
Now the first step is to train the machine with all different fruits one by one like
this:
• It is necessary then to generalize from the samples and the mapping so that it can
be used to estimate the output for new samples in the future.
• In practice, estimating the function is almost always too complicated, so we seek
excellent approximations.

•If the shape of the object is rounded and has a depression at the top, is red in color, then it will be labeled as –Apple.
•If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be labeled as –Banana.

Now suppose after training the data, you have given a new separate fruit, say Banana from the basket, and asked to
identify it.
Since the machine has already learned the things from previous data and this time has to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name
as BANANA and put it in the Banana category. Thus the machine learns the things from
training data(basket containing fruits) and then applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:

•Classification: A classification problem is when the output variable is a category, such as


“Red” or “blue” or “disease” and “no disease”.
•Regression: A regression problem is when the output variable is a real value, such as “dollars”
or “weight”.
• Supervised learning deals with or learns with “labeled” data. This implies that
some data is already tagged with the correct answer.
• Types:-
• Regression
• Logistic Regression
• Classification
• Naive Bayes Classifiers
• K-NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine
• Supervised learning allows collecting data and produces data output from previous experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world computation problems.
• Disadvantages:-
• Classifying big data can be challenging.
• Training for supervised learning needs a lot of computation time. So, it requires a lot of time.
• Logistic Regression in Machine Learning
• Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore,
it is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
• Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
• Assumptions for Logistic Regression:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.
• Logistic Regression Equation:
• The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:
• We know the equation of the straight line can be written as:

•In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
• But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
•Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
•Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
•Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
• Python Implementation of Logistic Regression (Binomial)
• To understand the implementation of Logistic Regression in Python, we will use the
below example:
• Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has recently
launched a new SUV car. So the company wanted to check how many users from the
dataset, wants to purchase the car.
• For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent
variables).
• Steps in Logistic Regression: To implement the Logistic Regression using Python, we
will use the same steps as we have done in previous topics of Regression. Below are the
steps:
• Data Pre-processing step
• Fitting Logistic Regression to the Training set
• Predicting the test result
• Test accuracy of the result(Creation of Confusion matrix)
• Visualizing the test set result.
• Data Pre-processing step: In this step, we will pre-process/prepare the data so
that we can use it in our code efficiently. It will be the same as we have done in
Data pre-processing topic. The code for this is given below:
• #Data Pre-procesing Step
• # importing libraries
• import numpy as nm
• import matplotlib.pyplot as mtp
• import pandas as pd

• #importing datasets
• data_set= pd.read_csv('user_data.csv')
By executing the above lines of code, we will get the dataset as the output. Consider the
given image:
• Now, we will extract the dependent and independent variables from the given dataset.
Below is the code for it:
• #Extracting Independent and dependent Variable
• x= data_set.iloc[:, [2,3]].values
• y= data_set.iloc[:, 4].values
• In the above code, we have taken [2, 3] for x because our independent variables are age
and salary, which are at index 2, 3. And we have taken 4 for y variable because our
dependent variable is at index 4. The output will be:
• Now we will split the dataset into a training set and test set. Below is the code
for it:
# Splitting the dataset into training and test set.
• from sklearn.model_selection import train_test_split
• x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_stat
e=0)
• The output for this is given below:
For test set:
For training set:
• #feature Scaling
• from sklearn.preprocessing import StandardScaler
• st_x= StandardScaler()
• x_train= st_x.fit_transform(x_train)
• x_test= st_x.transform(x_test)
The scaled output is given below:
• 2. Fitting Logistic Regression to the Training set:
• We have well prepared our dataset, and now we will train the dataset using the
training set. For providing training or fitting the model to the training set, we will
import the LogisticRegression class of the sklearn library.
• After importing the class, we will create a classifier object and use it to fit the model
to the logistic regression. Below is the code for it:
• #Fitting Logistic Regression to the training set
• from sklearn.linear_model import LogisticRegression
• classifier= LogisticRegression(random_state=0)
• classifier.fit(x_train, y_train)
• Output: By executing the above code, we will get the below output:
• LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
• intercept_scaling=1, l1_ratio=None, max_iter=100,
• multi_class='warn', n_jobs=None, penalty='l2',
• random_state=0, solver='warn', tol=0.0001, verbose=0,
• warm_start=False)
• Hence our model is well fitted to the training set.
• Predicting the Test Result
• Our model is well trained on the training set, so we will now predict the result by using
test set data. Below is the code for it:
• #Predicting the test set result
• y_pred= classifier.predict(x_test)
• In the above code, we have created a y_pred vector to predict the test set result.
• Output: By executing the above code, a new vector (y_pred) will be created under the
variable explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or
not purchase the car.
• Test Accuracy of the result
• Now we will create the confusion matrix here to check the accuracy of the
classification. To create it, we need to import the confusion_matrix function of
the sklearn library. After importing the function, we will call it using a new
variable cm. The function takes two parameters, mainly y_true( the actual
values) and y_pred (the targeted value return by the classifier). Below is the
code for it:
• #Creating the Confusion matrix
• from sklearn.metrics import confusion_matrix
• cm= confusion_matrix()
Output:
By executing the above code, a new confusion matrix will be created. Consider the below image:

We can find the accuracy of the predicted result by interpreting the confusion
matrix. By above output, we can interpret that 65+24= 89 (Correct Output) and
8+3= 11(Incorrect Output).
• 5. Visualizing the training set result
• Finally, we will visualize the training set result. To visualize the result, we will use ListedColormap class of
matplotlib library. Below is the code for it:
• #Visualizing the training set result
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_train, y_train
• x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
• alpha = 0.75, cmap = ListedColormap(('purple','green' )))
• mtp.xlim(x1.min(), x1.max())
• mtp.ylim(x2.min(), x2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('purple', 'green'))(i), label = j)
• mtp.title('Logistic Regression (Training set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
• In the above code, we have imported the ListedColormap class of Matplotlib library
to create the colormap for visualizing the result. We have created two new
variables x_set and y_set to replace x_train and y_train. After that, we have used
the nm.meshgrid command to create a rectangular grid, which has a range of -
1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution.
• To create a filled contour, we have used mtp.contourf command, it will create regions
of provided colors (purple and green). In this function, we have passed
the classifier.predict to show the predicted data points predicted by the classifier.
• Output: By executing the above code, we will get the below output:
• The graph can be explained in the below points:
• In the above graph, we can see that there are some Green points within the green
region and Purple points within the purple region.
• All these data points are the observation points from the training set, which shows the
result for purchased variables.
• This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
• The purple point observations are for which purchased (dependent variable) is
probably 0, i.e., users who did not purchase the SUV car.
• The green point observations are for which purchased (dependent variable) is
probably 1 means user who purchased the SUV car.
• We can also estimate from the graph that the users who are younger with low salary,
did not purchase the car, whereas older users with high estimated salary purchased
the car.
• But there are some purple points in the green region (Buying the car) and some green
points in the purple region(Not buying the car). So we can say that younger users with
a high estimated salary purchased the car, whereas an older user with a low estimated
salary did not purchase the car.
• The goal of the classifier:
• We have successfully visualized the training set result for the logistic regression,
and our goal for this classification is to divide the users who purchased the SUV
car and who did not purchase the car. So from the output graph, we can clearly
see the two regions (Purple and Green) with the observation points. The Purple
region is for those users who didn't buy the car, and Green Region is for those
users who purchased the car.
• Linear Classifier:
• As we can see from the graph, the classifier is a Straight line or linear in nature
as we have used the Linear model for Logistic Regression. In further topics, we
will learn for non-linear Classifiers.
• Visualizing the test set result:
• Our model is well trained using the training dataset. Now, we will visualize the
result for new observations (Test set). The code for the test set will remain
same as above except that here we will use x_test and y_test instead of x_train
and y_train. Below is the code for it:
• #Visulaizing the test set result
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_test, y_test
• x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
• alpha = 0.75, cmap = ListedColormap(('purple','green' )))
• mtp.xlim(x1.min(), x1.max())
• mtp.ylim(x2.min(), x2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('purple', 'green'))(i), label = j)
• mtp.title('Logistic Regression (Test set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
The above graph shows the test set result. As we can see, the graph is divided into two regions (Purple and Green). And
Green observations are in the green region, and Purple observations are in the purple region. So we can say it is a good
prediction and model. Some of the green and purple data points are in different regions, which can be ignored as we
have already calculated this error using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification problem.
• Unsupervised Learning :
It’s a type of learning where we don’t give a target to our model while training i.e. training
model has only input parameter values. The model by itself has to find which way it can learn.
Data-set in Figure A is mall data that contains information of its clients that subscribe to them.
Once subscribed they are provided a membership card and so the mall has complete information
about the customer and his/her every purchase. Now using this data and unsupervised learning
techniques, the mall can easily group clients based on the parameters we are feeding in.
•Unlabeled data: Data only contains a value for input parameters, there is no targeted
value(output). It is easy to collect as compared to labeled one in the Supervised approach.
• Types of Unsupervised Learning:-

• Clustering: Broadly this technique is applied to group data based on different


patterns, our machine model finds. For example, in the above figure, we are not
given an output parameter value, so this technique will be used to group clients
based on the input parameters provided by our data.
• Association: This technique is a rule-based ML technique that finds out some
very useful relations between parameters of a large data set. For e.g. shopping
stores use algorithms based on this technique to find out the relationship between
the sale of one product w.r.t to others sales based on customer behavior. Once
trained well, such models can be used to increase their sales by planning different
offers.
• Some algorithms:
• K-Means Clustering
• DBSCAN – Density-Based Spatial Clustering of Applications with Noise
• BIRCH – Balanced Iterative Reducing and Clustering using Hierarchies
• Hierarchical Clustering
• Semi-supervised Learning:
As the name suggests, its working lies between Supervised and Unsupervised
techniques. We use these techniques when we are dealing with data that is a little
bit labeled and the rest large portion of it is unlabeled. We can use the
unsupervised techniques to predict labels and then feed these labels to supervised
techniques. This technique is mostly applicable in the case of image data sets
where usually all images are not labeled.
Reinforcement Learning:
In this technique, the model keeps on increasing its performance using Reward Feedback to learn the behavior or
pattern. These algorithms are specific to a particular problem e.g. Google Self Driving car, AlphaGo where a bot
competes with humans and even itself to getting better and better performers of Go Game. Each time we feed in
data, they learn and add the data to its knowledge that is training data. So, the more it learns the better it gets
trained and hence experienced.
•Agents observe input.
•An agent performs an action by making some decisions.
•After its performance, an agent receives a reward and accordingly reinforces and the model stores in state-action
pair of information.
•Temporal Difference (TD)
•Q-Learning
•Deep Adversarial Networks
Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Here the task of
the machine is to group unsorted information according to similarities, patterns, and differences
without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by
itself.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
• Thus the machine has no idea about the features of dogs and cats so we can’t
categorize it as ‘dogs and cats ‘. But it can categorize them according to their
similarities, patterns, and differences, i.e., we can easily categorize the above
picture into two parts. The first may contain all pics having dogs in them and the
second part may contain all pics having cats in them. Here you didn’t learn
anything before, which means no training data or examples.
• It allows the model to work on its own to discover patterns and information that
was previously undetected. It mainly deals with unlabelled data.
• Unsupervised learning is classified into two categories of algorithms:

• Clustering: A clustering problem is where you want to discover the inherent


groupings in the data, such as grouping customers by purchasing behavior.
• Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X also tend
to buy Y.
• Types of Unsupervised Learning:-
• Clustering
• Exclusive (partitioning)
• Agglomerative
• Overlapping
• Probabilistic
• Clustering Types:-
• Hierarchical clustering
• K-means clustering
• Principal Component Analysis
• Singular Value Decomposition
• Independent Component Analysis
Supervised vs. Unsupervised Machine Learning

Supervised machine Unsupervised


Parameters
learning machine learning

Algorithms are trained using Algorithms are used against


Input Data
labeled data. data that is not labeled

Computational Complexity Simpler method Computationally complex

Accuracy Highly accurate Less accurate


What is Learning for a machine?

A machine is said to be learning from past Experiences(data feed in) with respect to some
class of tasks if its Performance in a given Task improves with the Experience. For example,
assume that a machine has to predict whether a customer will buy a specific product let’s say
“Antivirus” this year or not. The machine will do it by looking at the previous knowledge/past
experiences i.e the data of products that the customer had bought every year and if he buys
Antivirus every year, then there is a high probability that the customer is going to buy an
antivirus this year as well. This is how machine learning works at the basic conceptual level.
Supervised Learning :
Supervised learning is when the model is getting trained on a labelled dataset.
A labelled dataset is one that has both input and output parameters. In this type of learning both
training and validation, datasets are labelled as shown in the figures below.
• Both the above figures have labelled data set –
• Figure A: It is a dataset of a shopping store that is useful in predicting whether a
customer will purchase a particular product under consideration or not based on
his/ her gender, age, and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0
means that the customer won’t purchase it.
• Figure B: It is a Meteorological dataset that serves the purpose of predicting wind
speed based on different parameters.
Input: Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction
Output: Wind Speed
• Training the system:
While training the model, data is usually split in the ratio of 80:20 i.e. 80% as
training data and rest as testing data. In training data, we feed input as well as
output for 80% of data. The model learns from training data only. We use
different machine learning algorithms(which we will discuss in detail in the next
articles) to build our model. By learning, it means that the model will build some
logic of its own.
Once the model is ready then it is good to be tested. At the time of testing, the
input is fed from the remaining 20% data which the model has never seen before,
the model will predict some value and we will compare it with actual output and
calculate the accuracy.
• Types of Supervised Learning:
• Classification: It is a Supervised Learning task where output is having defined
labels(discrete value). For example in above Figure A, Output – Purchased has
defined labels i.e. 0 or 1; 1 means the customer will purchase and 0 means that
customer won’t purchase. The goal here is to predict discrete values belonging to
a particular class and evaluate them on the basis of accuracy.
It can be either binary or multi-class classification. In binary classification, the
model predicts either 0 or 1; yes or no but in the case of multi-class classification,
the model predicts more than one class.
Example: Gmail classifies mails in more than one class like social, promotions,
updates, forums.
• Regression: It is a Supervised Learning task where output is having continuous
value.
Example in above Figure B, Output – Wind Speed is not having any discrete value
but is continuous in the particular range. The goal here is to predict a value as
much closer to the actual output value as our model can and then evaluation is
done by calculating the error value. The smaller the error the greater the
accuracy of our regression model.
• Example of Supervised Learning Algorithms:
• Linear Regression
• Nearest Neighbor
• Gaussian Naive Bayes
• Decision Trees
• Support Vector Machine (SVM)
• Random Forest
• Understanding Hypothesis
• In most supervised machine learning algorithm, our main goal is to find out a
possible hypothesis from the hypothesis space that could possibly map out the
inputs to the proper outputs.
The following figure shows the common method to find out the possible
hypothesis from the Hypothesis space:
• Hypothesis Space (H):
Hypothesis space is the set of all the possible legal hypothesis. This is the set
from which the machine learning algorithm would determine the best possible
(only one) which would best describe the target function or the outputs.
• Hypothesis (h):
A hypothesis is a function that best describes the target in supervised machine
learning. The hypothesis that an algorithm would come up depends upon the
data and also depends upon the restrictions and bias that we have imposed on
the data. To better understand the Hypothesis Space and Hypothesis consider the
following coordinate that shows the distribution of some data:
Say suppose we have test data for which we have to determine the outputs or results. The
test data is as shown below:
We can predict the outcomes by dividing the coordinate as shown below:
So the test data would yield the following result:
But note here that we could have divided the coordinate plane as:
The way in which the coordinate would be divided depends on the data,
algorithm and constraints.
•All these legal possible ways in which we can divide the coordinate plane to
predict the outcome of the test data composes of the Hypothesis Space.
•Each individual possible way is known as the hypothesis.
Hence, in this example the hypothesis space would be like:
• Understanding Hypothesis Testing
• Hypothesis are statement about the given problem. Hypothesis testing is a
statistical method that is used in making a statistical decision using experimental
data. Hypothesis testing is basically an assumption that we make about a
population parameter. It evaluates two mutually exclusive statements about a
population to determine which statement is best supported by the sample data.
Example:
You say an average student in the class is 30 or a boy is taller than girls. All those
are an example in which we assume or need some statistic way to prove those.
We need some mathematical conclusion whatever we are assuming is true.
• Need for Hypothesis Testing
Hypothesis testing is an important procedure in statistics. Hypothesis testing
evaluates two mutually exclusive population statements to determine which
statement is most supported by sample data. When we say that the findings are
statistically significant, it is thanks to hypothesis testing.
• Null hypothesis(H0): In statistics, the null hypothesis is a general given statement or default position
that there is no relationship between two measured cases or no relationship among groups.
• In other words, it is a basic assumption or made based on the problem knowledge.
• Example: A company production is = 50 unit/per day etc.
• Alternative hypothesis(H1): The alternative hypothesis is the hypothesis used in hypothesis testing
that is contrary to the null hypothesis.
• Example : A company production is not equal to 50 unit/per day etc.
• Level of significance
• It refers to the degree of significance in which we accept or reject the null-hypothesis. 100% accuracy is
not possible for accepting a hypothesis, so we, therefore, select a level of significance that is usually
5%. This is normally denoted with \alpha and generally, it is 0.05 or 5%, which means your output
should be 95% confident to give similar kind of result in each sample.
• P-value
• The P value, or calculated probability, is the probability of finding the observed/extreme results when
the null hypothesis(H0) of a study given problem is true. If your P-value is less than the chosen
significance level then you reject the null hypothesis i.e. accept that your sample claims to support the
alternative hypothesis.
• Example :
Given a coin and it is not known whether that is fair or tricky so let’s decide null and alternate
hypothesis
• Null Hypothesis(H0): a coin is a fair coin.
• Alternative Hypothesis(H1) : a coin is a tricky coin.
• alpha = 5% or 0.05
• Now let’s toss the coin and calculate p-value (probability value).

• Toss a coin 1st time and assume that result is head- P-value = 50% (as head and tail have equal
probability)
• Toss a coin 2nd time and assume that result again is head, now p-value = 50/2 = 25%
• and similarly, we Toss 6 consecutive time and got the result as all heads, now P-value = 1.5%
• But we set our significance level as 95% means 5% error rate we allow and here we see we are
beyond that level i.e. our null- hypothesis does not hold good so we need to reject and propose
that this coin is a tricky coin which is actually because it gives us 6 consecutive heads.
• Error in Hypothesis Testing
• Type I error: When we reject the null hypothesis, although that hypothesis was
true. Type I error is denoted by alpha.
• Type II errors: When we accept the null hypothesis but it is false. Type II errors
are denoted by beta.
• Type I Errors — False Positives (Alpha)
• There will almost always be a possibility of wrongly rejecting a null hypothesis
when it should not have been rejected while performing hypothesis tests. Data
scientists have the option of selecting an alpha (𝛼) confidence level threshold that
they will use to accept or reject the null hypothesis. This confidence threshold,
which is in other words a level of trust, is also the likelihood that you will reject
the null hypothesis when it is actually valid. This case is a type I error, which is
more generally referred to as a false positive.
• In hypothesis testing, you need to decide what degree of confidence, or trust, for
which you can dismiss the null hypothesis. If a scientist were to set alpha (𝛼) =.05,
this means that there is a 5 percent probability that they would reject the null
hypothesis when it is actually valid. Another way to think about this is that you
would expect the hypothesis to be rejected once, simply by chance, if you repeated
this experiment 20 times. Generally speaking, an alpha level of 0.05 is adequate to
show that certain findings are statistically significant.
• Type II Errors — False Negatives (Beta)
• Beta (β) is another type of error, which is the possibility that you have not
rejected the null hypothesis when it is actually incorrect. Type II errors are also
known as false negatives. Beta is linked to something called Power, which, given
that the null hypothesis is actually false, is the likelihood of rejecting it. When
planning an experiment, researchers will always select the power level they want
and get their Type II error rate from that.
• Is one more important than the other?
• Various situations allow researchers to mitigate one form of error over the other.
The two types of error are inversely related to each other; decreasing type I
errors will increase type II errors, and vice versa. To decide when a type I or type
II error would be safer, let’s go through a couple of scenarios.
• Imagine that you are on a jury and that you need to determine if an individual is
going to be sent to jail for a crime. Since you don’t know the truth as to whether
or not this person committed a crime, which would be worse, a type I or type II
error? I hope you say that a type I error is going to be worse. A type I error would
suggest that, if they were really not guilty, you would send them to jail! The jury
has dismissed the null hypothesis that the defendant is innocent while he has not
committed any crime. You would also not want to make a type II error here
because this would mean that someone has actually committed a crime and the
jury is letting them get away with it.
• Let’s take another example of a medical situation. A patient with multiple
migraine headaches is referred to the doctor for an MRI head scan. The doctor
believes that a brain tumor may be present in the patient. Is it going to be worse
for this situation to have a type I or type II error? Let’s hope you said that a Type
II error would be worse. A type II error would mean that there is a brain tumor in
the patient, but the doctor insists that there is nothing wrong with them! In other
words, the null hypothesis is that the person does not have a brain tumor, and this
hypothesis is not denied. This implies that, even though they are genuinely far
from it, the person is diagnosed as healthy.
• As researchers design experiments and make choices about the degrees of alpha
level and power, they need to weigh the risks of Type I and Type II errors in order
to prepare for whatever type of error they want to mitigate.
• Inductive Learning
• Machine learning is one of the most important subfields of artificial intelligence. It
has been viewed as a viable way of avoiding the knowledge bottleneck problem in
developing knowledge-based systems.
• Inductive Learning, also known as Concept Learning, is how AI systems attempt
to use a generalized rule to carry out observations.
• To generate a set of classification rules, Inductive Learning Algorithms (APIs) are
used. These generated rules are in the “If this then that” format.
• These rules determine the state of an entity at each iteration step in Learning and
how the Learning can be effectively changed by adding more rules to the existing
ruleset.
• When the output and examples of the function are fed into the AI system,
inductive Learning attempts to learn the function for new data.
• The Fundamental Concept of Inductive Learning
• There are two methods for obtaining knowledge in the real world: first, from
domain experts, and second, from machine learning.
• Domain experts are not very useful or reliable for large amounts of data. As a
result, for this project, we are adopting a machine learning approach.
• The other method, using machine learning, replicates the logic of ‘experts’ in
algorithms, but this work may be very complex, time-consuming, and expensive.
• As a result, an option is the inductive algorithms, which generate a strategy for
performing a task without requiring instruction at each step.
• According to Jason Brownlee in his article “Basic Concepts in Machine
Learning,” an excellent method to understand how Inductive Learning works is,
for example, if we are given input samples (x) and output samples (f(x)) from the
perspective of inductive Learning, and the problem is to estimate the function (f).
• Inductive Learning Algorithm
• Inductive Learning Algorithm (ILA) is an iterative and inductive machine learning
algorithm which is used for generating a set of a classification rule, which
produces rules of the form “IF-THEN”, for a set of examples, producing rules at
each iteration and appending to the set of rules.
• Basic Idea:
There are basically two methods for knowledge extraction firstly from domain
experts and then with machine learning.
• Machine Learning Foundation Course at a student-friendly price and become
industry ready.
• For a very large amount of data, the domain experts are not very useful and
reliable. So we move towards the machine learning approach for this work.
• To use machine learning One method is to replicate the experts logic in the form of algorithms
but this work is very tedious, time taking and expensive.
• So we move towards the inductive algorithms which itself generate the strategy for performing a
task and need not instruct separately at each step.
• Need of ILA in presence of other machine learning algorithms:
The ILA is a new algorithm which was needed even when other reinforcement learnings like
ID3 and AQ were available.
• The need was due to the pitfalls which were present in the previous algorithms, one of the major
pitfalls was lack of generalisation of rules.
• The ID3 and AQ used the decision tree production method which was too specific which were
difficult to analyse and was very slow to perform for basic short classification problems.
• The decision tree-based algorithm was unable to work for a new problem if some attributes are
missing.
• The ILA uses the method of production of a general set of rules instead of decision trees, which
overcome the above problems
• THE ILA ALGORITHM:
• General requirements at start of the algorithm:-
• list the examples in the form of a table ‘T’ where each row corresponds to an
example and each column contains an attribute value.
• create a set of m training examples, each example composed of k attributes and a
class attribute with n possible decisions.
• create a rule set, R, having the initial value false.
• initially all rows in the table are unmarked.
• Steps in the algorithm:-
• Step 1:
divide the table ‘T’ containing m examples into n sub-tables (t1, t2,…..tn). One
table for each possible value of the class attribute. (repeat steps 2-8 for each sub-
table)
• Step 2:
Initialize the attribute combination count ‘ j ‘ = 1.
• Step 3:
For the sub-table on which work is going on, divide the attribute list into distinct combinations,
each combination with ‘j ‘ distinct attributes.
• Step 4:
For each combination of attributes, count the number of occurrences of attribute values that
appear under the same combination of attributes in unmarked rows of the sub-table under
consideration, and at the same time, not appears under the same combination of attributes of
other sub-tables. Call the first combination with the maximum number of occurrences the max-
combination ‘ MAX’.
• Step 5:
If ‘MAX’ = = null , increase ‘ j ‘ by 1 and go to Step 3.
• Step 6:
Mark all rows of the sub-table where working, in which the values of ‘MAX’ appear, as
classi?ed.
• Step 7:
Add a rule (IF attribute = “XYZ” –> THEN decision is YES/ NO) to R whose left-hand side
will have attribute names of the ‘MAX’ with their values separated by AND, and its right-hand
side contains the decision attribute value associated with the sub-table.
• Step 8:
If all rows are marked as classi?ed, then move on to process another sub-table and go to Step 2.
else, go to Step 4. If no sub-tables are available, exit with the set of rules obtained till then.
• An example showing the use of ILA
suppose an example set having attributes Place type, weather, location, decision and seven
examples, our task is to generate a set of rules that under what condition what is the decision.
Example no. Place type weather location decision

I) hilly winter kullu Yes

II ) mountain windy Mumbai No

III ) mountain windy Shimla Yes

IV ) beach windy Mumbai No

V) beach warm goa Yes

VI ) beach windy goa No

VII ) beach warm Shimla Yes


step 1
subset 1
s.no place type weather location decision

1 hilly winter kullu Yes

2 mountain windy Shimla Yes

3 beach warm goa Yes

4 beach warm Shimla Yes

subset 2

s.no place type weather location decision


5 mountain windy Mumbai No

6 beach windy Mumbai No

7 beach windy goa No


• step (2-8)
• at iteration 1
• row 3 & 4 column weather is selected and row 3 & 4 are marked.
the rule is added to R IF weather is warm then a decision is yes.
• at iteration 2
• row 1 column place type is selected and row 1 is marked.
the rule is added to R IF place type is hilly then the decision is yes.
• at iteration 3
• row 2 column location is selected and row 2 is marked.
the rule is added to R IF location is Shimla then the decision is yes.
• at iteration 4
• row 5&6 column location is selected and row 5&6 are marked.
the rule is added to R IF location is Mumbai then a decision is no.
• at iteration 5
• row 7 column place type & the weather is selected and row 7 is marked.
rule is added to R IF place type is beach AND weather is windy then the decision is no.
• finally we get the rule set :-
• Rule Set
• Rule 1: IF the weather is warm THEN the decision is yes.
• Rule 2: IF place type is hilly THEN the decision is yes.
• Rule 3: IF location is Shimla THEN the decision is yes.
• Rule 4: IF location is Mumbai THEN the decision is no.
• Rule 5: IF place type is beach AND the weather is windy THEN the decision is no.
• Classification Algorithm in Machine Learning
• As we know, the Supervised Machine Learning algorithm can be broadly classified into
Regression and Classification Algorithms. In Regression algorithms, we have predicted
the output for continuous values, but to predict the categorical values, we need
Classification algorithms.
• What is the Classification Algorithm?
• The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data. In Classification, a
program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not
Spam, cat or dog, etc. Classes can be called as targets/labels or categories.
• Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with
the corresponding output.
• In classification algorithm, a discrete output function(y) is mapped to input variable(x).
• y=f(x), where y = categorical output
• The best example of an ML classification algorithm is Email Spam Detector.
• The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the
categorical data.
• Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have
features that are similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
•Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
•Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:


In the classification problems, there are two types of learners:
1.Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test
dataset. In Lazy learner case, classification is done on the basis of the most related data stored in the
training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2.Eager Learners:Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and
less time in prediction. Example: Decision Trees, Naïve Bayes, ANN.
• Types of ML Classification Algorithms:
• Classification Algorithms can be further divided into the Mainly two category:
• Linear Models
• Logistic Regression
• Support Vector Machines
• Non-linear Models
• K-Nearest Neighbours
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
• Evaluating a Classification model:
• Once our model is completed, it is necessary to evaluate its performance; either
it is a Classification or Regression model. So for evaluating a Classification model,
we have the following ways:
• 1. Log Loss or Cross-Entropy Loss:
• It is used for evaluating the performance of a classifier, whose output is a
probability value between the 0 and 1.
• For a good binary Classification model, the value of log loss should be near to 0.
• The value of log loss increases if the predicted value deviates from the actual
value.
• The lower log loss represents the higher accuracy of the model.
• For Binary classification, cross-entropy can be calculated as:
• ?(ylog(p)+(1?y)log(1?p))
• Where y= Actual output, p= predicted output.
• Confusion Matrix:
• The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
• It is also known as the error matrix.
• The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative


• AUC-ROC curve:
• ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area
Under the Curve.
• It is a graph that shows the performance of the classification model at different thresholds.
• To visualize the performance of the multi-class classification model, we use the AUC-ROC
Curve.
• The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis.
• Use cases of Classification Algorithms
• Classification algorithms can be used in different places. Below are some popular
use cases of Classification Algorithms:
• Email Spam Detection
• Speech Recognition
• Identifications of Cancer tumor cells.
• Drugs Classification
• Biometric Identification, etc.
• Regression Analysis in Machine learning
• Regression analysis is a statistical method to model the relationship between a
dependent (target) and independent (predictor) variables with one or more
independent variables. More specifically, Regression analysis helps us to
understand how the value of the dependent variable is changing corresponding
to an independent variable when other independent variables are held fixed. It
predicts continuous/real values such as temperature, age, salary, price, etc.
• We can understand the concept of regression analysis using the below example:
• Example: Suppose there is a marketing company A, who does various
advertisement every year and get sales on that. The below list shows the
advertisement made by the company in the last 5 years and the corresponding
sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
• Regression is a supervised learning technique
• which helps in finding the correlation between variables and enables us to predict
the continuous output variable based on the one or more predictor variables. It is
mainly used for prediction, forecasting, time series modeling, and determining
the causal-effect relationship between variables.
• In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions about
the data. In simple words, "Regression shows a line or curve that passes through
all the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum." The
distance between datapoints and line tells whether a model has captured a strong
relationship or not.
• Some examples of regression can be as:
• Prediction of rain using temperature and other factors
• Determining Market trends
• Prediction of road accidents due to rash driving.
• Terminologies Related to the Regression Analysis:
• Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
• Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as
a predictor.
• Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
• Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset,
because it creates problem while ranking the most affecting variable.
• Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.
• Why do we use Regression Analysis?
• As mentioned above, Regression analysis helps in the prediction of a continuous
variable. There are various scenarios in the real world where we need some future
predictions such as weather condition, sales prediction, marketing trends, etc., for
such case we need some technology which can make predictions more accurately.
So for such case we need Regression analysis which is a statistical method and
used in machine learning and data science. Below are some other reasons for
using Regression analysis:
• Regression estimates the relationship between the target and the independent
variable.
• It is used to find the trends in data.
• It helps to predict real/continuous values.
• By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.
• Types of Regression
• There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core,
all the regression methods analyze the effect of the independent variable on
dependent variables. Here we are discussing some important types of regression
which are given below:
• Linear Regression
• Logistic Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
• Ridge Regression
• Lasso Regression:
• Linear Regression:
• Linear regression is a statistical regression method which is used for predictive
analysis.
• It is one of the very simple and easy algorithms which works on regression and
shows the relationship between the continuous variables.
• It is used for solving the regression problem in machine learning.
• Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), hence called linear regression.
• If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
• The relationship between variables in the linear regression model can be explained
using the below image. Here we are predicting the salary of an employee on the
basis of the year of experience.
•Below is the mathematical equation for Linear regression:
1.Y= aX+b
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
• Linear Regression in Machine Learning
• Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.
• Linear regression algorithm shows a linear relationship between a dependent (y)
and one or more independent (y) variables, hence called as linear regression. Since
linear regression shows the linear relationship, which means it finds how the value
of the dependent variable is changing according to the value of the independent
variable.
• The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
Mathematically, we can represent a linear regression as:
• Here,
• Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
• The values for x and y variables are training datasets for Linear Regression model
representation.
• Types of Linear Regression
• Linear regression can be further divided into two types of the algorithm:
• Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
• Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
•Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis,
then such a relationship is termed as a Positive linear relationship.
•Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases on the X-
axis, then such a relationship is called a negative linear relationship.
• Finding the best fit line:
• When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be minimized.
The best fit line will have the least error.
• The different values for weights or the coefficient of lines (a0, a1) gives a different
line of regression, so we need to calculate the best values for a0 and a1 to find the
best fit line, so to calculate this we use cost function.
• Simple Linear Regression in Machine Learning
• Simple Linear Regression is a type of Regression algorithms that models the
relationship between a dependent variable and a single independent variable.
The relationship shown by a Simple Linear Regression model is linear or a
sloped straight line, hence it is called Simple Linear Regression.
• The key point in Simple Linear Regression is that the dependent variable must
be a continuous/real value. However, the independent variable can be
measured on continuous or categorical values.
• Simple Linear regression algorithm has mainly two objectives:
• Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
• Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.
• Simple Linear Regression Model:
• The Simple Linear Regression model can be represented using the below
equation:
• y= a0+a1x+ ε
• Where,
• a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is
increasing or decreasing.
ε = The error term. (For a good model it will be negligible)
Implementation of Simple Linear Regression Algorithm using Python
Problem Statement example for Simple Linear Regression:
Here we are taking a dataset that has two variables: salary (dependent variable) and experience
(Independent variable). The goals of this problem is:
•We want to find out if there is any correlation between these two variables
•We will find the best fit line for the dataset.
•How the dependent variable is changing by changing the independent variable.
In this section, we will create a Simple Linear Regression model to find out the best fitting line
for representing the relationship between these two variables.
To implement the Simple Linear regression model in machine learning using Python, we need to
follow the below steps:
• Step-1: Data Pre-processing
• The first step for creating the Simple Linear Regression model is data pre-
processing
• . We have already done it earlier in this tutorial. But there will be some
changes, which are given in the below steps:
• First, we will import the three important libraries, which will help us for loading
the dataset, plotting the graphs, and creating the Simple Linear Regression
model.
• import numpy as nm
• import matplotlib.pyplot as mtp
• import pandas as pd
• Next, we will load the dataset into our code:
• data_set= pd.read_csv('Salary_Data.csv')
• By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder IDE
screen by clicking on the variable explorer option.
• The above output shows the dataset, which has two variables: Salary and
Experience.
• Note: In Spyder IDE, the folder containing the code file must be saved as a
working directory, and the dataset or csv file should be in the same folder.
• After that, we need to extract the dependent and independent variables from
the given dataset. The independent variable is years of experience, and the
dependent variable is salary. Below is code for it:
• x= data_set.iloc[:, :-1].values
• y= data_set.iloc[:, 1].values
• In the above lines of code, for x variable, we have taken -1 value since we want
to remove the last column from the dataset. For y variable, we have taken 1
value as a parameter, since we want to extract the second column and indexing
starts from the zero.
• By executing the above line of code, we will get the output for X and Y variable
as:
• In the above output image, we can see the X (independent) variable and Y
(dependent) variable has been extracted from the given dataset.
• Next, we will split both variables into the test set and training set. We have 30
observations, so we will take 20 observations for the training set and 10
observations for the test set. We are splitting our dataset so that we can train
our model using a training dataset and then test the model using a test dataset.
The code for this is given below:
• # Splitting the dataset into training and test set.
• from sklearn.model_selection import train_test_split
• x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_stat
e=0)
• By executing the above code, we will get x-test, x-train and y-test, y-train
dataset. Consider the below images:
Test-dataset:
• Training Dataset:
• For simple linear Regression, we will not use Feature Scaling. Because Python libraries
take care of it for some cases, so we don't need to perform it here. Now, our dataset is well
prepared to work on it and we are going to start building a Simple Linear Regression
model for the given problem.
• Step-2: Fitting the Simple Linear Regression to the Training Set:
• Now the second step is to fit our model to the training dataset. To do so, we will import
the LinearRegression class of the linear_model library from the scikit learn. After
importing the class, we are going to create an object of the class named as a regressor.
The code for this is given below:
• #Fitting the Simple Linear Regression model to the training dataset
• from sklearn.linear_model import LinearRegression
• regressor= LinearRegression()
• regressor.fit(x_train, y_train)
• In the above code, we have used a fit() method to fit our Simple Linear Regression object to
the training set. In the fit() function, we have passed the x_train and y_train, which is our
training dataset for the dependent and an independent variable. We have fitted our regressor
object to the training set so that the model can easily learn the correlations between the
predictor and target variables. After executing the above lines of code, we will get the below
output.
• Output:

Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


• Step: 3. Prediction of test set result:
• dependent (salary) and an independent variable (Experience). So, now, our model is ready
to predict the output for the new observations. In this step, we will provide the test dataset
(new observations) to the model to check whether it can predict the correct output or not.
• We will create a prediction vector y_pred, and x_pred, which will contain predictions of
test dataset, and prediction of training set respectively.
• #Prediction of Test and Training set result
• y_pred= regressor.predict(x_test)
• x_pred= regressor.predict(x_train)
• On executing the above lines of code, two variables named y_pred and x_pred
will generate in the variable explorer options that contain salary predictions for
the training set and test set.
• Output:
• You can check the variable by clicking on the variable explorer option in the IDE, and also compare
the result by comparing values from y_pred and y_test. By comparing these values, we can check how
good our model is performing.
• Step: 4. visualizing the Training set results:
• Now in this step, we will visualize the training set result. To do so, we will use the scatter() function of
the pyplot library, which we have already imported in the pre-processing step. The scatter ()
function will create a scatter plot of observations.
• In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of
employees. In the function, we will pass the real values of training set, which means a year of
experience x_train, training set of Salaries y_train, and color of the observations. Here we are taking a
green color for the observation, but it can be any color as per the choice.
• Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot
library. In this function, we will pass the years of experience for training set, predicted salary for
training set x_pred, and color of the line.
• Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library
and pass the name ("Salary vs Experience (Training Dataset)".
• After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
• Finally, we will represent all above things in a graph using show(). The code is given below:
• mtp.scatter(x_train, y_train, color="green")
• mtp.plot(x_train, x_pred, color="red")
• mtp.title("Salary vs Experience (Training Dataset)")
• mtp.xlabel("Years of Experience")
• mtp.ylabel("Salary(In Rupees)")
• mtp.show()
• Output:
• By executing the above lines of code, we will get the below graph plot as an
output.
In the above plot, we can see the real values observations in green dots and predicted values are covered by the
red regression line. The regression line shows a correlation between the dependent and independent variable.
The good fit of the line can be observed by calculating the difference between actual values and predicted values.
But as we can see in the above plot, most of the observations are close to the regression line, hence our
model is good for the training set.
• Step: 5. visualizing the Test set results:
• In the previous step, we have visualized the performance of our model on the training
set. Now, we will do the same for the Test set. The complete code will remain the same
as the above code, except in this, we will use x_test, and y_test instead of x_train and
y_train.
• Here we are also changing the color of observations and regression line to differentiate
between the two plots, but it is optional.
• #visualizing the Test set results
• mtp.scatter(x_test, y_test, color="blue")
• mtp.plot(x_train, x_pred, color="red")
• mtp.title("Salary vs Experience (Test Dataset)")
• mtp.xlabel("Years of Experience")
• mtp.ylabel("Salary(In Rupees)")
• mtp.show()
• Output:
• By executing the above line of code, we will get the output as:
• In the above plot, there are observations given by the blue color, and prediction is
given by the red regression line. As we can see, most of the observations are close to
the regression line, hence we can say our Simple Linear Regression is a good model
and able to make good predictions.
• Multiple Linear Regression
• In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there
may be various cases in which the response variable is affected by more than one predictor
variable; for such cases, the Multiple Linear Regression algorithm is used.
• Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it
takes more than one predictor variable to predict the response variable. We can define it as:
• Multiple Linear Regression is one of the important regression algorithms which models the
linear relationship between a single dependent continuous variable and more than one
independent variable.
• Example:
• Prediction of CO2 emission based on engine size and number of cylinders in a car.
• Some key points about MLR:
• For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
• Each feature variable must model the linear relationship with the dependent variable.
• MLR tries to fit a regression line through a multidimensional space of data-points.
• MLR equation:
• In Multiple Linear Regression, the target variable(Y) is a linear combination of
multiple predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple
Linear Regression, so the same is applied for the multiple linear regression equation,
the equation becomes:
• Y= b0x0+b1x1+ b2x2+ b3x3+...... bnxn
• Where,
• Y= Output/Response variable
• b0, b1, b2, b3 , bn....= Coefficients of the model.
• x1, x2, x3, x4,...= Various Independent/feature variable
• Assumptions for Multiple Linear Regression:
• A linear relationship should exist between the Target and predictor variables.
• The regression residuals must be normally distributed.
• MLR assumes little or no multicollinearity (correlation between the
independent variable) in data.
• Implementation of Multiple Linear Regression model using Python:
• To implement MLR using Python, we have below problem:
• Problem Description:
• We have a dataset of 50 start-up companies. This dataset contains five main
information: R&D Spend, Administration Spend, Marketing Spend, State, and Profit
for a financial year. Our goal is to create a model that can easily determine which
company has a maximum profit, and which is the most affecting factor for the profit of a
company.
• Since we need to find the Profit, so it is the dependent variable, and the other four
variables are independent variables. Below are the main steps of deploying the MLR
model:
• Data Pre-processing Steps
• Fitting the MLR model to the training set
• Predicting the result of the test set
• Step-1: Data Pre-processing Step:
• The very first step is data pre-processing
• , which we have already discussed in this tutorial. This process contains the below steps:
• Importing libraries: Firstly we will import the library which will help in building the
model. Below is the code for it:
• # importing libraries
• import numpy as nm # NumPy is a Python library used for working with arrays
• import matplotlib.pyplot as mtp #Matplotlib is a plotting library for the Python programming language
• import pandas as pd #pandas is a software library written for the Python programming language for data
manipulation and analysis.
• Importing dataset: Now we will import the dataset(50_CompList), which contains all the
variables. Below is the code for it:
• #importing datasets
• data_set= pd.read_csv('50_CompList.csv')
• Output: We will get the dataset as:

• In above output, we can clearly see that there are five variables, in which four variables
are continuous and one is categorical variable.
• Extracting dependent and independent Variables:
• #Extracting Independent and dependent Variable
• x= data_set.iloc[:, :-1].values
• y= data_set.iloc[:, 4].values
• Output:
• Out[5]:
array([[165349.2, 136897.8, 471784.1, 'New York'], [162597.7, 151377.59, 443898.53, 'California'], [153441.51,
101145.55, 407934.54, 'Florida'], [144372.41, 118671.85, 383199.62, 'New York'], [142107.34, 91391.77,
366168.42, 'Florida'], [131876.9, 99814.71, 362861.36, 'New York'], [134615.46, 147198.87, 127716.82,
'California'], [130298.13, 145530.06, 323876.68, 'Florida'], [120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'], [101913.08, 110594.11, 229160.95, 'Florida'], [100671.96,
91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'], [91992.39, 135495.07, 252664.93, 'California'], [119943.24,
156547.42, 256512.92, 'Florida'], [114523.61, 122616.84, 261776.23, 'New York'], [78013.11, 121597.55,
264346.06, 'California'], [94657.16, 145077.58, 282574.31, 'New York'], [91749.16, 114175.79, 294919.57,
'Florida'], [86419.7, 153514.11, 0.0, 'New York'], [76253.86, 113867.3, 298664.47, 'California'], [78389.47,
153773.43, 299737.29, 'New York'], [73994.56, 122782.75, 303319.26, 'Florida'], [67532.53, 105751.03,
304768.73, 'Florida'], [77044.01, 99281.34, 140574.81, 'New York'],

[64664.71, 139553.16, 137962.62, 'California'], [75328.87, 144135.98, 134050.07, 'Florida'], [72107.6,


127864.55, 353183.81, 'New York'], [66051.52, 182645.56, 118148.2, 'Florida'], [65605.48, 153032.06,
107138.38, 'New York'], [61994.48, 115641.28, 91131.24, 'Florida'], [61136.38, 152701.92, 88218.23, 'New
York'], [63408.86, 129219.61, 46085.25, 'California'], [55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'], [46014.02, 85047.44, 205517.64, 'New York'], [28663.76,
127056.21, 201126.82, 'Florida'], [44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'], [38558.51, 82982.09, 174999.3, 'California'], [28754.33, 118546.05,
172795.67, 'California'], [27892.92, 84710.77, 164470.71, 'Florida'], [23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'], [22177.74, 154806.14, 28334.72, 'California'], [1000.23, 124153.04,
1903.93, 'New York'], [1315.46, 115816.21, 297114.46, 'Florida'], [0.0, 135426.92, 0.0, 'California'], [542.05,
51743.15, 0.0, 'New York'], [0.0, 116983.8, 45173.06, 'California']], dtype=object)
• As we can see in the above output, the last column contains categorical variables which are
not suitable to apply directly for fitting the model. So we need to encode this variable.
• Encoding Dummy Variables:
• As we have one categorical variable (State), which cannot be directly applied to the model, so
we will encode it. To encode the categorical variable into numbers, we will use
the LabelEncoder class. But it is not sufficient because it still has some relational order,
which may create a wrong model. So in order to remove this problem, we will
use OneHotEncoder, which will create the dummy variables. Below is code for it:
• #Catgorical data
• from sklearn.preprocessing import LabelEncoder, OneHotEncoder
• labelencoder_x= LabelEncoder()
• x[:, 3]= labelencoder_x.fit_transform(x[:,3])
• onehotencoder= OneHotEncoder(categorical_features= [3])
• x= onehotencoder.fit_transform(x).toarray()
• Here we are only encoding one independent variable, which is state as other variables are
continuous.
Output:
• As we can see in the above output, the state column has been converted into dummy
variables (0 and 1). Here each dummy variable column is corresponding to the one
State. We can check by comparing it with the original dataset. The first column
corresponds to the California State, the second column corresponds to the Florida State,
and the third column corresponds to the New York State.
• Note: We should not use all the dummy variables at the same time, so it must be 1 less
than the total number of dummy variables, else it will create a dummy variable trap.
• Now, we are writing a single line of code just to avoid the dummy variable trap:
• #avoiding the dummy variable trap:
• x = x[:, 1:]
• If we do not remove the first dummy variable, then it may introduce
multicollinearity in the model.
• As we can see in the above output image, the first column has been removed.
• Now we will split the dataset into training and test set. The code for this is given below:
• # Splitting the dataset into training and test set.
• from sklearn.model_selection import train_test_split
• x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
• The above code will split our dataset into a training set and test set.
• Output: The above code will split the dataset into training set and test set. You can
check the output by clicking on the variable explorer option given in Spyder IDE. The
test set and training set will look like the below image:
• Test set:
Training set:
• Step: 2- Fitting our MLR model to the Training set:
• Now, we have well prepared our dataset in order to provide training, which means we
will fit our regression model to the training set. It will be similar to as we did in Simple
Linear Regression
• model. The code for this will be:
• #Fitting the MLR model to the training set:
• from sklearn.linear_model import LinearRegression
• regressor= LinearRegression()
• regressor.fit(x_train, y_train)
• Output:
Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
• Now, we have successfully trained our model using the training dataset. In the next step,
we will test the performance of the model using the test dataset.
• Step: 3- Prediction of Test set results:
• The last step for our model is checking the performance of the model. We will do it by
predicting the test set result. For prediction, we will create a y_pred vector. Below is the
code for it:
• #Predicting the Test set result;
• y_pred= regressor.predict(x_test)
• By executing the above lines of code, a new vector will be generated under the variable
explorer option. We can test our model by comparing the predicted values and test set
values.
Output:
• In the above output, we have predicted result set and test set. We can check model
performance by comparing these two value index by index. For example, the first index
has a predicted value of 103015$ profit and test/real value of 103282$ profit. The
difference is only of 267$, which is a good prediction, so, finally, our model is
completed here.
• We can also check the score for training dataset and test dataset. Below is the
code for it:
• print('Train Score: ', regressor.score(x_train, y_train))
• print('Test Score: ', regressor.score(x_test, y_test))
• Output: The score is:
Train Score: 0.9501847627493607
Test Score: 0.9347068473282446
The above score tells that our model is 95% accurate with the training dataset and 93% accurate with the test dataset.
• Applications of Multiple Linear Regression:
• There are mainly two applications of Multiple Linear Regression:
• Effectiveness of Independent variable on prediction:
• Predicting the impact of changes:
• ML Polynomial Regression
• Polynomial Regression is a regression algorithm that models the relationship between
a dependent(y) and independent variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:
• y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
• It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
• It is a linear model with some modification in order to increase the accuracy.
• The dataset used in Polynomial regression for training is of non-linear nature.
• It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
• Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."
Need for Polynomial Regression:
The need of Polynomial Regression in ML can be understood in the below points:
•If we apply a linear model on a linear dataset, then it provides us a good result as we
have seen in Simple Linear Regression, but if we apply the same model without any
modification on a non-linear dataset, then it will produce a drastic output. Due to which
loss function will increase, the error rate will be high, and accuracy will be decreased.
•So for such cases, where data points are arranged in a non-linear fashion, we need
the Polynomial Regression model. We can understand it in a better way using the below
comparison diagram of the linear dataset and non-linear dataset.
• In the above image, we have taken a dataset which is arranged non-linearly. So if we try
to cover it with a linear model, then we can clearly see that it hardly covers any data
point. On the other hand, a curve is suitable to cover most of the data points, which is of
the Polynomial model.
• Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.
• Note: A Polynomial Regression algorithm is also called Polynomial Linear
Regression because it does not depend on the variables, instead, it depends on
the coefficients, which are arranged in a linear fashion.
• Equation of the Polynomial Regression Model:
• Simple Linear Regression equation: y = b0+b1x .........(a)
• Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)
• Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+ bnxn ..........(c)
• When we compare the above three equations, we can clearly see that all three equations are
Polynomial equations but differ by the degree of variables. The Simple and Multiple Linear
equations are also Polynomial equations with a single degree, and the Polynomial regression
equation is Linear equation with the nth degree. So if we add a degree to our linear equations,
then it will be converted into Polynomial Linear equations.
• Note: To better understand Polynomial Regression, you must have knowledge of Simple
Linear Regression.
• Implementation of Polynomial Regression using Python:
• Here we will implement the Polynomial Regression using Python. We will understand
it by comparing Polynomial Regression model with the Simple Linear Regression
model. So first, let's understand the problem for which we are going to build the
model.
• Problem Description: There is a Human Resource company, which is going to hire a
new candidate. The candidate has told his previous salary 160K per annum, and the
HR have to check whether he is telling the truth or bluff. So to identify this, they only
have a dataset of his previous company in which the salaries of the top 10 positions are
mentioned with their levels. By checking the dataset available, we have found that
there is a non-linear relationship between the Position levels and the salaries. Our
goal is to build a Bluffing detector regression model, so HR can hire an honest
candidate. Below are the steps to build such a model.
• Steps for Polynomial Regression:
• The main steps involved in Polynomial Regression are given below:
• Data Pre-processing
• Build a Linear Regression model and fit it to the dataset
• Build a Polynomial Regression model and fit it to the dataset
• Visualize the result for Linear Regression and Polynomial Regression model.
• Predicting the output.
• Note: Here, we will build the Linear regression model as well as Polynomial Regression
to see the results between the predictions. And Linear regression model is for reference.
• Data Pre-processing Step:
• The data pre-processing step will remain the same as in previous regression models,
except for some changes. In the Polynomial Regression model, we will not use feature
scaling, and also we will not split our dataset into training and test set. It has two reasons:
• The dataset contains very less information which is not suitable to divide it into a test and
training set, else our model will not be able to find the correlations between the salaries
and levels.
• In this model, we want very accurate predictions for salary, so the model should have
enough information.
• The code for pre-processing step is given below:
• # importing libraries
• import numpy as nm
• import matplotlib.pyplot as mtp
• import pandas as pd

• #importing datasets
• data_set= pd.read_csv('Position_Salaries.csv')

• #Extracting Independent and dependent Variable
• x= data_set.iloc[:, 1:2].values
• y= data_set.iloc[:, 2].values
• Explanation:
• In the above lines of code, we have imported the important Python libraries to
import dataset and operate on it.
• Next, we have imported the dataset 'Position_Salaries.csv', which contains
three columns (Position, Levels, and Salary), but we will consider only two
columns (Salary and Levels).
• After that, we have extracted the dependent(Y) and independent variable(X)
from the dataset. For x-variable, we have taken parameters as [:,1:2], because
we want 1 index(levels), and included :2 to make it as a matrix.
Output:
By executing the above code, we can read our dataset as:
• As we can see in the above output, there are three columns present (Positions, Levels,
and Salaries). But we are only considering two columns because Positions are
equivalent to the levels or may be seen as the encoded form of Positions.
• Here we will predict the output for level 6.5 because the candidate has 4+ years'
experience as a regional manager, so he must be somewhere between levels 7 and 6.
• Building the Linear regression model:
• Now, we will build and fit the Linear regression model to the dataset. In building
polynomial regression, we will take the Linear regression model as reference
and compare both the results. The code is given below:
• #Fitting the Linear Regression to the dataset
• from sklearn.linear_model import LinearRegression
• lin_regs= LinearRegression()
• lin_regs.fit(x,y)
• In the above code, we have created the Simple Linear model
using lin_regs object of LinearRegression class and fitted it to the dataset
variables (x and y).
• Output:
• Out[5]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
• Building the Polynomial regression model:
• Now we will build the Polynomial Regression model, but it will be a little different
from the Simple Linear model. Because here we will use PolynomialFeatures class
of preprocessing library. We are using this class to add some extra features to our
dataset.
• #Fitting the Polynomial regression to the dataset
• from sklearn.preprocessing import PolynomialFeatures
• poly_regs= PolynomialFeatures(degree= 2)
• x_poly= poly_regs.fit_transform(x)
• lin_reg_2 =LinearRegression()
• lin_reg_2.fit(x_poly, y)
• In the above lines of code, we have used poly_regs.fit_transform(x), because first we are
converting our feature matrix into polynomial feature matrix, and then fitting it to the
Polynomial regression model. The parameter value(degree= 2) depends on our choice. We
can choose it according to our Polynomial features.
• After executing the code, we will get another matrix x_poly, which can be seen under the
variable explorer option:
• Next, we have used another LinearRegression object, namely lin_reg_2, to fit
our x_poly vector to the linear model.
Output:

Out[11]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


Visualizing the result for Linear regression:
Now we will visualize the result for Linear regression model as we did in Simple
Linear Regression. Below is the code for it:
1.#Visulaizing the result for Linear Regression model
2.mtp.scatter(x,y,color="blue")
3.mtp.plot(x,lin_regs.predict(x), color="red")
4.mtp.title("Bluff detection model(Linear Regression)")
5.mtp.xlabel("Position Levels")
6.mtp.ylabel("Salary")
7.mtp.show()
Output:

In the above output image, we can clearly see that the regression line is so far from the
datasets. Predictions are in a red straight line, and blue points are actual values. If we
consider this output to predict the value of CEO, it will give a salary of approx. 600000$,
which is far away from the real value.
So we need a curved model to fit the dataset other than a straight line.
• Visualizing the result for Polynomial Regression
• Here we will visualize the result of Polynomial regression model, code for which is
little different from the above model.
• Code for this is given below:
• #Visulaizing the result for Polynomial Regression
• mtp.scatter(x,y,color="blue")
• mtp.plot(x, lin_reg_2.predict(poly_regs.fit_transform(x)), color="red")
• mtp.title("Bluff detection model(Polynomial Regression)")
• mtp.xlabel("Position Levels")
• mtp.ylabel("Salary")
• mtp.show()
• In the above code, we have taken lin_reg_2.predict(poly_regs.fit_transform(x), instead
of x_poly, because we want a Linear regressor object to predict the polynomial
features matrix.
Output:

As we can see in the above output image, the predictions are close to the real values.
The above plot will vary as we will change the degree.
• For degree= 3:
• If we change the degree=3, then we will give a more accurate plot, as shown in
the below image.

So as we can see here in the above output image, the predicted salary for level 6.5 is near
to 170K$-190k$, which seems that future employee is saying the truth about his salary.
• Degree= 4: Let's again change the degree to 4, and now will get the most
accurate plot. Hence we can get more accurate results by increasing the degree
of Polynomial.
• Predicting the final result with the Linear Regression model:
• Now, we will predict the final output using the Linear regression model to see
whether an employee is saying truth or bluff. So, for this, we will use
the predict() method and will pass the value 6.5. Below is the code for it:
• lin_pred = lin_regs.predict([[6.5]])
• print(lin_pred)
• Output:
[330378.78787879]


• Predicting the final result with the Polynomial Regression model:
• Now, we will predict the final output using the Polynomial Regression model to
compare with Linear model. Below is the code for it:
• poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
• print(poly_pred)
• Output:
[ 158862.45265153]
• As we can see, the predicted output for the Polynomial Regression is
[158862.45265153], which is much closer to real value hence, we can say that
future employee is saying true.

• Cost function-
• The different values for weights or coefficient of lines (a0, a1) gives the different
line of regression, and the cost function is used to estimate the values of the
coefficient for the best fit line.
• Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
• We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also
known as Hypothesis function.
• For Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values and
actual values. It can be written as:
For the above linear equation, MSE can be calculated as:

Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
• Gradient Descent:
• Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
• A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
• It is done by a random selection of values of coefficient and then iteratively update the values to
reach the minimum cost function.
• Model Performance:
• The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is
called optimization. It can be achieved by below method:

• R-squared method:
• R-squared is a statistical method that determines the goodness of fit.
• It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%.
• The high value of R-square determines the less difference between the predicted
values and actual values and hence represents a good model.
• It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
• It can be calculated from the below formula:

• Assumptions of Linear Regression
• Below are some important assumptions of Linear Regression. These are some formal
checks while building a Linear Regression model, which ensures to get the best possible
result from the given dataset.
• Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and
independent variables.
• Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
• Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
• Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal
distribution pattern. If error terms are not normally distributed, then confidence
intervals will become either too wide or too narrow, which may cause difficulties
in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
• No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there
will be any correlation in the error term, then it will drastically reduce the
accuracy of the model. Autocorrelation usually occurs if there is a dependency
between residual errors.
• Simple Linear Regression in Machine Learning
• Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.
• The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous or
categorical values.
• Simple Linear regression algorithm has mainly two objectives:
• Model the relationship between the two variables. Such as the relationship between Income
and expenditure, experience and Salary, etc.
• Forecasting new observations. Such as Weather forecasting according to temperature, Revenue
of a company according to the investments in a year, etc.
• Simple Linear Regression Model:
• The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or
decreasing.
ε = The error term. (For a good model it will be negligible)
• Multiple Linear Regression
• In the previous topic, we have learned about Simple Linear Regression, where a
single Independent/Predictor(X) variable is used to model the response variable
(Y). But there may be various cases in which the response variable is affected by
more than one predictor variable; for such cases, the Multiple Linear Regression
algorithm is used.
• Moreover, Multiple Linear Regression is an extension of Simple Linear regression
as it takes more than one predictor variable to predict the response variable. We
can define it as:
• Multiple Linear Regression is one of the important regression algorithms which
models the linear relationship between a single dependent continuous variable
and more than one independent variable.
• Example:
• Prediction of CO2 emission based on engine size and number of cylinders in a car.
• Some key points about MLR:
• For MLR, the dependent or target variable(Y) must be the continuous/real, but
the predictor or independent variable may be of continuous or categorical form.
• Each feature variable must model the linear relationship with the dependent
variable.
• MLR tries to fit a regression line through a multidimensional space of data-points.
• MLR equation:
• In Multiple Linear Regression, the target variable(Y) is a linear combination of
multiple predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple
Linear Regression, so the same is applied for the multiple linear regression
equation, the equation becomes:
• Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</su
b>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ..
• Where,
• Y= Output/Response variable
• b0, b1, b2, b3 , bn....= Coefficients of the model.
• x1, x2, x3, x4,...= Various Independent/feature variable
• Assumptions for Multiple Linear Regression:
• A linear relationship should exist between the Target and predictor variables.
• The regression residuals must be normally distributed.
• MLR assumes little or no multicollinearity (correlation between the independent
variable) in data.
• Logistic Regression in Machine Learning
• Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a given
set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must
be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of
giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving
the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the
logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
• Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends to
1, and a value below the threshold values tends to 0.
• Assumptions for Logistic Regression:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.
• Logistic Regression Equation:
• The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:
• We know the equation of the straight line can be written as:

•In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
• But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.


• Type of Logistic Regression:
• On the basis of the categories, Logistic Regression can be classified into three
types:
• Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".
• Some popular applications of linear regression are:
• Analyzing trends and sales estimates
• Salary forecasting
• Real estate prediction
• Arriving at ETAs in traffic.
• Logistic Regression:
• Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or
discrete format such as 0 or 1.
• Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.
• It is a predictive analysis algorithm which works on the concept of probability.
• Logistic regression is a type of regression, but it is different from the linear regression algorithm
in the term how they are used.
• Logistic regression uses sigmoid function or logistic function which is a complex cost function.
This sigmoid function is used to model the data in logistic regression. The function can be
represented as:
•f(x)= Output between the 0 and 1 value.
•x= input to the function
•e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as follows:
•In Polynomial regression, the original features are transformed into polynomial features of
given degree and then modeled using a linear model. Which means the datapoints are best
fitted using a polynomial line.
• The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b0+ b1x, is transformed into
Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
• Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
• The model is still linear as the coefficients are still linear with quadratic
• Note: This is different from Multiple Linear regression in such a way that in
Polynomial regression, a single element has different degrees instead of multiple
variables with the same degree.
Distance metrics are a key part of several machine learning algorithms. These distance metrics are
used in both supervised and unsupervised learning, generally to calculate the similarity between
data points.
An effective distance metric improves the performance of our machine learning model, whether
that’s for classification tasks or clustering.

Let’s say we want to create clusters using the K-Means Clustering or k-Nearest Neighbour
algorithm to solve a classification or regression problem. How will you define the similarity
between different observations here? How can we say that two points are similar to each other?
This will happen if their features are similar, right? When we plot these points, they will be closer
to each other in distance.
• Hence, we can calculate the distance between points and then define the
similarity between them. Here’s the million-dollar question – how do we
calculate this distance and what are the different distance metrics in machine
learning?
• Types of Distance Metrics in Machine Learning
• Euclidean Distance
• Manhattan Distance
• Minkowski Distance
• Hamming Distance
• Let’s start with the most commonly used distance metric – Euclidean Distance.
Euclidean Distance
Euclidean Distance represents the shortest distance between two points.
Most machine learning algorithms including K-Means use this distance metric to measure the
similarity between observations. Let’s say we have two points as shown below:
• So, the Euclidean Distance between these two points A and B will be:

Here’s the formula for Euclidean Distance:


We use this formula when we are dealing with 2 dimensions. We can generalize this for an n-
dimensional space as:

Where,
•n = number of dimensions
•pi, qi = data points
Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all
the dimensions.
We can represent Manhattan Distance as:
• Since the above representation is 2 dimensional, to calculate Manhattan
Distance, we will take the sum of absolute distances in both the x and y
directions. So, the Manhattan distance in a 2-dimensional space is given as:

And the generalized formula for an n-dimensional space is given as:

Where,
•n = number of dimensions
•pi, qi = data points
Minkowski Distance
Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.
The formula for Minkowski Distance is given as:

Here, p represents the order of the norm. Let’s calculate the Minkowski Distance of the order 3:
Hamming Distance
Hamming Distance measures the similarity between two strings of the same length. The Hamming
Distance between two strings of the same length is the number of positions at which the
corresponding characters are different.
Let’s understand the concept using an example. Let’s say we have two strings:
“euclidean” and “manhattan”
Since the length of these strings is equal, we can calculate the Hamming Distance. We will go
character by character and match the strings. The first character of both the strings (e and m
respectively) is different. Similarly, the second character of both the strings (u and a) is different.
and so on.
Look carefully – seven characters are different whereas two characters (the last two characters)
are similar:

Hence, the Hamming Distance here will be 7. Note that larger the Hamming Distance between
two strings, more dissimilar will be those strings (and vice versa).
• Principal Component Analysis
• Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing
the variances.
• PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
• PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels. It is a feature extraction technique, so it
contains the important variables and drops the least important variable.
• The PCA algorithm is based on some mathematical concepts such as:
• Variance and Covariance
• Eigenvalues and Eigen factors
• Some common terms used in PCA algorithm:
• Dimensionality: It is the number of features or variables present in the given dataset. More
easily, it is the number of columns present in the dataset.
• Correlation: It signifies that how strongly two variables are related to each other. Such as if one
changes, the other variable also gets changed. The correlation value ranges from -1 to +1. Here,
-1 occurs if variables are inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
• Orthogonal: It defines that variables are not correlated to each other, and hence the correlation
between the pair of variables is zero.
• Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be
eigenvector if Av is the scalar multiple of v.
• Covariance Matrix: A matrix containing the covariance between the pair of variables is called
the Covariance Matrix.
• Principal Components in PCA
• As described above, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than
the original features present in the dataset. Some properties of these principal
components are given below:
• The principal component must be the linear combination of the original features.
• These components are orthogonal, i.e., the correlation between a pair of variables
is zero.
• The importance of each component decreases when going to 1 to n, it means the 1
PC has the most importance, and n PC will have the least importance.
• Steps for PCA algorithm
• Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is the
training set, and Y is the validation set.
• Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items, and
the column corresponds to the Features. The number of columns is the dimensions of the
dataset.
• Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features with
high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will divide
each data item in a column with the standard deviation of the column. Here we will name the
matrix as Z.
• Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
• Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix
Z. Eigenvectors or the covariance matrix are the directions of the axes with high information.
And the coefficients of these eigenvectors are defined as the eigenvalues.
• Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which means
from largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix P of
eigenvalues. The resultant matrix will be named as P*.
• Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In
the resultant matrix Z*, each observation is the linear combination of original features. Each
column of the Z* matrix is independent of each other.
• Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to remove. It
means, we will only keep the relevant or important features in the new dataset, and unimportant
features will be removed out.
• More specifically, the reason why it is critical to perform standardization prior to
PCA, is that the latter is quite sensitive regarding the variances of the initial
variables. That is, if there are large differences between the ranges of initial
variables, those variables with larger ranges will dominate over those with small
ranges (For example, a variable that ranges between 0 and 100 will dominate over
a variable that ranges between 0 and 1), which will lead to biased results. So,
transforming the data to comparable scales can prevent this problem.
• Mathematically, this can be done by subtracting the mean and dividing by the
standard deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to the same
scale.
• COVARIANCE MATRIX COMPUTATION
• The aim of this step is to understand how the variables of the input data set are
varying from the mean with respect to each other, or in other words, to see if there
is any relationship between them. Because sometimes, variables are highly
correlated in such a way that they contain redundant information. So, in order to
identify these correlations, we compute the covariance matrix.
• The covariance matrix is a p × p symmetric matrix (where p is the number of
dimensions) that has as entries the covariances associated with all possible pairs of
the initial variables. For example, for a 3-dimensional data set with 3
variables x, y, and z, the covariance matrix is a 3×3 matrix of this from:
•if positive then : the two variables increase or decrease together (correlated)
•if negative then : One increases when the other decreases (Inversely correlated)

Now, that we know that the covariance matrix is not more than a table that
summaries the correlations between all the possible pairs of variables, let’s move
to the next step.
• Mathematics Behind PCA
• PCA can be thought of as an unsupervised learning problem. The whole process
of obtaining principle components from a raw dataset can be simplified in six
parts :
• Take the whole dataset consisting of d+1 dimensions and ignore the labels such
that our new dataset becomes d dimensional.
• Compute the mean for every dimension of the whole dataset.
• Compute the covariance matrix of the whole dataset.
• Compute eigenvectors and the corresponding eigenvalues.
• Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with
the largest eigenvalues to form a d × k dimensional matrix W.
• Use this d × k eigenvector matrix to transform the samples onto the new
subspace.
• So, let’s unfurl the maths behind each of this one by one.
• Take the whole dataset consisting of d+1 dimensions and ignore the labels such
that our new dataset becomes d dimensional.
• Let’s say we have a dataset which is d+1 dimensional. Where d could be thought
as X_train and 1 could be thought as y_train (labels) in modern machine learning
paradigm. So, X_train + y_train makes up our complete train dataset.
• So, after we drop the labels we are left with d dimensional dataset and this would
be the dataset we will use to find the principal components. Also, let’s assume
we are left with a three-dimensional dataset after ignoring the labels i.e d = 3.
• we will assume that the samples stem from two different classes, where one-half
samples of our dataset are labeled class 1 and the other half class 2.
• Let our data matrix X be the score of three students :

Compute the mean of every dimension of the whole dataset.


The data from the above table can be represented in matrix A, where each column
in the matrix shows scores on a test and each row shows the score of a student.
• Matrix A

So, The mean of matrix A would be


Compute the covariance matrix of the whole dataset ( sometimes also called as
the variance-covariance matrix)
So, we can compute the covariance of two variables X and Y using the following
formula

Using the above formula, we can find the covariance matrix of A. Also, the result
would be a square matrix of d ×d dimensions.
Let’s rewrite our original matrix like this
Its covariance matrix would be
• Few points that can be noted here is :
• Shown in Blue along the diagonal, we see the variance of scores for each test. The
art test has the biggest variance (720); and the English test, the smallest (360). So
we can say that art test scores have more variability than English test scores.
• The covariance is displayed in black in the off-diagonal elements of the matrix A
• a) The covariance between math and English is positive (360), and the covariance
between math and art is positive (180). This means the scores tend to covary in a
positive way. As scores on math go up, scores on art and English also tend to go
up; and vice versa.
• b) The covariance between English and art, however, is zero. This means there
tends to be no predictable relationship between the movement of English and art
scores.
• Compute Eigenvectors and corresponding Eigenvalues
• Intuitively, an eigenvector is a vector whose direction remains unchanged when
a linear transformation is applied to it.
• Now, we can easily compute eigenvalue and eigenvectors from the covariance
matrix that we have above.
• Let A be a square matrix, ν a vector and λ a scalar that satisfies Aν = λν, then λ is
called eigenvalue associated with eigenvector ν of A.
• The eigenvalues of A are roots of the characteristic equation
Calculating det(A-λI) first, I is an identity matrix :

Simplifying the matrix first, we can calculate the determinant later,


• Now that we have our simplified matrix, we can find the determinant of the same
:
• After solving this equation for the value of λ, we get the following value

Eigenvalues
Now, we can calculate the eigenvectors corresponding to the above eigenvalues. I
would not show how to calculate eigenvector here, visit this link to understand how
to calculate eigenvectors.
So, after solving for eigenvectors we would get the following solution for the
corresponding eigenvalues
Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest
eigenvalues to form a d × k dimensional matrix W.
We started with the goal to reduce the dimensionality of our feature space, i.e., projecting the
feature space via PCA onto a smaller subspace, where the eigenvectors will form the axes of this
new feature subspace. However, the eigenvectors only define the directions of the new axis, since
they have all the same unit length 1.
So, in order to decide which eigenvector(s) we want to drop for our lower-dimensional subspace,
we have to take a look at the corresponding eigenvalues of the eigenvectors. Roughly speaking,
the eigenvectors with the lowest eigenvalues bear the least information about the distribution of
the data, and those are the ones we want to drop.
The common approach is to rank the eigenvectors from highest to lowest corresponding
eigenvalue and choose the top k eigenvectors.
So, after sorting the eigenvalues in decreasing order, we have
For our simple example, where we are reducing a 3-dimensional feature space to a
2-dimensional feature subspace, we are combining the two eigenvectors with the
highest eigenvalues to construct our d×k dimensional eigenvector matrix W.
So, eigenvectors corresponding to two maximum eigenvalues are :
Transform the samples onto the new subspace
In the last step, we use the 2×3 dimensional matrix W that we just computed to
transform our samples onto the new subspace via the equation y = W′ ×
x where W′ is the transpose of the matrix W.
So lastly, we have computed our two principal components and projected the data
points onto the new subspace.
• Applications of Principal Component Analysis
• PCA is mainly used as the dimensionality reduction technique in various AI
applications such as computer vision, image compression, etc.
• It can also be used for finding hidden patterns if data has high dimensions. Some
fields where PCA is used are Finance, data mining, Psychology, etc.

• Naïve Bayes Classifier Algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training
dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
• Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the fruit is
identified on the bases of color, shape, and taste, then red, spherical, and sweet
fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
• Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
• The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
• Bayes' theorem in Artificial intelligence
• Bayes' theorem:
• Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning,
which determines the probability of an event with uncertain knowledge.
• In probability theory, it relates the conditional probability and marginal probabilities of
two random events.
• Bayes' theorem was named after the British mathematician Thomas Bayes.
The Bayesian inference is an application of Bayes' theorem, which is fundamental to
Bayesian statistics.
• It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).
• Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
• Example: If cancer corresponds to one's age then by using Bayes' theorem, we can
determine the probability of cancer more accurately with the help of age.
• Bayes' theorem can be derived using product rule and conditional probability of event A
with known event B:
• As from product rule we can write:
• As from product rule we can write:
• P(A ⋀ B)= P(A|B) P(B) or
• Similarly, the probability of event B with known event A:
• P(A ⋀ B)= P(B|A) P(A)
• Equating right hand side of both the equations, we will get:

The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of
most modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
• P(A|B) is known as posterior, which we need to calculate, and it will be read as
Probability of hypothesis A when we have occurred an evidence B.
• P(B|A) is called the likelihood, in which we consider that hypothesis is true, then
we calculate the probability of evidence.
• P(A) is called the prior probability, probability of hypothesis before considering
the evidence
• P(B) is called marginal probability, pure probability of an evidence.
• In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the
Bayes' rule can be written as:

Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
• Applying Bayes' rule:
• Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A).
This is very useful in cases where we have a good probability of these three terms and want to
determine the fourth one. Suppose we want to perceive the effect of some unknown cause, and
want to compute that cause, then the Bayes' rule becomes:

Question: what is the probability that a patient has diseases meningitis with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80% of the
time. He is also aware of some more facts, which are given as follows:
•The Known probability that a patient has meningitis disease is 1/30,000.
•The Known probability that a patient has a stiff neck is 2%.
Let a be the proposition that patient has stiff neck and b be the proposition that patient has meningitis. , so
we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.
Question 2: From a standard deck of playing cards, a single card is drawn. The probability that
the card is king is 4/52, then calculate posterior probability P(King|Face), which means the
drawn face card is a king card.
Solution:

P(king): probability that the card is King= 4/52= 1/13


P(face): probability that a card is a face card= 3/13
P(Face|King): probability of face card when we assume it is a king = 1
Putting all values in equation (i) we will get:
• Working of Naïve Bayes' Classifier:
• Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
• Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we should
play or not on a particular day according to the weather conditions. So to solve
this problem, we need to follow the below steps:
• Convert the given dataset into frequency tables.
• Generate Likelihood table by finding the probabilities of given features.
• Now, use Bayes theorem to calculate the posterior probability.
• Problem: If the weather is sunny, then the Player should play or not?
• Solution: To solve this, first consider the below dataset:
Outlook Play

0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:

Weather No Yes
Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71


• Applying Bayes'theorem:
• P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
• P(Sunny|Yes)= 3/10= 0.3
• P(Sunny)= 0.35
• P(Yes)=0.71
• So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
• P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
• P(Sunny|NO)= 2/4=0.5
• P(No)= 0.29
• P(Sunny)= 0.35
• So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
• So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
• Hence on a Sunny day, Player can play the game.
• Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.
• Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
• Applications of Naïve Bayes Classifier:
• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis.
• Types of Naïve Bayes Model:
• There are three types of Naive Bayes Model, which are given below:
• Gaussian: The Gaussian model assumes that features follow a normal
distribution. This means if predictors take continuous values instead of discrete,
then the model assumes that these values are sampled from the Gaussian
distribution.
• Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems,
it means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
• Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but
the predictor variables are the independent Booleans variables. Such as if a
particular word is present or not in a document. This model is also famous for
document classification tasks.
• Python Implementation of the Naïve Bayes algorithm:
• Now we will implement a Naive Bayes Algorithm using Python. So for this, we
will use the "user_data" dataset, which we have used in our other classification
model. Therefore we can easily compare the Naive Bayes model with the other
models.
• Steps to implement:
• Data Pre-processing step
• Fitting Naive Bayes to the Training set
• Predicting the test result
• Test accuracy of the result(Creation of Confusion matrix)
• Visualizing the test set result.
• Data Pre-processing step:
• In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It is similar as we did in data-pre-
processing. The code for this is given below:
• Importing the libraries
• import numpy as nm
• import matplotlib.pyplot as mtp
• import pandas as pd

• # Importing the dataset
• dataset = pd.read_csv('user_data.csv')
• x = dataset.iloc[:, [2, 3]].values
• y = dataset.iloc[:, 4].values

• # Splitting the dataset into the Training set and Test set
• from sklearn.model_selection import train_test_split
• x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)

• # Feature Scaling
• from sklearn.preprocessing import StandardScaler
• sc = StandardScaler()
• x_train = sc.fit_transform(x_train)
• x_test = sc.transform(x_test)
• In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test
set, and then we have scaled the feature variable.
• The output for the dataset is given as:
• Fitting Naive Bayes to the Training Set:
• After the pre-processing step, now we will fit the Naive Bayes model to the
Training set. Below is the code for it:
• # Fitting Naive Bayes to the Training set
• from sklearn.naive_bayes import GaussianNB
• classifier = GaussianNB()
• classifier.fit(x_train, y_train)
• In the above code, we have used the GaussianNB classifier to fit it to the training
dataset. We can also use other classifiers as per our requirement.
• Prediction of the test set result:
• Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions.
• # Predicting the Test set results
• y_pred = classifier.predict(x_test)
• The above output shows the result for prediction vector y_pred and real vector
y_test. We can see that some predications are different from the real values,
which are the incorrect predictions.
• 4) Creating Confusion Matrix:
• Now we will check the accuracy of the Naive Bayes classifier using the Confusion
matrix. Below is the code for it:
• # Making the Confusion Matrix
• from sklearn.metrics import confusion_matrix
• cm = confusion_matrix(y_test, y_pred)
Output:

As we can see in the above confusion matrix output, there are 7+3= 10 incorrect
predictions, and 65+25=90 correct predictions.
• Visualizing the training set result:
• Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code for it:
• # Visualising the Training set results
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_train, y_train
• X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
• alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
• mtp.xlim(X1.min(), X1.max())
• mtp.ylim(X2.min(), X2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('purple', 'green'))(i), label = j)
• mtp.title('Naive Bayes (Training set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
Output:

In the above output we can see that the Naïve Bayes classifier has segregated the data points
with the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our
code.
• # Visualising the Test set results
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_test, y_test
• X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
• alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
• mtp.xlim(X1.min(), X1.max())
• mtp.ylim(X2.min(), X2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('purple', 'green'))(i), label = j)
• mtp.title('Naive Bayes (test set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
Output:

The above output is final output for test set data. As we can see the classifier has created a
Gaussian curve to divide the "purchased" and "not purchased" variables. There are some wrong
predictions which we have calculated in Confusion matrix. But still it is pretty good classifier.
• The Bayes Optimal Classifier is a probabilistic model that makes the most probable
prediction for a new example. ... Bayes Optimal Classifier is a probabilistic model that finds
the most probable prediction using the training data and space of hypotheses to make a
prediction for a new data instance.
• Data Mining Bayesian Classifiers
• In numerous applications, the connection between the attribute set and the class variable is non-
deterministic. In other words, we can say the class label of a test record cant be assumed with
certainty even though its attribute set is the same as some of the training examples. These
circumstances may emerge due to the noisy data or the presence of certain confusing factors that
influence classification, but it is not included in the analysis. For example, consider the task of
predicting the occurrence of whether an individual is at risk for liver illness based on individuals
eating habits and working efficiency. Although most people who eat healthly and exercise
consistently having less probability of occurrence of liver disease, they may still do so due to
other factors. For example, due to consumption of the high-calorie street foods and alcohol
abuse. Determining whether an individual's eating routine is healthy or the workout efficiency is
sufficient is also subject to analysis, which in turn may introduce vulnerabilities into the leaning
issue.
• Bayesian classification uses Bayes theorem to predict the occurrence of any
event. Bayesian classifiers are the statistical classifiers with the Bayesian
probability understandings. The theory expresses how a level of belief,
expressed as a probability.
• Bayes theorem came into existence after Thomas Bayes, who first utilized
conditional probability to provide an algorithm that uses evidence to calculate
limits on an unknown parameter.
• Bayes's theorem is expressed mathematically by the following equation that is
given below.
• Where X and Y are the events and P (Y) ≠ 0
• P(X/Y) is a conditional probability that describes the occurrence of event X is
given that Y is true.
• P(Y/X) is a conditional probability that describes the occurrence of event Y is
given that X is true.
• P(X) and P(Y) are the probabilities of observing X and Y independently of each
other. This is known as the marginal probability.
• Bayesian interpretation:
• In the Bayesian interpretation, probability determines a "degree of belief." Bayes
theorem connects the degree of belief in a hypothesis before and after
accounting for evidence. For example, Lets us consider an example of the coin. If
we toss a coin, then we get either heads or tails, and the percent of occurrence of
either heads and tails is 50%. If the coin is flipped numbers of times, and the
outcomes are observed, the degree of belief may rise, fall, or remain the same
depending on the outcomes.
For proposition X and evidence Y,
•P(X), the prior, is the primary degree of belief in X
•P(X/Y), the posterior is the degree of belief having accounted for Y.

The quotient represents the supports Y provides for X.

Bayes theorem can be derived from the conditional probability:

Where P (X⋂Y) is the joint probability of both X and Y being true,


because
• Bayesian network:
• A Bayesian Network falls under the classification of Probabilistic Graphical
Modelling (PGM) procedure that is utilized to compute uncertainties by utilizing
the probability concept. Generally known as Belief Networks, Bayesian
Networks are used to show uncertainties using Directed Acyclic Graphs (DAG)
• A Directed Acyclic Graph is used to show a Bayesian Network, and like some
other statistical graph, a DAG consists of a set of nodes and links, where the
links signify the connection between the nodes.

The nodes here represent random variables, and the edges define the relationship between these
variables.
• A DAG models the uncertainty of an event taking place based on the
Conditional Probability Distribution (CDP) of each random variable.
A Conditional Probability Table (CPT) is used to represent the CPD of each
variable in a network.
• K-Nearest Neighbor(KNN) Algorithm for Machine Learning
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
• It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.
• Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a
similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs
images and based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
•Step-1: Select the number K of the neighbors
•Step-2: Calculate the Euclidean distance of K number of neighbors
•Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
•Step-4: Among these k neighbors, count the number of the data points in each category.
•Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
•Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

•Firstly, we will choose the number of neighbors, so we will choose the k=5.
•Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two
points, which we have already studied in geometry. It can be calculated as:
•By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:
•As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
• There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects
of outliers in the model.
• Large values for K are good, but it may find some difficulties.
• Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
• Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data points
for all the training samples.
• Python implementation of the KNN algorithm
• To do the Python implementation of the K-NN algorithm, we will use the same problem
and dataset which we have used in Logistic Regression. But here we will improve the
performance of the model. Below is the problem description:
• Problem for K-NN Algorithm: There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give the ads to the users who are
interested in buying that SUV. So for this problem, we have a dataset that contains
multiple user's information through the social network. The dataset contains lots of
information but the Estimated Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent variable. Below is the dataset:
• Steps to implement the K-NN algorithm:
• Data Pre-processing step
• Fitting the K-NN algorithm to the Training set
• Predicting the test result
• Test accuracy of the result(Creation of Confusion matrix)
• Visualizing the test set result.
• Data Pre-Processing Step:
• The Data Pre-processing step will remain exactly the same as Logistic Regression.
Below is the code for it:
• # importing libraries
• import numpy as nm
• import matplotlib.pyplot as mtp
• import pandas as pd
• #importing datasets
• data_set= pd.read_csv('user_data.csv')
• #Extracting Independent and dependent Variable
• x= data_set.iloc[:, [2,3]].values
• y= data_set.iloc[:, 4].values
• # Splitting the dataset into training and test set.
• from sklearn.model_selection import train_test_split
• x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
• #feature Scaling
• from sklearn.preprocessing import StandardScaler
• st_x= StandardScaler()
• x_train= st_x.fit_transform(x_train)
• x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed.
After feature scaling our test dataset will look like:
• From the above output image, we can see that our data is successfully scaled.
• Fitting K-NN classifier to the Training data:
Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing
the class, we will create the Classifier object of the class. The Parameter of this
class will be
• n_neighbors: To define the required neighbors of the algorithm. Usually, it takes 5.
• metric='minkowski': This is the default parameter and it decides the distance between the
points.
• p=2: It is equivalent to the standard Euclidean metric.
• And then we will fit the classifier to the training data. Below is the code for it:
• #Fitting K-NN classifier to the training set
• from sklearn.neighbors import KNeighborsClassifier
• classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
• classifier.fit(x_train, y_train)
• Output: By executing the above code, we will get the output as:
• Predicting the Test Result: To predict the test set result, we will create
a y_pred vector as we did in Logistic Regression. Below is the code for it:
• #Predicting the test set result
• y_pred= classifier.predict(x_test)
• Output:
• The output for the above code will be:
• #Creating the Confusion matrix
• from sklearn.metrics import confusion_matrix
• cm= confusion_matrix(y_test, y_pred)
• In above code, we have imported the confusion_matrix function and called it
using the variable cm.
• Output: By executing the above code, we will get the matrix as below:
• In the above image, we can see there are 64+29= 93 correct predictions and 3+4=
7 incorrect predictions, whereas, in Logistic Regression, there were 11 incorrect
predictions. So we can say that the performance of the model is improved by
using the K-NN algorithm.
• Visualizing the Training set result:
Now, we will visualize the training set result for K-NN model. The code will
remain same as we did in Logistic Regression, except the name of the graph.
Below is the code for it:
• #Visulaizing the trianing set result
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_train, y_train
• x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
• alpha = 0.75, cmap = ListedColormap(('red','green' )))
• mtp.xlim(x1.min(), x1.max())
• mtp.ylim(x2.min(), x2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('red', 'green'))(i), label = j)
• mtp.title('K-NN Algorithm (Training set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
Output:
By executing the above code, we will get the below graph:
• The output graph is different from the graph which we have occurred in Logistic
Regression. It can be understood in the below points:
• As we can see the graph is showing the red point and green points. The green
points are for Purchased(1) and Red Points for not Purchased(0) variable.
• The graph is showing an irregular boundary instead of showing any straight line
or any curve because it is a K-NN algorithm, i.e., finding the nearest neighbor.
• The graph has classified users in the correct categories as most of the users who
didn't buy the SUV are in the red region and users who bought the SUV are in
the green region.
• The graph is showing good result but still, there are some green points in the
red region and red points in the green region. But this is no big issue as by doing
this model is prevented from overfitting issues.
• Hence our model is well trained.
• Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new
dataset, i.e., Test dataset. Code remains the same except some minor changes:
such as x_train and y_train will be replaced by x_test and y_test.
Below is the code for it:
• #Visualizing the test set result
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_test, y_test
• x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
• alpha = 0.75, cmap = ListedColormap(('red','green' )))
• mtp.xlim(x1.min(), x1.max())
• mtp.ylim(x2.min(), x2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('red', 'green'))(i), label = j)
• mtp.title('K-NN algorithm(Test set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
Output:

The above graph is showing the output for the test data set. As we can see in the graph, the predicted output is well
good as most of the red points are in the red region and most of the green points are in the green region.
However, there are few green points in the red region and a few red points in the green region. So these are the
incorrect observations that we have observed in the confusion matrix(7 Incorrect output).
• Radial Basis Functions Neural Networks — All we need to know
• In Single Perceptron / Multi-layer Perceptron(MLP), we only have linear
separability because they are composed of input and output layers(some
hidden layers in MLP)
• ⁃ For example, AND, OR functions are linearly-separable & XOR function
is not linearly separable.
• We atleast need one hidden layer to derive a non-linearity separation.
• ⁃ Our RBNN what it does is, it transforms the input signal into another form,
which can be then feed into the network to get linear separability.
• ⁃ RBNN is structurally same as perceptron(MLP).
• RBNN is composed of input, hidden, and output layer. RBNN is strictly limited to
have exactly one hidden layer. We call this hidden layer as feature vector.
• RBNN increases dimension of feature vector.

Simplest diagram shows the architecture of RBNN


Extended diagram shows the architecture of RBNN with hidden functions.
We apply non-linear transfer function to the feature vector before we go for classification
problem.
⁃ When we increase the dimension of the feature vector, the linear separability of feature vector
increases.
A non-linearity separable problem(pattern classification problem) is highly separable in high
dimensional space than it is in low dimensional space.

What is a Radial Basis Function ?

we define a receptor = t
⁃ we draw confrontal maps around the receptor.
⁃ Gaussian Functions are generally used for Radian Basis Function(confrontal mapping). So we
define the radial distance r = ||x- t||.
Gaussian Radial Function :=

ϕ(r) = exp (- r²/2σ²)

where σ > 0
Classification only happens on the second phase, where linear combination of hidden functions
are driven to output layer.
• Example. XOR function :-
• ⁃ I have 4 inputs and I will not increase dimension at the feature vector here. So I will
select 2 receptors here. For each transformation function ϕ(x), we will have each
receptors t.
• ⁃ Now consider the RBNN architecture,
• P := # of input features/ values.
• ⁃ M = # of transformed vector dimensions (hidden layer width). So M ≥ P usually be.
• ⁃ Each node in the hidden layer, performs a set of non-linear radian basis function.
• ⁃ Output C will remains the same as for the classification problems(certain number of
class labels as predefined).
Architecture of XOR RBNN
Transformation function with receptors and variances.
Output → linear combination of transformation function is tabulated.
• Only Nodes in the hidden layer perform the radian basis transformation function.
• ⁃ Output layer performs the linear combination of the outputs of the hidden layer
to give a final probabilistic value at the output layer.
• ⁃ So the classification is only done only @ (hidden layer → output layer)
• Training the RBNN :-
• ⁃ First, we should train the hidden layer using back propagation.
• ⁃ Neural Network training(back propagation) is a curve fitting method.
It fits a non-linear curve during the training phase. It runs through stochastic
approximation, which we call the back propagation.
• ⁃ For each of the node in the hidden layer, we have to find t(receptors) & the
variance (σ)[variance — the spread of the radial basis function]
• ⁃ On the second training phase, we have to update the weighting
vectors between hidden layers & output layers.
• ⁃ In hidden layers, each node represents each transformation basis
function. Any of the function could satisfy the non-linear separability OR
even combination of set of functions could satisfy the non-linear separability.
• So in our hidden layer transformation, all the non-linearity terms are included.
Say like X² + Y² + 5XY ; its all included in a hyper-surface equation(X & Y are
inputs).
• ⁃ Therefore, the first stage of training is done by clustering algorithm. We
define the number of cluster centers we need. And by clustering algorithm, we
compute the cluster centers, which then is assigned as the receptors for each
hidden neurons.
• ⁃ I have to cluster N samples or observations into M clusters (N > M).
• ⁃ So the output “clusters” are the “receptors”.
• ⁃ for each receptors, I can find the variance as “the squared sum of the
distances between the respective receptor & the each cluster nearest
samples” := 1/N * ||X — t||²
• The interpretation of the first training phase is that the “feature vector is
projected onto the transformed space”.

Complex diagram depicting the RBNN


Advantages of using RBNN than the MLP :-
1. Training in RBNN is faster than in Multi-layer Perceptron (MLP) → takes many
interactions in MLP.
2. We can easily interpret what is the meaning / function of the each node in
hidden layer of the RBNN. This is difficult in MLP.
3. (what should be the # of nodes in hidden layer & the # of hidden layers)
this parameterization is difficult in MLP. But this is not found in RBNN.
4. Classification will take more time in RBNN than MLP.
• What are the fundamental differences between RBF and FFNN?
• The fundamental differences between Radial Basis Functions and Feedforward Neutal
Networks are depicted below.
• RBF has localized basis functions (e.g. Gaussian) whereas FFNN has global basis
functions (sigmoid).
• RBFs can be solved using linear regression (if the spread (S) and number of basis
functions (m) is fixed) while FFNNs require non-linear regression, i.e. using an
optimization algorithm to minimize error. That slows down the regression.
• Since RBFs are cheaper to solve, we have embedded cross-validation in the outer loop
of the regression to determine m and S. Cross-validation means that we minimize the
PRESS error over m and S to theoretically get the best predictor.
• Because we do the non-linear regression of FFNNs from a random starting point, they
tend to have a little variation. To counter the variation, we generate ensembles of
networks which are then averaged. An ensemble typically has 9 members (but can be
adjusted by the user).
• In the experience FFNNs are better at approximating smooth functions since they tend
to interpolate rather than average. They also seem better at approximating sparse
data, mainly because cross validation does not work well for sparse sets. They also
seem to be more accurate for non-uniform point distribution, again mainly because
cross-validation works better for uniformly dense sets.
• For uniformly dense sets, RBFs may be better since, theoretically, cross-validation
should provide a more accurate response surface. As you probably know
obtaining a dense set for high dimensionality can be very expensive and would
typically run into the thousands of simulations for only 50 variables.

• We have also had feedback from large automotive users that they prefer FFNNs,
but cannot afford them (a typical automotive design problem might have 7 cases,
50 variables and 100 constraint functions). So when using ensembles, typically
4500 neural networks must be computed individually (including the ensembles
and hidden nodes options). This could take days.

• For an optimization in which the user is really only interested to arrive at a single
design point (i.e. a converged solution), using the default SRSM (sequential)
approach with linear basis functions (the default approach) is still the best and
cheapest. It also works well for a large number of variables and its cost is in a
linear relation to the number of variables.

You might also like