Inductive Learning and Machine Learning
Inductive Learning and Machine Learning
• In the real world, we are surrounded by humans who can learn everything from
their experiences with their learning capability, and we have computers or
machines which work on our instructions. But can a machine also learn from
experiences or past data like a human does? So here comes the role of Machine
Learning.
• Machine Learning is said as a subset of artificial intelligence that is mainly
concerned with the development of algorithms which allow a computer to learn
from the data and past experiences on their own. The term machine learning was
first introduced by Arthur Samuel in 1959. We can define it in a summarized way
as:
• Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly
programmed.
• With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions
or decisions without being explicitly programmed. Machine learning brings
computer science and statistics together for creating predictive models. Machine
learning constructs or uses the algorithms that learn from historical data. The more
we will provide the information, the higher will be the performance.
• A machine has the ability to learn if it can improve its performance by
gaining more data.
• How does Machine Learning work
• A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The
accuracy of predicted output depends upon the amount of data, as the huge amount
of data helps to build a better model which predicts the output more accurately.
• Suppose we have a complex problem, where we need to perform some
predictions, so instead of writing a code for it, we just need to feed the data to
generic algorithms, and with the help of these algorithms, machine builds the logic
as per the data and predict the output. Machine learning has changed our way of
thinking about the problem. The below block diagram explains the working of
Machine Learning algorithm:
Features of Machine Learning:
•Machine learning uses data to detect various patterns in a given dataset.
•It can learn from past data and improve automatically.
•It is a data-driven technology.
•Machine learning is much similar to data mining as it also deals with the huge amount of the
data.
• Need for Machine Learning
• The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems and here comes
the machine learning to make things easy for us.
• We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of machine
learning, we can save both time and money.
• The importance of machine learning can be easily understood by its uses cases,
Currently, machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc. Various top companies such as
Netflix and Amazon have build machine learning models that are using a vast amount of
data to analyze the user interest and recommend product accordingly.
• Following are some key points which show the importance of Machine
Learning:
• Rapid increment in the production of data
• Solving complex problems, which are difficult for a human
• Decision making in various sector including finance
• Finding hidden patterns and extracting useful information from data.
• Classification of Machine Learning
• At a broad level, machine learning can be classified into three types:
• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Supervised Learning
• Supervised learning is a type of machine learning method in which we provide
sample labeled data to the machine learning system in order to train it, and on
that basis, it predicts the output.
• The system creates a model using labeled data to understand the datasets and
learn about each data, once the training and processing are done then we test
the model by providing a sample data to check whether it is predicting the exact
output or not.
• The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a
student learns things in the supervision of the teacher. The example of
supervised learning is spam filtering.
Supervised learning can be grouped further in two categories of algorithms:
•Classification
•Regression
• Classification Algorithm in Machine Learning
• the Supervised Machine Learning algorithm can be broadly classified into Regression
and Classification Algorithms. In Regression algorithms, we have predicted the output
for continuous values, but to predict the categorical values, we need Classification
algorithms.
• What is the Classification Algorithm?
• The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data. In Classification, a
program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not
Spam, cat or dog, etc. Classes can be called as targets/labels or categories.
• Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with
the corresponding output.
• In classification algorithm, a discrete output function(y) is mapped to input variable(x).
• y=f(x), where y = categorical output
• The best example of an ML classification algorithm is Email Spam Detector.
• The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.
• Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar
to each other and dissimilar to other classes.
• The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
• Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
• Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
• Learners in Classification Problems:
• In the classification problems, there are two types of learners:
• Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of the
most related data stored in the training dataset. It takes less time in training but more
time for predictions.
Example: K-NN algorithm, Case-based reasoning
• Eager Learners: Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve
Bayes, ANN.
• Types of ML Classification Algorithms:
• Classification Algorithms can be further divided into the Mainly two category:
• Linear Models
• Logistic Regression
• Support Vector Machines
• Non-linear Models
• K-Nearest Neighbours
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
• Evaluating a Classification model:
• Once our model is completed, it is necessary to evaluate its performance; either
it is a Classification or Regression model. So for evaluating a Classification model,
we have the following ways:
• Log Loss or Cross-Entropy Loss:
• It is used for evaluating the performance of a classifier, whose output is a
probability value between the 0 and 1.
• For a good binary Classification model, the value of log loss should be near to 0.
• The value of log loss increases if the predicted value deviates from the actual
value.
• The lower log loss represents the higher accuracy of the model.
• For Binary classification, cross-entropy can be calculated as:
• Cross-Entropy Loss Function
• When working on a Machine Learning or a Deep Learning Problem, loss/cost functions are
used to optimize the model during training. The objective is almost always to minimize the
loss function. The lower the loss the better the model. Cross-Entropy loss is a most important
cost function. It is used to optimize classification models. The understanding of Cross-Entropy
is pegged on understanding of Softmax activation function.
• Consider a 4-class classification task where an image is classified as either a dog,
cat, horse or cheetah.
• In the above Figure, Softmax converts logits into probabilities. The purpose of the
Cross-Entropy is to take the output probabilities (P) and measure the distance
from the truth values (as shown in Figure below).
For the example above the desired output is [1,0,0,0] for the class dog but the model
outputs [0.775, 0.116, 0.039, 0.070] .
The objective is to make the model output be as close as possible to the desired output (truth
values). During model training, the model weights are iteratively adjusted accordingly with the
aim of minimizing the Cross-Entropy loss. The process of adjusting the weights is what
defines model training and as the model keeps training and the loss is getting minimized, we
say that the model is learning.
The concept of cross-entropy traces back into the field of Information Theory where Claude
Shannon introduced the concept of entropy in 1948. Before diving into Cross-Entropy cost
function, let us introduce entropy .
Entropy
Entropy of a random variable X is the level of uncertainty inherent in the variables possible
outcome. For p(x) — probability distribution and a random variable X, entropy is defined as
follows
Reason for negative sign: log(p(x))<0 for all p(x) in (0,1) . p(x) is a probability distribution
and therefore the values must range between 0 and 1.
The greater the value of entropy,H(x) , the greater the uncertainty for probability distribution
and the smaller the value the less the uncertainty.
• Example
• Consider the following 3 “containers” with shapes: triangles and circles
• Container 1: The probability of picking a triangle is 26/30 and the probability of picking a
circle is 4/30. For this reason, the probability of picking one shape and/or not picking another is
more certain.
• Container 2: Probability of picking the a triangular shape is 14/30 and 16/30 otherwise. There
is almost 50–50 chance of picking any particular shape. Less certainty of picking a given shape
than in 1.
• Container 3: A shape picked from container 3 is highly likely to be a circle. Probability of
picking a circle is 29/30 and the probability of picking a triangle is 1/30. It is highly certain
than the shape picked will be circle.
• Let us calculate the entropy so that we ascertain our assertions about the certainty of picking a
given shape.
• As expected the entropy for the first and third container is smaller than the second one. This is
because probability of picking a given shape is more certain in container 1 and 3 than in 2.
Cross-Entropy Loss Function
• Also called logarithmic loss, log loss or logistic loss. Each predicted class probability is
compared to the actual class desired output 0 or 1 and a score/loss is calculated that penalizes
the probability based on how far it is from the actual expected value. The penalty is
logarithmic in nature yielding a large score for large differences close to 1 and small score for
small differences tending to 0.
• Cross-entropy loss is used when adjusting model weights during training. The aim is to
minimize the loss, i.e, the smaller the loss the better the model. A perfect model has a cross-
entropy loss of 0.
Cross-entropy is defined as
•If the shape of the object is rounded and has a depression at the top, is red in color, then it will be labeled as –Apple.
•If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the basket, and asked to
identify it.
Since the machine has already learned the things from previous data and this time has to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name
as BANANA and put it in the Banana category. Thus the machine learns the things from
training data(basket containing fruits) and then applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:
•In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
• But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
We can find the accuracy of the predicted result by interpreting the confusion
matrix. By above output, we can interpret that 65+24= 89 (Correct Output) and
8+3= 11(Incorrect Output).
• 5. Visualizing the training set result
• Finally, we will visualize the training set result. To visualize the result, we will use ListedColormap class of
matplotlib library. Below is the code for it:
• #Visualizing the training set result
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_train, y_train
• x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
• alpha = 0.75, cmap = ListedColormap(('purple','green' )))
• mtp.xlim(x1.min(), x1.max())
• mtp.ylim(x2.min(), x2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('purple', 'green'))(i), label = j)
• mtp.title('Logistic Regression (Training set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
• In the above code, we have imported the ListedColormap class of Matplotlib library
to create the colormap for visualizing the result. We have created two new
variables x_set and y_set to replace x_train and y_train. After that, we have used
the nm.meshgrid command to create a rectangular grid, which has a range of -
1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution.
• To create a filled contour, we have used mtp.contourf command, it will create regions
of provided colors (purple and green). In this function, we have passed
the classifier.predict to show the predicted data points predicted by the classifier.
• Output: By executing the above code, we will get the below output:
• The graph can be explained in the below points:
• In the above graph, we can see that there are some Green points within the green
region and Purple points within the purple region.
• All these data points are the observation points from the training set, which shows the
result for purchased variables.
• This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
• The purple point observations are for which purchased (dependent variable) is
probably 0, i.e., users who did not purchase the SUV car.
• The green point observations are for which purchased (dependent variable) is
probably 1 means user who purchased the SUV car.
• We can also estimate from the graph that the users who are younger with low salary,
did not purchase the car, whereas older users with high estimated salary purchased
the car.
• But there are some purple points in the green region (Buying the car) and some green
points in the purple region(Not buying the car). So we can say that younger users with
a high estimated salary purchased the car, whereas an older user with a low estimated
salary did not purchase the car.
• The goal of the classifier:
• We have successfully visualized the training set result for the logistic regression,
and our goal for this classification is to divide the users who purchased the SUV
car and who did not purchase the car. So from the output graph, we can clearly
see the two regions (Purple and Green) with the observation points. The Purple
region is for those users who didn't buy the car, and Green Region is for those
users who purchased the car.
• Linear Classifier:
• As we can see from the graph, the classifier is a Straight line or linear in nature
as we have used the Linear model for Logistic Regression. In further topics, we
will learn for non-linear Classifiers.
• Visualizing the test set result:
• Our model is well trained using the training dataset. Now, we will visualize the
result for new observations (Test set). The code for the test set will remain
same as above except that here we will use x_test and y_test instead of x_train
and y_train. Below is the code for it:
• #Visulaizing the test set result
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_test, y_test
• x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
• alpha = 0.75, cmap = ListedColormap(('purple','green' )))
• mtp.xlim(x1.min(), x1.max())
• mtp.ylim(x2.min(), x2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('purple', 'green'))(i), label = j)
• mtp.title('Logistic Regression (Test set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
The above graph shows the test set result. As we can see, the graph is divided into two regions (Purple and Green). And
Green observations are in the green region, and Purple observations are in the purple region. So we can say it is a good
prediction and model. Some of the green and purple data points are in different regions, which can be ignored as we
have already calculated this error using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification problem.
• Unsupervised Learning :
It’s a type of learning where we don’t give a target to our model while training i.e. training
model has only input parameter values. The model by itself has to find which way it can learn.
Data-set in Figure A is mall data that contains information of its clients that subscribe to them.
Once subscribed they are provided a membership card and so the mall has complete information
about the customer and his/her every purchase. Now using this data and unsupervised learning
techniques, the mall can easily group clients based on the parameters we are feeding in.
•Unlabeled data: Data only contains a value for input parameters, there is no targeted
value(output). It is easy to collect as compared to labeled one in the Supervised approach.
• Types of Unsupervised Learning:-
A machine is said to be learning from past Experiences(data feed in) with respect to some
class of tasks if its Performance in a given Task improves with the Experience. For example,
assume that a machine has to predict whether a customer will buy a specific product let’s say
“Antivirus” this year or not. The machine will do it by looking at the previous knowledge/past
experiences i.e the data of products that the customer had bought every year and if he buys
Antivirus every year, then there is a high probability that the customer is going to buy an
antivirus this year as well. This is how machine learning works at the basic conceptual level.
Supervised Learning :
Supervised learning is when the model is getting trained on a labelled dataset.
A labelled dataset is one that has both input and output parameters. In this type of learning both
training and validation, datasets are labelled as shown in the figures below.
• Both the above figures have labelled data set –
• Figure A: It is a dataset of a shopping store that is useful in predicting whether a
customer will purchase a particular product under consideration or not based on
his/ her gender, age, and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0
means that the customer won’t purchase it.
• Figure B: It is a Meteorological dataset that serves the purpose of predicting wind
speed based on different parameters.
Input: Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction
Output: Wind Speed
• Training the system:
While training the model, data is usually split in the ratio of 80:20 i.e. 80% as
training data and rest as testing data. In training data, we feed input as well as
output for 80% of data. The model learns from training data only. We use
different machine learning algorithms(which we will discuss in detail in the next
articles) to build our model. By learning, it means that the model will build some
logic of its own.
Once the model is ready then it is good to be tested. At the time of testing, the
input is fed from the remaining 20% data which the model has never seen before,
the model will predict some value and we will compare it with actual output and
calculate the accuracy.
• Types of Supervised Learning:
• Classification: It is a Supervised Learning task where output is having defined
labels(discrete value). For example in above Figure A, Output – Purchased has
defined labels i.e. 0 or 1; 1 means the customer will purchase and 0 means that
customer won’t purchase. The goal here is to predict discrete values belonging to
a particular class and evaluate them on the basis of accuracy.
It can be either binary or multi-class classification. In binary classification, the
model predicts either 0 or 1; yes or no but in the case of multi-class classification,
the model predicts more than one class.
Example: Gmail classifies mails in more than one class like social, promotions,
updates, forums.
• Regression: It is a Supervised Learning task where output is having continuous
value.
Example in above Figure B, Output – Wind Speed is not having any discrete value
but is continuous in the particular range. The goal here is to predict a value as
much closer to the actual output value as our model can and then evaluation is
done by calculating the error value. The smaller the error the greater the
accuracy of our regression model.
• Example of Supervised Learning Algorithms:
• Linear Regression
• Nearest Neighbor
• Gaussian Naive Bayes
• Decision Trees
• Support Vector Machine (SVM)
• Random Forest
• Understanding Hypothesis
• In most supervised machine learning algorithm, our main goal is to find out a
possible hypothesis from the hypothesis space that could possibly map out the
inputs to the proper outputs.
The following figure shows the common method to find out the possible
hypothesis from the Hypothesis space:
• Hypothesis Space (H):
Hypothesis space is the set of all the possible legal hypothesis. This is the set
from which the machine learning algorithm would determine the best possible
(only one) which would best describe the target function or the outputs.
• Hypothesis (h):
A hypothesis is a function that best describes the target in supervised machine
learning. The hypothesis that an algorithm would come up depends upon the
data and also depends upon the restrictions and bias that we have imposed on
the data. To better understand the Hypothesis Space and Hypothesis consider the
following coordinate that shows the distribution of some data:
Say suppose we have test data for which we have to determine the outputs or results. The
test data is as shown below:
We can predict the outcomes by dividing the coordinate as shown below:
So the test data would yield the following result:
But note here that we could have divided the coordinate plane as:
The way in which the coordinate would be divided depends on the data,
algorithm and constraints.
•All these legal possible ways in which we can divide the coordinate plane to
predict the outcome of the test data composes of the Hypothesis Space.
•Each individual possible way is known as the hypothesis.
Hence, in this example the hypothesis space would be like:
• Understanding Hypothesis Testing
• Hypothesis are statement about the given problem. Hypothesis testing is a
statistical method that is used in making a statistical decision using experimental
data. Hypothesis testing is basically an assumption that we make about a
population parameter. It evaluates two mutually exclusive statements about a
population to determine which statement is best supported by the sample data.
Example:
You say an average student in the class is 30 or a boy is taller than girls. All those
are an example in which we assume or need some statistic way to prove those.
We need some mathematical conclusion whatever we are assuming is true.
• Need for Hypothesis Testing
Hypothesis testing is an important procedure in statistics. Hypothesis testing
evaluates two mutually exclusive population statements to determine which
statement is most supported by sample data. When we say that the findings are
statistically significant, it is thanks to hypothesis testing.
• Null hypothesis(H0): In statistics, the null hypothesis is a general given statement or default position
that there is no relationship between two measured cases or no relationship among groups.
• In other words, it is a basic assumption or made based on the problem knowledge.
• Example: A company production is = 50 unit/per day etc.
• Alternative hypothesis(H1): The alternative hypothesis is the hypothesis used in hypothesis testing
that is contrary to the null hypothesis.
• Example : A company production is not equal to 50 unit/per day etc.
• Level of significance
• It refers to the degree of significance in which we accept or reject the null-hypothesis. 100% accuracy is
not possible for accepting a hypothesis, so we, therefore, select a level of significance that is usually
5%. This is normally denoted with \alpha and generally, it is 0.05 or 5%, which means your output
should be 95% confident to give similar kind of result in each sample.
• P-value
• The P value, or calculated probability, is the probability of finding the observed/extreme results when
the null hypothesis(H0) of a study given problem is true. If your P-value is less than the chosen
significance level then you reject the null hypothesis i.e. accept that your sample claims to support the
alternative hypothesis.
• Example :
Given a coin and it is not known whether that is fair or tricky so let’s decide null and alternate
hypothesis
• Null Hypothesis(H0): a coin is a fair coin.
• Alternative Hypothesis(H1) : a coin is a tricky coin.
• alpha = 5% or 0.05
• Now let’s toss the coin and calculate p-value (probability value).
• Toss a coin 1st time and assume that result is head- P-value = 50% (as head and tail have equal
probability)
• Toss a coin 2nd time and assume that result again is head, now p-value = 50/2 = 25%
• and similarly, we Toss 6 consecutive time and got the result as all heads, now P-value = 1.5%
• But we set our significance level as 95% means 5% error rate we allow and here we see we are
beyond that level i.e. our null- hypothesis does not hold good so we need to reject and propose
that this coin is a tricky coin which is actually because it gives us 6 consecutive heads.
• Error in Hypothesis Testing
• Type I error: When we reject the null hypothesis, although that hypothesis was
true. Type I error is denoted by alpha.
• Type II errors: When we accept the null hypothesis but it is false. Type II errors
are denoted by beta.
• Type I Errors — False Positives (Alpha)
• There will almost always be a possibility of wrongly rejecting a null hypothesis
when it should not have been rejected while performing hypothesis tests. Data
scientists have the option of selecting an alpha (𝛼) confidence level threshold that
they will use to accept or reject the null hypothesis. This confidence threshold,
which is in other words a level of trust, is also the likelihood that you will reject
the null hypothesis when it is actually valid. This case is a type I error, which is
more generally referred to as a false positive.
• In hypothesis testing, you need to decide what degree of confidence, or trust, for
which you can dismiss the null hypothesis. If a scientist were to set alpha (𝛼) =.05,
this means that there is a 5 percent probability that they would reject the null
hypothesis when it is actually valid. Another way to think about this is that you
would expect the hypothesis to be rejected once, simply by chance, if you repeated
this experiment 20 times. Generally speaking, an alpha level of 0.05 is adequate to
show that certain findings are statistically significant.
• Type II Errors — False Negatives (Beta)
• Beta (β) is another type of error, which is the possibility that you have not
rejected the null hypothesis when it is actually incorrect. Type II errors are also
known as false negatives. Beta is linked to something called Power, which, given
that the null hypothesis is actually false, is the likelihood of rejecting it. When
planning an experiment, researchers will always select the power level they want
and get their Type II error rate from that.
• Is one more important than the other?
• Various situations allow researchers to mitigate one form of error over the other.
The two types of error are inversely related to each other; decreasing type I
errors will increase type II errors, and vice versa. To decide when a type I or type
II error would be safer, let’s go through a couple of scenarios.
• Imagine that you are on a jury and that you need to determine if an individual is
going to be sent to jail for a crime. Since you don’t know the truth as to whether
or not this person committed a crime, which would be worse, a type I or type II
error? I hope you say that a type I error is going to be worse. A type I error would
suggest that, if they were really not guilty, you would send them to jail! The jury
has dismissed the null hypothesis that the defendant is innocent while he has not
committed any crime. You would also not want to make a type II error here
because this would mean that someone has actually committed a crime and the
jury is letting them get away with it.
• Let’s take another example of a medical situation. A patient with multiple
migraine headaches is referred to the doctor for an MRI head scan. The doctor
believes that a brain tumor may be present in the patient. Is it going to be worse
for this situation to have a type I or type II error? Let’s hope you said that a Type
II error would be worse. A type II error would mean that there is a brain tumor in
the patient, but the doctor insists that there is nothing wrong with them! In other
words, the null hypothesis is that the person does not have a brain tumor, and this
hypothesis is not denied. This implies that, even though they are genuinely far
from it, the person is diagnosed as healthy.
• As researchers design experiments and make choices about the degrees of alpha
level and power, they need to weigh the risks of Type I and Type II errors in order
to prepare for whatever type of error they want to mitigate.
• Inductive Learning
• Machine learning is one of the most important subfields of artificial intelligence. It
has been viewed as a viable way of avoiding the knowledge bottleneck problem in
developing knowledge-based systems.
• Inductive Learning, also known as Concept Learning, is how AI systems attempt
to use a generalized rule to carry out observations.
• To generate a set of classification rules, Inductive Learning Algorithms (APIs) are
used. These generated rules are in the “If this then that” format.
• These rules determine the state of an entity at each iteration step in Learning and
how the Learning can be effectively changed by adding more rules to the existing
ruleset.
• When the output and examples of the function are fed into the AI system,
inductive Learning attempts to learn the function for new data.
• The Fundamental Concept of Inductive Learning
• There are two methods for obtaining knowledge in the real world: first, from
domain experts, and second, from machine learning.
• Domain experts are not very useful or reliable for large amounts of data. As a
result, for this project, we are adopting a machine learning approach.
• The other method, using machine learning, replicates the logic of ‘experts’ in
algorithms, but this work may be very complex, time-consuming, and expensive.
• As a result, an option is the inductive algorithms, which generate a strategy for
performing a task without requiring instruction at each step.
• According to Jason Brownlee in his article “Basic Concepts in Machine
Learning,” an excellent method to understand how Inductive Learning works is,
for example, if we are given input samples (x) and output samples (f(x)) from the
perspective of inductive Learning, and the problem is to estimate the function (f).
• Inductive Learning Algorithm
• Inductive Learning Algorithm (ILA) is an iterative and inductive machine learning
algorithm which is used for generating a set of a classification rule, which
produces rules of the form “IF-THEN”, for a set of examples, producing rules at
each iteration and appending to the set of rules.
• Basic Idea:
There are basically two methods for knowledge extraction firstly from domain
experts and then with machine learning.
• Machine Learning Foundation Course at a student-friendly price and become
industry ready.
• For a very large amount of data, the domain experts are not very useful and
reliable. So we move towards the machine learning approach for this work.
• To use machine learning One method is to replicate the experts logic in the form of algorithms
but this work is very tedious, time taking and expensive.
• So we move towards the inductive algorithms which itself generate the strategy for performing a
task and need not instruct separately at each step.
• Need of ILA in presence of other machine learning algorithms:
The ILA is a new algorithm which was needed even when other reinforcement learnings like
ID3 and AQ were available.
• The need was due to the pitfalls which were present in the previous algorithms, one of the major
pitfalls was lack of generalisation of rules.
• The ID3 and AQ used the decision tree production method which was too specific which were
difficult to analyse and was very slow to perform for basic short classification problems.
• The decision tree-based algorithm was unable to work for a new problem if some attributes are
missing.
• The ILA uses the method of production of a general set of rules instead of decision trees, which
overcome the above problems
• THE ILA ALGORITHM:
• General requirements at start of the algorithm:-
• list the examples in the form of a table ‘T’ where each row corresponds to an
example and each column contains an attribute value.
• create a set of m training examples, each example composed of k attributes and a
class attribute with n possible decisions.
• create a rule set, R, having the initial value false.
• initially all rows in the table are unmarked.
• Steps in the algorithm:-
• Step 1:
divide the table ‘T’ containing m examples into n sub-tables (t1, t2,…..tn). One
table for each possible value of the class attribute. (repeat steps 2-8 for each sub-
table)
• Step 2:
Initialize the attribute combination count ‘ j ‘ = 1.
• Step 3:
For the sub-table on which work is going on, divide the attribute list into distinct combinations,
each combination with ‘j ‘ distinct attributes.
• Step 4:
For each combination of attributes, count the number of occurrences of attribute values that
appear under the same combination of attributes in unmarked rows of the sub-table under
consideration, and at the same time, not appears under the same combination of attributes of
other sub-tables. Call the first combination with the maximum number of occurrences the max-
combination ‘ MAX’.
• Step 5:
If ‘MAX’ = = null , increase ‘ j ‘ by 1 and go to Step 3.
• Step 6:
Mark all rows of the sub-table where working, in which the values of ‘MAX’ appear, as
classi?ed.
• Step 7:
Add a rule (IF attribute = “XYZ” –> THEN decision is YES/ NO) to R whose left-hand side
will have attribute names of the ‘MAX’ with their values separated by AND, and its right-hand
side contains the decision attribute value associated with the sub-table.
• Step 8:
If all rows are marked as classi?ed, then move on to process another sub-table and go to Step 2.
else, go to Step 4. If no sub-tables are available, exit with the set of rules obtained till then.
• An example showing the use of ILA
suppose an example set having attributes Place type, weather, location, decision and seven
examples, our task is to generate a set of rules that under what condition what is the decision.
Example no. Place type weather location decision
subset 2
In the above output image, we can clearly see that the regression line is so far from the
datasets. Predictions are in a red straight line, and blue points are actual values. If we
consider this output to predict the value of CEO, it will give a salary of approx. 600000$,
which is far away from the real value.
So we need a curved model to fit the dataset other than a straight line.
• Visualizing the result for Polynomial Regression
• Here we will visualize the result of Polynomial regression model, code for which is
little different from the above model.
• Code for this is given below:
• #Visulaizing the result for Polynomial Regression
• mtp.scatter(x,y,color="blue")
• mtp.plot(x, lin_reg_2.predict(poly_regs.fit_transform(x)), color="red")
• mtp.title("Bluff detection model(Polynomial Regression)")
• mtp.xlabel("Position Levels")
• mtp.ylabel("Salary")
• mtp.show()
• In the above code, we have taken lin_reg_2.predict(poly_regs.fit_transform(x), instead
of x_poly, because we want a Linear regressor object to predict the polynomial
features matrix.
Output:
As we can see in the above output image, the predictions are close to the real values.
The above plot will vary as we will change the degree.
• For degree= 3:
• If we change the degree=3, then we will give a more accurate plot, as shown in
the below image.
So as we can see here in the above output image, the predicted salary for level 6.5 is near
to 170K$-190k$, which seems that future employee is saying the truth about his salary.
• Degree= 4: Let's again change the degree to 4, and now will get the most
accurate plot. Hence we can get more accurate results by increasing the degree
of Polynomial.
• Predicting the final result with the Linear Regression model:
• Now, we will predict the final output using the Linear regression model to see
whether an employee is saying truth or bluff. So, for this, we will use
the predict() method and will pass the value 6.5. Below is the code for it:
• lin_pred = lin_regs.predict([[6.5]])
• print(lin_pred)
• Output:
[330378.78787879]
•
• Predicting the final result with the Polynomial Regression model:
• Now, we will predict the final output using the Polynomial Regression model to
compare with Linear model. Below is the code for it:
• poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
• print(poly_pred)
• Output:
[ 158862.45265153]
• As we can see, the predicted output for the Polynomial Regression is
[158862.45265153], which is much closer to real value hence, we can say that
future employee is saying true.
•
• Cost function-
• The different values for weights or coefficient of lines (a0, a1) gives the different
line of regression, and the cost function is used to estimate the values of the
coefficient for the best fit line.
• Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
• We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also
known as Hypothesis function.
• For Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values and
actual values. It can be written as:
For the above linear equation, MSE can be calculated as:
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
• Gradient Descent:
• Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
• A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
• It is done by a random selection of values of coefficient and then iteratively update the values to
reach the minimum cost function.
• Model Performance:
• The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is
called optimization. It can be achieved by below method:
•
• R-squared method:
• R-squared is a statistical method that determines the goodness of fit.
• It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%.
• The high value of R-square determines the less difference between the predicted
values and actual values and hence represents a good model.
• It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
• It can be calculated from the below formula:
•
• Assumptions of Linear Regression
• Below are some important assumptions of Linear Regression. These are some formal
checks while building a Linear Regression model, which ensures to get the best possible
result from the given dataset.
• Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and
independent variables.
• Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
• Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
• Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal
distribution pattern. If error terms are not normally distributed, then confidence
intervals will become either too wide or too narrow, which may cause difficulties
in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
• No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there
will be any correlation in the error term, then it will drastically reduce the
accuracy of the model. Autocorrelation usually occurs if there is a dependency
between residual errors.
• Simple Linear Regression in Machine Learning
• Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.
• The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous or
categorical values.
• Simple Linear regression algorithm has mainly two objectives:
• Model the relationship between the two variables. Such as the relationship between Income
and expenditure, experience and Salary, etc.
• Forecasting new observations. Such as Weather forecasting according to temperature, Revenue
of a company according to the investments in a year, etc.
• Simple Linear Regression Model:
• The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or
decreasing.
ε = The error term. (For a good model it will be negligible)
• Multiple Linear Regression
• In the previous topic, we have learned about Simple Linear Regression, where a
single Independent/Predictor(X) variable is used to model the response variable
(Y). But there may be various cases in which the response variable is affected by
more than one predictor variable; for such cases, the Multiple Linear Regression
algorithm is used.
• Moreover, Multiple Linear Regression is an extension of Simple Linear regression
as it takes more than one predictor variable to predict the response variable. We
can define it as:
• Multiple Linear Regression is one of the important regression algorithms which
models the linear relationship between a single dependent continuous variable
and more than one independent variable.
• Example:
• Prediction of CO2 emission based on engine size and number of cylinders in a car.
• Some key points about MLR:
• For MLR, the dependent or target variable(Y) must be the continuous/real, but
the predictor or independent variable may be of continuous or categorical form.
• Each feature variable must model the linear relationship with the dependent
variable.
• MLR tries to fit a regression line through a multidimensional space of data-points.
• MLR equation:
• In Multiple Linear Regression, the target variable(Y) is a linear combination of
multiple predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple
Linear Regression, so the same is applied for the multiple linear regression
equation, the equation becomes:
• Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</su
b>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ..
• Where,
• Y= Output/Response variable
• b0, b1, b2, b3 , bn....= Coefficients of the model.
• x1, x2, x3, x4,...= Various Independent/feature variable
• Assumptions for Multiple Linear Regression:
• A linear relationship should exist between the Target and predictor variables.
• The regression residuals must be normally distributed.
• MLR assumes little or no multicollinearity (correlation between the independent
variable) in data.
• Logistic Regression in Machine Learning
• Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a given
set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must
be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of
giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving
the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the
logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
• Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends to
1, and a value below the threshold values tends to 0.
• Assumptions for Logistic Regression:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.
• Logistic Regression Equation:
• The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:
• We know the equation of the straight line can be written as:
•In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
• But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
Let’s say we want to create clusters using the K-Means Clustering or k-Nearest Neighbour
algorithm to solve a classification or regression problem. How will you define the similarity
between different observations here? How can we say that two points are similar to each other?
This will happen if their features are similar, right? When we plot these points, they will be closer
to each other in distance.
• Hence, we can calculate the distance between points and then define the
similarity between them. Here’s the million-dollar question – how do we
calculate this distance and what are the different distance metrics in machine
learning?
• Types of Distance Metrics in Machine Learning
• Euclidean Distance
• Manhattan Distance
• Minkowski Distance
• Hamming Distance
• Let’s start with the most commonly used distance metric – Euclidean Distance.
Euclidean Distance
Euclidean Distance represents the shortest distance between two points.
Most machine learning algorithms including K-Means use this distance metric to measure the
similarity between observations. Let’s say we have two points as shown below:
• So, the Euclidean Distance between these two points A and B will be:
•
Where,
•n = number of dimensions
•pi, qi = data points
Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all
the dimensions.
We can represent Manhattan Distance as:
• Since the above representation is 2 dimensional, to calculate Manhattan
Distance, we will take the sum of absolute distances in both the x and y
directions. So, the Manhattan distance in a 2-dimensional space is given as:
Where,
•n = number of dimensions
•pi, qi = data points
Minkowski Distance
Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.
The formula for Minkowski Distance is given as:
Here, p represents the order of the norm. Let’s calculate the Minkowski Distance of the order 3:
Hamming Distance
Hamming Distance measures the similarity between two strings of the same length. The Hamming
Distance between two strings of the same length is the number of positions at which the
corresponding characters are different.
Let’s understand the concept using an example. Let’s say we have two strings:
“euclidean” and “manhattan”
Since the length of these strings is equal, we can calculate the Hamming Distance. We will go
character by character and match the strings. The first character of both the strings (e and m
respectively) is different. Similarly, the second character of both the strings (u and a) is different.
and so on.
Look carefully – seven characters are different whereas two characters (the last two characters)
are similar:
Hence, the Hamming Distance here will be 7. Note that larger the Hamming Distance between
two strings, more dissimilar will be those strings (and vice versa).
• Principal Component Analysis
• Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing
the variances.
• PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
• PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels. It is a feature extraction technique, so it
contains the important variables and drops the least important variable.
• The PCA algorithm is based on some mathematical concepts such as:
• Variance and Covariance
• Eigenvalues and Eigen factors
• Some common terms used in PCA algorithm:
• Dimensionality: It is the number of features or variables present in the given dataset. More
easily, it is the number of columns present in the dataset.
• Correlation: It signifies that how strongly two variables are related to each other. Such as if one
changes, the other variable also gets changed. The correlation value ranges from -1 to +1. Here,
-1 occurs if variables are inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
• Orthogonal: It defines that variables are not correlated to each other, and hence the correlation
between the pair of variables is zero.
• Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be
eigenvector if Av is the scalar multiple of v.
• Covariance Matrix: A matrix containing the covariance between the pair of variables is called
the Covariance Matrix.
• Principal Components in PCA
• As described above, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than
the original features present in the dataset. Some properties of these principal
components are given below:
• The principal component must be the linear combination of the original features.
• These components are orthogonal, i.e., the correlation between a pair of variables
is zero.
• The importance of each component decreases when going to 1 to n, it means the 1
PC has the most importance, and n PC will have the least importance.
• Steps for PCA algorithm
• Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is the
training set, and Y is the validation set.
• Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items, and
the column corresponds to the Features. The number of columns is the dimensions of the
dataset.
• Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features with
high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will divide
each data item in a column with the standard deviation of the column. Here we will name the
matrix as Z.
• Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
• Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix
Z. Eigenvectors or the covariance matrix are the directions of the axes with high information.
And the coefficients of these eigenvectors are defined as the eigenvalues.
• Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which means
from largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix P of
eigenvalues. The resultant matrix will be named as P*.
• Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In
the resultant matrix Z*, each observation is the linear combination of original features. Each
column of the Z* matrix is independent of each other.
• Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to remove. It
means, we will only keep the relevant or important features in the new dataset, and unimportant
features will be removed out.
• More specifically, the reason why it is critical to perform standardization prior to
PCA, is that the latter is quite sensitive regarding the variances of the initial
variables. That is, if there are large differences between the ranges of initial
variables, those variables with larger ranges will dominate over those with small
ranges (For example, a variable that ranges between 0 and 100 will dominate over
a variable that ranges between 0 and 1), which will lead to biased results. So,
transforming the data to comparable scales can prevent this problem.
• Mathematically, this can be done by subtracting the mean and dividing by the
standard deviation for each value of each variable.
Once the standardization is done, all the variables will be transformed to the same
scale.
• COVARIANCE MATRIX COMPUTATION
• The aim of this step is to understand how the variables of the input data set are
varying from the mean with respect to each other, or in other words, to see if there
is any relationship between them. Because sometimes, variables are highly
correlated in such a way that they contain redundant information. So, in order to
identify these correlations, we compute the covariance matrix.
• The covariance matrix is a p × p symmetric matrix (where p is the number of
dimensions) that has as entries the covariances associated with all possible pairs of
the initial variables. For example, for a 3-dimensional data set with 3
variables x, y, and z, the covariance matrix is a 3×3 matrix of this from:
•if positive then : the two variables increase or decrease together (correlated)
•if negative then : One increases when the other decreases (Inversely correlated)
Now, that we know that the covariance matrix is not more than a table that
summaries the correlations between all the possible pairs of variables, let’s move
to the next step.
• Mathematics Behind PCA
• PCA can be thought of as an unsupervised learning problem. The whole process
of obtaining principle components from a raw dataset can be simplified in six
parts :
• Take the whole dataset consisting of d+1 dimensions and ignore the labels such
that our new dataset becomes d dimensional.
• Compute the mean for every dimension of the whole dataset.
• Compute the covariance matrix of the whole dataset.
• Compute eigenvectors and the corresponding eigenvalues.
• Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with
the largest eigenvalues to form a d × k dimensional matrix W.
• Use this d × k eigenvector matrix to transform the samples onto the new
subspace.
• So, let’s unfurl the maths behind each of this one by one.
• Take the whole dataset consisting of d+1 dimensions and ignore the labels such
that our new dataset becomes d dimensional.
• Let’s say we have a dataset which is d+1 dimensional. Where d could be thought
as X_train and 1 could be thought as y_train (labels) in modern machine learning
paradigm. So, X_train + y_train makes up our complete train dataset.
• So, after we drop the labels we are left with d dimensional dataset and this would
be the dataset we will use to find the principal components. Also, let’s assume
we are left with a three-dimensional dataset after ignoring the labels i.e d = 3.
• we will assume that the samples stem from two different classes, where one-half
samples of our dataset are labeled class 1 and the other half class 2.
• Let our data matrix X be the score of three students :
Using the above formula, we can find the covariance matrix of A. Also, the result
would be a square matrix of d ×d dimensions.
Let’s rewrite our original matrix like this
Its covariance matrix would be
• Few points that can be noted here is :
• Shown in Blue along the diagonal, we see the variance of scores for each test. The
art test has the biggest variance (720); and the English test, the smallest (360). So
we can say that art test scores have more variability than English test scores.
• The covariance is displayed in black in the off-diagonal elements of the matrix A
• a) The covariance between math and English is positive (360), and the covariance
between math and art is positive (180). This means the scores tend to covary in a
positive way. As scores on math go up, scores on art and English also tend to go
up; and vice versa.
• b) The covariance between English and art, however, is zero. This means there
tends to be no predictable relationship between the movement of English and art
scores.
• Compute Eigenvectors and corresponding Eigenvalues
• Intuitively, an eigenvector is a vector whose direction remains unchanged when
a linear transformation is applied to it.
• Now, we can easily compute eigenvalue and eigenvectors from the covariance
matrix that we have above.
• Let A be a square matrix, ν a vector and λ a scalar that satisfies Aν = λν, then λ is
called eigenvalue associated with eigenvector ν of A.
• The eigenvalues of A are roots of the characteristic equation
Calculating det(A-λI) first, I is an identity matrix :
Eigenvalues
Now, we can calculate the eigenvectors corresponding to the above eigenvalues. I
would not show how to calculate eigenvector here, visit this link to understand how
to calculate eigenvectors.
So, after solving for eigenvectors we would get the following solution for the
corresponding eigenvalues
Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest
eigenvalues to form a d × k dimensional matrix W.
We started with the goal to reduce the dimensionality of our feature space, i.e., projecting the
feature space via PCA onto a smaller subspace, where the eigenvectors will form the axes of this
new feature subspace. However, the eigenvectors only define the directions of the new axis, since
they have all the same unit length 1.
So, in order to decide which eigenvector(s) we want to drop for our lower-dimensional subspace,
we have to take a look at the corresponding eigenvalues of the eigenvectors. Roughly speaking,
the eigenvectors with the lowest eigenvalues bear the least information about the distribution of
the data, and those are the ones we want to drop.
The common approach is to rank the eigenvectors from highest to lowest corresponding
eigenvalue and choose the top k eigenvectors.
So, after sorting the eigenvalues in decreasing order, we have
For our simple example, where we are reducing a 3-dimensional feature space to a
2-dimensional feature subspace, we are combining the two eigenvectors with the
highest eigenvalues to construct our d×k dimensional eigenvector matrix W.
So, eigenvectors corresponding to two maximum eigenvalues are :
Transform the samples onto the new subspace
In the last step, we use the 2×3 dimensional matrix W that we just computed to
transform our samples onto the new subspace via the equation y = W′ ×
x where W′ is the transpose of the matrix W.
So lastly, we have computed our two principal components and projected the data
points onto the new subspace.
• Applications of Principal Component Analysis
• PCA is mainly used as the dimensionality reduction technique in various AI
applications such as computer vision, image compression, etc.
• It can also be used for finding hidden patterns if data has high dimensions. Some
fields where PCA is used are Finance, data mining, Psychology, etc.
•
• Naïve Bayes Classifier Algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training
dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
• Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the fruit is
identified on the bases of color, shape, and taste, then red, spherical, and sweet
fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
• Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
• The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
• Bayes' theorem in Artificial intelligence
• Bayes' theorem:
• Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning,
which determines the probability of an event with uncertain knowledge.
• In probability theory, it relates the conditional probability and marginal probabilities of
two random events.
• Bayes' theorem was named after the British mathematician Thomas Bayes.
The Bayesian inference is an application of Bayes' theorem, which is fundamental to
Bayesian statistics.
• It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).
• Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
• Example: If cancer corresponds to one's age then by using Bayes' theorem, we can
determine the probability of cancer more accurately with the help of age.
• Bayes' theorem can be derived using product rule and conditional probability of event A
with known event B:
• As from product rule we can write:
• As from product rule we can write:
• P(A ⋀ B)= P(A|B) P(B) or
• Similarly, the probability of event B with known event A:
• P(A ⋀ B)= P(B|A) P(A)
• Equating right hand side of both the equations, we will get:
•
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of
most modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
• P(A|B) is known as posterior, which we need to calculate, and it will be read as
Probability of hypothesis A when we have occurred an evidence B.
• P(B|A) is called the likelihood, in which we consider that hypothesis is true, then
we calculate the probability of evidence.
• P(A) is called the prior probability, probability of hypothesis before considering
the evidence
• P(B) is called marginal probability, pure probability of an evidence.
• In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the
Bayes' rule can be written as:
Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
• Applying Bayes' rule:
• Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A).
This is very useful in cases where we have a good probability of these three terms and want to
determine the fourth one. Suppose we want to perceive the effect of some unknown cause, and
want to compute that cause, then the Bayes' rule becomes:
Question: what is the probability that a patient has diseases meningitis with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80% of the
time. He is also aware of some more facts, which are given as follows:
•The Known probability that a patient has meningitis disease is 1/30,000.
•The Known probability that a patient has a stiff neck is 2%.
Let a be the proposition that patient has stiff neck and b be the proposition that patient has meningitis. , so
we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.
Question 2: From a standard deck of playing cards, a single card is drawn. The probability that
the card is king is 4/52, then calculate posterior probability P(King|Face), which means the
drawn face card is a king card.
Solution:
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect
predictions, and 65+25=90 correct predictions.
• Visualizing the training set result:
• Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code for it:
• # Visualising the Training set results
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_train, y_train
• X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
• alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
• mtp.xlim(X1.min(), X1.max())
• mtp.ylim(X2.min(), X2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('purple', 'green'))(i), label = j)
• mtp.title('Naive Bayes (Training set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
Output:
In the above output we can see that the Naïve Bayes classifier has segregated the data points
with the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our
code.
• # Visualising the Test set results
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_test, y_test
• X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
• alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
• mtp.xlim(X1.min(), X1.max())
• mtp.ylim(X2.min(), X2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('purple', 'green'))(i), label = j)
• mtp.title('Naive Bayes (test set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
Output:
The above output is final output for test set data. As we can see the classifier has created a
Gaussian curve to divide the "purchased" and "not purchased" variables. There are some wrong
predictions which we have calculated in Confusion matrix. But still it is pretty good classifier.
• The Bayes Optimal Classifier is a probabilistic model that makes the most probable
prediction for a new example. ... Bayes Optimal Classifier is a probabilistic model that finds
the most probable prediction using the training data and space of hypotheses to make a
prediction for a new data instance.
• Data Mining Bayesian Classifiers
• In numerous applications, the connection between the attribute set and the class variable is non-
deterministic. In other words, we can say the class label of a test record cant be assumed with
certainty even though its attribute set is the same as some of the training examples. These
circumstances may emerge due to the noisy data or the presence of certain confusing factors that
influence classification, but it is not included in the analysis. For example, consider the task of
predicting the occurrence of whether an individual is at risk for liver illness based on individuals
eating habits and working efficiency. Although most people who eat healthly and exercise
consistently having less probability of occurrence of liver disease, they may still do so due to
other factors. For example, due to consumption of the high-calorie street foods and alcohol
abuse. Determining whether an individual's eating routine is healthy or the workout efficiency is
sufficient is also subject to analysis, which in turn may introduce vulnerabilities into the leaning
issue.
• Bayesian classification uses Bayes theorem to predict the occurrence of any
event. Bayesian classifiers are the statistical classifiers with the Bayesian
probability understandings. The theory expresses how a level of belief,
expressed as a probability.
• Bayes theorem came into existence after Thomas Bayes, who first utilized
conditional probability to provide an algorithm that uses evidence to calculate
limits on an unknown parameter.
• Bayes's theorem is expressed mathematically by the following equation that is
given below.
• Where X and Y are the events and P (Y) ≠ 0
• P(X/Y) is a conditional probability that describes the occurrence of event X is
given that Y is true.
• P(Y/X) is a conditional probability that describes the occurrence of event Y is
given that X is true.
• P(X) and P(Y) are the probabilities of observing X and Y independently of each
other. This is known as the marginal probability.
• Bayesian interpretation:
• In the Bayesian interpretation, probability determines a "degree of belief." Bayes
theorem connects the degree of belief in a hypothesis before and after
accounting for evidence. For example, Lets us consider an example of the coin. If
we toss a coin, then we get either heads or tails, and the percent of occurrence of
either heads and tails is 50%. If the coin is flipped numbers of times, and the
outcomes are observed, the degree of belief may rise, fall, or remain the same
depending on the outcomes.
For proposition X and evidence Y,
•P(X), the prior, is the primary degree of belief in X
•P(X/Y), the posterior is the degree of belief having accounted for Y.
The nodes here represent random variables, and the edges define the relationship between these
variables.
• A DAG models the uncertainty of an event taking place based on the
Conditional Probability Distribution (CDP) of each random variable.
A Conditional Probability Table (CPT) is used to represent the CPD of each
variable in a network.
• K-Nearest Neighbor(KNN) Algorithm for Machine Learning
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
• It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.
• Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a
similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs
images and based on the most similar features it will put it in either cat or dog category.
•
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
•Step-1: Select the number K of the neighbors
•Step-2: Calculate the Euclidean distance of K number of neighbors
•Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
•Step-4: Among these k neighbors, count the number of the data points in each category.
•Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
•Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
•Firstly, we will choose the number of neighbors, so we will choose the k=5.
•Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two
points, which we have already studied in geometry. It can be calculated as:
•By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:
•As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
• There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects
of outliers in the model.
• Large values for K are good, but it may find some difficulties.
• Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
• Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data points
for all the training samples.
• Python implementation of the KNN algorithm
• To do the Python implementation of the K-NN algorithm, we will use the same problem
and dataset which we have used in Logistic Regression. But here we will improve the
performance of the model. Below is the problem description:
• Problem for K-NN Algorithm: There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give the ads to the users who are
interested in buying that SUV. So for this problem, we have a dataset that contains
multiple user's information through the social network. The dataset contains lots of
information but the Estimated Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent variable. Below is the dataset:
• Steps to implement the K-NN algorithm:
• Data Pre-processing step
• Fitting the K-NN algorithm to the Training set
• Predicting the test result
• Test accuracy of the result(Creation of Confusion matrix)
• Visualizing the test set result.
• Data Pre-Processing Step:
• The Data Pre-processing step will remain exactly the same as Logistic Regression.
Below is the code for it:
• # importing libraries
• import numpy as nm
• import matplotlib.pyplot as mtp
• import pandas as pd
• #importing datasets
• data_set= pd.read_csv('user_data.csv')
• #Extracting Independent and dependent Variable
• x= data_set.iloc[:, [2,3]].values
• y= data_set.iloc[:, 4].values
• # Splitting the dataset into training and test set.
• from sklearn.model_selection import train_test_split
• x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
• #feature Scaling
• from sklearn.preprocessing import StandardScaler
• st_x= StandardScaler()
• x_train= st_x.fit_transform(x_train)
• x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed.
After feature scaling our test dataset will look like:
• From the above output image, we can see that our data is successfully scaled.
• Fitting K-NN classifier to the Training data:
Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing
the class, we will create the Classifier object of the class. The Parameter of this
class will be
• n_neighbors: To define the required neighbors of the algorithm. Usually, it takes 5.
• metric='minkowski': This is the default parameter and it decides the distance between the
points.
• p=2: It is equivalent to the standard Euclidean metric.
• And then we will fit the classifier to the training data. Below is the code for it:
• #Fitting K-NN classifier to the training set
• from sklearn.neighbors import KNeighborsClassifier
• classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
• classifier.fit(x_train, y_train)
• Output: By executing the above code, we will get the output as:
• Predicting the Test Result: To predict the test set result, we will create
a y_pred vector as we did in Logistic Regression. Below is the code for it:
• #Predicting the test set result
• y_pred= classifier.predict(x_test)
• Output:
• The output for the above code will be:
• #Creating the Confusion matrix
• from sklearn.metrics import confusion_matrix
• cm= confusion_matrix(y_test, y_pred)
• In above code, we have imported the confusion_matrix function and called it
using the variable cm.
• Output: By executing the above code, we will get the matrix as below:
• In the above image, we can see there are 64+29= 93 correct predictions and 3+4=
7 incorrect predictions, whereas, in Logistic Regression, there were 11 incorrect
predictions. So we can say that the performance of the model is improved by
using the K-NN algorithm.
• Visualizing the Training set result:
Now, we will visualize the training set result for K-NN model. The code will
remain same as we did in Logistic Regression, except the name of the graph.
Below is the code for it:
• #Visulaizing the trianing set result
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_train, y_train
• x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
• alpha = 0.75, cmap = ListedColormap(('red','green' )))
• mtp.xlim(x1.min(), x1.max())
• mtp.ylim(x2.min(), x2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('red', 'green'))(i), label = j)
• mtp.title('K-NN Algorithm (Training set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
Output:
By executing the above code, we will get the below graph:
• The output graph is different from the graph which we have occurred in Logistic
Regression. It can be understood in the below points:
• As we can see the graph is showing the red point and green points. The green
points are for Purchased(1) and Red Points for not Purchased(0) variable.
• The graph is showing an irregular boundary instead of showing any straight line
or any curve because it is a K-NN algorithm, i.e., finding the nearest neighbor.
• The graph has classified users in the correct categories as most of the users who
didn't buy the SUV are in the red region and users who bought the SUV are in
the green region.
• The graph is showing good result but still, there are some green points in the
red region and red points in the green region. But this is no big issue as by doing
this model is prevented from overfitting issues.
• Hence our model is well trained.
• Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new
dataset, i.e., Test dataset. Code remains the same except some minor changes:
such as x_train and y_train will be replaced by x_test and y_test.
Below is the code for it:
• #Visualizing the test set result
• from matplotlib.colors import ListedColormap
• x_set, y_set = x_test, y_test
• x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
• nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
• mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
• alpha = 0.75, cmap = ListedColormap(('red','green' )))
• mtp.xlim(x1.min(), x1.max())
• mtp.ylim(x2.min(), x2.max())
• for i, j in enumerate(nm.unique(y_set)):
• mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
• c = ListedColormap(('red', 'green'))(i), label = j)
• mtp.title('K-NN algorithm(Test set)')
• mtp.xlabel('Age')
• mtp.ylabel('Estimated Salary')
• mtp.legend()
• mtp.show()
Output:
The above graph is showing the output for the test data set. As we can see in the graph, the predicted output is well
good as most of the red points are in the red region and most of the green points are in the green region.
However, there are few green points in the red region and a few red points in the green region. So these are the
incorrect observations that we have observed in the confusion matrix(7 Incorrect output).
• Radial Basis Functions Neural Networks — All we need to know
• In Single Perceptron / Multi-layer Perceptron(MLP), we only have linear
separability because they are composed of input and output layers(some
hidden layers in MLP)
• ⁃ For example, AND, OR functions are linearly-separable & XOR function
is not linearly separable.
• We atleast need one hidden layer to derive a non-linearity separation.
• ⁃ Our RBNN what it does is, it transforms the input signal into another form,
which can be then feed into the network to get linear separability.
• ⁃ RBNN is structurally same as perceptron(MLP).
• RBNN is composed of input, hidden, and output layer. RBNN is strictly limited to
have exactly one hidden layer. We call this hidden layer as feature vector.
• RBNN increases dimension of feature vector.
we define a receptor = t
⁃ we draw confrontal maps around the receptor.
⁃ Gaussian Functions are generally used for Radian Basis Function(confrontal mapping). So we
define the radial distance r = ||x- t||.
Gaussian Radial Function :=
where σ > 0
Classification only happens on the second phase, where linear combination of hidden functions
are driven to output layer.
• Example. XOR function :-
• ⁃ I have 4 inputs and I will not increase dimension at the feature vector here. So I will
select 2 receptors here. For each transformation function ϕ(x), we will have each
receptors t.
• ⁃ Now consider the RBNN architecture,
• P := # of input features/ values.
• ⁃ M = # of transformed vector dimensions (hidden layer width). So M ≥ P usually be.
• ⁃ Each node in the hidden layer, performs a set of non-linear radian basis function.
• ⁃ Output C will remains the same as for the classification problems(certain number of
class labels as predefined).
Architecture of XOR RBNN
Transformation function with receptors and variances.
Output → linear combination of transformation function is tabulated.
• Only Nodes in the hidden layer perform the radian basis transformation function.
• ⁃ Output layer performs the linear combination of the outputs of the hidden layer
to give a final probabilistic value at the output layer.
• ⁃ So the classification is only done only @ (hidden layer → output layer)
• Training the RBNN :-
• ⁃ First, we should train the hidden layer using back propagation.
• ⁃ Neural Network training(back propagation) is a curve fitting method.
It fits a non-linear curve during the training phase. It runs through stochastic
approximation, which we call the back propagation.
• ⁃ For each of the node in the hidden layer, we have to find t(receptors) & the
variance (σ)[variance — the spread of the radial basis function]
• ⁃ On the second training phase, we have to update the weighting
vectors between hidden layers & output layers.
• ⁃ In hidden layers, each node represents each transformation basis
function. Any of the function could satisfy the non-linear separability OR
even combination of set of functions could satisfy the non-linear separability.
• So in our hidden layer transformation, all the non-linearity terms are included.
Say like X² + Y² + 5XY ; its all included in a hyper-surface equation(X & Y are
inputs).
• ⁃ Therefore, the first stage of training is done by clustering algorithm. We
define the number of cluster centers we need. And by clustering algorithm, we
compute the cluster centers, which then is assigned as the receptors for each
hidden neurons.
• ⁃ I have to cluster N samples or observations into M clusters (N > M).
• ⁃ So the output “clusters” are the “receptors”.
• ⁃ for each receptors, I can find the variance as “the squared sum of the
distances between the respective receptor & the each cluster nearest
samples” := 1/N * ||X — t||²
• The interpretation of the first training phase is that the “feature vector is
projected onto the transformed space”.
• We have also had feedback from large automotive users that they prefer FFNNs,
but cannot afford them (a typical automotive design problem might have 7 cases,
50 variables and 100 constraint functions). So when using ensembles, typically
4500 neural networks must be computed individually (including the ensembles
and hidden nodes options). This could take days.
• For an optimization in which the user is really only interested to arrive at a single
design point (i.e. a converged solution), using the default SRSM (sequential)
approach with linear basis functions (the default approach) is still the best and
cheapest. It also works well for a large number of variables and its cost is in a
linear relation to the number of variables.