ML Final
ML Final
ML Final
ML Final
DIGITAL
NOTES ON
Machine
Learning
(R20D5803)
M.Tech., II YEAR – I
SEM (2021-2022)
UNIT - I
Introduction Well-posed learning problems, designing a learning system Perspectives and issues in
machine learning Concept learning and the general to specific ordering Introduction,A concept learning
task, concept learning as search, Find-S: Finding a Maximally Specific Hypothesis, Version Spaces and
the Candidate Elimination algorithm, Remarks on Version Spaces and Candidate Elimination, Inductive
Bias. Decision Tree Learning-Introduction, Decision Tree Representation, Appropriate Problems for
Decision Tree Learning, The Basic Decision Tree Learning Algorithm Hypothesis Space Search in
Decision Tree Learning, Inductive Bias in Decision Tree Learning, Issues in Decision Tree Learning.
UNIT - II
Artificial Neural Networks -Introduction, Neural Network Representation, Appropriate Problems for
Neural Network Learning, Perceptions, Multilayer Networks and the Back propagation Algorithm.
Discussion on the Back Propagation Algorithm, An illustrative Example: Face Recognition
UNIT - III
Bayesian learning-Introduction, Byes Theorem, Bayes Theorem and Concept Learning Maximum
Likelihood and Least Squared Error Hypotheses, Maximum Likelihood Hypotheses for Predicting
Probabilities, Minimum Description Length Principle, Bayes Optimal Classifier, Gibs Algorithm, Naïve
Bayes Classifier, An Example: Learning to Classify Text, Bayesian Belief Networks, EM Algorithm.
Instance-Based Learning-Introduction, k-Nearest Neighbor Learning, Locally Weighted Regression,
Radial Basis Functions, Case-Based Reasoning, Remarks on Lazy and Eager Learning.
UNIT -IV
Pattern Comparison Techniques-Temporal patterns, Dynamic Time Warping Methods,Clustering,
Introduction to clustering, K-means clustering, K-Mode Clustering. Codebook Generation, Vector
Quantization.
UNIT - V
Genetic Algorithms: Different search methods for induction - Explanation-based Learning: using prior
knowledge to reduce sample complexity. Dimensionality reduction: feature selection, principal
component analysis, linear discriminate analysis, factor analysis, independent component analysis,
multidimensional scaling, and manifold learning.
Textbooks:
1. Machine Learning – Tom M. Mitchell, -MGH
2. Fundamentals of Speech Recognition By Lawrence Rabiner and Biing –
Hwang Juang .Ethem Alpaydin, ”Introduction to Machine Learning”, MIT
Press, Prentice Hall of India, 3 rd Edition2014.
3. Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar ” Foundations of
Machine Learning”,MIT Press,2012
References:
1. Machine Learning : An Algorithmic Perspective, Stephen Marsland, Taylor & Francis .
INDEX
3 IV Clustering 67
5 IV K-means clustering 69
6 IV K-Mode Clustering. Codebook Generation 70
7 IV Vector Quantization. 76
3 V Dimensionality reduction 82
UNIT-I
Machine Learning
is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies
that one would have ever come across. As it is evident from the name, it
gives the computer that makes it more similar to humans: The ability to
learn. Machine learning is actively being used today, perhaps in many more
places than one would expect.
Machine learning evolved from left to right as shown in the above diagram.
• Initially, researchers started out with Supervised Learning. This is the
case of housing price prediction discussed earlier
. • This was followed by unsupervised learning, where the machine is made
to learn on its own without any supervision.
• Scientists discovered further that it may be a good idea to reward the
machine when it does the job the expected way and there came the
Reinforcement Learning.
• Very soon, the data that is available these days has become so
humongous that the conventional techniques developed so far failed to
analyse the big data and provide us the predictions.
• Thus, came the deep learning where the human brain is simulated in the
Artificial Neural Networks (ANN) created in our binary computers.
• The machine now learns on its own using the high computing power and
huge memory resources that are available today.
• It is now observed that Deep Learning has solved many of the previously
unsolvable problems.
• The technique is now further advanced by giving incentives to Deep
Learning networks as awards and there finally comes Deep Reinforcement
Learning.
Let us now study each of these categories in more details
Supervised Learning:
Supervised learning is analogous to training a child to walk. You will hold
the child’s hand, show him how to take his foot forward, walk yourself for a
demonstration and so on, until the child learns to walk on his own.
Regression:
Similarly, in the case of supervised learning, you give concrete known
examples to the computer. You say that for given feature value x1 the output
is y1, for x2 it is y2, for x3 it is y3, and so on. Based on this data, you let the
computer figure out an empirical relationship between x and y. Once the
machine is trained in this way with a sufficient number of data points, now
you would ask the machine to predict Y for a given X. Assuming that you
know the real value of Y for this given X, you will be able to deduce whether
the machine’s prediction is correct. Thus, you will test whether the machine
has learned by using the known test data. Once you are satisfied that the
machine is able to do the predictions with a desired level of accuracy (say 80
to 90%) you can stop further training the machine. Now, you can safely use
the machine to do the predictions on unknown data points, or ask the
machine to predict Y for a given X for which you do not know the real value
of Y. This training comes under the regression that we talked about earlier.
Classification:
You may also use machine learning techniques for classification problems. In
classification problems, you classify objects of similar nature into a single
group. For example, in a set of 100 students say, you may like to group them
into three groups based on their heights - short, medium and long. Measuring
the height of each student, you will place them in a proper group. Now, when
a new student comes in, you will put him in an appropriate group by
measuring his height. By following the principles in regression training, you
will train the machine to classify a student based on his feature – the height.
When the machine learns how the groups are formed, it will be able to
classify any unknown new student correctly. Once again, you would use the
test data to verify that the machine has learned your technique of
classification before putting the developed model in production. Supervised
Learning is where the AI really began its journey. This technique was
applied successfully in several cases. You have used this model while doing
the hand-written recognition on your machine. Several algorithms have been
developed for supervised learning. You will learn about them in the
following chapters.
Unsupervised Learning:
In unsupervised learning, we do not specify a target variable to the machine,
rather we ask machine “What can you tell me about X?”. More specifically,
we may ask questions such as given a huge data set X, “What are the five
best groups we can make out of X?” or “What features occur together most
frequently in X?”. To arrive at the answers to such questions, you can
understand that the number of data points that the machine would require to
deduce a strategy would be very large. In case of supervised learning, the
machine can be trained with even about few thousands of data points.
However, in case of unsupervised learning, the number of data points that is
reasonably accepted for learning starts in a few millions. These days, the data
is generally abundantly available. The data ideally requires curating.
However, the amount of data that is continuously flowing in a social area
network, in most cases data curation is an impossible task. The following
figure shows the boundary between the yellow and red dots as determined by
unsupervised machine learning. You can see it clearly that the machine
would be able to determine the class of each of the black dots with a fairly
good accuracy.
Reinforcement Learning:
Consider training a pet dog, we train our pet to bring a ball to us. We throw
the ball at a certain distance and ask the dog to fetch it back to us. Every time
the dog does this right, we reward the dog. Slowly, the dog learns that doing
the job rightly gives him a reward and then the dog starts doing the job right
way every time in future. Exactly, this concept is applied in “Reinforcement”
type of learning. The technique was initially developed for machines to play
games. The machine is given an algorithm to analyse all possible moves at
each stage of the game. The machine may select one of the moves at random.
If the move is right, the machine is rewarded, otherwise it may be penalized.
Slowly, the machine will start differentiating between right and wrong
moves and after several iterations would learn to solve the game puzzle with
a better accuracy. The accuracy of winning the game would improve as the
machine plays more and more games.
The entire process may be depicted in the following diagram:
Deep Learning:
The deep learning is a model based on Artificial Neural Networks (ANN),
more specifically Convolutional Neural Networks (CNN)s. There are several
architectures used in deep learning such as deep neural networks, deep belief
networks, recurrent neural networks, and convolutional neural networks.
These networks have been successfully applied in solving the problems of
computer vision, speech recognition, natural language processing,
bioinformatics, drug design, medical image analysis, and games. There are
several other fields in which deep learning is proactively applied. The deep
learning requires huge processing power and humongous data, which is
generally easily available these days. We will talk about deep learning more
in detail in the coming chapters.
have got a brief introduction to various machine learning models, now let us
explore slightly deeper into various algorithms that are available under these
models.
Just now we looked into the learning process and also understood the goal
of the learning. When we want to design a learning system that follows
the learning process, we need to consider a few design choices. The
design choices will be to decide the following key components:
1. Type of training experience
2. Choosing the Target Function
3. Choosing a representation for the Target Function
4. Choosing an approximation algorithm for the Target Function
5. The final Design
We will look into the game - checkers learning problem and apply the above
design choices. For a checkers learning problem, the three elements will be,
• Task T: To play checkers
• Performance measure P: Total present of the game won in the tournament.
• Training experience E: A set of games played against itself.
1. Teacher or Not:
Supervised:
The training experience will be labelled, which means, all the board states
will be labelled with the correct move. So the learning takes place in the
presence of a supervisor or a teacher.
Un-Supervised:
The training experience will be unlabelled, which means, all the board
states will not have the moves. So the learner generates random games and
plays against itself with no supervision or teacher involvement.
Semi-supervised:
Learner generates game states and asks the teacher for help in finding
the correct move if the board state is confusing.
2. Teacher or Not:
Supervised:
The training experience will be labelled, which means, all the board states
will be labelled with the correct move. So the learning takes place in the
presence of a supervisor or a teacher.
Un-Supervised:
The training experience will be unlabelled, which means, all the board
states will not have the moves. So the learner generates random games and
plays against itself with no supervision or teacher involvement.
Semi-supervised:
Learner generates game states and asks the teacher for help in finding
the correct move if the board state is confusing.
Choosing the Target Function:
When you are playing the checkers game, at any moment of time, you make
a decision on choosing the best move from different possibilities. You think
8
state tends towards the winning situation. Now the same learning has to be
defined in terms of the target function.
Here there are 2 considerations — direct and indirect experience.
• During the direct experience the checkers learning system, it needs only
to learn how to choose the best move among some large search space. We
need to find a target function that will help us choose the best move among
alternatives.
Let us call this function Choose Move and use the notation Choose Move: B
→M to indicate that this function accepts as input any board from the set of
legal board states B and produces as output some move from the set of legal
moves M.
• When there is an indirect experience it becomes difficult to learn such
function. How about assigning a real score to the board state.
If the system can successfully learn such a target function V, then it can
easily use it to select the best move from any board position.
Let us therefore define the target value V(b) for an arbitrary board state b in
B, as follows:
10
4. if b is a not a final state in the game, then V (b) = V (b’), where b’ is the best
final board state that can be achieved starting from b and playing optimally
until the end of the game.
The (4) is a recursive definition and to determine the value of V(b) for a
particular board state, it performs the search ahead for the optimal line of
play, all the way to the end of the game. So this definition is not efficiently
computable by our checkers playing program, we say that it is a non-
operational definition.
12
V_train(b) ← ^V(Successor(b))
13
CONCEPT LEARNING:
A set of example days, and each is described by six attributes. The task is to
learn to predict the value of Enjoy Sport for arbitrary day, based on the
values of its attribute values.
15
FIND-S:
• FIND-S Algorithm starts from the most specific hypothesis and generalize it
by considering only positive examples.
• FIND-S algorithm ignores negative example
: As long as the hypothesis space contains a hypothesis that describes the
true target concept, and the training data contains no errors, ignoring
negative examples does not cause to any problem.
• FIND-S algorithm finds the most specific hypothesis within H that is
consistent with the positive training examples. – The final hypothesis will
also be consistent with negative examples if the correct target concept is in
H, and the training examples are correct.
FIND-S Algorithm:
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x For each
attribute constraint a, in h
If the constraint a, is satisfied by
x Then do nothing
3. Else replace a, in h by the next more general constraint that is satisfied by
x 4. Output hypothesis h
FIND-S Algorithm – Example:
Important-Representation:
16
2. Take the next example and if it is negative, then no changes occur to the
hypothesis.
3. If the example is positive and we find that our initial hypothesis is too
specific then we update our current hypothesis to a general condition.
4. Keep repeating the above steps till all the training examples are complete.
5. After we have completed all the training examples we will have the final
hypothesis when can use to classify the new examples. Example: Consider
the following data set having the data about which particular seeds are
poisonous.
Consider example 1:
The data in example 1 is {GREEN, HARD, NO, WRINKLED}. We see that
our initial hypothesis is more specific and we have to generalize it for this
example.
Hence, the hypothesis becomes:
h = {GREEN, HARD, NO, WRINKLED}
Consider example 2:
17
Here we see that this example has a negative outcome. Hence we neglect
this example and our hypothesis remains the same. h = {GREEN,
HARD, NO, WRINKLED}
Consider example 3:
Here we see that this example has a negative outcome. hence we neglect
this example and our hypothesis remains the same. h = {GREEN,
HARD, NO, WRINKLED}
Consider example 4:
The data present in example 4 is {ORANGE, HARD, NO, WRINKLED}.
We
compare every single attribute with the initial data and if any mismatch is
found we replace that particular attribute with a general case (“ ?”). After
doing the process the hypothesis becomes: h = {?, HARD, NO,
WRINKLED }
Consider example 5:
The data present in example 5 is {GREEN, SOFT, YES, SMOOTH}. We
compare every single attribute with the initial data and if any mismatch is
found we replace that particular attribute with a general case ( “?” ).
After doing the process the hypothesis becomes:
h = {?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis
have the general condition, example 6 and example 7 would result in the
same hypothesizes with all general attributes. h = {?, ?, ?, ? }
Hence, for the given data the final hypothesis would be:
Final Hypothesis: h = { ?, ?, ?, ? }.
Version Spaces
Definition(Version space). A concept is complete if it covers all positive
examples.
A concept is consistent if it covers none of the negative examples. The
version space is the set of all complete and consistent concepts. This set is
convex and is fully defined by its least and most general elements.
18
19
20
• Consider the third training example. This negative example reveals that the
boundary of the version space is overly general, that is, the hypothesis in G
incorrectly predicts that this new example is a positive example.
• The hypothesis in the G boundary must therefore be specialized until it
correctly classifies this new negative example.
Given that there are six attributes that could be specified to specialize G2,
why are there only three new hypotheses in G3?
21
Inductive bias:
22
An example of a decision tree can be explained using above binary tree. Let’s say
you want to predict whether a person is fit given their information like age,
eating habit, and physical activity, etc. The decision nodes here are questions like
‘What’s the age?’, ‘Does he exercise?’, and ‘Does he eat a lot of pizzas’? And the
leaves, which are outcomes like either ‘fit’, or ‘unfit’. In this case this was a
binary classification problem (a yes no type problem). There are two main types
of Decision Trees:
Here the decision or the outcome variable is Continuous, e.g. a number like
123. Working Now that we know what a Decision Tree is, we’ll see how it
works internally. There are many algorithms out there which construct
Decision Trees, but one of the best is called as ID3 Algorithm. ID3 Stands
for Iterative Dichotomiser3.
The set of possible decision tree, Simple to complex, hill climbing search.
Capability:
• Cannot determine how many alternative decision trees are consistent with
the available training data.
23
• ID3 uses all training example at each step to make statistically based
decisions regarding how to refine its current hypothesis.
BFS-ID3
Difference between (ID3 & C-E) && Restriction bias and Preference
bias
ID3 Candidate-Elimination
Searches a complete hypothesis Searches an incomplete
space incompletely hypothesis space completely
Inductive bias is solely a Inductive bias is solely a
consequence of the ordering of consequence of the
hypotheses by its search strategy expressive power of its
hypothesis representation
sss
Restriction bias Preference bias
Candidate-Elimination ID3
24
25
UNIT-II
Artificial Neural
Networks Introduction:
Artificial Neural Networks (ANN) are algorithms based on brain function
and are used to model complicated patterns and forecast issues. The
Artificial Neural Network (ANN) is a deep learning method that arose from
the concept of the human brain Biological Neural Networks. The
development of ANN was the result of an attempt to replicate the workings
of the human brain. The workings of ANN are extremely similar to those of
biological neural networks, although they are not identical. ANN algorithm
accepts only numeric and structured data.
The ANN applications:
Classification, the aim is to predict the class of an input vector
• Pattern matching, the aim is to produce a pattern best associated with a
given input vector.
• Pattern completion, the aim is to complete the missing parts of a given input
vector.
• Optimization, the aim is to find the optimal values of parameters in an
optimization problem.
• Control, an appropriate action is suggested based on given an input vectors
• Function approximation/times series modelling, the aim is to learn the
functional relationships between input and desired output vectors.
• Data mining, with the aim of discovering hidden patterns from
data (knowledge discovery). ANN architectures
• Neural Networks are known to be universal function approximators
• Various architectures are available to approximate any nonlinear function
• Different architectures allow for generation of functions of
different complexity and power
Feed forward networks
Feedback networks
Lateral networks
26
27
28
4. 1969: Minsky and Papert showed that the Perceptron cannot deal with
nonlinearly-separable data sets---even those that represent simple function
such as X-OR.
5. 1970-1985: Very little research on Neural Nets
6. 1986: Invention of Backpropagation Rumelhart and McClelland, but also
Parker and earlier on: Werbos which can learn from nonlinearly-separable
data sets.
7. Since 1985: A lot of research in Neural Nets!
29
30
31
Definition:
The Back propagation algorithm in neural network computes the gradient of
the loss function for a single weight by the chain rule. It efficiently computes
one layer at a time, unlike a native direct computation. It computes the
gradient, but it does not define how the gradient is used. It generalizes the
computation in the delta rule.
Consider the following Back propagation neural network example diagram
to understand:
32
34
• If any pattern remains in the pool, then go back to Step 2. If all the training
patterns in the pool have been used, then set EP = EP+1, and if EP EPMax,
then create a random pool of patterns and go to Step 2. If EP = EPMax, then
stop.
35
UNIT - III
Imagine a situation where your friend gives you a new coin and asks you the
fairness of the coin (or the probability of observing heads) without even
flipping the coin once. In fact, you are also aware that your friend has not
made the coin biased. In general, you have seen that coins are fair, thus you
expect the probability of observing heads is 0.50.5. In the absence of any
such observations, you assert the fairness of the coin only using your past
experiences or observations with coins.
Suppose that you are allowed to flip the coin 1010 times in order to
determine the fairness of the coin. Your observations from the experiment
will fall under one of the following cases:
If case 1 is observed, you are now more certain that the coin is a fair coin,
and you will decide that the probability of observing heads is 0.50.5 with
more confidence. If case 2 is observed you can either:
1. Neglect your prior beliefs since now you have new data, decide the
probability of observing heads is h/10h/10 by solely depending on recent
observations.
2. Adjust your belief accordingly to the value of hh that you have just observed,
and decide the probability of observing heads using your recent
observations.
The first method suggests that we use the frequentist method, where we
omit our beliefs when making decisions. However, the second method
seems to be more convenient because 1010 coins are insufficient to
determine the fairness of a coin. Therefore, we can make better decisions
by combining our recent observations and beliefs that we have gained
through our past experiences. It is this thinking model which uses our most
recent observations together with our beliefs or inclination for critical
thinking that is known as Bayesian thinking.
36
Moreover, assume that your friend allows you to conduct another 1010 coin
flips. Then we can use these new observations to further update our beliefs.
As we gain more data, we can incrementally update our beliefs increasing
the certainty of our conclusions. This is known as incremental learning,
where you update your knowledge incrementally with new evidence.
Bayesian learning comes into play on such occasions, where we are unable
to use frequentist statistics due to the drawbacks that we have discussed
above. We can use Bayesian learning to address all these drawbacks and
even with additional capabilities (such as incremental updates of the
posterior) when testing a hypothesis to estimate unknown parameters of a
machine learning models. Bayesian learning uses Bayes’ theorem to
determine the conditional probability of a hypotheses given some evidence
or observations.
The Famous Coin Flip Experiment
When we flip a coin, there are two possible outcomes - heads or tails. Of
course, there is a third rare possibility where the coin balances on its edge
without falling onto either side, which we assume is not a possible outcome
of the coin flip for our discussion. We conduct a series of coin flips and
record our observations i.e. the number of the heads (or tails) observed for a
certain number of coin flips. In this experiment, we are trying to determine
the fairness of the coin, using the number of heads (or tails) that we observe.
Frequentist Statistics
Let us think about how we can determine the fairness of the coin using our
observations in the above mentioned experiment. Once we have conducted a
sufficient number of coin flip trials, we can determine the frequency or the
probability of observing the heads (or tails). If we observed heads and tails
with equal frequencies or the probability of observing heads (or tails) is
0.50.5, then it can be established that the coin is a fair coin. Failing that, it is
a biased coin. Let's denote pp as the probability of observing the heads.
Consequently, as the quantity that pp deviates from 0.50.5 indicates how
biased the coin is, pp can be considered as the degree-of-fairness of the coin.
37
38
Will pp continue to change when we further increase the number of coin flip
trails?
We cannot find out the exact answers to the first three questions using
frequentist statistics. We may assume that true value of pp is closer to
0.550.55 than 0.60.6 because the former is computed using observations from
a considerable number of trials compared to what we used to compute the
latter. Yet there is no way of confirming that hypothesis. However, if we
further increase the number of trials, we may get a different probability from
both of the above values for observing the heads and eventually, we may
even discover that the coin is a fair coin.
Number of coin Number of heads Probability of observing heads
flips
10 6 0.6
50 29 0.58
100 55 0.55
200 94 0.47
500 245 0.49
Table 1 - Coin flip experiment results when increasing the number of
trials
Moreover, we may have valuable insights or prior beliefs (for example, coins
are usually fair and the coin used is not made biased intentionally, therefore
p≈0.5p≈0.5) that describes the value of pp. Embedding that information can
significantly improve the accuracy of the final conclusion. Such beliefs play a
significant role in shaping the outcome of a hypothesis test especially when
we have limited data. However, with frequentist statistics, it is not possible to
incorporate such beliefs or past experience to increase the accuracy of the
hypothesis test.
Bayes’ Theorem
all the test cases, including our prior belief that we have rarely observed any
bugs in our code. However, this intuition goes beyond that simple hypothesis
test where there are multiple events or hypotheses involved (let us not worry
about this for the moment).
P(θ|X)=P(X|θ)P(θ)P(X)P(θ|X)=P(X|θ)P(θ)P(X)
I will now explain each term in Bayes’ theorem using the above example.
Consider the hypothesis that there are no bugs in our code. θθ and XX denote
that our code is bug free and passes all the test cases respectively.
41
now know the values for the other three terms in the Bayes’ theorem, we can
calculate the posterior probability using the following formula:
P(θ|X)=1×p0.5(1+p)P(θ|X)=1×p0.5(1+p)
We can also calculate the probability of observing a bug, given that our code
P(¬θ|X)=P(X|¬θ).P(¬θ)P(X)=0.5×(1−p)0.5×(1+p)=(1−p)(1+p)P(¬θ|X)=P(X|¬
θ).P(¬θ)
P(X)=0.5×(1−p)0.5×(1+p)=(1−p)(1+p)
We can use MAP to determine the valid hypothesis from a set of hypotheses.
According to MAP, the hypothesis that has the maximum posterior
probability is considered as the valid hypothesis. Therefore, we can express
the hypothesis θMAPθMAP that is concluded using MAP as follows:
θMAP=argmaxθP(θi|X)=argmaxθ(P(X|θi)P(θi)P(X))θMAP=argmaxθP(θi|X)
=argmaxθ(P(X|θ i)P(θi)P(X))
The argmaxθargmaxθ operator estimates the event or hypothesis θiθi that
maximizes the posterior probability P(θi|X)P(θi|X). Let us apply MAP to the
above example in order to determine the true hypothesis:
θMAP=argmaxθ{θ:P(θ|X)=p0.5(1+p),¬θ:P(¬θ|X)=(1−p)(1+p)}θMAP=argma
xθ{θ:P(θ|X)=p0.5(1+p),¬θ:P(¬θ|X)=(1−p)(1+p)}
42
MAP=argmaxθ{θ:P(|X)=0.40.5(1+0.4),¬θ:P(¬θ|X)=0.5(1−0.4)0.5(1+0.4)}=ar
gmaxθ{θ:P(θ|X)=0.57,¬θ:P(¬θ|X)=0.43}=θ⟹No bugs present in
our
codeMAP=argmaxθ{θ:P(|X)=0.40.5(1+0.4),¬θ:P(¬θ|X)=0.5(1−0.4)0.5(1+0.4
)}=argmaxθ{θ:P(θ|X)=0.57,¬θ:P(¬θ|X)=0.43}=θ⟹No bugs present in our
code
43
However, P(X)P(X) is independent of θθ, and thus P(X)P(X) is same for all
below: θMAP=argmaxθ(P(X|θi)P(θi))θMAP=argmaxθ(P(X|θi)P(θi))
Using the Bayesian theorem, we can now incorporate our belief as the prior
probability, which was not possible when we used frequentist statistics.
However, we still have the problem of deciding a sufficiently large number of
trials or attaching a confidence to the concluded hypothesis. This is because
the above example was solely designed to introduce the Bayesian theorem
and each of its terms. Let us now gain a better understanding of
Bayesian learning to learn about the full potential of Bayes’ theorem.
Binomial Likelihood
The likelihood for the coin flip experiment is given by the probability of
observing heads out of all the coin flips given the fairness of the coin. As we
have defined the fairness of the coins (θθ) using the probability of observing
heads for each coin flip, we can define the probability of observing heads or
44
tails given the fairness of the coin P(y|θ)P(y|θ) where y=1y=1 for observing
θ)=(1−θ)P(y=1|θ)=θP(y=0|θ)=(1−θ)
Now that we have defined two conditional probabilities for each outcome
Note that yy can only take either 00 or 11, and θθ will lie within the range of
follows:
P(Y=y|θ)=θy×(1−θ)1−yP(Y=y|θ)=θy×(1−θ)1−y
The above equation represents the likelihood of a single test coin flip
experiment.
shown below:
P(k,N|θ)=(Nk)θk(1−θ)N−k
45
Least squares estimates are calculated by fitting a regression line to the points
from a data set that has the minimal sum of the deviations squared (least
square error). In reliability analysis, the line and the data are plotted on a
probability plot.
The Bayes optimal classifier is a probabilistic model that makes the most
probable prediction for a new example, given the training dataset.
This model is also referred to as the Bayes optimal learner, the Bayes
classifier, Bayes optimal decision boundary, or the Bayes optimal
discriminant function.
Let’s take a look at an example. Suppose we had the following posterior and
conditional probability distributions.
47
EXAMPLE
Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions.
So to solve this problem, we need to follow the below steps:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
48
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
49
P(Sunny)=
0.35
P(Yes)=0.71
NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
• In the above figure, we have an alarm ‘A’ – a node, say installed in a house
50
51
‘F’, which are – parent nodes of the alarm node. The alarm is the parent node
of two probabilities P1 calls ‘P1’ & P2 calls ‘P2’ person nodes.
• Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’,
respectively. But, there are few drawbacks in this case, as sometimes ‘P1’
may forget to call the person ‘gfg’, even after hearing the alarm, as he has a
tendency to forget things, quick. Similarly, ‘P2’, sometimes fails to call the
person ‘gfg’, as he is only able to hear the alarm, from a certain distance.
Expectation-Maximization Algorithm
In the real-world applications of machine learning, it is very common that
there are many relevant features available for learning but only a small subset
of them are observable. So, for the variables which are sometimes observable
and sometimes not, then we can use the instances when that variable is
visible is observed for the purpose of learning and then predict its value in the
instances when it is not observable.
On the other hand, Expectation-Maximization algorithm can be used for the
latent variables (variables that are not directly observable and are actually
inferred from the values of the other observed variables) too in order to
predict their values with the condition that the general form of probability
distribution governing those latent variables is known to us. This algorithm is
actually at the base of many unsupervised clustering algorithms in the field of
machine learning.
It was explained, proposed and given its name in a paper published in 1977
by Arthur Dempster, Nan Laird, and Donald Rubin. It is used to find the local
maximum likelihood parameters of a statistical model in the cases where
latent variables are involved and the data is missing or incomplete.
Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the
dataset, estimate (guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the
expectation (E) step is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.
52
53
Usage of EM algorithm
• It can be used to fill the missing data in a sample.
• It can be used as the basis of unsupervised learning of clusters.
• It can be used for the purpose of estimating the parameters of Hidden
Markov Model (HMM).
• It can be used for discovering the values of latent variables.
Advantages of EM algorithm
• It is always guaranteed that likelihood will increase with each iteration.
• The E-step and M-step are often pretty easy for many problems in terms
of implementation.
• Solutions to the M-steps often exist in the closed form.
54
Instance-based learning
The Machine Learning systems which are categorized as instance-based
learning are the systems that learn the training examples by heart and then
generalizes to new instances based on some similarity measure. It is called
instance-based because it builds the hypotheses from the training instances.
It is also known as memory-based learning or lazy-learning. The time
complexity of this algorithm depends upon the size of training data. The
worst-case time complexity of this algorithm is O (n), where n is the
number of training instances.
For example, If we were to create a spam filter with an instance-based
learning algorithm, instead of just flagging emails that are already marked as
spam emails, our spam filter would be programmed to also flag emails that
are very similar to them. This requires a measure of resemblance between
two emails. A similarity measure between two emails could be the same
sender or the repetitive use of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be
made to the target function.
2. This algorithm can adapt to new data easily, one which is collected as we go.
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each
query involves starting the identification of a local model from
scratch. Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
55
56
• 3.1 − Calculate the distance between test data and each row of training data
with the help of any of the method namely: Euclidean, Manhattan or
Hamming distance. The most commonly used method to calculate distance is
Euclidean.
• 3.2 − Now, based on the distance value, sort them in ascending order.
• 3.3 − Next, it will choose the top K rows from the sorted array.
• 3.4 − Now, it will assign a class to the test point based on most frequent class
of these rows.
Step 4 – End
EXAMPLE :
57
Lazy learning is very suitable for complex and incomplete problem domains,
where a complex target function can be represented by a collection of less
complex local approximations.
Eager learning methods use the same approximation to the target function,
which must be learned based on training examples and before input queries
are observed.
58
UNIT - IV
PATTERN COMPARISON TECHNIQUES
Before searching for a pattern there are some certain steps and the first one is to
collect the data from the real world. The collected data needs to be filtered and
preprocessed so that its system can extract the features from the data. Then
based on the type of the data system will choose the appropriate algorithm
among Classification, Regression, and Regression to recognize the pattern.
• Classification. In classification, the algorithm assigns labels to data based on
the predefined features. This is an example of supervised learning.
• Clustering. An algorithm splits data into a number of clusters based on the
similarity of features. This is an example of unsupervised learning.
• Regression. Regression algorithms try to find a relationship between variables
and predict unknown dependent variables based on known data. It is based on
supervised learning. [2]
• Features can be represented as continuous, discrete, or discrete binary
variables. A feature is basically a function of one or more measurements,
computed to quantify the significant characteristics of the object. The feature is
one of the most important components in the Pattern Recognition system.
Example: consider a football, shape, size and color, etc. are features of the
football.
59
Temporal patterns
Temporal patterns are one of the pattern comparison techniques that is defined
as a segment of signals that recurs frequently in the whole temporal signal
sequence. For example, the temporal signal sequences could be the movements
of head, hand, and body, a piece of music, and so on.
Temporal abstraction and data mining are two research fields that have tried to
synthesis time oriented data and bring out an understanding on the hidden
relationships that may exist between time oriented events. In clinical settings,
having the ability to know the hidden relationships on patient data as they
unfold could help save a life by aiding in detection of conditions that are not
obvious to clinicians and healthcare workers. Understanding the hidden patterns
is a huge challenge due to the exponential search space unique to time-series
data. In this paper, we propose a temporal pattern recognition model based on
dimension reduction and similarity measures thereby maintaining the temporal
nature of the raw data
INTRODUCTION
Temporal pattern processing is important for various intelligent behaviours,
including hearing, vision, speech, music and motor control. Because we live in
an ever-changing environment, an intelligent system, whether it be a human or a
robot, must encode patterns over time, recognize and generate temporal
patterns. Time is embodied in a temporal pattern in two different ways: •
Temporal order. It refers to the ordering among the components of a sequence.
For example, the sequence N-E-T is different from T-E-N. Temporal order may
60
where Wij is the connection weight from unit xj in the input layer to sequence
recognizer si in the recognition layer. Parameter C controls learning rate.
Hebbian learning is applied after the presentation of the entire sequence is
completed. The templates thus formed can be used to recognize specific input
sequences. The recognition layer typically includes recurrent connections for
selecting a winner by self-organization (e.g. winner-take-all) during training or
recognition.
61
If one views each memory state as a category, the Hopfield net performs pattern
recognition: the recalled category is the recognized pattern. This process of
dynamic evolution can also be viewed as an optimization process, which
minimizes a cost function until equilibrium is reached.
With normalized exponential kernel STM, Tank and Hopfield (1987) described
a recognition network based on associative memory dynamics. A layer of
sequence recognizers receives inputs from the STM model. Each recognizer
encodes a different template sequence by its unique weight vector acting upon
the inputs in STM. In addition, recognizers form a competitive network. The
recognition process uses the current input sequence (evidence) to bias a
minimization process so that the most similar template wins the competition,
thus activating its corresponding recognizer. Due to the exponential kernels,
they demonstrated that recognition is fairly robust to time warping, distortions
in duration. A similar architecture is later applied to speakerindependent spoken
digit recognition.
Multilayer Perceptrons
A popular approach to temporal pattern learning is multilayer perceptrons
(MLP). MLPs have been demonstrated to be effective for static pattern
recognition. It is natural to combine MLP with an STM model to do temporal
pattern recognition. For example, using delay line STM Waibel et al. (1989)
reported an architecture called Time Delay Neural Networks (TDNN) for
spoken phoneme recognition. Besides the input layer, TDNN uses 2 hidden
layers and an output layer where each unit encodes one phoneme. The feed
forward connections converge from the input layer to each successive layer so
that each unit in a specific layer receives inputs within a limited time window
from the previous layer. They demonstrated good recognition performance: for
the three stop consonants /b/, /d/, and /g/, the accuracy of speaker dependent
recognition reached 98.5%.
DYNAMIC TIME WARPING
Sounds like time traveling or some kind of future technic, however, it is not.
Dynamic Time Warping is used to compare the similarity or calculate the
62
distance between two arrays or time series with different length. Suppose we
want to calculate the distance of two equal-length arrays:
a = [1, 2,
3] b = [3,
2, 2]
How to do that? One obvious way is to match up a and b in 1-to-1 fashion and
sum up the total distance of each component. This sounds easy, but what if a
and b have different lengths?
a = [1, 2, 3] b
= [2, 2, 2, 3,
4]
How to match them up? Which should map to which? To solve the problem,
there comes dynamic time warping. Just as its name indicates, to warp the series
so that they can match up.
Use Cases
Before digging into the algorithm, you might have the question that is it useful?
Do we really need to compare the distance between two unequal-length time
series?
Yes, in a lot of scenarios DTW is playing a key role.
63
Stock Market
In a stock market, people always hope to be able to predict the future, however
using general machine learning algorithms can be exhaustive, as most prediction
task requires test and training set to have the same dimension of features.
However, if you ever speculate in the stock market, you will know that even the
same pattern of a stock can have very different length reflection on klines and
indicators.
64
Suppose we have two different arrays red and blue with different length:
65
Clearly these two series follow the same pattern, but the blue curve is longer
than the red. If we apply the one-to-one match, shown in the top, the mapping is
not perfectly synced up and the tail of the blue curve is being left out.
66
Rules
In general, DTW is a method that calculates an optimal match between two
given sequences (e.g. time series) with certain restriction and rules(comes from
wiki):
• Every index from the first sequence must be matched with one or more indices
from the other sequence and vice versa
• The first index from the first sequence must be matched with the first index
from the other sequence (but it does not have to be its only match)
• The last index from the first sequence must be matched with the last index from
the other sequence (but it does not have to be its only match)
• The mapping of the indices from the first sequence to indices from the other
sequence must be monotonically increasing, and vice versa, i.e. if j > i are indices
from the first sequence, then
there must not be two indices l in the other sequence, such
>
that index i is matched with index l and index j is matched with index k , and
vice versa.
The optimal match is denoted by the match that satisfies all the restrictions and
the rules and that has the minimal cost, where the cost is computed as the sum of
absolute differences, for each matched pair of indices, between their values.
67
Introduction to Clustering:
It is basically a type of unsupervised learning method. An unsupervised learning method
is a method in which we draw references from datasets consisting of input data without
labelled responses. Generally, it is used as a process to find meaningful structure,
explanatory underlying processes, generative features, and groupings inherent in a set of
examples.
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the same
group and dissimilar to the data points in other groups. It is basically a collection of
objects on the basis of similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one
single group. We can distinguish the clusters, and we can identify that there are 3 clusters
in the below picture.
68
K means Clustering:
It is the simplest unsupervised learning algorithm that solves clustering problem.K-means
algorithm partitions n observations into k clusters where each observation belongs to the
cluster with the nearest mean serving as a prototype of the cluster.
70
K-MODE CLUSTERING
KModes clustering is one of the unsupervised Machine Learning algorithms that is used to
cluster categorical variables.
How does the KModes algorithm work?
71
Example: Imagine we have a dataset that has the information about hair color, eye color, and
skin color of persons. We aim to group them based on the available information(maybe we
want to suggest some styling ideas)
Hair color, eye color, and skin color are all categorical variables. Below is how our dataset
looks like.
Alright, we have the sample data now. Let us proceed by defining the number of
clusters(K)=3
Step 1: Pick K observations at random and use them as leaders/clusters
I am choosing P1, P7, P8 as leaders/clusters
Step 2: Calculate the dissimilarities(no. of mismatches) and assign each observation to its
closest cluster
Iteratively compare the cluster data points to each of the observations. Similar data points
give 0, dissimilar data points give 1.
72
73
Considering one cluster at a time, for each feature, look for the Mode and update the new
leaders.
Explanation: Cluster 1 observations(P1, P2, P5) has brunette as the most observed hair
color, amber as the most observed eye color, and fair as the most observed skin color.
Below are our new leaders after the update.
Repeat steps 2–4 : After obtaining the new leaders, again calculate the dissimilarities
between the observations and the newly obtained leaders.
74
Likewise, calculate all the dissimilarities and put them in a matrix. Assign each
observation to its closest cluster.
75
The observations P1, P2, P5 are assigned to Cluster 1; P3, P7 are assigned to Cluster 2;
and P4, P6, P8 are assigned to Cluster 3.
76
Vector Quantization
Learning Vector Quantization ( or LVQ ) is a type of Artificial Neural Network which
also inspired by biological models of neural systems. It is based on prototype supervised
learning classification algorithm and trained its network through a competitive learning
algorithm similar to Self Organizing Map. It can also deal with the multiclass
classification problem. LVQ has two layers, one is the Input layer and the other one is the
Output layer. The architecture of the Learning Vector Quantization with the number of
classes in an input data and n number of input features for any sample is given below:
77
Let say an input data of size ( m, n ) where m is number of training example and n is the
number of features in each example and a label vector of size ( m, 1 ). First, it initializes
the weights of size ( n, c ) from the first c number of training samples with different labels
and should be discarded from all training samples. Here, c is the number of classes. Then
iterate over the remaining input data, for each training example, it updates the winning
vector ( weight vector with the shortest distance ( e.g Euclidean distance ) from training
example ). Weight updation rule is given by :
where alpha is a learning rate at time t, j denotes the winning vector, i denotes the i th
feature of training example and k denotes the kth training example from the input data.
After training the LVQ network, trained weights are used for classifying new examples.
A new example labeled with the class of winning vector.
Algorithm
78
UNIT- V
Genetic Algorithms
79
EBL Architecture:
• EBL model during training
• During training, the model generalizes the training example in such a way that
all scenarios lead to the Goal Concept, not just in specific cases. (As shown in
Fig 1)
80
81
Dimensionality Reduction
An intuitive example of dimensionality reduction can be discussed through
a simple e-mail classification problem, where we need to classify whether
the e-mail is spam or not. This can involve a large number of features,
such as whether or not the e-mail has a generic title, the content of the e-
mail, whether the e-mail uses a template, etc. However, some of these
features may overlap. In another condition, a classification problem that
relies on both humidity and rainfall can be collapsed into just one
underlying feature, since both of the aforementioned are correlated to a
high degree. Hence, we can reduce the number of features in such
problems. A 3D classification problem can be hard to visualize, whereas a
2-D one can be mapped to a simple 2 dimensional space, and a 1-D
problem to a simple line. The below figure illustrates this concept, where a
3-D feature space is split into two 1-D feature spaces, and later, if found to
be correlated, the number of features can be reduced even further.
82
Components of Dimensionality
Reduction There are two components of dimensionality
reduction:
• Feature selection: In this, we try to find a subset of the original set of variables,
or features, to get a smaller subset which can be used to model the problem. It
usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
• Feature extraction: This reduces the data in a high dimensional space to a
lower dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction The various
methods used for dimensionality reduction include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending
upon the method used. The prime linear method, called Principal
Component Analysis, or PCA, is discussed below.
83
84
• We may not know how many principal components to keep- in practice, some
thumb rules are applied.
Factor analysis.
Factor analysis is a statistical method used to describe variability among
observed, correlated variables in terms of a potentially lower number of
observed variables called factors. For example, it is possible that variations in
six observed variables mainly reflect the variations in two unobserved
(underlying) variables. Factor analysis searches for such joint variations in
response to unobserved latent variables. The observed variables are modelled
as linear combinations of the potential factors plus "error" terms, hence factor
analysis can be thought of as a special case of errors-invariables models.
Here,There is a party going into a room full of people. There is ‘n’ number of
speakers in that room and they are speaking simultaneously at the party. In the
same room, there are also ‘n’ number of microphones placed at different
85
distances from the speakers which are recording ‘n’ speakers’ voice signals.
Hence, the number of speakers is equal to the number must of microphones in
the room.
Now, using these microphones’ recordings, we want to separate all the ‘n’
speakers’ voice signals in the room given each microphone recorded the voice
signals coming from each speaker of different intensity due to the difference in
distances between them. Decomposing the mixed signal of each microphone’s
recording into independent source’s speech signal can be done by using the
machine learning technique, independent component analysis.
[ X1, X2, ….., Xn ] => [ Y1, Y2, ….., Yn ]
where, X1, X2, …, Xn are the original signals present in the mixed signal and
Y1, Y2, …, Yn are the new features and are independent components which are
independent of each other.
Restrictions on ICA
Multidimensional scaling
Multidimensional scaling is a visual representation of distances or
dissimilarities between sets of objects.
“Objects” can be colors, faces, map coordinates, political persuasion, or any
kind of real or conceptual stimuli
(Kruskal and Wish, 1978). Objects that are more similar (or have shorter
distances) are closer together on the graph than objects that are less similar (or
have longer distances). As well as interpreting dissimilarities as distances on a
86
graph, MDS can also serve as a dimension reduction technique for high-
dimensional data (Buja et. al, 2007).
MDS is now used over a wide variety of disciplines. It’s use isn’t limited to a
specific matrix or set of data; In fact, just about any matrix can be analyzed with
the technique as long as the matrix contains some type of relational data
(Young, 2013). Examples of relational data include correlations, distances,
multiple rating scales or similarities.
Manifold learning
What is a manifold?
A two-dimensional manifold is any 2-D shape that can be made to fit in a higher
dimensional space by twisting or bending it, loosely speaking.
87
In simpler terms, it means that higher-dimensional data most of the time lies on
a much closer lower-dimensional manifold. The process of modelling the
manifold on which training instances lie is called Manifold Learning.
For each training instance x(i), the algorithm first finds its k nearest neighbors
and then tries to express x(i) as a linear function of them. In general, if there are
m training instances in total, then it tries to find the set of weights w which
minimizes the squared distance between x(i) and its linear representation.
Here the weights wi,j are kept fixed while we try to find the optimum coordinates
y(i)
89