Machine Learning
Machine Learning
1 Introduction
CONTENTS
Part-1: Introduction, Well Defined. 1-4to -144
Learning Problems,
Designing a Learning
System, Issues in
Machine Learning
Part-2: The Concept Learning |-140 to 1-234
Task : General-to-8peeifie
Ordering of Hypothesis,
Pind-S, List Then
Eliminate Algorithm,
Candidate Elimination
Algorithm, Induetive Bias
1-1G(CSITIOE-Sem-8)
Introduction
1-2 G (CSTT/OE-Sem-8)
PART-1
Introduction, Well Defined Learning Problems, Designing a
Learning System, Issues in Machine Learning.
Questions-Answers
Type Questions
Long Answer Type and Medium Answer
Answer
Intelligence (AI) that
1. Machine learning is an application of Artificial and improve from
provides systems the ability to automatically learn
experience without being explicitly programmed.
2. Machine learning focuses on the development of computer programs
that can access data.
learn automatically without
3 The primary aim is to allow the computers to accordingly.
human intervention or assistance and adjust actions
data.
4 Machine learning enables analysis of massive quantities of
results in order to identify
5. It generally delivers faster and more accurate
profitable opportunities or dangerous risks.
can
6. Combining machine learning with AI and cognitive technologies
information.
of
make it even more effective in processing large volumes
algorithm.
Que 1.2. Describe different types of machinelearning
Answer
Answer
3. Continuous improvement :
MI. algorithmngin oxperience, they keep improving in aceuracy
nnd officiency.
learn to make
b. An tlhe amount of data keepu growing, alyorithmn
eeurnte predictions fawter.
data :
4. llundlingmulti-dimensional nnd multi-variety
Machine leurning algorithmn are good nttheyhanding data that are
cun do this in dynamic
multidinensional and multi- variety, and
or unceruin environments,
problem with
Que 1.6. Write short note on well defined learning
example.
Answer
For example :
1. Acheckers learning problem :
a. Task (T): Playing checkers.
b. Performance measure (P): Percent of games won against
opponents.
C. Training experience (E): Playing practice games against itself.
2. A handwriting recognition learning problem :
a. Task (T) :Recognizing and classifying handwritten words within
images.
b. Performance measure (P): Percent of words correctly classified.
C. Training experience (E):Adatabase of handwritten words with
given classifications.
3. A robot driving learning problem :
a. Task (T) :Driving on public four-lane highways using vision sensors.
b. Performance measure (P):Average distance travelled before an
error (as judged by human overseer).
Training experience (E) :A sequence of images and steering
commands recorded while observing a human driver.
Que 1.7. Describe well defined learning problems role's in
machine learning.
Answer
Well defined learning problems role's in machine learning:
1. Learning to recognize spoken words :
a Successful speech recognition systems employ machine learning in
some form.
b For example, the SPHINX system learns speaker-specific
strategies
for recognizing the primitive sounds (phonemes) and words
the observed speech signal. from
C. Neural network learning methods and methods for learning
Markov models are effective for automatically hidden
customizing to
individual speakers, vocabularies, microphone characteristics,
background noise, etc.
d. Similar techniques have potential
interpretation problems. applications in many signal
2. Learning to drive an autonomous vehicle :
a.
Machine learning methods have been used to train
controlled vehicles to steer correctly when driving on a computerof
road types. variety
1-8 G (CS/TT/OE-Sem-8) Introduction
b. For example, the ALYINN system has used its learned strategies to
drive unassisted at 70miles per hour for 90 miles on public highways
among other cars.
C.
Similar techniques have possible applications in many sensor based
control problems.
3. Learning to classify new astronomical structures :
a. . Machine learning methods have been applied toa variety of large
databases to learn general regularities implicit in the data.
b. For example, decision tree learning algorithms have been used by
NASA to learn how to classify celestial objects from the second
Palomar Observatory Sky Survey.
This system isused to automatically classify all objects in the Sky
Survey, which consists of three terabytes of image data.
4. Learning to play world class backgammon:
a The most successful cor.puter programs for playing games such as
backgammon are based on machine learning algorithms.
b For example, the world's top computer program for backgammon,
TD-GAMMON learned its strategy by playing over one million
practice games against itself.
C. It now plays at a level competitive with the human world champion.
d Similar techniques have applications in many practical problems
where large search spaces must be examined efficiently.
Environment
or teacher Critic
Knowledge performance
base
evaluator
Response
Performance
component
Tasks
Fig. 1.8.1. General learning model.
1. Acquisition of new knowledge:
a. One component of learning is the acquisition of new knowledge.
b. Simple data acquisition is easy for computers, even though it is
difficult for people.
2. Problem solving :
The other component of learning is the problem solving that is required
for both to integrate into the system, new knowledge that is presented
to it and to deduce new information when required facts are not been
presented.
Que 1.10. What are the steps used to design a learning system ?
Answer
Steps used to design a learning system are :
1. Specify the learning task.
2 Choose a suitable set of training data to serve as the training experience.
3 Divide the training data into groups or classes and label accordingly.
4 Determine the type of knowledge representation to be learned from the
training experience.
5 Choose a learner classifier that can generate general hypotheses from
the training data.
6. Apply the learner classifier to test data.
7. Compare the performance of the system with that of an expert human.
Leørner
Environment/
Experience Knowledge
Performance
element
Fig. 1.10.1.
Que 1.11. How we split data in machine learning ?
Answer
Data is splitted in three ways in machine learning:
1. Training data :
a The part of data we use to train our model.
b This is the data which our model actually sees (both input and
output)and learn from.
2. Validation data :
The part of data which is used to do a frequent evaluation of model,
fit on training dataset along with ìmproving involved
hyperparameters (initially set parameters before the model begins
learning).
b This data plays its part when the model is actually training.
3. Testing data :
a Once our model is completely trained, testing data provides the
unbiased evaluation.
Machine Learning 1-11G (CSTT/OE-Sem-8)
b. When we feed in the inputs of testing data, our model will predict
some values without seeing actual output.
C. After prediction, we evaluate our model by comparing it with actual
output present in the testing data.
d This is how we evaluate and see how much our model has learned
from the experiences feed in as training data, set at the time of
training.
Data in machine
learning
Fig. 1.11.1.
Que 1.12. Describe the terminologies used in machine learning.
Answer
Terminologies used in machine learning are :
1. Features: Aset ofvariables that carry discriminating and characterizing
information about the objects under consideration.
2. Feature vector :A collection of dfeatures, ordered in meaningful way
into a d-dimensional column vector that represents the signature of the
object to be identified.
3. Feature space : Feature space is d-dimensional space in which the
feature vectors lie. Ad-dimensional vector in ad-dimensional space
constitutes a point in that space.
a, ]feature 1
X, |feature 2
Jfeature 4
Answer
Components of machine learning system are :
1. Sensing :
a. It uses transducer such as camera or microphone for input.
b PR (Pattern Recognition) system depends on the bandwidth,
resolution, sensitivity, distortion, etc., ofthe transducer.
2. Segmentation : Patterns should be well separated and should not
overlap.
3. Feature extraction :
a. It is used for distinguishing features. .
b. This process extracts invariant features Wth respect to translation,
rotation and scale.
4. Classification :
a. It use a feature vector provide by a feature extractor to assign the
object to a category.
b It is not always possible to determine the values of allthe features.
Post processing :
a. Post processor uses the output of the classifier to decide on the
recommended action.
Machine Learning 1-13 G(CSIT/OE-Sem-8)
Decision
Costs
Post-processing Adjustments for
context
Classification
Adjustments for
missing features
Feature extraction
Segmentation
Sensing
4. Rule extraction:
a. In rule extraction,data is used as the basis for the extraction of
propositional rules.
b. These rules discover statistically supportable relationships between
attributes in the data.
PART-2
Questions-Answers
10. Ahypotheses thereby is more general than another if its set of allowed
instances is a superset to the set of instances belonging to the other
hypothesis.
Que 1.17. Define the term concept and concept learning. How can
we represent a concept ?
Answe
Concept : Concept is Boolean-valued function defined over a large set of
objects or events.
Concept learning :Concept learning is defined as inferring a Boolean
valued function from training examples of input and output of the function.
Concept learning can be represented using:
1. Instance x : Instance x is a collection of attributes (Sky, AirTemp,
Humidity, etc.)
Target function c:Enjoysport: X (0, 1)
3. Hypothesis h : Hypothesis h is a conjunction of constraints on the
attributes. Aconstraint can be:
Que 1.18. Explain the working of find-S algorithm with flow chart.
Answer
Working of find-S algorithm :
1 The process starts with initializing 'h' with the most specific hypothesis,
generally, it is the first positive example in the data set.
2 We check for each positive example. If the example is nogative, we will
move on to the next example but if it is apositive example we will
consider it for the next step.
3. We will check ifeach attribute in the example is equal to the hypothesis
value.
4. If the value matches, then no changes are made.
5. If the value does not match, the value is changed to ?".
Machine Learning 1-17 G (CSTT/OE-Sem-8)
6. We do this until we reach the last positive example in the data
set.
Initialize h
Identify a positive
example
Attribute
values is equal to
hypothesis
value ?
Replace the
value with "
Fig.1.18.1.
Answer
1. The Candidate-Elimination algorithm computes the version space
containing all hypotheses from H that are consistent with an observed
sequence of training examples.
2 It begins by initializing the version space to the set of all hypotheses in
H, that is, by initializing the Gboundary set to contain the most general
hypothesis in H
Go - (<?,?,?,?,?,?>)
and initializing the S boundary set to contain the most specific hypothesis
S,- (<0, 0, 0, 0, 0, 0>)
3 These two boundary sets delimit the entire hypothesis space, because
every other hypothesis in H is both more general than S, and more
specific than Go:
4. As each training example is considered, the S and Gboundary sets are
generalized and specialized, respectively, to eliminate from the version
space any hypotheses found inconsistent with the new training example.
5. After all examples have been processed, the computed
version space
contains all the hypotheses consistent with these examples and
hypotheses.
Algorithm :
1 Initialize G tothe set of maximally general hypotheses in H.
2 Initialize S to the set of maximally specific hypotheses in H.
3. For each training example d, do
a. Ifd is a positive example
Remove from G any hypothesis that does not included
For each hypothesis s in S that does not include d
Remove s from S
1-20 G (CS/IT/OE-Sem-8) Introduction
Fig. 1.22.1.
Answer
S. No. Supervised Unsupervised
learning learning
1. Supervised learning is also Unsupervised learning is also
known as associative learning, known as self-organization, in
in which the network is trained which an output unit is trained
by providing it with input and to respond toclusters of pattern
matching output patterns. within the input.
2 Supervised training requires the Unsupervised training is
pairing of each input vector with employed in self-organizing
a target vector representing the neural networks.
desired output.
3 During thetraining session, an During training, the neural
input vector is applied to the network receives input
network, and it results in an patterns and organizes these
output vector. Thisresponse is patterns into categories. When
then compared with the target| new input pattern is applied,
response. the neural network provides an
output response indicating the
class to which the input patterns
belong.
4 If the actual response differs If aclass cannot be found for
from the target response, the the input pattern, a new class
network will generate an error is generated.
signal.
5. The error minimization in this Unsupervised training does not
kind of training requires a require a teacher, it requires
supervisor or teacher. These certain guidelines to form
input-output pairs can be groups, Grouping can be done
provided by an external|based on colour, shape, and any
teacher, or by the system which other property ofthe object.
contains neural network.
6. Supervised training methods Unsupervised learning is useful
are used to perform non-linear for data compression and
mapping in pattern classification clustering.
networks, pattern association
networks and multi-layer
neural networks.
7. Supervised learning generates In this, asystemis supposed to
aglobal model and alocal model. discover statistically salient
fe atures of the input
population.
2 UNIT
Decision Tree
Learning
CONTENTS
Part-1 : Decision Tree Learning, 2-2G to 2-13G
Decision Tree Learning
Algorithm, Inductive Bias,
Issues in Decision
Tree Learning
Artificial Neural Network, 2-13G to 2-23G
Part-2
Perceptrons, Gradients
Descent and the
Delta Rule, Adaline
2-1 G (CSIT/OE-Sem-8)
2-2G(CS/IT/OE-Sem-8) Decision Tree Learning
PART-1
Decision Tree Learning : Decision Tree Learning Algorithm,
Inductive Bias, Issues in Decision Tree Learning.
Questions-Answers
Outlook
Que 2.2. What are the steps used for making decision tree ?
Answer
Que 2.3. Write short note on Gini impurity and Gini impurity
index.
Answer
j=l
Answer
Answer
I Overfiting in the phenonenon in which the learning aystem tightly fits
the given training data so that it would be inaccurate in predicting the
outeonen ofthe untrained data.
In decision treen, overfitting oceurs when the tree is deaigned to perlectly
it all annploa inthe training data set.
2-5 G (CS/TT/OE-Sem-8)
Machine Learning
branches that
3. To avoid decision tree from overfitting, we remove the called as
method is
make use of features having low importance. This
pruning or post-pruning.
accuracy
4. It reduces the complexity of tree, and hence improves predictive
by the reduction of overfitting.
reducing
5. Pruning should reduce the size of a learning tree without
predictive accuracy as measured by a cross-validation set.
6. There are two major pruning techniques:
the
a. Minimum error:The tree i_ pruned backto the point where
cross-validated error is minimum.
b. Smallest tree: The tree is pruned back slightly further than the
cross-validation
minimum error. Pruning creates a decision tree with The smaller
error within 1 standard error of the minimum error.
tree is nmore intelligible at the cost of a small increase in error.
7. Another method to prevent over-fitting is to try and stop the tree
building process early, before it produces leaves with very small samples.
This heuristic is known as early stopping or pre-pruning decision trees.
error. If
8 At each stage of splitting the tree, we check the cross-validation
the error does not decrease significantly enough then we stop.
9. Early stopping is a quuck fix heuristic. If early stopping is used together
with pruning, it can save time.
Answer
1.
Decision trees classify instances by sorting them down the tree from the
root toleaf node, which provides the classification of the instance.
2. An instance is classified by starting at the root node of the tree, testing
the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute as shown in Fig. 2.6.1.
3. This process is then repeated for the subtree rooted at the new node.
4. The decision tree in Fig. 2.6.1 classifies a particular morning according
towhether it is suitable for playing tennis and returning the classification
associated with the particular leaf.
5. For example, the instance
(Outlook =Rain, Temperature =Hot, Humidity =High, Wind = Strong)
would be sorted down the left most branch of this decision tree and
would therefore be classified as a negative instance.
6. In other words, decision tree represent a disjunction of conjunctions of
constraints on the attribute values of instances.
2-6 G (CSIT/OE-Sem-8) Decision Tree Learning
Outlook
Yes
Humidity Wind
No Yes No Yes
Fig. 2.6.1.
Answer
Issues related to the applications of decision trees are:
1. Missing data:
a. When values have gone unrecorded, or they might be too expensive
to obtain.
b. Twoproblems arise :
i To classify an object that is missing from the test attributes.
To modify the information gain formula when examples have
unknown values for the attribute.
2. Multi-valued attribute :
a. When an attribute has many possible values, the information gain
measure gives an inappropriate indication of the attribute's
usefulness.
b. In the extreme case, we could use an attribute that has a different
value for every example.
Then each subset of examples would be asingleton with a unique
classification, so the information gain measure would have its
highest value for this attribute, the attribute could be irrelevant or
useless.
d One solution is to use the gain ratio.
Machine Learning 2-7G (CSIT/OE-Sem-8)
Fig. 2.8.1.
4. Leaf Terminal node : Nodes that do not split is called leaf or terminal
node.
2-8 G (CS/IT/OE-Sem-8) Decision Tree Learning
5. Pruning: When we remove sub-nodes of adecision node, this process
is called pruning. This process isopposite to splitting process.
6. Branch/sub-tree : Asub section of entire tree is called branch or sub
tree
7. Parent and child node : A node which is divided into sub-nodes is
called parent node of sub-nodes where as sub-nodes are the child of
parent node.
Answer
1 Decision trees can be visualized, simple to understand and interpret.
2. They require less data preparation whereas other techniques often
require data normalization, the creation of dummyvariables and removal
of blank values.
3. The cost of using the tree (for predicting data) is logarithmic in the
number of data points used to train the tree.
4. Decision trees can handle both categorical and numerical data whereas
other techniques are specialized for only one type of variable.
5. Decision trees can handle multi-output problems.
6. Decision tree is a white box model i.e., the explanation for the condition
can be explained easly by Boolean logic because there are two outputs.
For example yes or no.
7. Decision trees can be used even if assumptions are violated by the
dataset from which the data is taken.
Answer
Advantages of ID3 algorithm :
1 The training data is used to create understandable prediction rules.
2 It builds short and fast tree.
3 ID3 searches the whole dataset to create the whole tree.
4 It finds the leaf nodes thus enabling the test data to be
pruned and
reducing the number of tests.
5. The calculation time of ID3 is the
linear function of the product of the
characteristic number and node number.
Disadvantages of ID3 algorithm :
1 For a small sample, data may be overfitted or
overclassified.
2. For making a decision, only one attribute is tested at an instant thus
consuming a lot of time.
3 Classifying the continuous data may prove to be expensive in terms of
computation, as many trees have to be generated to see where to break
the continuous sequence.
4. It is overly sensitive to features when given a large number of input
values.
Advantages of C4.5 algorithm :
1. C4.5 is easy to implement.
2. C4.5 builds models that can be easily interpreted.
3. It can handle both categorical and continuous values.
4. It can deal with noise and missing value attributes.
Disadvantages of C4.5 algorithm :
1. A
small variation in data can lead to different decision trees when using
C4.5.
2. For a small training set, C4.5 does not work very well.
Answer
Attribute selection measures used in decision tree are :
1. Entropy:
Entropy is a measure of uncertainty associated with a random
variable.
ii. The entropy increases with the increase in uncertainty or
randomness and decreases with a decrease in uncertainty or
randomness.
ii. The value of entropy ranges from 0-1.
Entropy(D)=,-P.log,(p,)
where p, is the non-zero probability that an arbitrary tuple in D
belongs to class Cand is estimated by |C, D|/|D|.
iv. Alog function of base 2 is used because the entropy is encoded in
bits 0 and 1.
2. Information gain :
i. ID3 uses information gain as its attribute selection measure.
: Information gain is the difference between the original information
gain requirement (i.e. based on the proportion of classes) and the
new requirement (i.e. obtained after the partitioning of A).
|D,|
Gain(D, A) =EntropylD)- 2,.Di EntropylD)
Where,
D: Agiven data partition
A:Attribute
V: Suppose we partition the tuples in D on some
attribute Ahaving Vdistinct values
ii. Dis split into Vpartitionor subsets, (D, D,..D) where D, ,contains
those tuples in D that have outcome a,, ofA.
iv. The attribute that has the highest information gain is chosen.
3. Gain ratio :
i. The information gain measure is biased towards tests with many
outcomes.
That is, it prefers to select attributes having a large number of
values.
2-12 G (CS/IT/OE-Sem-8) Decision Tree Learning
ii. As each partition is pure, the information gain by partitioning is
maximal. But such partitioning cannot be used for classification.
iv. C4.5uses this attribute selection measure which is an extension to
the information gain.
V. Gain ratio differs from information gain, which measures the
information with respect to a classification that is acquired based
on some partitioning.
vi. Gain ratio applies kind ofinformation gain using a split information
value defined as:
D
Splitinfo,=- E log
vii. The gain ratio is then defined as:
Gain ratio (A) = Gain (A)
SplitInfo, (D)
viiü. Asplitting attribute is selected which is the attribute having the
maximum gain ratio.
4. Gini index : Refer Q. 2.3, Page 2-3G,Unit-2.
Answer
The various decision tree applications in data mining are :
1. E-Commerce : It is used widely in the field of e-commerce, decision
tree helps to generate online catalog which is an important factor for
the success of an e-commerce website.
2 Industry:Decision tree algorithm is useful for producing quality control
(faults identification) systems.
3. Intelligent vehicles : An important task for the development of
intelligent vehicles is to find the lane boundaries of the road.
4. Medicine:
a. Decision tree is an important technique for medical research and
practice. Adecision tree is used for diagnostic of various diseases.
b. Decision tree is also used for hard sound diagnosis.
5. Business: Decision trees find use in the field of business where they
are used for visualization of probabilistic business models, used in CRM
(Customer Relationship Management) and used for credit scoring for
credit card users and for predicting loan risks in banks.
Answer
:
ID3(Examples, Target Attribute, Attributes)
1. Create a Root node for the tree.
single-node tree root, with label
2. If all Examples are positive, return the
single-node tree root, with label
3. If all Examples are negative, return the
single-node tree root, with label =
4 If Attributes is empty, return the examples.
most common value of target attribute in
5. Otherwise begin
classifies Examples
A+the attribute fromAttributes that best
b The decision attribute for Root 4-A
C. For each possible value, V, ofA,
corresponding to the test A
Add a new tree branch below root,
= V,
that have value V.
Let Example V, be the subset of Examples
for A
ii. If Example V, is empty
a. Then below this new branch
add a leaf node with label
in Examples
= most common value of TargetAttribute
ID3 (Example
b. Else below this new branch add the sub-tree
V,, TargetAttribute, Attributes-(A))
6. End
7. Return root.
Answer
Refer Q. 1.22, Page 1-20G, Unit-1.
PART-2
Questions-Answers
Answer
Advantages of Artificial Neural Networks (ANN) :
1 Problems in ANN are represented by attribute-value pairs.
2 ANNs are used for problems having the target function, output may be
discrete-valued, real-valued, or avector of several real or discrete-valued
attributes.
3 ANNslearning methods are quite robust to noise in the training data.
The training examples may contain errors, which do not affect the final
output.
4 It is used where the fast evaluation of the learned target function
required.
5 ANNs can bear long training times depending on factors such as the
number of weights in the network, the number of training examples
considered, and the settings of various learning algorithm parameters.
Disadvantages of Artificial Neural Networks (ANN) :
1 Hardware dependence :
a. Artificial neural networks require processors with parallel processing
power, by their structure.
b. For this reason, the realization of the equipment is dependent.
2. Unexplained functioning of the network :
This is the most important problem of ANN.
b When ANN gives a probing solution, it does not give a clue as to
why and how.
C. This reduces trust in the network.
3. Assurance of proper network structure:
a There is no specific rule for determining the structure of artificial
neural networks.
b. The appropriate network structure is achieved through experience
and trial and error.
4. The difficulty of showing the problem to the network:
a. ANNs can work with numerical information.
b. Problems have to be translated into numerical values before being
introduced to ANN.
C. The display mechanism to be determined will directly influence the
performance of the network.
d. This is dependent on the user's ability.
5. The duration of thenetwork is unknown:
a. The network is reduced to a certain value of the error on the
sample means that the training has been completed.
9-16G(CTTOE Sem 8)
Answer
Answer
Different types of neuron connection are :
1. Single-layer feed forward network:
In this type of network, we have only two layers i.e., input layer
and output layer but input layer does not count because no
computation is performed in this layer.
b Output layer is formed when diferent weights are applied on input
nodes and the cumulative effect per node is taken.
C. After this the neurons collectively give the output layer to compute
the output signals.
Input layer Output layer
W11
W12
W21
(Y2)
W
1m
W2m
W2
W
nm m
2-18 G (CSIT/OE-Sem-8)
Decision Tree Learning
2. Multilayer feed forward network :
a This layer has hidden layer which is internal to the network and
has no direct contact with the external layer.
b Existence of one or more hidden layers enables the network to be
computationally stronger.
C. There are no feedback connections in which outputs of the model
are fed back into itself.
Input Output
Feedback
Fig. 2.20.1.
Wi1
W22
Wnm
W11 V11 Z
V12
V21
Xe W22
3m
Vk3
Wnm
-Vnm Lm
Answer
1. Artificial neural networks are flexible and adaptive.
2. Artificial neural networks are used in sequence and pattern recognition
systems, data processing, robotics, modeling, ete.
ANN acquires knowledge from their surroundings by adapting to internal
and external parameters and they solve complex problems which are
difficult to manage.
4 It generalizes knowledge to produce adequate responses to unknown
situations.
5 Artificial neural networks are flexible and have the ability to learn,
generalize and adapts to situations based on its findings.
6 This function allows the network to efficiently acquire knowledge by
learning. This is a distinct advantage over a traditionally linear network
that is inadequate when it comes to modelling non-linear data.
7. An artificial neuron networkis capable of greater fault tolerance than a
traditional network. Without the loss of stored data, the network is able
to regenerate a fault in any of its components.
8. An artificial neuron network is based on adaptive learning.
Answer
1. Gradient descent is an optimization technique in machine learning and
deep learning and it can be used with all the learning algorithms.
2. Agradient is the slope of afunction, the degree of changeof a parameter
with the amount of change in another parameter.
3. Mathematically, it can be described as the partial derivatives of a set of
parameters with respect to its inputs. The more the gradient, the steeper
the slope.
4. Gradient Descent is a convex function.
5. Gradient Descent can be deseribed as an iterative method which is used
to find the values of the parameters ofa function that minimizes the
cost function as much as possible.
6 The parameters are initially defined a particular value and from that,
Gradient Descent is run in an iterative fashion to find the optimal values
ofthe parameters,using caleulus, to find the minimum possible value of
the given cost function.
Answer
Different types of gradient descent are:
1. Batch gradient descent:
This is a type of gradient descent which processes all the training
examples for each iteration of gradient descent.
b When the number of training examples is large, then batch gradient
descent is computationally very expensive. So, it is not preferred.
C. Instead, we prefer to use stochastic gradient descent or
mini-batch gradient descent.
2. Stochastic gradient descent :
a. This is a type of gradient descent which processes single training
example per iteration.
b Hence, the parameters are being updated even after one iteration
in which only a single example has been processed.
C. Hence, this is faster than batch gradient descent. When the number
of training examples is large, even then it processes only one
example which can be additional overhead for the system as the
number of iterations will be large.
3. Mini-batch gradient descent :
This is a mixture of both stochastic and batch gradient descent.
b. The training set is divided into multiple groups called batches.
Each batch has a number of training samples in it.
d. At a time, a single batch is passed through the network which
computes the loss of every sample in the batch and uses their
average to update the parameters of the neural network.
t exp- Ww'o)
where O, is the output vector of the hidden layer :
I+ exp(-W')
Step4: Weights of the output unit are updated
W:=W+ nào
where & (y -O)0(1-0)
Step 5: Weights of the hidden units are updated
W, =w, +noW,o,(1-0,x, l =1,., L
Step 6:Cumulative eycle error is computed by adding the present error
toE
E: E+ 1/2(y-0
Step 7: Ifk <Kthen k: k+1 and we continue the training by going
back tostep 2,otherwise we go to step 8.
Step 8:The training cycle is completed. For E< E terminate the
training session. IfE> E then E: = 0, k := land we initiate a new
training cycle by going back to step 3.
PART-3
Questions-Answers
Long Answer Type and Medium Answer Type Questions
2-24 G (CS/TT/OE-Sem-8) Decision Tree Learning
Answer
Adaline network training algorithm is as follows :
Step 0: Weights and bias are to be set to some random values but not zero.
Set the learning rate parameter a.
Step 1:Perform steps 2-6 when stopping condition is false.
Step 2: Perform steps 3-5 for each bipolar training pair.
Step 3: Set activations for input units i= l ton.
Step 4: Calculate the net input to the nt put unit.
Step 5: Update the weight and bias for = 1 to n.
Step 6:If the highest weight change that occurred during training is smaller
than a specified tolerance then stop the training process, else continue. This
is the test for the stopping condition of a network.
Machine Learning 2-25 G (CSIT/OE-Sem-8)
Inputs W2 Hand
limiter
Wm
Xm
Fig. 2.30.1. Signal flow graph of the perceptron.
8. From the model, we find that the hard limiter input or induced local field
of the neuron as
m
V= Xwx,+b
/=l
Fig. 2.30.2.
2-27 G (CSTT/OE-Sem-8)
Machine Learning
by a hyperplane defined as:
12. There are two decision regions separated
Xwx, +b = 0
....u, oftheperceptron can be adapted
The synaptic weights w,, W,basis.
on an iteration by iteration
rule known as perceptron
13. For the adaption, an error-correction
convergence algorithm is used.
classes G, and G, must be
14. For a perceptron to function properly, the two
linearly separable.
set of inputs to be classified
15. Linearly separable means, the pattern or
must be separated by a straight line.
linearly separable
16. Generalizing, a set of points in n-dimensional space are
separates the sets.
if there is a hyperplane of (n -1) dimensions that
Gc,l a s s Gc,l a s s
Gz
class Gcg
lass
Answer
Statement: The Perceptron convergence theorem states that for any data
set which is linearly separable the Perceptron learning rule is guaranteed to
find a solution in a finite number of steps.
Proof:
1. To derive the error-correction learning algorithm for the perceptron.
2. The perceptron convergence theorem used the synaptic weights w,, W
w, of the perceptron can be adapted on an iteration by iteration
basis.
The bias b(n) is treated as a synaptic weight driven by fixed input equal
to + 1.
Answer
Multilayer perceptron :
1. The perceptrons which are arranged in layers are called multilayer
perceptron. This model has three layers : an input layer, output layer
and hidden layer.
2. For the perceptrons in the input layer, the linear transfer function used
and for the perceptron in the hidden layer and output layer, the sigmoidal
or squashed-S function is used.
3. The input signal propagates through the network in a forward direction.
4. On a layer by layer basis, in the multilayer perceptron bias b(n) is treated
as a synaptic weight driven by fixed input equal to +1.
x(n) = (+1,x,(n), x,n), .... (n)]?
where n denotes the iteration step in applying the algorithm.
2-29 G (CSIT/OE-Sem-8)
Machine Learning
Correspondingly, we define the weight vector as :
wn) =b(n), w,(n), w,n).
5. Accordingly, the linear combiner output is written in the compact form :
m
Output
Input signal
signal
Output layer
1+expl-u,)
where v, is the induced local field (ie., the sum of all weights and bias)
and y is the output of neuron j.
2. The network contains hidden neurons that are not a part of input or
output of the network. Hidden layer of neurons enabled network to
learn complex tasks.
3. The network exhibits a high degree of connectivity.
Answer
Effect of tuning parameters of the backpropagation neural network:
1. Momentum factor :
The momentum factor has a significant role in deciding the values
of learning rate that willproduce rapid learning.
b. It determines the size of change in weights or biases.
C. If momentum factor is zero, the smoothening is minimum and the
entire weight adjustment comes from the newly calculated change.
d. If momentum factor is one, new adjustment is ignored and previous
one is repeated.
e.
Between 0 and 1 is a region where the weight adjustment is
smoothened by an amount proportional to the momentum factor.
f The momentum factor effectively increases the speed of learning
without leading to oscillations and filters out high frequency
variations of the error surface in the weight space.
2. Learning coefficient :
a. An formula to select learning coefficient has been :
1.5
h=
(N +N, +...+N,)
Where N, is the number of patterns of type l and mis the number
of different pattern types.
2-31 G (CSIT/OE-Sem-8)
Machine Learning
b. The small value of learning coefficient less than 0.2 produces slower
but stable training.
The largest value oflearning coefficient i.e., greater than 0.5, the
weights are changed drastically but this may cause optimum
combination of weights to be overshot resulting in oscillations about
the optimum.
d The optimum value of learning rate is 0.6 which produce fast
learning without leading to oscillations.
3. Sigmoidal gain :
a. If sigmoidal function is selected, the input-output relationship of
the neuron can be set as
1 ...(2.33.1)
O= (1+e-l+0)
10 *T=
I, +1,)
10T
I,=
I,+1,)
Which yields the value for I,.
2. Momentum coefficient a :
a To reduce the training time we use the momentum factor because
it enhances the training process.
b The influences of momentum on weight change is
(Weight change
without momentum)
OW
(AW]n
aAW]n
[AW]n+1
(Momentum term)
Fig. 2.34.1, Influence of momentum term on weight change.
C.
The momentum also overcomes the effect of local minima.
d. The use of momentum term will carry a weight change process
through one or local minima and get it into global minima.
Machine Learning 2-33 G (CS/IT/OE-Sem-8)
3. Sigmoidal gain à :
a. When the weights become large and force the neuron to operate in
a region where sigmoidal function is very flat, a better method of
coping with network paralysis is to adjust the sigmoidal gain.
b. By decreasing this scaling factor, we effectively spread out sigmoidal
function on wide range so that training proceeds faster.
4. Local minima:
One of the most practical solutions involves the introduction of a
shock which changes all weights by specific or random amounts.
b. Ifthis fails, then the most practical solution is to rerandomize the
weights and start the training all over.
UNIT
3 Evaluating Hypotheses
CONTENTS
Part-1 : Evaluating hypotheses, 3-2G to 3-5G
Estimating Hypotheses
Accuracy
Part-2 : Basics of Sampling. 3-5G to 3-9G
Theory, Comparing
Learning Algorithm
Part-3 : Bayesian Learning, Bayes 3-9G to 3-19G
Theorem, Concept Learning,
Bayes Optimal Classifier
Part-4: Naive Bayes Classifier, .....,.....,. 3-19G to 3-26G
Bayesian Belief Networks,
EM Algorithm
3-1G (CSIT/OE-Sem-8)
3-20(CIT/OE-Sem-8) Evaluating Hypotheses
PART- 1
Questions-Answers
Anawer
Examples!
ligher the unemployment, higher would be the rale otove in
nOcioty.
Lawer the une offertilizorn, lower would be gricultural podwtivy
i. Higher the poverty in asociety, higlher would bo the rate of vmes
2. Complex hypotheain 1A conplex hypothesia ia alyotleaia tt vetlta
relationship anong nore thaa two variablea,
Examples1
igher the poverty, higher the illiteracy in a society, bigley will be
the rate of erime(three variablen (wo independont vaiablos and
one dependent variable),
Lawer the use of lertilizor, improved aoeda and nudern epuipents,
lower would be the agricultural produetivity(l'ourvariablea throo
independent variables and one dependent varinble),
ii. Higher the illiteracy in a Bowiety, higher willlhe poverty and erine
rate,(three variables-one independent variable nnd two dependel
variables).
3. Working hypothesin i
i. Ahypothesis, that ia acceepted to put to teat and work in a rosorch,
is called a working hypotheala,
ii. Iis a hypothess that ia anauned to be auitable to expluin cortan
facts and relationahip of phenomena.
ii It is aupposed that thus hypothenia would gonerate u productive
theory and is accepted to put to tent for inventigutlon.
iv. It ean be any hypothenia that in procensed lor work during the
researeh.
4. Alternative hypothenin i
If the worlking hypothenia ia provedwrong or rejected, mutler
hypotlenia (to replavo tlhn working bypthenin) in tormulated to bo
tested to generate the denired rosultn thia in owwn Aallenate
hypothesis.
ii. It is an alternate AHHHnption (a relationalhip or an xplmation)
which is adopted atler the working hypothnala faila to gonorate
required theory. Allernativo bypothonin in lenoted by H
34G(CSIT/OE-Sem-8) Evaluating Hypotheses
PART-2
Anwwer
After being arranged the sample elements are picked on the basis
of pre-defined interval set or function.
P(ofgetting selected) = depends upon the ordered population tray after
it has been sorted
The basic methods of employing systematic random sampling are:
Choosing the population set wisely.
Checking whether systematic sampling will be the efficient
method or not.
iüi. If yes, then application of sorting method to get an ordered
pair ofpopulation elements.
iv. Choosing a periodicity tocrawl out elements.
3. Stratified sampling :
a. Stratified sampling is a hybrid method concerning both simple
random sampling as well as systematic sampling.
b. It is one of the most advanced types of sampling method available,
providing accurate result to the tester.
C. In this method, the population tray is divided into sub-segments
also known as stratum (singular).
d. Each stratum can have their own unique property. After being
divided intodifferent sub-stratum, SRS or systematic sampling can
be used to create and pick out samples for performing statistics.
e The elementary methods for stratified sampling are :
i Choosing the population tray.
Checking ior periodicity or any other features, so that they
can be divided into different strata.
üi. Dividing the population tray into sub-sets and sub-groups on
the basis of selective property.
iv. Using SRS or systematic sampling of each individual strata to
form the sample frame.
V We can even apply different sampling methods to different
sub-sets.
3-8G (CS/IT/OE-Sem-8) Evaluating Hypotheses
PART-3
Questions-Answers
Answer
Bayesian learning :
1. Bayesian learning is a fundamental statistical appronch to the problem
of pattern classification.
2 This approach is based on quantifying the tradeofls between various
classification decisions using probability and costs that accompnny such
decisions.
3 Because the decision problem is solved on the basis of probabilistic term8,
hence it is assumed that all the relevant probabilities are known.
4 For this we define the state of nature of the things present in the
particular pattern. We denote the state of nature by 0.
5. For example, there are a number of balls which are red and blue in
colour then o =0, when the ball is red and o = o, when the ball is blue.
Because the state of nature is so unpredictable, we consider o to be a
variable that must be described probabilistically.
6. If one ball is red then we can say that the next ball is equally likely to be
red or blue.
7 We assume that there is a prior probability plo,) that the next ball is
blue.
8 These prior probabilities reflect the prior knowledge ofhow likelya ball
obtained is red or blue before the ball actually appears.
9. Now after defining the state ofnature and prior probabilities, the decision
has to be made that a particular ball is present in which class.
10. Adecision rule is used to take decision as:
Decide o, ifplo,) >p(o,), otherwise o,.
Twocategory classification :
1 Let o,,o, be the two classes of the patterns. It is assumed that the a
prioriprobabilities p(o,) and plo,) are known.
2 Even if they are not known, they can easily be estimated from the
available training feature vectors.
3 IfN is total number ofavailable training patterns and N,, N, of them
belong to o,and o,, respectively then plw,) =N/N and pto,) =N/N.
4. The conditional probability density functions plx| o,), i = 1,2 is also
assumed to be known which describes the distribution of the feature
vectors in each of the classes.
5 The feature vectors can take any value in the l-dimensional feature
space.
6. Density functions plr |o) become probability and will be denoted by
plx|o) when the feature vectors can take only disereto values.
Machine Learning 3-11G (CSIT/OE-Sem-8)
where plr) is the probability density function of xand for which we have
2
-R, R
Fig. 3.10.1. Bayesian classifier for the case of two equiprobable classes.
13. The dotted line at x, is a threshold which partitions the space into two
regions, R, and R,. According to Baye's decisions rule, for all value of x
in R,the classifier decides o, and for all values in R, it decides o,:
3-12 G (CSTT/OE-Sem-8) Evaluating Hypotheses
14. From the Fig.3.10.1, it is obvious that the errors are unavoidable. There
is a finite probability for an x to lie in the R, region and at the same time
to belong in class o,. Then there is error in the decision.
15. The total probability, P of committing a decision error for two
equiprobable classes is given by,
1
P,=plxlog)dr
2
+ p(zlo)dx
which is equal to the total shaded area under the curves in Fig. 3.10.1.
Answer
1. Bayesian classifier can be made optimal by minimizing the classification
error probability.
2 In Fig. 3.10.1, it is observed that when the threshold is moved away from
Ko, the corresponding shaded area under the curves always increases.
3.
Hence, we have to decrease this shaded area to minimize the error.
4 Let R, be the region of the feature space for o, and R, be the
corresponding region for o,.
5. Then an error will be occurred if.,
xeR, although it belongs tow, or if z eR, although it belongs to o, i.e.,
P,= plxeRy, 0,) +plreR, o,) ...3.11.1)
6. P. can be written as,
P =plreR, lo,) plo,)+preR, lo,)plo,)
=Po,) lx |o)dx+ plog) plx<og)da ...3.11.2)
R
7. Using the Baye's rule,
=P plo |*)p(xdx+plo|*)plr)d* ...3.11.3)
R
8 The error will be minimized ifthe partitioning regions R, and R, of the
feature space are chosen so that
R,: plo, |*) >plo,|x)
R,: plo,]*) >plo, |x) ...(3.11.4)
9. Since the union of the regions R,, R, covers all the space, we have
plo | *)px)dx + plo1|*)p)dx =1 ..(3.11.5)
R
Machine Learning 3-13G (CS/IT/OE-Sem-8)
10. Combining equation (3.11.3) and (3.11.5), we get,
P(alw,) = a, -a,
9 *ela,, a,]
, muullion
1
P(artw,) = , -, *elb,, b,)
0 , muullion
Show the classification results for some values for a and b
(muullion" means "otherwise").
Answer
Typical cases are presented in the Fig.3.12.1.
Px|y) Pxly)
1 1 1
a, -a,
b, -b,
b,
(a) (b)
P ly) P ly)
1 1
b,-b
1
b, az b, b b, a)
(c) (d)
Fig. 3.12.1.
Answer
1. Letx be a thing in a pattern. In Bayesian tern, x0s considered"evidenee".
Let H be some hypothesis such as thatx belongs toa npecified clusn C.
2. For classification problen, p(H|x) is determined which is the probubility
that the hypothesis holds given the observed x.
3. In other words, the probability that xbelongs to clasA Cis determined,
given that description of x is known.
4. plH|r) is the posterior probability of Hconditioned on x.
the
5 For example,suppose there are a number of custonners deseribed by
attributes age and income, respectively and thatx is a 36 yearn old
customer with an income of Rs. 40,000.
6. Suppose that H is the hypothesis that our customer will buy a computer.
Then p(H|x) rejects the probability that customerx will buy a computer
given that the customer's age and income is known.
7. Similarly plx |H) is the posterior probability of xconditioned on H.
8 It is the probability that custoner x is 35 years old and earns Rs, 40,000
given that we know that the customer will buy computer. plx) is the
prior probability of x.
9 It is the probability that a person from the set of customers is 36 years
old and earns Rs. 40,000.
10. Baye's theorem is useful in that it provides a way of caleulating the
posterior probability p(H|x), fromp(), plr| ) and plx)
)p(H)
Baye's theorem is, p(Hlx) = Px|plx)
can work with the Naive Bayes model without believing in Bayesian
probability or using any Bayesian methods.
5. An advantage of the Naive Bayes classifier is that it requires a small
amount of training data to estimate the parameters (means and
variances of the variables) necessary for classification.
6. The perceptron bears a certain relationship to a classical pattern
classifier known as the Bayes classifier.
7 When the environment is Gaussian, the Bayes classifier reduces to a
linear classifier.
In the Bayes classifier, or Bayes hypothesis testing procedure, we
minimize the average risk, denoted by R. For a two-class problem,
represented by classes C, and C, the average risk is defined:
R=
H H
H) H
where the various terms are defined as follows:
P, = Prior probability that the observation vector x is drawn from
subspace H, with i=1, 2, and P, +P,=1
C, =Cost of deciding in favour of class C; represented by subspace H,
when class C, is true, with i,j = 1, 2
P, (x/C)=Conditional probability density function of the random vector X
8 Fig. 3.14.1(a) depicts a block diagram representation of the Bayes
classifier. The important points in this block diagram are twofold :
The data processing in designing the Bayes classifier is confined
entirely to the computation of the likelihood ratio A(*).
Assign x to class &
Likelihood
Input vector A(x) if (x) >
ratio Comparator Otherwise, assign
computer Lit to class
(a)
Assign x to class 1
Input vector Likelihood |log A(x)
if log a () > log 5
ratio Comparator Otherwise, assign
computer it to class 2
(6) log
Fig. 3.14.1. Two equivalent implementations of the Bayes classifier :
(a) Likelihood ratio test, (b) Log-likelihood ratio test
3-16 G (CSIT/OE-Sem-8) Evaluating Hypotheses
Answer
labels. Each
1. Let D be a training set of features and their associated class
vector
feature is represented by an n-dimensional attribute
feature from
X=(x, xy .., x) depicting n measurements made on the
n attributes, respectively A,,A,, , A
Given a feature X, the
2. Suppose that there are mclasses, C, Cgy., Cm
classifier will predict that X belongs to the class having the highest
X
posterior probability, conditioned on X. That is, classifier predicts that
belongs to class C, if and only if,
p(C,|X) >p(C,|X) for 1sj Zm, j zi
Thus,we maximize p(C, |X). The class C, for which p(C,|X) is maximized
is called the maximum posterior hypothesis. By Bayes theorem,
p(C,|X) = p(X |C)p(C)
pX)
3. As p(X) is constant for all classes, only P(X| C,) P(C,) need to be
maximized. If the class prior probabilities are not known then it is
commonly assumed that the classes are equally likely i.e.,
p(C,)=pC) =...p(C) and therefore p(X|C) is maximized. Otherwise
pX|C) p(C) is maximized.
4. a.
Given data sets with many attributes, the computation ofp(X| C,)
will be extremely expensive.
b. To reduce computation in evaluating p(X|C,), the assumption of
class conditional independence is made.
C.
This presumes that the values of the attributes are conditionally
independent of one another, given the class label of the feature.
-1(*
1 2 2
gx) = et
Que 3.16. Let blue, green, and red be three classes of objects with
prior probabilities given by P(blue) =1/4, P(green) =1/2, P(red) =1/4.
Let there be three types of objects pencils, pens, and paper. Let the
class-conditional probabilities of these objects be given as follows.
Use Bayes classifier to classify pencil, pen and paper.
P(pencil/green) = /3 P(pen/green) = /2 P(paperlgreen) = 1/6
P(pencil/blue) = /2 P(pen/blue) = /6 P(paper/blue) = /3
P(pencilred) = 1/6 P(pen/red) = 1/3 P(paper/red) = /2
3-18 G (CS/IT/OE-Sem-8) Evaluating Hypotheses
Answer
As per Bayes rule :
Plgreen/pencil) = P(pencil/ green) Plgreen)
(P(pencil green) P(green) + P(pencil blue)
P(blue) + P(penci/ red) P(red)
1 1
1,12 6 =0.5050
1 1 1 1 1 0.33
X +
|3 4 6
P(pencil/ blue) P(blue)
P(blue/pencil) =
(P(penci/ green) P(green) +P(pencil blue)
Pblue) +P(pencil/ red) P(red)
1 1
X
2°4 = 0.378
0.33
P(pencil/ red) P(red)
P(red/pencil) =
(P(pencil/ red) P(red) + P(pencilV blue)
P(blue) +P(pencil/ green) P(green)
11 1
X
6 4 24 =0.126
0.33 0.33
Since, P(green/pencil) has the highest value therefore pencil belongs to
class green.
P(pen green) P(green)
P(green/pen) = P(pen/ green)
P(green) +P(pen/ blue)
P(blue) + P(pen red) P(red)
11X
1
2 4 =0.666
11 1 1 1 1 0.375
X + X +X
22 6 4 3 4
P(pen/ blue)P(blue)
P(blue/pen) =
P(pen/ green) P(green) + P(pen/ blue)
P(blue) +P(pen/ red) P(red)
1 1 1
X
64- 24 =0.111
0.375 0.375
P(pen/ red) P(red)
P(red/pen) =
P(pen/ green) P(green) +Ppen/ blue)
P(blue) +P(pen/ red) P(red)
3-19 G (CSIT/OF-Sem-8)
Machine earning
11 1
3 4 12 - 0,222
0.375 0.375
Since Pgreen/pen) has the highest value therefore, pen belongs to
class greon.
P(paper/ green) P(green)
Pgreen/paper) = Ppaper/ green) P(green) + P(paper/ blue)
Pblue) +P(paper/ red) P(red)
1 1 1
X
6 2 12
1 1 1 1 1 1 1 1
+ +
623 4 2 4 12 12 8
12 0,286
0.291
P(paper/ blue) P(blue)
Pbluepaper) =Ppaper/ green) P(green) + P(paper/ blue)
P(blue) + P(paper/ red) P(red)
1 1 1
X
3 4 12 = 0.286
0.291 0.291
P(paper/ red) P(red)
Pred/paper)=
P(paper/ green) P(green) + Ppaper/ blue)
Pblue) + P(paper/ red) P(red)
1 1 1
X
24 = 0.429
0,291 0.291
Since, Pred/paper) has the highest value therefore, paper belongs to
class red.
PART-4
Naive Bayes Classifier, Bayesian Belief Network, EMAlgorithm.
Questions-Answers
Answer
used
1 Naive Bayes model is the most common Bayesian network model
in machine learning.
predicted and the
2. Here, the class variable Cis the root which is to be
attribute variables X, are the leaves.
are
3 The model is Naive because it assumes that the attributes
conditionally independent of each other, given the class.
4 Assuming Boolean variables, the parameters are :
0= P(C = true), 0, = P(X, = true |C= true),
., = PX,= true | C=False)
which each
5. Naive Bayes models can be viewed as Bayesian networks in
X, has C as the sole parent and C has no parents.
6 ANaive Bayes model with gaussian PX,|C) is equivalent
to a mixture
of gaussians with diagonal covariance matrices.
7 While mixtures of gaussians are widely used for density
estimation in
mixed
continuous domains, Naive Bayes models used in discrete and
domains.
and
8. Naive Bayes models allow for very efficient inference of marginal
conditional distributions.
9. Naive Bayes learning has no difficulty with noisy data and can give
more appropriate probabilistic predictions.
set
test
0.9
correct
on
Proportion
0.8
0.7
0.6
Decision tree
0.5 Naive Bayes
0.4
0 20 40 60 80 100
Training set size
Fig. 3.17.1. The learning curve for Naive Bayes learning.
Machine Learning 3-21 G (CSTT/OE-Sem-8)
Answer
2 Bad 2 3 Indian 4 1
Asha
Usha 2 2
Tasty
Yes No
6 4
3-22 G (CS/IT/OE-Sem-8) Evaluating Hypotlieses
Health Cuisine
Cook
status
Yes No Yes No
Yes No
Tasty
Yes No
6/10 4/10
2 2 2 6
= 0.023
Likelihood of yes = 6 6 6 10
3 3 4
Likelihood of no = 0x X X =0
4 4 10
Therefore, the prediction is tasty.
Weather Cavity
(Toothache Catch
3. Bayesian Network (BN) has been accepted as a powerful tool for common
knowledge representation and reasoning of partial beliefs under
uncertainty.
4. Bayesian networks utilize knowledge about the independence of variables
to simplify the model.
5. One of the most important features of Bayesian networks is the fact
that they provide an elegant mathematical structure for modeling
complicated relationships among random variables while keeping a
relatively simple visualization of these relationships.
Que 3.20. Write short note on Bayesian belief networks.
Answer
1 Bayesian belief networks specify joint conditional probability distributions.
2. They are also known as belief networks, Bayesian networks, or
probabilistic networks.
3 A BeliefNetwork allows class conditional independencies to be defined
between subsets of variables.
4. It provides a graphical model of causal relationship on which learning
can be performed.
5. We can use a trained Bayesian network for classification.
6. There are two components that define a Bayesian belief network :
a. Directed acyclic graph :
i. Each node in a directed acyclic graph represents a random
variable.
ii. These variable may be diserete or continuous valued.
ii. These variables may correspond to the actual attribute given
in the data.
Directed acyclic graph representation : The following diagram shows a
directed acyclic graph for six Boolean variables.
CONTENTS
Part-1: Computational Learning 4-2G to 4-4G
Theory, Sample Complexity
for Finite Hypothesis Spaces,
Sample Complexity for
Infinite Hypothesis Space,
The Mistake Bound Model
of Learning
4-1G(CSTT/OE-Sem-8)
Computational Learning Theory
4-2G (CSIT/OE-Sem-8)
PART- 1
Questions-Answers
Questions
Long Answer Type and Medium Answer Type
Answer
for studying
Computational Learning Theory (CLT) is a field ofAI usedwhat
1
sorts of
the design of machine learning algorithms to determine
problems are learnable.
ideas of deep
2 The ultimate goals are to understand the theoretical
improving
learning programs, what makes them work or not, while
accuracy and efficiency.
3 This field merges many disciplines, such as probability theory, statistics,
programming optimization, information theory, calculus and geometry.
4. Computational learning theory is used to:
i. Provide a theoretical analysis of learning.
ii Show when a learning algorithm can be expected to succeed.
ii. Show when learning may be impossible.
5. There are three areas comprised by CLT:
i. Sample complexity : Sample complexity described the examples
we need tofind in a good hypothesis.
ii. Computational complexity : Computational complexity defined
the computational power we need to find in a good hypothesis.
iüi. Mistake bound : Mistake bound find the mistakes we will make
before finding a good hypothesis.
Que 4.2. Describe sample complexity for finite hypothesis space.
Answer
1. The sample complexity of a machine learning algorithm represents the
number of training samples that it needs in order to successfully learn a
target function.
Machine Learning 4-3G(CS/TT/OE-Sem-8)
Answer
1 An algorithm A learns a class Cwith mistake bound Miff
Mistake (A, C) < M.
2 In mistake bound model, learning proceeds in rounds, one by one.
Suppose Y= (-1, + 1).
3 At the beginning of round t, the learning algorithm A has the hypothesis
h,, in round t,we see x, and predict h,(x,).
4. At the end of the round, y, is revealed and Amakes a mistake if h,(r,) *
y,. The algorithm then updates its hypothesis to h,., and this continues
till time T.
5. Suppose the labels were actually produced by some function fin a given
concept class C.
6. Then we bound the total number of mistakes the learner commits as:
Mistake A,C):= max
7. Amount of computation A has to do in each round in order to update its
hypothesis from h, to h,,1:
8. Setting this issue aside for a moment, we have a remarkably simple
algorithm halving (C) that has a mistake bound of lg(|C|)for any finite
concept class C.
44G(CSIT/OE-Sem-8) Computational Learning Theory
majority (H) as
9 For a finite set H of hypotheses, define the hypothesis
follows:
[+1 |\he H| h(x) =+1)| 2|H|/2,
Majority()(x):= -1 otherwise
Algorithm :
HALVING (C)
h, majority (C,)
for t = 1 to T do
Receive x
Predict h,(x)
Receive y,
C+1t fe Ct | fx) =y)
h,, majority (C,,)
end for
PART-2
Questions-Answers
Questions
Long Answer Type and Medium Answer Type
Answer
1. Instance-Based Learning (IBL) is an extension of nearest neighbour or
K-NN classification algorithms.
created
2. IBL algorithms do not maintain a set of abstractions of model
from the instances.
3. The K-NN, algorithms have large space requirement.
4. They also extend it with a significance test to work with noisy instances,
since a lot of real-life datasets have training instances and K-NN
algorithms do not work well with noise.
5. Instance-based learning is based on the memorization of the dataset.
6. The number of parameters is unbounded and grows with the size of the
data.
Machine Learning 4-5 G (CSIT/OE-Sem-8)
7. The classification is obtained through
8.
memorized examples.
The cost of the learning process is 0, all the cost is in the
the prediction. computation of
9. This kind learning is also known as lazy learning.
Que 4.5.Explain instance-based learning representation.
Answer
Instance-based representation (1) :
1. The simplest form of learning is plain memorization.
2. This is a completely different way of representing the knowledge
from a set of instances: just store the instances themselves andextracted
by relating new instances whose class is operate
whose class is known.
unknown to existing ones
3. Instead of creating rules, work directly from the examples
themselves.
Instance-based representation (2) :
1. Instance-based learning is lazy, deferring the real work as long as
possible.
2. In instance-based learning, each new instance is
compared with
ones using a distance metric, and the closest existing instance isexisting
used to
assign the class to the new one. This is also called the
classification method. nearest-neighbour
3. Sometimes more than one nearest neighbor is used, and the majority
class of the closest k-nearest neighbours is assigned to the new instance.
This is termed the k-nearest neighbour method.
Instance-based representation (3) :
1 When computing the distance between two examples, the standard
Euclidean distance may be used.
2. When nominal attributes are present, we may use the
procedure. following
3 A distance of 0 is assigned if the values are identical,
distance is 1.
otherwise the
4 Some attributes will be more important than others. We need
kinds of attribute weighting. To get suitable attribute weights fromsome
the
training set is a key problem.
5. It may not be necessary, or desirable, to store all the
training instances.
Instance-based representation (4):
1. Generally some regions of attribute space are more stable with regard
to class than others, and just a few examples are needed
inside stable
regions.
4-6G(CSIT/OE-Sem-8) Computational Learning Theory
2. An apparent drawback to instance-based representation is that they do
not make explicit the structures that are learned.
Po
1. Generality :
a This is the class of concepts that describe the representation of an
algorithm.
b IBL algorithms can pac-learn any concept whose boundary is a
union of a finite number of closed hyper-curves of finite size.
2. Accuracy: This concept describe the accuracy of classification.
3. Learning rate :
a This is the speed at which classification accuracy increases during
training.
b. It is a more useful indicator of the performance of the learning
algorithm than accuracy for finite-sized training sets.
4. Incorporation costs :
a These are incurred while updating the concept deseriptions with a
single training instance.
b. They include classification costs.
5. Storage requirement : This is the size of the concept description for
IBL algorithms, which is defined as the number of saved instances used
for classification decisions.
Answer
Functions of instance-based learning are :
1. Similarity function :
a. This computes the similarity between a training instance i and the
instances in the concept description.
Machine Learning 4-7G(CSIT/OE-Sem-8)
b Similarities are numeric-valued.
2. Classification function :
a.
This receives the similarity function's results and the classification
performance records of the instances in the concept description.
b. It yields a classification for i.
3. Concept description updater:
This maintains records on classification performance and decides
which instances to include in the concept deseription.
b. Inputs include i, the similarity results, the classification results,
and a current concept description. It yields the modified concept
description.
Que 4.8. What are the advantages and disadvantages of instance
based learning ?
Answer
Advantages of instance-based learning:
1. Learning is trivial.
2. Works efficiently.
3. Noise resistant.
4. Rich representation, arbitrary decision surfaces.
5. Easy to understand.
Disadvantages of instance-based learning :
1. Need lots of data.
2. Computational cost is high.
3. Restricted to xe R".
4. Implicit weights of attributes (need normalization).
5. Need large space for storage i.e., require large memory.
6. Expensive application time.
Answer
1 The KNN classification algorithm is used to decide the new instance
should belong to which class.
2 When K = 1, we have the nearest neighbour algorithm.
3. KNN classification is incremental.
4 KNN classification does not have a training phase, all instances are
stored. Training uses indexing to find neighbours quickly.
4-8G (CS/IT/OE-Sem-8) Computational Learning Theory
K-nearest
5. During testing, KNN classification algorithm has to find exhaustive
neighbours of a new instance. This is time consuming if we do
comparison.
6. K-nearest neighbours use the local neighborhood to obtain a prediction.
Algorithm: Let m be the number of training data samples. Let p be an
unknown point.
means
1 Store the training samples in an array of data points array. This
each element of this array represents a tuple (x, y).
2. For i = 0 to m :
Calculate Euclidean distance d(arrlil, p).
Each of these distances
3 Make set S of K smallest distances obtained.
corresponds to an already classified data point.
4 Return the majority label among S.
Answer
Advantages of KNN algorithm :
1. No training period :
a. KNN is called lazy learner (Instance-based learning).
b. It does not learn anything in the
training period. It does not derive
any discriminative function from the training data.
It stores the
C. In other words, there is no training period for it. making real
training dataset and learns from it only at the time of
time predictions.
d This makes the KNN algorithm much faster
than other algorithms
Regression etc.
that require training for example,SVM, Linear
predictions,
2 Since the KNN algorithm requires no training before making
the accuracy of
new data can be added seamlessly which will not impact
the algorithm.
parameters required
3. KNN is very easy to implement. There are only two
distance function (for
toimplement KNN ie., the value of K and the
example, Euclidean).
Disadvantages of KNN:
the cost of
1. Does not work well with large dataset : In large datasets, points
calculating the distance between the new point and each existing
algorithm.
is huge which degrades the performance of the
algorithm
2. Does not work well with high dimensions : The KNN large numier
does not work well with high dimensional data because with
Machine Learning 4-9G (CSTT/OE-Sem-8)
X
Fig. 4.11.1.
6. The LOESS (Locally Estimated Scatterplot Smoothing) model performs
a linear regression on points in the data set, weighted by a kernel
centered at x.
7. The kernel shape is a design parameter for which the original LOESS
model uses a tricubic kernel :
h{*) =hx -x)= exp(- k(x-x)),
where k is a smoothing parameter.
8 For brevity, we will drop the argument x for h,(*), and define n = Lh,.
We can then write the estimated means and covariances as :
4-10 G (CSIT/OE-Sem-8) Computational Learning Theory
n'
Answer
1. Radial Basis Function (RBF) networks have three layers :an input
layer, a hidden layer with a non-linear RBF activation function and a
linear output layer.
2 The input can be modeled as a vector of real numbers x e R",
3 The output of the network is then ascalar function of the input vector,
¢:R" ’ R, and is given by
N
o(x) = 4, pllx- c, |)
Output y
Linear weights
Radial basis
functions
Weights
Input x
Fig.4.13.1. Architecture of a radialbasis function net work. An input
vector x is used as input to all radial basis functions, each with different
parameters. The output of the network is a linear combination of the
outputs from radial basis functions.
where n is the number of neurons in the hidden layer, c, is the center
vector for neuron i and a, is the weight of neuron i in the linear output
neuron.
4 Functions that depend only on the distance from a center vector are
radially symmetric about that vector.
5 In the basic form all inputs are connected toeach hidden neuron.
6 The radial basis function is taken to be Gaussian
pl| x-c, |) = exp-pI|x-c, i
7. The Gaussian basis functions are local to the center vector in the sense
that
lim pl| x - c, |) = 0
iLe., changing parameters of one neuron has only asmall effect for input
values that are far away from the center of that neuron.
8. Given certain mild conditions on the shape of the activation function,
RBF networks are niversal approximators on a compact subset of P".
4-12G(CSTT/OE-Sem-8) Computational Learning Theory
9. This means that an RBF network with enough hidden neurons can
approximate any continuous function on aclosed, bounded set with
arbitrary precision.
10. The parameters a,,c, p, and Bare determined in a manner that optimize
the fit between and the data.
PART-3
Case-Based Learning.
Questions-Answers
Long Answer Type and Medium Answer Type Questions
Answer
Case-based learning nlgorithm processingK ntagen are:
the
1. Case retrieval: Afler the problem nituation has been anaeANed,
best matching case is searched in the cnne-bune nnd an approximate
solution is retrieved.
2 Case adaptation: The retrieved solution in adapted to fit better in the
new prolblem.
Problem
Retriove
Retain
Revine
Confirmed Proponed
solution Nolution
Que 4.17. What are the benefits of CBL as a lazy problem solving
method?
Answer
The benefits of CBL as a lazy Problem solving method are:
1. Ease of knowledge elicitation:
a. Lazy methods can utilise easily available case or problem instances
instead ofrules that are difficult to extract.
b. So, classical knowledge engineering is replaced by case acquisition
and structuring.
2. Absence of problem-solving bias:
a. Cases can be used for multiple problem-solving purposes, because
they are stored in a raw form.
b. Thisin contrast to eager methods, which can be used merely for
the purpose for which the knowledge has already been compiled.
3. Incremental learning :
a. A
CBL system can be put into operation with a minimal set solved
cases furnishing the case base.
b The case base will be filled with new cases increasing the system's
problem-solving ability.
C. Besides augmentation of the case base, new indexes and clusters
categories can be created and the existing ones can be changed.
d. This in contrast requires a special training period whenever
informatics extraction (knowledge generalisation) is performed.
e. Hence, dynamic on-line adaptation a non-rigid environment is
possible.
Machine Learning 4-15 G (CS/IT/OE-Sem-8)
Answer
Limitations of CBL are :
1. Handling large case bases :
a High memory/storage requirements and time-consuming retrieval
accompany CBL systems utilising large case bases.
b. Although the order of both is linear with the number of cases,
these problems usually lead to increased construction costs and
reduced system performance.
C These problems are less significant as the hardware components
become faster and cheaper.
2 Dynamic problem domains :
a CBL systems may have difficulties in handling dynamic problem
domains, where they may be unable to followa shift in the way
roblems are solved, since they are strongly biased towards what
has already worked.
b. This may result in an outdated case base.
3. Handling noisy data :
Parts of the problem situation may be irrelevant to the problem
itself.
Computational Learning Theory
4-16 G (CSTT/OE-Sem-8)
in a problem
b. Unsuccessful assessment of such noise present the
may result in
situation currently imposed on a CBL system
numerous times in the
stored
same problem being unnecessarily due to the noise.
case basebecause of the difference
retrieval of cases.
C. Inturn this implies inefficient storage and
4. Fully automatic operation :
fully covered.
a. In a CBL system, the problem domain is not
b. Hence, some problem situations can
occur for which the system
has no solution.
the user.
C. In such situations, CBL systems expect input from
CONTENTS
Part-1: Genetic Algorithm 5-2G to 5-9G
An Ilustrative Example,
Hypothesis Space Search,
Genetic Programming
Part-2 : Models of Evolution 5-9G to 5-12G
and Learning
5-1 G (CSIT/OE-Sem-8)
Genetic Algorithm
5-2 G (CS/IT/OE-Sem-8)
PART-1
Hypothesis Space
Genetic Algorithm : An Ilustrative Example,
Search, Genetic Programming.
Questions-Answers
Answer Type Questions
Long Answer Type and Medium
algorithm.
Que 5.1. Write short note on Genetic
Answer
search and optimization algorithm
Genetic algorithms are computerized
1
genetics and natural selection.
based on mechanics of natural
of natural genetics and natural
2 These algorithms mimic the principle
optimization procedure.
selection to construct search and genetic Design
space.
3 Geneticalgorithms convert the design space into
solutions.
space isa set of feasible
4 Genetic algorithms work
with a coding of variables.
of variablesspace is that coding
5 The advantage of working with a coding function may be continuous.
discretizes the search space even though the
space is the space for all possible feasible solutions of particular
6 Search
problem. algorithm:
7. Following are the benefits of Genetic
a. They are robust.
b They provide optimization over large space state.
or presence of noise.
input
C. They do not break on slight change in
:
8 Following are the application of Geneticalgorithms
Recurrent neural network
a
b. Mutation testing
C. Code breaking
d. Filtering and signal processing
e. Learning fuzzy rule base
algorithm with advantages
Que 5.2.Writeprocedure of Genetic
and disadvantages.
Answer
Procedure of Geneticalgorithm:
1. Generate a set of individuals as the initial population.
2. Use genetic operators such as selection or cross over.
Machine Learning 5-3 G(CSTT/OE-Sem-8)
3 Apply mutation or digital reverse if necessary.
4. Evaluate the fitness function of the new population.
5 Use the fitness function for determining the best individuals and replace
predefined members from the original population.
6 Iterate steps 2-5 and terminate when some predefined population
threshold is met.
Advantages of genetic algorithm
1. Genetic algorithms can be executed in paralel. Hence, genetic algorithms
are faster.
2 It is useful for solving optimization problems.
Disadvantages of Genetic algorithm :
1 Identification of the fitness function is difficult as it depends on the
problem.
2 The selection of suitable geneticoperators is difficult.
Que 5.3. Explain different phases of genetic algorithm.
Answer
Different phases of genetic algorithm are :
1. Initial population :
a. The process begins with a set of individuals which is called a
population.
b. Each individual is a solution to the problem we want to solve.
C. An individual is characterized by a set of parameters (variables)
known as genes.
d Genes are joined into a string to form a chromosome (solution).
e. Ina genetic algorithm, the set ofgenes of an individual is represented
using a string.
f. Usually, binary values are used (string of ls and 0s).
A1 0 00 0 0 Gene
A2 1
1111 Chromosome
A3 1 0 1 0 1 1
A4 1 1| 0 1 10 Population
6-4GCSTYOE-Sem-8) Genetic Algorithm
C
Individuals with high fitness have more chance to be selected for
reproduetion.
4. Crossover :
1.
Crossover is the most significant phase in a genetic algorithm.
b. For each pair of parents to be mated, a crossover point is chosen at
random from within the genes.
C. For example, consider the crossover point to be 3 as shown:
A1 0 0 0
A2 |1 1 1
Crossover point
d. Offspring are created by exchanging the genes of parents among
themselves until the crossover point is reached.
A1 0 0 0 00
A2 1
A6 00 0 1 11
Machine Learning 5-5 G (CSIT/OE-Sem-8)
5. Mutation :
a.
When new offspring formed, some of their genes can be subjected
to a mutation with a low random probability.
b. This implies that some of the bits in the bit string can be flipped.
Before mutation
A5 1 1 1
After mutation
A5 1 1 0 11 0
Answer
1 Genetic Programming (GP) is a type of Evolutionary Algorithm (EA,
i.e.,a subset of machine learning.
2 EAs are used to discover solutions to problems that humans do not
know how to solve.
3. Free ofhuman preconceptions or biases, the adaptive nature of EAs can
generate solutions that are comparable to, and often better than the
best human efforts.
4. GP software systems implement an algorithm that uses random
mutation, crossover, a fitness function, and multiple generations of
evolution to resolve a user-defined task.
5. GP is used to discover a functional relationship between features in data
(symbolic regression), to group data into categories (classification), and
to assist in the design of electrical circuits, antennae, and quantum
algorithms.
6. GP is applied to software engineering through code synthesis, genetic
improvement, automatic bug-fixing, and in developing game-playing
strategies.
Que 5.6. What are the advantages and disadvantages of Genetic
programming ?
Answer
Advantages of Genetic programming are :
1. It does impose any fixed length of solution, so the maximum length can
be extended up to hardware limits.
2 In genetic programming, it is not necessary for an individual to have
maximum knowledge of the problem and to their solutions.
Disadvantages of Genetic programming are :
1 In GP, the number of possible programs that can be constructed by the
algorithm is immense. This is one of the main reasons why people
thought that it would be impossible to find programs thát are good
solutions to a given problem.
2 Although GP uses machine code which helps in providing result very
fast but if any of the high level language is used which needs to be
compile, it can generate errors and can make the program slow.
3 There is a high probability that even a very small variation has a disastrous
effect on fitness of the solution generated.
Que 5.7. Explain different types of Geneticprogramming.
Machine Learning 5-7G(CSTT/OE-Sem-8)
Answer
Different types of Genetic programming are:
1. Tree-based Genetic programming:
In tree-based GP, the computer programs are represented in tree
structures that are evaluated recursively to produce the resulting
multivariate expressions.
b Traditional nomenclature states that a tree node (or just node) is
an operator (+, -, *, l and a terminal node (or leaf) is a variable
[a, b, c, d).
2. Stack-based Genetic programming :
a. In stack-based genetic programming, the programs in the evolving
population are expressed in a stack-based programming language.
b. In stack-based genetic programming, programs are composed of
instructions that take arguments from data stacks and push results
back on data stackS.
C. Aseparate stack is provided for each data type, and program code
itself can be manipulated on data stacks and subsequently executed.
3. Linear Genetic Programming (LGP):
Linear Genetic Programming (LGP) is a subset of genetic
programming where computer programs in a population are
represented as a sequence of instructions from imperative
programming language or machine language.
4 Grammatical Evolution (GE):
a Grammatical Evolution is a new evolutionary computation technique
use to find an executable program or program fragment that will
achieve a good fitness value for the given objective function.
b Grammatical Evolution applies genetic operators to an integer string,
subsequently mapped to a program (or similar) through the use of
a grammar.
C. The benefit of GE is that this mapping simplifies the application of
search to different programming languages and other structures.
5. Cartesian Genetic Programming(CGP):
CGP is a highly efficient and flexible form of Genetic programming
that encodes a graph representation of a computer program.
b CGP represents computational structures (mathematical equations,
circuits, computer programs etc) as a string of integers.
C. These integers, known as genes determine the functions of nodes
in the graph, the connections between nodes, the connections to
inputs and the locations in the graph from where outputs are taken.
5-8 G (CS/TT/OE-Sem-8) Genetic Algorithm
PART-2
Questions-Answers
Answer
1 The Learnable Evolution Model (LEM) is a non-Darwinian methodology
for evolutionary computation that employs machine learning to guide
solutions).
the generation of new individuals (candidate problem
2. Unlike standard, Darwinian-type evolutionary computation methods
that use random or semi-random operators for generating new individuals
(such as mutations and/or recombination), LEM employs hypothesis
generation and instantiation operators.
3. The hypothesis generation operator applies a machine learning program
toinduce descriptions that distinguish between high-fitness and low
fitness individuals in each consecutive population.
4. Such descriptions delin ate he search space that contain the
desirable solutions.
5. Subsequently the instantiation operator samples these areas tocreate
new individuals.
5-12 G (CS/IT/OE-Sem-8) Genetic Algorithm
PART-3
Questions-Answers
Questions
Long Answer Type and Medium Answer Type
algorithm.
Que 5.13. Explain the procedure of learn-one rule
Answer
Learn-one-rule (Target_attribute, Attributes, Examples, k)
Conducts a general
Returns a single rule that covers some of the examples. by the performance
to-specific greedy beam search for the best rule, guided
metric.
hypothesis Ø.
1. Initialize best_hypothesis to the most general
(Best_hypothesis).
2. Initialize candidate_hypotheses to the set
Machine Learning 5-13G(CSIT/OE-Sem-8)
3. While candidate_hypotheses is not empty. Do :
Generate the next more specific candidate_hypotheses :
All_ constraints - the set of all constraints of theform (a=v),
where a is a member of attributes, and v is a value of a that
occurs in the current set of examples.
ii. New_candidate_hypotheses
for each h in candidate_hypotheses.
for each cin All_canstraints.
iii. Create aspecialization of hby adding the constraint c.
iv. Remove from New _candidate_hypotheses any hypotheses that
are duplicates, inconsistent, or not maximally specific.
b. Update best_hypothesis :
For all h in New_candidate_hypotheses do.
ü. If (performance(h_Examples, Target_attribute)
>Performance(Best_hypothesis, Examples, Target_attribute)
Then Best_hypothesis - h.
C. Update candidate_hypotheses:
i. Candidate_hypotheses the k best members of
New_candidate_hypotheses, according to the performance
measure.
Answer
1 Sequential covering is a general procedure that repeatedly learns a
single rule to create adecision list (or set) that covers the entire dataset
rule by rule.
2. Many rule learning algorithms are variants of the sequential covering
algorithm.
3. This is the most popular algorithm implementing rule learning.
6-14QCNTOR-No-)
Example count
1 2 3
6 hén) =3
7 5 4
h(n) = 3 h(n) = 1
1 23 1|23
6 8 6 4
7 5 4 7 5
move 1 move 2
Fig. 5.18.2.
After moving and selecting move 2, we will be in the position as per in
Fig. 5.18.3.
Machine Learning 5-17 G (CSIT/OE-Sem-8)
Example count
12 3
6 h(n) = 3
7 5 4
PART-4
Task, QLearning.
FOIL, Reinforcement Learning, The Learning
Questions-Answers
Type Questions
Long Answer Type and Medium Answer
Heuristic
reinforcement
signal
Actions
Learning
system
3
+1
3 0.812 0.868| 0.918 +1
2 2 0.762
-1 0.660 -1
1 2 3 4 2 3 4
(a) (b)
Fig. 5.23.1. (a) Apolicy nfor the 4 x3 world;
(b) The utilities of the states in the 4 x 3 world, given policy
5-22 G (CSIT/OE-Sem-8) Genetie Algorithm
9. Each state percept is subscripted with the reward received. The object is
to use the information about rewards to learn the expected utility U"9)
associated with each non-terminal state s.
10. The utility is defined to be the expected sum of (discounted) rewards
obtained if policy Tis followed:
U(s) = E|
where y is a discount factor, for the 4 x 5 world we set y = 1.
Active reinforcement learning:
1. An active agent must decide what actions to take.
2. First, the agent will need to learn a complete model with outcome
probabilities for all actions, rather than just model for the fixed policy.
3. We need to take into account the fact that the agent has a choice of
actions.
4. The utilities it needs to learn are those defined by the optimal policy;
they obey the Bellman equations:
US) = RS)+ YmaxT(s, a, s) U(s')
a
5. These equations can be solved to obtain the utility function Uusing the
value iteration or policy iteration algorithms.
6. Autility function Uis optimal for the learned model, the agent can
extract an optimal action by one-step look-ahead to maximize the expected
utility.
7. Alternatively, if it uses policy iteration, the optimal policy is already
available, so it should simply exccute the action the optimal policy
recommends.
b. It can be probabilistic.
Que 5.26. Write short note on Q-learning.
Answer
1 Reinforcement learning is the problem faced by an agent that must
learn behaviour through trial-and-error interactions witha dynamic
environment, Q-learning is model-free reinforcement learning, and it is
typically easier to implement.
2 Each residential load defined as an agent. Agents should learn how to
participate in the electrical market and optimize their cost,
simultaneously.
3. Areinforcement learning algorithm so-called a Q-learning algorithm is
employed.
4. When an agent i is modeled by a Q-learning algorithm, it keeps in
memory a function Q,:A,’ R such that Q,(a,) represents it will obtain
the expected reward if playing action a,.
5. It then plays with a high probability the action it believes is going to lead
to the highest reward, observes the reward it obtains and uses this
observation toupdate its estimate of Q,. Suppose that the tth time the
game is played, the joint action (a,, .... a,) represents the actions the
different agents have taken.