Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
102 views

Machine Learning

This document provides an introduction to machine learning including: 1. It describes machine learning as providing systems the ability to automatically learn and improve from experience without explicit programming. 2. It outlines different types of machine learning algorithms including supervised, unsupervised, semi-supervised, and reinforcement learning. 3. It discusses the advantages and disadvantages of each type of machine learning algorithm.

Uploaded by

prefixcs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
102 views

Machine Learning

This document provides an introduction to machine learning including: 1. It describes machine learning as providing systems the ability to automatically learn and improve from experience without explicit programming. 2. It outlines different types of machine learning algorithms including supervised, unsupervised, semi-supervised, and reinforcement learning. 3. It discusses the advantages and disadvantages of each type of machine learning algorithm.

Uploaded by

prefixcs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

UNIT

1 Introduction

CONTENTS
Part-1: Introduction, Well Defined. 1-4to -144
Learning Problems,
Designing a Learning
System, Issues in
Machine Learning
Part-2: The Concept Learning |-140 to 1-234
Task : General-to-8peeifie
Ordering of Hypothesis,
Pind-S, List Then
Eliminate Algorithm,
Candidate Elimination
Algorithm, Induetive Bias

1-1G(CSITIOE-Sem-8)
Introduction
1-2 G (CSTT/OE-Sem-8)

PART-1
Introduction, Well Defined Learning Problems, Designing a
Learning System, Issues in Machine Learning.

Questions-Answers
Type Questions
Long Answer Type and Medium Answer

Que 1.1. Explain briefly the term machine learning.

Answer
Intelligence (AI) that
1. Machine learning is an application of Artificial and improve from
provides systems the ability to automatically learn
experience without being explicitly programmed.
2. Machine learning focuses on the development of computer programs
that can access data.
learn automatically without
3 The primary aim is to allow the computers to accordingly.
human intervention or assistance and adjust actions
data.
4 Machine learning enables analysis of massive quantities of
results in order to identify
5. It generally delivers faster and more accurate
profitable opportunities or dangerous risks.
can
6. Combining machine learning with AI and cognitive technologies
information.
of
make it even more effective in processing large volumes
algorithm.
Que 1.2. Describe different types of machinelearning
Answer

Different types of machine learning algorithms are :


1. Supervised machine learning algorithms :
Supervised learning is defined when the model is getting trained
on a labelled dataset.
b Labelled dataset have both input and output parameters.
are
C. In this type of learning, both training and validation datasets
labelled.
2. Unsupervised machine learning algorithms :
a. Unsupervised machine learning is used when the information is
neither classified nor labelled.
b. Unsupervised learning studies how systems can infer a function to
describe a hidden structure from unlabelled data.
Machine Learning 1-3G
(CSTT/OE-Sem-8)
The system does not figure out the
data and can draw right output, but it explores the
inferences from datasets to describe hidden
structures from unlabelled data.
3.
Semi-supervised machine learning algorithms :
Semi-supervised
supervised and
machine learning algorithm fall in between
unsupervised learning, since they use both labelled
and unlabelled data for training.
b The systems that use this method are
able to improve learning
accuracy.
C.
Semi-supervised
skilled and
learning is chosen when labelled data requires
relevant resources in order to train/learn from it.
4.
Reinforcement machine learning algorithms :
a.
Reinforcement machine learning
that interacts with environment byalgorithm
producing
is a learning method
errors or rewards. actions and discovers
b Trial, error search and delayed reward are the
most relevant
characteristics of reinforcement learning.
C. This method allows machines and software
agents to automatically
determine the ideal behaviour within a specific context in order to
maximize performance.
d. Simple reward feedback is required for the agent to learn
action is best. which

Que 1.3. What are the advantages and disadvantages of


different
types of machine learning algorithm ?
Answer
Advantages of supervised machine learning algorithm :
1 Classes represent the features on the ground.
2 Training data is reusable unless features change.
Disadvantages of supervised machine learning algorithm :
1, Classes may not match spectral classes.
2 Varying consistency in classes.
3. Cost and time are involved in selecting training
data.
Advantages of unsupervised machine learning algorithm:
1 No previous
knowledge of the image area is required.
2. The opportunity for
human error is minimised.
3. It produces unique spectral classes.
4. Relatively easy and fast to carry out.
14G(CSTT/OE-Sem-8) Introduction

Disadvantages of unsupervised machine learning algorithm :


1. The spectral classes do not necessarily represent the featuren on the
ground.
2. It does not consider spatial relationships in the data.
3 It can take time to interpret the spectral classON.
Advantages of semi-supervised machine learning nlgorithm :
1 Itis easy to understand.
2. It reduces the anount of annotated data UNed.
3. It is stablo, fast convergent.
4. It is simple.
5. It has high efficiency.
Disadvantages of semi-supervised machine learning algorithm:
1. Iteration results are not stable.
2. It is not applicable to network level data.
3. It has low accuracy.
Advantages of reinforcement learning algorithm:
1. Reinforcement learning is used to solve complex probles that cannot
be solved by conventional techniques.
2. This technique is preferred to achieve long-term results which are very
difficult to achieve.
3 This learning model is very similar to the learning of human beings.
Hence,it is close to achieving perfection.
Disadvantages of reinforcement learning algorithm :
1 Too much reinforcement learning can lead to an overloud of states
which can diminish the results.
2. Reinforcement learning is not preferable for solving simple problems.
3. Reinforcement learning needs a lot of data and a lot of computation.
4. The curse of dimensionality limits reinforcement learning for real
physical systems.
Que 14. What are the applications of machine learning ?

Answer

Applications of machine learning are :


1. Iaage recognition :
Image recognition is the process of identifying and detecting nn
object or a feature in a digital image or video.
b. This is used in many applications like systems for factory automation,
toll booth monitoring, and security surveillance.
Machine Learning 1-5 G(CSIT/OE-Sem-8)
2. Speech recognition :
Speech Recognition (SR) is the translation of spoken words into
text.
b It is also known as Automatic Speech Recognition (ASR), computer
speech recognition, or Speech To Text (STT).
C In speech recognition, a software application recognizes spoken
words.
3. Medical diagnosis :
a. ML provides methods, techniques, and tools that can help in solving
diagnostic and prognostic problems in a variety of medical domains.
b It is being used for the analysis of the importance of clinical
parameters and their combinations for prognosis.
4. Statistical arbitrage :
a. In finance, statistical arbitrage refers to
automated trading
strategies that are typical of ashort-term and involve alarge number
of securities.
b In such strategies, the user tries to implement a trading
for a set of securities on the basis of quantities such as algorithm
historical
correlations and general economic variables.
5. Learning associations : Learning association is the process for
discovering relations between variables in large data base.
6. Extraction :
a.
Information Extraction (IE) is another application of machine
learning.
b. It is the process of extracting structured information from
unstructured data.

Que 1.5. What are the advantages and disadvantages of machine


learning?
Answer
Advantages of machine learning are:
1. Easily identifies trends and
patterns:
a.
Machine learning can review large volumes of data
specific trends and patterns that would not be apparentandto discover
b For an e-commerce website like Flipcart, it humans.
the browsing behaviours and purchase serves to understand
histories
cater to the right products, deals, and reminders of its users to help
It uses
relevant to them.
C. the results to reveal relevant
2. No human intervention needed advertisements to them.
(automation):
does not require physical force i.e., no Machine learning
human intervention is needed.
Introduction
1-00(CNITOE-Sem-8)

3. Continuous improvement :
MI. algorithmngin oxperience, they keep improving in aceuracy
nnd officiency.
learn to make
b. An tlhe amount of data keepu growing, alyorithmn
eeurnte predictions fawter.
data :
4. llundlingmulti-dimensional nnd multi-variety
Machine leurning algorithmn are good nttheyhanding data that are
cun do this in dynamic
multidinensional and multi- variety, and
or unceruin environments,

Dinadvantuges of mnchine learning nre :


I. Data nCquisition :
train on,and these
Machine learning requiresmaNN0Ve data Hets to
nhould be inclusive/unbiased, and of good qunlity.
2. T'ime and resourceN :
algorithns learn and develop
ML needa enough time to let thewith aconsiderable amount of
.
onough to fulfill their purpose
accurncy und relevancy,
function.
b. ItalNo needsmaNive resources to
3. Interpretation of results :
the algorithms, We
To aceurately interpret results generatedourby purpo8e.
.
must carofully choose the algorithms for
4. ligh error-susceptilbility :
susceptible to errors.
Machine learning is autonomous but highly
the issue, und even longer
b. IL takestime torecognize the source of
to correct it.

problem with
Que 1.6. Write short note on well defined learning
example.
Answer

Well defined learningproblem :


computor program is said to learn from experionce Ewith respect to some
A
performance measure P, ifits performance at tasks in T,
class of taska T'and
experienceE.
N measured by P, improves with
Three features in learning problems :
The claxs of tasks (T)
improved (P)
2 The measure of performance to be
3. The souree of experienco (E)
Machine Learning 1-7G (CSIT/OE-Sem-8)

For example :
1. Acheckers learning problem :
a. Task (T): Playing checkers.
b. Performance measure (P): Percent of games won against
opponents.
C. Training experience (E): Playing practice games against itself.
2. A handwriting recognition learning problem :
a. Task (T) :Recognizing and classifying handwritten words within
images.
b. Performance measure (P): Percent of words correctly classified.
C. Training experience (E):Adatabase of handwritten words with
given classifications.
3. A robot driving learning problem :
a. Task (T) :Driving on public four-lane highways using vision sensors.
b. Performance measure (P):Average distance travelled before an
error (as judged by human overseer).
Training experience (E) :A sequence of images and steering
commands recorded while observing a human driver.
Que 1.7. Describe well defined learning problems role's in
machine learning.
Answer
Well defined learning problems role's in machine learning:
1. Learning to recognize spoken words :
a Successful speech recognition systems employ machine learning in
some form.
b For example, the SPHINX system learns speaker-specific
strategies
for recognizing the primitive sounds (phonemes) and words
the observed speech signal. from
C. Neural network learning methods and methods for learning
Markov models are effective for automatically hidden
customizing to
individual speakers, vocabularies, microphone characteristics,
background noise, etc.
d. Similar techniques have potential
interpretation problems. applications in many signal
2. Learning to drive an autonomous vehicle :
a.
Machine learning methods have been used to train
controlled vehicles to steer correctly when driving on a computerof
road types. variety
1-8 G (CS/TT/OE-Sem-8) Introduction

b. For example, the ALYINN system has used its learned strategies to
drive unassisted at 70miles per hour for 90 miles on public highways
among other cars.
C.
Similar techniques have possible applications in many sensor based
control problems.
3. Learning to classify new astronomical structures :
a. . Machine learning methods have been applied toa variety of large
databases to learn general regularities implicit in the data.
b. For example, decision tree learning algorithms have been used by
NASA to learn how to classify celestial objects from the second
Palomar Observatory Sky Survey.
This system isused to automatically classify all objects in the Sky
Survey, which consists of three terabytes of image data.
4. Learning to play world class backgammon:
a The most successful cor.puter programs for playing games such as
backgammon are based on machine learning algorithms.
b For example, the world's top computer program for backgammon,
TD-GAMMON learned its strategy by playing over one million
practice games against itself.
C. It now plays at a level competitive with the human world champion.
d Similar techniques have applications in many practical problems
where large search spaces must be examined efficiently.

Que 1.8. What is learning ? Explain the important components


of learning.
Answer
1. Learning refers to the change in a subject's behaviour to a given situation
brought by repeated experiences in that situation, provided that the
behaviour changes cannot be explained on the basis of native response
tendencies, matriculation or temporary states of the subject.
2. Learning agent can be thought of as containing a performance element
that decides what actions to take and a learning element that modifies
the performance element so that it makes better decisions.
3. The design of a learning element is affected by three major issues :
a. Components of the performance element.
b. Feedback of components.
C. Representation of the components.
Machine Learning 1-9 G (CSTT/OE-Sem-8)

The important componentsof learning are :


Stimuli
examples Feedback
L Learner
component

Environment
or teacher Critic
Knowledge performance
base
evaluator

Response
Performance
component
Tasks
Fig. 1.8.1. General learning model.
1. Acquisition of new knowledge:
a. One component of learning is the acquisition of new knowledge.
b. Simple data acquisition is easy for computers, even though it is
difficult for people.
2. Problem solving :
The other component of learning is the problem solving that is required
for both to integrate into the system, new knowledge that is presented
to it and to deduce new information when required facts are not been
presented.

Que 1.9. Differentiate between artificial intelligence and machine


learning.
Answer

S. No. Artificial Intelligence (AI) Machine Learning (ML)


1 AI is human intelligence It provides machines the ability
demonstrated by machines to to learn and underst and
perform simple to complex without being explicitly
tasks. programmed.
2. The idea behind AI is to program The idea behind ML is to teach
machines to carry out tasks incomputers to think and
human ways or smart ways. understand like humans.
3. It is based on characteristics of It is based on the system of
human intelligence. probability.
4. It is used in healthcare,finance, It is used for optical character
transportation, marketing, recognition, web security,
media, education, etc. imitation learning, etc.
1-10G (CS/TT/OE-Sem-8) Introduction

Que 1.10. What are the steps used to design a learning system ?

Answer
Steps used to design a learning system are :
1. Specify the learning task.
2 Choose a suitable set of training data to serve as the training experience.
3 Divide the training data into groups or classes and label accordingly.
4 Determine the type of knowledge representation to be learned from the
training experience.
5 Choose a learner classifier that can generate general hypotheses from
the training data.
6. Apply the learner classifier to test data.
7. Compare the performance of the system with that of an expert human.
Leørner

Environment/
Experience Knowledge

Performance
element

Fig. 1.10.1.
Que 1.11. How we split data in machine learning ?

Answer
Data is splitted in three ways in machine learning:
1. Training data :
a The part of data we use to train our model.
b This is the data which our model actually sees (both input and
output)and learn from.
2. Validation data :
The part of data which is used to do a frequent evaluation of model,
fit on training dataset along with ìmproving involved
hyperparameters (initially set parameters before the model begins
learning).
b This data plays its part when the model is actually training.
3. Testing data :
a Once our model is completely trained, testing data provides the
unbiased evaluation.
Machine Learning 1-11G (CSTT/OE-Sem-8)
b. When we feed in the inputs of testing data, our model will predict
some values without seeing actual output.
C. After prediction, we evaluate our model by comparing it with actual
output present in the testing data.
d This is how we evaluate and see how much our model has learned
from the experiences feed in as training data, set at the time of
training.
Data in machine
learning

Training data Validation data Testing data

Fig. 1.11.1.
Que 1.12. Describe the terminologies used in machine learning.
Answer
Terminologies used in machine learning are :
1. Features: Aset ofvariables that carry discriminating and characterizing
information about the objects under consideration.
2. Feature vector :A collection of dfeatures, ordered in meaningful way
into a d-dimensional column vector that represents the signature of the
object to be identified.
3. Feature space : Feature space is d-dimensional space in which the
feature vectors lie. Ad-dimensional vector in ad-dimensional space
constitutes a point in that space.
a, ]feature 1
X, |feature 2

Jfeature 4

Feature space (3D)


Fig. 1.12.1.
1-12 G (CSIT/OE-Sem-8) Introduction

4. Class : The category to which a given object belongs,denoted by o.


5. Decision boundary : Aboundary in the d-dimensional feature space
that separates patterns of different classes from each other.
6. Classifier :An algorithm whichadjusts its parameters to find the correct
decision boundaries through alearning algorithm using atraining dataset
Such that a cost function is minimized.
7. Error: Incorrect labelling ofthedata by the classifier.
8. Training performance : The ability/performance of the classifier in
correctly identifying the classes of the training data, which it has already
seen. It is not a good indicator of the generalization performance.
9. Generalization (Test performance) : Generalization is the ability/
performance of the classifier in identifying the classes of previously
unseen data.

Que 1,13. Explain the components of machine learning system.

Answer
Components of machine learning system are :
1. Sensing :
a. It uses transducer such as camera or microphone for input.
b PR (Pattern Recognition) system depends on the bandwidth,
resolution, sensitivity, distortion, etc., ofthe transducer.
2. Segmentation : Patterns should be well separated and should not
overlap.
3. Feature extraction :
a. It is used for distinguishing features. .
b. This process extracts invariant features Wth respect to translation,
rotation and scale.
4. Classification :
a. It use a feature vector provide by a feature extractor to assign the
object to a category.
b It is not always possible to determine the values of allthe features.
Post processing :
a. Post processor uses the output of the classifier to decide on the
recommended action.
Machine Learning 1-13 G(CSIT/OE-Sem-8)
Decision

Costs
Post-processing Adjustments for
context

Classification
Adjustments for
missing features
Feature extraction

Segmentation

Sensing

Input Fig, 1.13.1.

Que 1.14.What are the classes of problemin machine learning ?


Answer

Common classes of problem in machine learning :


1. Classification :
a. In classification data is labelled i.e., it is assigned a class, for example,
spam/non-spam or fraud/non-fraud.
b. The decision being modelled is to assign labels to new unlabelled
pieces of data.
C. This can be thought of as a discrimination problem, nodelling the
differences or similarities between groups.
2. Regression :
a. Regression data is labelled with a real value rather than a label.
b. The decision being modelled is what value to predict for new
unpredicted data.
3. Clustering :
a. In clustering data is not labelled, but can be divided into groups
based on similarity and other measures of natural structure in the
data.
b. For example, organising pictures by faces without names, where
the human user has to assign names togroups, like iPhoto on the
Mac.
1-14 G (CS/IT/OE-Sem-8) Introduction

4. Rule extraction:
a. In rule extraction,data is used as the basis for the extraction of
propositional rules.
b. These rules discover statistically supportable relationships between
attributes in the data.

Que 1.15. Briefly explain the issues related with machine


learning.
Answer

Issues related with machine learning are:


1. Data quality :
a. It is essential to have good quality data to produce quality ML
algorithms and models.
b. To get high-quality data, we must implement data evaluation,
integration, exploration, and governance techniques prior to
developing ML models.
C. Accuracy of ML is driven by the quality of the data.
2. Transparency :
a. It is difficult tomake definitive statements on how well a model is
going to generalize in new environments.
3. Manpower :
a. Manpower means having data and being able to use it. This does
not introduce bias into the model.
b There should be enough skill sets in the organization for software
development and data collection.
4 Other:
a. The most common issue with ML is people using it where it does
not belong.
b. Every time there is some new innovation in ML, we see overzealous
engineers trying to use it where it's not really necessary.
C. This used to happen a lot with deep learning and neural networks.
d. Traceability and reproduction of results are two main issues.

PART-2

The Concept Learning Task, General-to-Specific Ordering of


Hypothesis, Find-s, List Then Eliminate Algorithm, Candidate
Elimination Algorithm, Inductive Bias.
Machine Learning 1-15 G (CSIT/OE-Sem-8)

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 1.16. Write short note on concept learning task.


Answer
1. The task learning, in contrast to learning from observations, can be
described by being given a set of training data
KA,,C), (A,, C;),.(A,, CH), where the A- [4,,A,, .., A,l with
A,, eA represent the observable part of the data (here denoted as
vector of attributes in the common formalism) and the C; represent a
valuation of this data.

2 If a functional relationship between the A, and C values is to be


discovered, this task is either called regression (in the statistics domain )
or supervised learning (in the machine learning domain).
3 The more special case where the C' values are restricted to some finite
set C' is called classification or concept learning in computational learning
theory.
4 The classical approach to concept learning is concerned with learning
concept descriptions for predefined classes C, of entities from E.

5 A concept is regarded as a function mapping attribute values A, of


discrete attributes toa Boolean value indicating concept membership.
6 In this case,the set of entities E is defined by the outer product over the
range of the considered attributes inA.
7 Concepts are described as hypotheses, i.e., the conjunction of restrictions
on allowed attribute values like allowing just one specific, a set of or any
value for an attribute.
8 The task of classical concept learning consists of finding a hypothesis for
each class C,that matches the training data.
9. This task can be performed as adirected search in hypotheses space by
exploiting apre-existing ordering relation, called general to specific
ordering of hypotheses.
1-16 G (CSIT/OE-Sem-8) Introduetion

10. Ahypotheses thereby is more general than another if its set of allowed
instances is a superset to the set of instances belonging to the other
hypothesis.
Que 1.17. Define the term concept and concept learning. How can
we represent a concept ?

Answe
Concept : Concept is Boolean-valued function defined over a large set of
objects or events.
Concept learning :Concept learning is defined as inferring a Boolean
valued function from training examples of input and output of the function.
Concept learning can be represented using:
1. Instance x : Instance x is a collection of attributes (Sky, AirTemp,
Humidity, etc.)
Target function c:Enjoysport: X (0, 1)
3. Hypothesis h : Hypothesis h is a conjunction of constraints on the
attributes. Aconstraint can be:

a specific value (for example water = warm)


do not care (for example water = ?)
no value allowed (for example water= ¢)
4, Training exampled: An instance x,paired with the target functionc,

cx) = 0 negative example


clx,) = 1 positive example

Que 1.18. Explain the working of find-S algorithm with flow chart.
Answer
Working of find-S algorithm :
1 The process starts with initializing 'h' with the most specific hypothesis,
generally, it is the first positive example in the data set.
2 We check for each positive example. If the example is nogative, we will
move on to the next example but if it is apositive example we will
consider it for the next step.
3. We will check ifeach attribute in the example is equal to the hypothesis
value.
4. If the value matches, then no changes are made.
5. If the value does not match, the value is changed to ?".
Machine Learning 1-17 G (CSTT/OE-Sem-8)
6. We do this until we reach the last positive example in the data
set.

Initialize h

Identify a positive
example

Check for attributes

Attribute
values is equal to
hypothesis
value ?

Replace the
value with "

Fig.1.18.1.

Que 1.19. Write find-S algorithm, with its advantages and


disadvantages.
Answer
Find-S algorithm :
1. Initialize h to the most specific hypothesis in H.
2. For each positive training instance x.
For each attribute constraint a, in h
IF the constraint a, in h is satisfied by x THEN
do nothing
ELSE
replace a, in h by next more general constraint satisfied by x
3. Output hypothesish.
1-18 G (CSIT/OE-Sem-8) Introduction

Advantages of Find-S algorithm :


1. Ifthe correct target concept is contained in H and the training data are
correct, the Find-S algorithm can guarantee that the output is the most
specific hypothesis in Hthat is consistent with the positive examples.
Disadvantages of Find-S algorithm :
1 There is no way to determine if the hypothesis is consistent throughout
the data.
2 Inconsistent training sets can mislead the Find-S algorithm, since it
ignores the negative examples.
3 Find-S algorithm does not provide a backtracking technique to determine
the best possible changes that could be done to improve the resulting
hypothesis.
4 The convergence of the learning process is poor, and convergence to the
correct objective function cannot be guaranteed.
5 The robustness to noise is weak, and, for anumber of special assumptions,
the algorithm becomes powerless.
Que 1.20. What is version space ? Explain list-then-eliminate
algorithm.
Answer
Version space :
1. Ahypothesis h is consistent with a set of training examples LD of target
concept c if and only ifh(x) =cr) for each training example in D.
Consistent (h, D) = ( <x, cx)>e D) he) = clx)
2 The version space, VSp. D, with respect to hypothesis space Hand training
examples D, is the subset of hypotheses from H consistent with all
training examples in D.
VSn lh e H|Consistent (h, D))
List-then-eliminate algorithm :
1. List-Then-Eliminate algorithm initializes the version space to contain
all hypotheses in H, then eliminates hypothesis that are inconsistent,
from training example.
2. The version space of candidate hypotheses thus shrinks as more examples
are observed, until one hypothesis remain that is consistent with all the
observed examples.
a. Presumably this is the desired target concept.
b. If insufficient data is available to narrow the version space to a
single hypothesis, then the algorithm can output the entire set of
Machine Learning 1-19 G (CSIT/OE-Sem-8)

hypothesis consistent with the observed data.


3
List-Then-Eliminate algorithm can be applied whenever the hypothesis
space H is finite.
a. It has many advantages, including the fact that it is guaranteed to
output all hypotheses consistent with the training data.
b. It requires exhaustively enumerating all hypotheses in H - an
unrealistic requirement for all but the most trivial hypothesis spaces.
Que 1.21. Explain candidate elimination algorithm with its
procedure.

Answer
1. The Candidate-Elimination algorithm computes the version space
containing all hypotheses from H that are consistent with an observed
sequence of training examples.
2 It begins by initializing the version space to the set of all hypotheses in
H, that is, by initializing the Gboundary set to contain the most general
hypothesis in H
Go - (<?,?,?,?,?,?>)
and initializing the S boundary set to contain the most specific hypothesis
S,- (<0, 0, 0, 0, 0, 0>)
3 These two boundary sets delimit the entire hypothesis space, because
every other hypothesis in H is both more general than S, and more
specific than Go:
4. As each training example is considered, the S and Gboundary sets are
generalized and specialized, respectively, to eliminate from the version
space any hypotheses found inconsistent with the new training example.
5. After all examples have been processed, the computed
version space
contains all the hypotheses consistent with these examples and
hypotheses.
Algorithm :
1 Initialize G tothe set of maximally general hypotheses in H.
2 Initialize S to the set of maximally specific hypotheses in H.
3. For each training example d, do
a. Ifd is a positive example
Remove from G any hypothesis that does not included
For each hypothesis s in S that does not include d
Remove s from S
1-20 G (CS/IT/OE-Sem-8) Introduction

Add to S all minimal generalizations hofs such that


h includes d, and
C, Some member of G is more general than h
Remove from S any hypothesis that is more general than another
hypothesis in S.
4 For each training example d, do
Ifd is a negative example
Remove from S any hypothesis that does included
For each hypothesis gin Gthat does include d
Remove g from G
5. Add toG allminimal generalizations h ofg such that
h does not include d and
b Some member of S is more specific than h
6. Remove from G any hypothesis that is less general than another
hypothesis in G.
7. IfG or S ever becomes empty, data not consistent (with H).

Que 1.22. Explain inductive bias with inductive system.


Answer
Inductive bias:
1. Inductive bias refers to the restrictions that are imposed by the
assumptions made in the learning method.
2. For example, assuming that the solution to the problem of road safety
can be expressed as a conjunction of a set of eight concepts.
3 This does not allow for more complex expressions that cannot be
expressed as a conjunction.
4. This inductive bias means that there are some potential solutions that
we cannot explore, and not contained within the version space we
examine.

5. Order to have an unbiased learner, the version space would have to


contain every possible hypothesis that could possibly be expressed.
6. The solution that the learner produced could never be more general
than the complete set of training data.
7. In other words, it would be able to classify data that it had previously
seen (as the rote learner could) but would be unable to generalize in
order to classify new, unseen data.
Machine Learning 1-21 G (CS/TT/OE-Sem-8)
8. The inductive bias of the candidate elimination algorithm is that it is
only able to classify a new piece of data if all the hypotheses
within its version space give data the same classification. contained
9. Hence, the inductive bias does impose a limitation on the learning method.
Inductive systenm:
Inductive system
Classification of
Candidate
Training examples new instance or
elimination do not know
New instance algorithm
Using hypothesis
space H

Fig. 1.22.1.

Que 1.23. Explain inductive learning algorithm.


Answer
Inductive learning algorithm :
Step 1 : Divide the table T containing m examples into n sub-tables
(t1, t2, ... tn). One table for each possible value of the class attribute (repeat
steps 2-8 for each sub-table).
Step 2:Initialize the attribute combination countj =1.
Step 3:For the sub-table on which work is going on, divide the attribute list
into distinct combinations, each combination withj distinct attributes.
Step 4: For each combination of attributes, count the number of occurrences
of attribute values that appear under the same combination of attributes in
unmarked rows of the sub-table under consideration, and at the same time,
not appears under the same combination of attributes of other sub-tables.
Call the first combination with the maximum number of occurrences the
max-combination MAX.
Step 5: If MAX = =null, increase j by 1and go to Step 3.
Step 6: Mark all rows of the sub-table where working, in which the values
of MAX appear, as classified.
Step 7:Add arule (IF attribute ="XYZ" ’THEN decision is YES/NO) to R
(rule set) whose left-hand side will have attribute names of the MAX with
their values separated by AND, and its right hand side contains the decision
attribute value associated with the sub-table.
1-22 G (CS/IT/OE-Sem-8)
Introduction
Step 8:If all rows are marked as classified, then move on to
process another
sub-table and go to Step 2,else, go to Step 4. If no sub-tables are available,
exit with the set of rules obtained till then.

Que 1.24. What are thelearning algorithm used in


inductive bias?
Answer
Learning algorithm used in inductive bias are:
1. Rote-learner :
Learning corresponds to storing each observed training example in
memory.
b. Subsequent instances are classified by looking them up in memory.
C. If the instance is found in memory, the stored
classification is
returned.
d. Otherwise, the system refuses to classify the new instance.
e. Inductive bias : There is no inductive bias.
2. Candidate-elimination :
New instances are classified only in the case where all members of
the current version space agree on the classification.
b. Otherwise, the system refuses to classify the new, instance.
C. Inductive bias : The target concept can be represented in its
hypothesis space.
3. FIND-S:
a. This algorithm, finds the most specific hypothesis consistent with
the training examples.
b. It then uses this hypothesis to classify all subsequent instances.
C. Inductive bias : The target concept can be represented in its
hypothesis space, and all instances are negative instances unless
the opposite is entailed by its other Knowledge.

Que 1.25. Differentiate between supervised and unsupervised


learning.
Machine Learning 1-23 G (CSIT/OE-Sem-8)

Answer
S. No. Supervised Unsupervised
learning learning
1. Supervised learning is also Unsupervised learning is also
known as associative learning, known as self-organization, in
in which the network is trained which an output unit is trained
by providing it with input and to respond toclusters of pattern
matching output patterns. within the input.
2 Supervised training requires the Unsupervised training is
pairing of each input vector with employed in self-organizing
a target vector representing the neural networks.
desired output.
3 During thetraining session, an During training, the neural
input vector is applied to the network receives input
network, and it results in an patterns and organizes these
output vector. Thisresponse is patterns into categories. When
then compared with the target| new input pattern is applied,
response. the neural network provides an
output response indicating the
class to which the input patterns
belong.
4 If the actual response differs If aclass cannot be found for
from the target response, the the input pattern, a new class
network will generate an error is generated.
signal.
5. The error minimization in this Unsupervised training does not
kind of training requires a require a teacher, it requires
supervisor or teacher. These certain guidelines to form
input-output pairs can be groups, Grouping can be done
provided by an external|based on colour, shape, and any
teacher, or by the system which other property ofthe object.
contains neural network.
6. Supervised training methods Unsupervised learning is useful
are used to perform non-linear for data compression and
mapping in pattern classification clustering.
networks, pattern association
networks and multi-layer
neural networks.
7. Supervised learning generates In this, asystemis supposed to
aglobal model and alocal model. discover statistically salient
fe atures of the input
population.
2 UNIT
Decision Tree
Learning

CONTENTS
Part-1 : Decision Tree Learning, 2-2G to 2-13G
Decision Tree Learning
Algorithm, Inductive Bias,
Issues in Decision
Tree Learning
Artificial Neural Network, 2-13G to 2-23G
Part-2
Perceptrons, Gradients
Descent and the
Delta Rule, Adaline

Part-3 : Multilayer Network, 2-23G to 2-33G


Derivation of
Backpropagation Rule,
Backpropagation Algorithm
Convergence, Generalization

2-1 G (CSIT/OE-Sem-8)
2-2G(CS/IT/OE-Sem-8) Decision Tree Learning

PART-1
Decision Tree Learning : Decision Tree Learning Algorithm,
Inductive Bias, Issues in Decision Tree Learning.

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 2.1. Explain decision tree in detail.


Answer
1 A decision tree is a flowchart structure in which each internal node
represents a test on a feature, each leaf node represents a class label
and branches represent conjunctions of features that lead to those class
labels.
2. The paths from root to leaf represent classification rules.
3. Fig 2.1.1, illustrate the basic flow of decision tree for decision making
with labels (Rain(Yes), Rain(No).

Outlook

Sunny Overcast Rain

|Humidity Yes Wind

High Normal Weak


Strong
No Yes No Yes
Fig. 2.1.1,
4. Decision tree is the predictive modelling approach used in statistics, data
mining and machine learning.
5. Decision trees are constructed via an algorithmic approach that identifies
the ways to split a data set based on diferent conditions.
6. Decision trees are a non-parametric supervised learning method used
for both classification and regression tasks.
2-3G (CSIT/OE-Sem-8)
Machine Learning
7.
Classification trees are the tree models where the target variable can
take a discrete set of values.
8 Regression trees are the decision trees where the target variable can
take continuous set of values.

Que 2.2. What are the steps used for making decision tree ?

Answer

Steps used for making decision tree are :


1. Get list of rows (dataset) which are taken into
consideration for making
decision tree (recursively at each node).
2. Calculate uncertainty of our dataset or Gini impurity
or how much our
data is mixed up etc.
3. Generate list of all question which needs to be
asked at that node.
question
4 Partition rows into True rows and False rows based on each
asked.
5 Calculate information gain based on Gini impurity and partition of data
from previous step.
asked.
6. Update highest information gain based on each question
information gain).
7 Update question based on information gain (higher
8. Divide the node on question. Repeat again from step l until we get pure
node (leaf nodes).

Que 2.3. Write short note on Gini impurity and Gini impurity
index.

Answer

1 Gini impurity is the probability of incorrectly classifying a randomly


element in the dataset if it were randomly labeled according to
chosen
the class distribution in the dataset.
feature with
2. The Gini impurity index measures the impurity of an input
respect to the classes.
attributes in
3. Gini impurity index reaches itsminimum (zero) when all
the node fall into a single information class.
4 The Gini index associated with attribute X= [x, X,, ...*) for node t is
denoted by Itx) and is expressed as
m

j=l

where ftyr.y) is the proportion of sampleswith the value x, belonging


to classj at 'node t as defined in equation.
Decision Tree Leurning

Te docain too pliing oriterion is basecl on cloosing the attribute


witlthe lowont niurity index ot the aplit.
i Lotnode t be ai intor cildren, n, le the number of records atchild
ilN, Io the total nunber of aanplea al node .
he du inputy ideN of the aplit at noxde lor attribute Xis then
onputol by

Que 4. What ure the advantages and disadvantages of decision


roo ethod?

Answer

Advantagen ofdecinion tree method are:


I Docinon troes are able to generate underatandable rules.
Dociniontroon perlomclasaification without requiring computation.
Docnion treen are able to handle both continuou8 and categorical
vAablen.
Deciniontreen provide uclear indication lor the fields that are important
lor prodiction or claanification.
Dinudvantugen of dleeision tree method are :
Deciaion treen are lesN Appropriate for estimation tasks where the goal
in to predict the value ofa continuous attribute.
Decnon treen are pronetoerrors in classification problems with many
clann and relativoly Iall number of training examples.
Decinion tree are computationully expensive to train. At ench node,
echenlidate nplitting field must be sorted before its best split can be
lound.
In decinion treo algorithun,combinations offields are Used and a search
must be made for optimal combining weights. Pruning algorithms can
alnobe expennive nnce many candidate sub-trees must be formed and
compared.
Que 2.6, low to avoid overfiting in decision tree model ?

Answer
I Overfiting in the phenonenon in which the learning aystem tightly fits
the given training data so that it would be inaccurate in predicting the
outeonen ofthe untrained data.
In decision treen, overfitting oceurs when the tree is deaigned to perlectly
it all annploa inthe training data set.
2-5 G (CS/TT/OE-Sem-8)
Machine Learning
branches that
3. To avoid decision tree from overfitting, we remove the called as
method is
make use of features having low importance. This
pruning or post-pruning.
accuracy
4. It reduces the complexity of tree, and hence improves predictive
by the reduction of overfitting.
reducing
5. Pruning should reduce the size of a learning tree without
predictive accuracy as measured by a cross-validation set.
6. There are two major pruning techniques:
the
a. Minimum error:The tree i_ pruned backto the point where
cross-validated error is minimum.
b. Smallest tree: The tree is pruned back slightly further than the
cross-validation
minimum error. Pruning creates a decision tree with The smaller
error within 1 standard error of the minimum error.
tree is nmore intelligible at the cost of a small increase in error.
7. Another method to prevent over-fitting is to try and stop the tree
building process early, before it produces leaves with very small samples.
This heuristic is known as early stopping or pre-pruning decision trees.
error. If
8 At each stage of splitting the tree, we check the cross-validation
the error does not decrease significantly enough then we stop.
9. Early stopping is a quuck fix heuristic. If early stopping is used together
with pruning, it can save time.

Que 2.6. How can we eXxpress decision trees?

Answer
1.
Decision trees classify instances by sorting them down the tree from the
root toleaf node, which provides the classification of the instance.
2. An instance is classified by starting at the root node of the tree, testing
the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute as shown in Fig. 2.6.1.
3. This process is then repeated for the subtree rooted at the new node.
4. The decision tree in Fig. 2.6.1 classifies a particular morning according
towhether it is suitable for playing tennis and returning the classification
associated with the particular leaf.
5. For example, the instance
(Outlook =Rain, Temperature =Hot, Humidity =High, Wind = Strong)
would be sorted down the left most branch of this decision tree and
would therefore be classified as a negative instance.
6. In other words, decision tree represent a disjunction of conjunctions of
constraints on the attribute values of instances.
2-6 G (CSIT/OE-Sem-8) Decision Tree Learning

(Outlook = Sunny Humidity = Normal) v (Outlook = Overcast) v


(Outlook = Rain a Wind = Weak)

Outlook

Sunny Overcost Rain

Yes

Humidity Wind

High Normal Strong Weak

No Yes No Yes

Fig. 2.6.1.

Que 2.7. Discuss the issues related to the applications of decision


trees.

Answer
Issues related to the applications of decision trees are:
1. Missing data:
a. When values have gone unrecorded, or they might be too expensive
to obtain.
b. Twoproblems arise :
i To classify an object that is missing from the test attributes.
To modify the information gain formula when examples have
unknown values for the attribute.
2. Multi-valued attribute :
a. When an attribute has many possible values, the information gain
measure gives an inappropriate indication of the attribute's
usefulness.
b. In the extreme case, we could use an attribute that has a different
value for every example.
Then each subset of examples would be asingleton with a unique
classification, so the information gain measure would have its
highest value for this attribute, the attribute could be irrelevant or
useless.
d One solution is to use the gain ratio.
Machine Learning 2-7G (CSIT/OE-Sem-8)

3. Continuous and integer valued input attributes :


a. Height and weight have an infinite set of possible values.
b. Rather than generating infinitely many branches, decision tree
learning algorithms find the split point that gives the highest
information gain.
C.
Efficient dynamic programming methods exist for finding good
split points, but it is still the most expensive part of real world
decision tree learning applications.
4. Continuous-valued output attributes :
a. If we are trying to predict a numerical value, such as the price of
a work of art, rather than discrete classifications, then we need a
regression tree.
b Such a tree has alinear function of some subset of numerical
attributes,rather than a single value at each leaf.
C. The learning algorithm must decide when to stop splitting and
begin applying line ar regression using the remaining attributes.
Que 2.8.Describe the basic terminology used in decision tree.
Answer
Basic terminology used in decision trees are :
1. Root node :It represents entire population or sample and this further
gets divided into two or more homogeneous sets.
2 Splitting: It is aprocess of dividinganode into two or more sub-nodes.
3. Decision node : When a sub-node splits into further sub-nodes, then it
is called decision node.
Root node
Branch/sub-tree
Splitting
Decision node Decision node

|Terminal node Decision node Terminal node Terminal node

Terminal node Terminal node

Fig. 2.8.1.
4. Leaf Terminal node : Nodes that do not split is called leaf or terminal
node.
2-8 G (CS/IT/OE-Sem-8) Decision Tree Learning
5. Pruning: When we remove sub-nodes of adecision node, this process
is called pruning. This process isopposite to splitting process.
6. Branch/sub-tree : Asub section of entire tree is called branch or sub
tree
7. Parent and child node : A node which is divided into sub-nodes is
called parent node of sub-nodes where as sub-nodes are the child of
parent node.

Que 2.9. Why do we use decision tree?

Answer
1 Decision trees can be visualized, simple to understand and interpret.
2. They require less data preparation whereas other techniques often
require data normalization, the creation of dummyvariables and removal
of blank values.
3. The cost of using the tree (for predicting data) is logarithmic in the
number of data points used to train the tree.
4. Decision trees can handle both categorical and numerical data whereas
other techniques are specialized for only one type of variable.
5. Decision trees can handle multi-output problems.
6. Decision tree is a white box model i.e., the explanation for the condition
can be explained easly by Boolean logic because there are two outputs.
For example yes or no.
7. Decision trees can be used even if assumptions are violated by the
dataset from which the data is taken.

Que 2.10.Explain various decision tree learning algorithms.


Answer
Various decision tree learning algorithms are :
1. ID3 (Iterative Dichotomiser 3) :
ID3 is an algorithm used to generate a decision tree from a dataset.
ii. To construct a decision tree, ID3 uses a top-down, greedy search
through the given sets, where each attribute at every tree node is
tested toselect the attribute that is best for classification of a given
set.
iüi. Therefore, the attribute with the highest information gain can be
selected as the test attribute of the current node.
iv. In this algorithm, small decision trees are preferred over the larger
ones. It is a heuristic algorithm because it does not construct the
smallest tree.
2-9G(US/II/OE-Sem #)
Machine Learning
V For building a decision tree model, ID3 only nccepts cntegorical
attributes. Accurate results are not given by ID3 when there is
noise and when it is serially implemented.
vi. Therefore data is preprocessed before constructing a decision troe.
vii For constructing a decision tree information guin is caleulated lor
each andevery attribute and attribute with the highest inlormation
gain becomes the root node. The rest possible values are denoled
by arcs.
vii. All the outcome instances that are possible are examined whether
they belong to the same class or not. For the instances of the suDe
class, a single name is used to denote the class otherwise tho
instances are classified on the basis of splitting attribute.
2. C4.5:

C4.5 is an algorithm used to generate a decision tree. It is an extension


of ID3 algorithm.
C4.5 generates decision trees which can be used for classification
ii.
and therefore C4.5 is referred to as statisticalclassifier.
i. It is better than the ID3 algorithm because it deuls with both
continuous and discrete attributes and also with the missing values
and pruning trees after construction.
iv. C5.0 is the commercial successor of C4.5 because it is fasler, mennory
efficient and used for building smaller decision trees.
V. C4.5 performs by default a tree pruning process. This leads to the
formation ofsmaller trees, simple rules and producos more intuitive
interpretations.
3. CART (Classification And Regression Trees) :
CART algorithm builds both classification and regression trees.
iü. The classification tree is constructed by CART through binary
splitting ofthe attribute.
iüi. Gini Index is used for selecting the splitting attribute.
iv. The CART is also used for regression analysis with the help of
regression tree.
V. The regression feature of CART can be used in forecasting a
dependent variable given a set of predictor variable over a given
period of time.
vi. CART has an average speed of processing and supports both
continuous and nominal attribute data.

Que 2.11. What are the advantages and disadvantages of different


decision tree learning algorithm ?
2-10 G (CS/TT/OE-Sem-8)
Decision Tree Learning

Answer
Advantages of ID3 algorithm :
1 The training data is used to create understandable prediction rules.
2 It builds short and fast tree.
3 ID3 searches the whole dataset to create the whole tree.
4 It finds the leaf nodes thus enabling the test data to be
pruned and
reducing the number of tests.
5. The calculation time of ID3 is the
linear function of the product of the
characteristic number and node number.
Disadvantages of ID3 algorithm :
1 For a small sample, data may be overfitted or
overclassified.
2. For making a decision, only one attribute is tested at an instant thus
consuming a lot of time.
3 Classifying the continuous data may prove to be expensive in terms of
computation, as many trees have to be generated to see where to break
the continuous sequence.
4. It is overly sensitive to features when given a large number of input
values.
Advantages of C4.5 algorithm :
1. C4.5 is easy to implement.
2. C4.5 builds models that can be easily interpreted.
3. It can handle both categorical and continuous values.
4. It can deal with noise and missing value attributes.
Disadvantages of C4.5 algorithm :
1. A
small variation in data can lead to different decision trees when using
C4.5.
2. For a small training set, C4.5 does not work very well.

Advantages of CART algorithm:


1. CART can handle missing values automatically using proxy splits.
2. It uses combination of continuous/diserete variables.
3. CART automatically performs variable selection.
4 CART can establish interactions among variables.
5. CART does not vary according to the monotonic transformation of
predictive variable.
Disadvantages of CART algorithm :
1. CART has unstable decision trees.
2. CART splits only by one variable.
Machine Learning 2-11G(CSIT/OE-Sem-8)
3 It is non-parametric algorithm.

Que 2.12. | Explain attribute selection measures used in decision


tree.

Answer
Attribute selection measures used in decision tree are :
1. Entropy:
Entropy is a measure of uncertainty associated with a random
variable.
ii. The entropy increases with the increase in uncertainty or
randomness and decreases with a decrease in uncertainty or
randomness.
ii. The value of entropy ranges from 0-1.

Entropy(D)=,-P.log,(p,)
where p, is the non-zero probability that an arbitrary tuple in D
belongs to class Cand is estimated by |C, D|/|D|.
iv. Alog function of base 2 is used because the entropy is encoded in
bits 0 and 1.
2. Information gain :
i. ID3 uses information gain as its attribute selection measure.
: Information gain is the difference between the original information
gain requirement (i.e. based on the proportion of classes) and the
new requirement (i.e. obtained after the partitioning of A).
|D,|
Gain(D, A) =EntropylD)- 2,.Di EntropylD)
Where,
D: Agiven data partition
A:Attribute
V: Suppose we partition the tuples in D on some
attribute Ahaving Vdistinct values
ii. Dis split into Vpartitionor subsets, (D, D,..D) where D, ,contains
those tuples in D that have outcome a,, ofA.
iv. The attribute that has the highest information gain is chosen.
3. Gain ratio :
i. The information gain measure is biased towards tests with many
outcomes.
That is, it prefers to select attributes having a large number of
values.
2-12 G (CS/IT/OE-Sem-8) Decision Tree Learning
ii. As each partition is pure, the information gain by partitioning is
maximal. But such partitioning cannot be used for classification.
iv. C4.5uses this attribute selection measure which is an extension to
the information gain.
V. Gain ratio differs from information gain, which measures the
information with respect to a classification that is acquired based
on some partitioning.
vi. Gain ratio applies kind ofinformation gain using a split information
value defined as:
D
Splitinfo,=- E log
vii. The gain ratio is then defined as:
Gain ratio (A) = Gain (A)
SplitInfo, (D)
viiü. Asplitting attribute is selected which is the attribute having the
maximum gain ratio.
4. Gini index : Refer Q. 2.3, Page 2-3G,Unit-2.

Que 2.13. Explain applications of decision tree in various areas


of data mining.

Answer
The various decision tree applications in data mining are :
1. E-Commerce : It is used widely in the field of e-commerce, decision
tree helps to generate online catalog which is an important factor for
the success of an e-commerce website.
2 Industry:Decision tree algorithm is useful for producing quality control
(faults identification) systems.
3. Intelligent vehicles : An important task for the development of
intelligent vehicles is to find the lane boundaries of the road.
4. Medicine:
a. Decision tree is an important technique for medical research and
practice. Adecision tree is used for diagnostic of various diseases.
b. Decision tree is also used for hard sound diagnosis.
5. Business: Decision trees find use in the field of business where they
are used for visualization of probabilistic business models, used in CRM
(Customer Relationship Management) and used for credit scoring for
credit card users and for predicting loan risks in banks.

Que 2.14.Explain procedure of ID3 algorithm.


2-13 G (CSIT/OE-Sem-8)
Machine Learning

Answer
:
ID3(Examples, Target Attribute, Attributes)
1. Create a Root node for the tree.
single-node tree root, with label
2. If all Examples are positive, return the
single-node tree root, with label
3. If all Examples are negative, return the
single-node tree root, with label =
4 If Attributes is empty, return the examples.
most common value of target attribute in
5. Otherwise begin
classifies Examples
A+the attribute fromAttributes that best
b The decision attribute for Root 4-A
C. For each possible value, V, ofA,
corresponding to the test A
Add a new tree branch below root,
= V,
that have value V.
Let Example V, be the subset of Examples
for A
ii. If Example V, is empty
a. Then below this new branch
add a leaf node with label
in Examples
= most common value of TargetAttribute
ID3 (Example
b. Else below this new branch add the sub-tree
V,, TargetAttribute, Attributes-(A))
6. End
7. Return root.

Que 2.15. Explain inductive bias with example.

Answer
Refer Q. 1.22, Page 1-20G, Unit-1.

PART-2

Artificial Neural Network Perceptrons, Gradients Descent


and the Delta Rule.

Questions-Answers

Long Answer Type and Medium Answer Type Questions


2-14G (CSIT/OE-Sem-8) Decision Tree Learning

Que 2.16. Write short note on Artifieial Neural Network (ANN).


Answer
Artificial Neural Networks (ANN) or neural networks are computational
algorithns that intended to simulate the bechaviour of biological systems
composed of neurons.
2 ANNs are computational models inspired by an animal's central nervous
systems.
3 It is capable of machine learning as well as pattern recognition.
4 Aneuralnetwork is an oriented graph. It consists ofnodes which in the
biological analogy represent neurons, connected by arcs.
5 It corresponds to dendrites and synapses. Each arc associated with a
weight at each node.
6. A neural network is a machine learning algorithn based on the model
of ahuman neuron. The human brain consists of millions of neurons.
7. It sends and process signals in the form of electrical and chemical signals.
8 These neurons are connected with a special structure known as synapses.
Synapses allow neurons to pass signals.
9. An Artificial Neural Network is an information processing technique. It
works likethe way human brain processes information.
10. ANN includes a large number of connected processing units that work
together to process information. They also generate meaningful results
from it.
11. Aneural network contains the following three layers:
a. Input layer : The activity of the input units represents the raw
information that can feed intothe network.
b. Hidden layer:
Hidden layer is used to determine the activity of each hidden
unit.
iL. The activities of the input units and the weights depend on the
connections between the input and the hidden units.
ii. There may be one or more hidden layers.
C.
Output layer: The behaviour of the output units depends on the
activity of the hidden units and the weights between the hidden
and output units.

Que 2.17. What are the advantages and disadvantage of Artificial


NeuralNetwork ?
Machine Learning 2-15 G (CSTT/OE-Sem-8)

Answer
Advantages of Artificial Neural Networks (ANN) :
1 Problems in ANN are represented by attribute-value pairs.
2 ANNs are used for problems having the target function, output may be
discrete-valued, real-valued, or avector of several real or discrete-valued
attributes.
3 ANNslearning methods are quite robust to noise in the training data.
The training examples may contain errors, which do not affect the final
output.
4 It is used where the fast evaluation of the learned target function
required.
5 ANNs can bear long training times depending on factors such as the
number of weights in the network, the number of training examples
considered, and the settings of various learning algorithm parameters.
Disadvantages of Artificial Neural Networks (ANN) :
1 Hardware dependence :
a. Artificial neural networks require processors with parallel processing
power, by their structure.
b. For this reason, the realization of the equipment is dependent.
2. Unexplained functioning of the network :
This is the most important problem of ANN.
b When ANN gives a probing solution, it does not give a clue as to
why and how.
C. This reduces trust in the network.
3. Assurance of proper network structure:
a There is no specific rule for determining the structure of artificial
neural networks.
b. The appropriate network structure is achieved through experience
and trial and error.
4. The difficulty of showing the problem to the network:
a. ANNs can work with numerical information.
b. Problems have to be translated into numerical values before being
introduced to ANN.
C. The display mechanism to be determined will directly influence the
performance of the network.
d. This is dependent on the user's ability.
5. The duration of thenetwork is unknown:
a. The network is reduced to a certain value of the error on the
sample means that the training has been completed.
9-16G(CTTOE Sem 8)

Que 2.18. Whnt ne the ehaeetetettee ut Avt Hmat


Network ?

Answer

Chnreteristies of Atitetnt Neurnt Netwtne:


Iis neally inplementetnntlentiat d
2. I eontains latge nnl of internneted n lmtlhl
neurons to do nll he opentinne
3. Infomation storod in the oea betenlly th weutl k t
1euron8.
4. Tle input signnla nnive nt thepelh oleentenemtm
And eonnteeting weiphta.
Ihas(he nbility tolen, oenll and gnlize hnn the elven lnn y
suitalble nesignnnent nnd oliuetmentof wephta
6. The collective behviomr of te eme deron te att
power,and o single nerm en ife intntn
Que 2.19. Explain the npplientim enofntetatnent twh
Answer
Appliention nrens of artifieial neurt network ne
1. Speech recognition
A
Speech oeenpies nprominent rute in lm omnintodn
Therefore, it ia natural for enple toexwet hnttwith
computers.
C. In the present ern, for eommuniestion withchies, uuhall
need sophistiented lapungea wlhieh aredieult o lem ee
To ease hia eonunniention lbrier, a slple elitton u
eommunieation in a goken Inngge that te pole the n e
to understnnd
e. Hence, ANN in plnyin a aoe le in eeelh retm
2. Charaeter reeognitiont
a. It is a problem whieh alle under the genenl aon of PattetH
Recognition.
b. Many neural networka hnve leon develed bn mutunte
recognition ofhandwrittenemters,eitler lottea telte
3. Signnture verifiention npplientkon :
Signaturen Are useful wnys to autlriend nthenteaten
in legal traneactions,
Machine Learning 2-17G(CSTT/OE-Sem-8)

b. Signature verification technique is a non-vision based technique.


C. For this application, the first approach is to extract the feature or
rather the geometrical feature set representing the signature.
d. With these feature sets, we have to train the neural networks
using an efficient neural network algorithm.
e. This trained neural network will classify the signature as being
genuine or forged under the verification stage.
4. Human face recognition:
a. It is one of the biometric methods to identify the given face.
b. It is a typical task because of the characterization of "non-face"
images.
C. However, if a neural network is well trained, then it can be divided
into two classes namely images having faces and images that do not
have faces.

Que 2.20. Explain different types of neuron connection with


architecture.

Answer
Different types of neuron connection are :
1. Single-layer feed forward network:
In this type of network, we have only two layers i.e., input layer
and output layer but input layer does not count because no
computation is performed in this layer.
b Output layer is formed when diferent weights are applied on input
nodes and the cumulative effect per node is taken.
C. After this the neurons collectively give the output layer to compute
the output signals.
Input layer Output layer
W11

W12
W21
(Y2)
W
1m
W2m
W2

W
nm m
2-18 G (CSIT/OE-Sem-8)
Decision Tree Learning
2. Multilayer feed forward network :
a This layer has hidden layer which is internal to the network and
has no direct contact with the external layer.
b Existence of one or more hidden layers enables the network to be
computationally stronger.
C. There are no feedback connections in which outputs of the model
are fed back into itself.

Input layer Hidden layer Output layer


X W11
Vi1
W19
Wo1
W22; V22
1m
W1m
Wn2 2n
W
2m
X W
nm Vmk

3. Single node with its own feedback :


When outputs can be directed back as inputs to the same layer or
preceding layer nodes, then it results in feedback networks.
b Recurrent networks are feedback networks with closed loop.
Fig. 2.20.1 shows a single recurrent network having single neuron
with feedback to itself.

Input Output

Feedback

Fig. 2.20.1.

4. Single-layer recurrent network:


a. This network is single layer network with feedback connection in
which processing element's output can be directed back to itself or
toother processing element or both.
Machine Learning 2-19 G (CSIT/OE-Sem-8)

b Recurrent neural network is a class of artificial neural network


where connections between nodes form a directed graph along a
sequence.
C. This allows it to exhibit dynamic temporal behaviour for a time
sequence. Unlike feed forward neural networks, RNNs can use
their internal state (memory) to process sequences of inputs.

Wi1

W22

Wnm

5. Multilayer recurrent network :


a. In this type of network, processing element output can be directed
to the processing element in the same layer and in the preceding
layer forming a multilayer recurrent network.
b. They perform the same task for every element of a sequence, with
the output being depended on the previous computations. Inputs
are not needed at each time step.
C. The main feature of a multilayer recurrent neural network is its
hidden state, which captures information about a sequence.

W11 V11 Z

V12
V21
Xe W22

3m
Vk3
Wnm
-Vnm Lm

Que 2.21. Discuss the benefits of artificial neural


network.
2-20 G (CSIT/OE-Sem-8) Decision Tree Learning

Answer
1. Artificial neural networks are flexible and adaptive.
2. Artificial neural networks are used in sequence and pattern recognition
systems, data processing, robotics, modeling, ete.
ANN acquires knowledge from their surroundings by adapting to internal
and external parameters and they solve complex problems which are
difficult to manage.
4 It generalizes knowledge to produce adequate responses to unknown
situations.
5 Artificial neural networks are flexible and have the ability to learn,
generalize and adapts to situations based on its findings.
6 This function allows the network to efficiently acquire knowledge by
learning. This is a distinct advantage over a traditionally linear network
that is inadequate when it comes to modelling non-linear data.
7. An artificial neuron networkis capable of greater fault tolerance than a
traditional network. Without the loss of stored data, the network is able
to regenerate a fault in any of its components.
8. An artificial neuron network is based on adaptive learning.

Que 2.22. Write short note on gradient descent.

Answer
1. Gradient descent is an optimization technique in machine learning and
deep learning and it can be used with all the learning algorithms.
2. Agradient is the slope of afunction, the degree of changeof a parameter
with the amount of change in another parameter.
3. Mathematically, it can be described as the partial derivatives of a set of
parameters with respect to its inputs. The more the gradient, the steeper
the slope.
4. Gradient Descent is a convex function.
5. Gradient Descent can be deseribed as an iterative method which is used
to find the values of the parameters ofa function that minimizes the
cost function as much as possible.
6 The parameters are initially defined a particular value and from that,
Gradient Descent is run in an iterative fashion to find the optimal values
ofthe parameters,using caleulus, to find the minimum possible value of
the given cost function.

Que 2.23.|Explain different types of gradientdescent.


Machine Learning 2-21 G (CSTT/OE-Sem-8)

Answer
Different types of gradient descent are:
1. Batch gradient descent:
This is a type of gradient descent which processes all the training
examples for each iteration of gradient descent.
b When the number of training examples is large, then batch gradient
descent is computationally very expensive. So, it is not preferred.
C. Instead, we prefer to use stochastic gradient descent or
mini-batch gradient descent.
2. Stochastic gradient descent :
a. This is a type of gradient descent which processes single training
example per iteration.
b Hence, the parameters are being updated even after one iteration
in which only a single example has been processed.
C. Hence, this is faster than batch gradient descent. When the number
of training examples is large, even then it processes only one
example which can be additional overhead for the system as the
number of iterations will be large.
3. Mini-batch gradient descent :
This is a mixture of both stochastic and batch gradient descent.
b. The training set is divided into multiple groups called batches.
Each batch has a number of training samples in it.
d. At a time, a single batch is passed through the network which
computes the loss of every sample in the batch and uses their
average to update the parameters of the neural network.

Que 2.24. What are the advantages and disadvantages of batch


gradient descent ?
Answer
Advantages of batch gradient descent :
1. Less oscillations and noisy steps taken towards the global minima of the
loss function due to updating the parameters by computing the average
of all the training samples rather than the value of asingle sample.
2. It CAn benefit from the vectorization which inereases the speed of
processing all training samples together.
3. It produces amore stable gradient descent convergence and stable error
gradient than stochastic gradient descent.
2-22 G (CSIT/OE-Sem-8) Decision Tree Learning
4 It is computationally efficient as all computer resources are not being
used to process a single sample rather are being used for all training
samples.
Disadvantages of batch gradient descent :
1. Sometimes a stable error gradient can lead to alocal minima and unlike
stochastic gradient descent no noisy steps are there to help to get out of
the local minima.
2. The entire training set can be too large to process in the memory due to
which additional memory might be needed.
3. Depending on computer resources it can take too long for processing all
the training samples as a batch.
Que 2.25. What are the advantages and disadvantages of
stochasticgradient descent ?
Answer
Advantages of stochastic gradient descent :
1. It is easier to fit into memory due to a single training sample being
processed by the network.
2 It is computationally fast as only one sample is processed at a time.
3 For larger datasets it can converge faster as it causes updates to the
parameters more frequently.
4 Due to frequent updates the steps taken towards the minima of the loss
function have oscillations which can help getting out oflocal minimums
of the loss function (in case the computed position turns out to be the
local minimum).
Disadvantages of stochastic gradient descent :
1. Due to frequent updates the steps taken
towards the minima are very
noisy. This can often lead the gradient descent intoother directions.
2 Also, due to noisy steps it may take longer to achieve conv•rgence to the
minima of the loss function.
3. Frequent updates are computationally expensive due to using all
resources for processing one training sample at a time.
4. It loses the advantage of vectorized operations as it deals with only a
single example at a time.
Que 2.26. Explain delta rule. Explain generalized delta learning
rule (error backpropagation learning rule).
Answer
Delta rule:
1. The delta rule is specialized version of backpropagation's learning rule
that uses single layer neural networks.
Machine Learning 2-23 G (CS/IT/OE-Sem-8)

Ilcaleulaten the orror betwoen caleulated oulput and sample output


data, and unen thia to croate a modification to the weights, thus
implementing a torn of gradient descent.
Generalized delta learning rule (Error backpropagation learning) :
In generalized delta learning rule (eror backpropagation learning). We
ar given the training Net:

where x .and y eR,k =I, ., K.


Step 1:> 0, E,> 0 are chosen.
Step 2: Weighta ware initialized at small random values, k= 1, and the
running error E is Net to 0.
Step 3:Inputais presented, x:=y:=y, and output Ois coniputed

t exp- Ww'o)
where O, is the output vector of the hidden layer :

I+ exp(-W')
Step4: Weights of the output unit are updated
W:=W+ nào
where & (y -O)0(1-0)
Step 5: Weights of the hidden units are updated
W, =w, +noW,o,(1-0,x, l =1,., L
Step 6:Cumulative eycle error is computed by adding the present error
toE
E: E+ 1/2(y-0
Step 7: Ifk <Kthen k: k+1 and we continue the training by going
back tostep 2,otherwise we go to step 8.
Step 8:The training cycle is completed. For E< E terminate the
training session. IfE> E then E: = 0, k := land we initiate a new
training cycle by going back to step 3.

PART-3

Adaline, Multilayer Network, Derivation of Backpropagation Rule,


Backpropgation Algorithm Convergence, Generalization.

Questions-Answers
Long Answer Type and Medium Answer Type Questions
2-24 G (CS/TT/OE-Sem-8) Decision Tree Learning

Que 2.27. Explain adaline network with its architecture.


Answer
1. ADALINE is an Adaptive Linear Neuron network with a single linear
unit. The Adaline network is trained using the delta rule.
2 It receives input from several units and bias unit.
3 An Adaline model consists of trainable weights. The inputs are of two
values (+ l or- 1) and the weights have signs (positive or negative).
4 Initially random weights are assigned. The net input calculated is applied
toa quantizer transfer function (activation function) that restores the
output to + l or 1.
5 The Adaline model compares the actual output with the target output
and with the bias units and then adjusts all the weights.

Input Adaline Adaline Output


units units units units

Que 2.28. Explain training and testing algorithm used in adaline


network.

Answer
Adaline network training algorithm is as follows :
Step 0: Weights and bias are to be set to some random values but not zero.
Set the learning rate parameter a.
Step 1:Perform steps 2-6 when stopping condition is false.
Step 2: Perform steps 3-5 for each bipolar training pair.
Step 3: Set activations for input units i= l ton.
Step 4: Calculate the net input to the nt put unit.
Step 5: Update the weight and bias for = 1 to n.
Step 6:If the highest weight change that occurred during training is smaller
than a specified tolerance then stop the training process, else continue. This
is the test for the stopping condition of a network.
Machine Learning 2-25 G (CSIT/OE-Sem-8)

Adaline networks testing algorithm is as follows :


When the training has been completed, the Adaline can be used to classify
input patterns. Astep function is used to test the performance of the network.
Step 0: Initialize the weights. (The weights are obtained from the training
algorithm.)
Step 1 : Perform steps 2-4 for each bipolar input vector x.
Step 2: Set the activations of the input units to x.
Step 3: Calculate the net input to the output units.
Step 4: Applythe activation function over the net input caleulated.
Que 2.29.Writeshort note on backpropagation algorithm.
Answer
1. Backpropagation is an algorithm used in the training of feedforward
neural networks for supervised learning.
2 Backpropagation efficiently computes the gradient of the loss function
with respect to the weights of the network for a single input-output
example.
3 This makes it feasible touse gradient methods for training multi-layer
networks, updating weights to minimize loss, we use gradient descent
or variants such as stochastic gradient descent.
4 The backpropagation algorithm works by computing the gradient of the
loss function with respect to each weight by the chain rule, iterating
backwards one layer at a time from the last layer to avoid redundant
calculations of intermediate terms in the chain rule;this is an example
of dynamic programming.
5 The term backpropagation refers only to the algorithm for computing
the gradient, but it is often used loosely to refer to the entire learning
algorithm, also including how the gradient is used, such as by stochastic
gradient descent.
6 Backpropagation generalizes the gradient computation in the delta rule,
which is the single-layer version of backpropagation, and is in turn
generalized by automatic differentiation, where backpropagation is a
special case of reverse accumulation (reverse mode).
Que 2.30. Explain perceptron with single flow graph.
Answer
1 The perceptron is the simplest form of a neural network used for
classification of patterns said to be linearly separable.
2 It consists of a single neuron with adjustable synaptic weights and bias.
2-26 G (CSTT/OE-Sem-8) Decision Tree Learning
3. The perceptron build around a single neuron is limited for performing
pattern classification with only two classes.
4 By expanding the output layer of perceptron to include more than one
neuron, more than two classes can be classified.
5 Suppose, a perceptron have synaptic weights denoted by w,, W,, W3, ...
Wm
6 The input applied to the perceptron are denoted by *,Xz ... n'
7. The externally applied bias is denoted by b.
Bias b
W1
Output

Inputs W2 Hand
limiter
Wm

Xm
Fig. 2.30.1. Signal flow graph of the perceptron.
8. From the model, we find that the hard limiter input or induced local field
of the neuron as
m

V= Xwx,+b
/=l

9. The goal of the perceptron is to correctly classify the set of externally


applied input x, x, ....x., into one oftwo classes G, and G,.
10. The decision rule for classification is that if output y is +1 then assign the
point represented by input x, x,, ...to class G, else yis -1 then
assign to class G,.
11. In Pig. 2.30.2, if apoint (x,, x,) lies below the boundary lines is assigned
toclass G, and above the line is assigned to class G,. Decision boundary
is calculated as :
W,*, + w, + b = 0
Decision boundary
W}X} + W,x, + b = 0.
Glass G Glass G,1

Fig. 2.30.2.
2-27 G (CSTT/OE-Sem-8)
Machine Learning
by a hyperplane defined as:
12. There are two decision regions separated
Xwx, +b = 0
....u, oftheperceptron can be adapted
The synaptic weights w,, W,basis.
on an iteration by iteration
rule known as perceptron
13. For the adaption, an error-correction
convergence algorithm is used.
classes G, and G, must be
14. For a perceptron to function properly, the two
linearly separable.
set of inputs to be classified
15. Linearly separable means, the pattern or
must be separated by a straight line.
linearly separable
16. Generalizing, a set of points in n-dimensional space are
separates the sets.
if there is a hyperplane of (n -1) dimensions that
Gc,l a s s Gc,l a s s

Gz
class Gcg
lass

(a) A pair of linearly (6) A pair of non-linearly


separable patterns separable patterns
Fig.2.30.3.
Que 2.31. State and prove perceptron convergence theorem.

Answer
Statement: The Perceptron convergence theorem states that for any data
set which is linearly separable the Perceptron learning rule is guaranteed to
find a solution in a finite number of steps.
Proof:
1. To derive the error-correction learning algorithm for the perceptron.
2. The perceptron convergence theorem used the synaptic weights w,, W
w, of the perceptron can be adapted on an iteration by iteration
basis.
The bias b(n) is treated as a synaptic weight driven by fixed input equal
to + 1.

xn) =(+1, x,(n), x,n), ...*n))"


Where n denotes the iteration step in applying the algorithm.
2-28 G (CSTT/OE-Sem-8) Decision Tree Learning
4 Correspondingly, we define the weight vector as

wln) = (b(n), w, (n), w,n)... w(n))"


Accordingly, the linear combiner output is written in the compact form

vln) =w,(n) x,(n) =wn) x(n)


i=0

The algorithm for adapting the weight vector isstated as:


1. If the nth member of input set x(n), is correctly classified into linearly
separable classes, by the weight vector w(n) (that is output is correct)
then no adjustment of weights are done.
wn + 1) = w(n)
if wx(n) >0 and xn) belongs to class G,:
w(n + 1) = w(n)
if w? x(n)<0and x(n) belongs to class G:
2. Otherwise, the weight vector of the perceptron is updated in accordance
with the rule:
w(n + 1) = w(n) n(n) x(n)
if wT(n) x(n) >Oand xln) belongs to class G
w(n + 1) = wn) -n(n) x(n)
if wT(n) x(n) s0 and x(n) belongs to class G,.
where n(n) is the learning-rate parameter for controlling the adjustment
applied to the weight vector at iteration n.
Also small a leads to slow learning and large a to fast learning. For a
constant a, the learning algorithm is termed as fixed increment algorithm.
Que 2.32. Explain multilayer perceptron with its architecture
and characteristics.

Answer
Multilayer perceptron :
1. The perceptrons which are arranged in layers are called multilayer
perceptron. This model has three layers : an input layer, output layer
and hidden layer.
2. For the perceptrons in the input layer, the linear transfer function used
and for the perceptron in the hidden layer and output layer, the sigmoidal
or squashed-S function is used.
3. The input signal propagates through the network in a forward direction.
4. On a layer by layer basis, in the multilayer perceptron bias b(n) is treated
as a synaptic weight driven by fixed input equal to +1.
x(n) = (+1,x,(n), x,n), .... (n)]?
where n denotes the iteration step in applying the algorithm.
2-29 G (CSIT/OE-Sem-8)
Machine Learning
Correspondingly, we define the weight vector as :
wn) =b(n), w,(n), w,n).
5. Accordingly, the linear combiner output is written in the compact form :
m

Vn) = w(n)x, (n) =wT(n) xxtn)


i=0

The algorithm for adapting the weight vector is stated as :


linearly
1. If the nth number of input set x(n), is correctly classified into
separable classes, by the weight vector w(n)(that is output is correct)
then no adjustment of weights are done.
w(n + 1) = w(n)
if wx(n) > 0 and x(n) belongs to class G,.
w(n + 1) = w(n)
ifwTxin) s0 and x(n) belongs to class G:
2 Otherwise, the weight vector of the perceptron is updated in accordance
with the rule.
Architecture of multilayer perceptron :

Output
Input signal
signal

Output layer

Input layer First hidden Second hidden


layer layer
Fig. 2.32.1.

1 Fig. 2.32.1 shows architectural graph ofmultilayer perceptron with two


hidden layer and an output layer.
2. Signal flow through the network progresses in a forward direction,
from the left to right and on a layer-by-layer basis.
3 Two kinds of signals are identified in this network :
Functional signals : Functional signal is an input signal and
propagates forward and emerges at the output end of the network
as an output signal.
b. Error signals : Error signal originates at an output neuron and
propagates backward through the network.
4 Multilayer perceptrons have been applied successfully to solve some
difficult and diverse problems by training them in a supervised manner
2-30 G (CSIT/OE-Sem-8) Decision Tree Learning
with highly popular algorithm known as the error backpropagation
algorithm.
Characteristics of multilayer perceptron :
1. In this model, each neuron in the network includes a non-linear
activation function (non-linearity is smooth). Most commonly used
non-linear function is defined by:
1

1+expl-u,)
where v, is the induced local field (ie., the sum of all weights and bias)
and y is the output of neuron j.
2. The network contains hidden neurons that are not a part of input or
output of the network. Hidden layer of neurons enabled network to
learn complex tasks.
3. The network exhibits a high degree of connectivity.

Que 2.33. How tuning parameters effect the backpropagation


neural network ?

Answer
Effect of tuning parameters of the backpropagation neural network:
1. Momentum factor :
The momentum factor has a significant role in deciding the values
of learning rate that willproduce rapid learning.
b. It determines the size of change in weights or biases.
C. If momentum factor is zero, the smoothening is minimum and the
entire weight adjustment comes from the newly calculated change.
d. If momentum factor is one, new adjustment is ignored and previous
one is repeated.
e.
Between 0 and 1 is a region where the weight adjustment is
smoothened by an amount proportional to the momentum factor.
f The momentum factor effectively increases the speed of learning
without leading to oscillations and filters out high frequency
variations of the error surface in the weight space.
2. Learning coefficient :
a. An formula to select learning coefficient has been :
1.5
h=
(N +N, +...+N,)
Where N, is the number of patterns of type l and mis the number
of different pattern types.
2-31 G (CSIT/OE-Sem-8)
Machine Learning

b. The small value of learning coefficient less than 0.2 produces slower
but stable training.
The largest value oflearning coefficient i.e., greater than 0.5, the
weights are changed drastically but this may cause optimum
combination of weights to be overshot resulting in oscillations about
the optimum.
d The optimum value of learning rate is 0.6 which produce fast
learning without leading to oscillations.
3. Sigmoidal gain :
a. If sigmoidal function is selected, the input-output relationship of
the neuron can be set as

1 ...(2.33.1)
O= (1+e-l+0)

where is a scaling factor known as sigmoidal gain.


b. As the scaling factor increases,the input-output characteristic of
the analog neuron approaches that of the two state neuron or the
activation function approaches the (Satisifiability) function.
C. It also affects the backpropagation. To get graded output, as the
sigmoidal gain factor is increased, learning rate and momentum
factor have to be decreased in order to prevent oscillations.
4. Threshold value :
a 9in eg. (2.33.1) is called as threshold value or the bias or the noise
factor.
b A neuron fires or generates an output if the weighted sum of the
input exceeds the threshold value.
C. One method is to simply assign a small value to it and not to change
it during training.
The other method is to initially choose some random values and
change them during training.
Que 2.34. Discuss selection of various parameters in
Backpropagation Neural Network (BPN).
Answer
Selection of various parameters in BPN:
1. Number of hidden nodes :
a. The guiding criterion is toselect the minimum nodes in the first
and third layer, so that the memory demand for storing the weights
can be kept minimum.
b. The number of separable regions in the input space M, is a function
of thenumber of hidden nodes Hin BPN and H =M-1.
2-32 G (CS/IT/OE-Sem-8) Decision Tree Learning

C. When the number of hidden nodes is equal to the number of training


patterns, the learning could be fastest.
d. In such cases, BPN simply remembers training patterns losing all
generalization capabilities.
e. Hence, as far as generalization is concerned, the number of hidden
nodes should be small compared to the number of training patterns
with help of VapnikChervonenkis dimension (VCdim) of probability
theory.
f. We can estimate the selection of number of hidder. nodes for a
given number of training patterns as number of weights which is
equal to I,* I, +1, * I, where I, and 1, denote input and output
nodes and I, denote hidden nodes.
g. Assume the training samples T to be greater than VCdim. Now if
we accept the ratio 10: 1

10 *T=
I, +1,)
10T
I,=
I,+1,)
Which yields the value for I,.
2. Momentum coefficient a :
a To reduce the training time we use the momentum factor because
it enhances the training process.
b The influences of momentum on weight change is

IAWn+ = -n aw-+ alAW''

(Weight change
without momentum)
OW
(AW]n
aAW]n

[AW]n+1
(Momentum term)
Fig. 2.34.1, Influence of momentum term on weight change.
C.
The momentum also overcomes the effect of local minima.
d. The use of momentum term will carry a weight change process
through one or local minima and get it into global minima.
Machine Learning 2-33 G (CS/IT/OE-Sem-8)

3. Sigmoidal gain à :
a. When the weights become large and force the neuron to operate in
a region where sigmoidal function is very flat, a better method of
coping with network paralysis is to adjust the sigmoidal gain.
b. By decreasing this scaling factor, we effectively spread out sigmoidal
function on wide range so that training proceeds faster.
4. Local minima:
One of the most practical solutions involves the introduction of a
shock which changes all weights by specific or random amounts.
b. Ifthis fails, then the most practical solution is to rerandomize the
weights and start the training all over.
UNIT
3 Evaluating Hypotheses

CONTENTS
Part-1 : Evaluating hypotheses, 3-2G to 3-5G
Estimating Hypotheses
Accuracy
Part-2 : Basics of Sampling. 3-5G to 3-9G
Theory, Comparing
Learning Algorithm
Part-3 : Bayesian Learning, Bayes 3-9G to 3-19G
Theorem, Concept Learning,
Bayes Optimal Classifier
Part-4: Naive Bayes Classifier, .....,.....,. 3-19G to 3-26G
Bayesian Belief Networks,
EM Algorithm

3-1G (CSIT/OE-Sem-8)
3-20(CIT/OE-Sem-8) Evaluating Hypotheses

PART- 1

Evaluating Hypotheses, Estimating Hypotheses Accuracy.

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 3.1.Explain hypotheses with its characteristics and


importance.
Answer
lypothesis (h) :
Ahypothesis is a function that describes the target in supervised machine
learning.
The hypothesis depends upon the data and also depends upon the
restrictions and bias that we have imposed on the data.
3. A
bypothesis is a tentative relationship between two or more variables
which direct the research activity.
4. Ahypothesis is a testable prediction which is expected to occur. It can be
a false or atrue statement that is tested in the research to check its
authenticity.
Characteristies of hypothesis:
1. Empirically testable
2. Siple and clear
3. Specific and relevant
4. Predictable
5. Manageable
Importance of hypothesis :
It gives a direction to the research.
2. Itspecifies the focus of the researcher.
3. It helps in devising research techniques.
4. It prevents from blind research.
6,. It ensures accuracy and precision.
6. It saves resources i.e.,time, money and energy.

Que 3.2. Explain different types of hypotheses.


Machine Learning

Anawer

Ditferent (ypen of hypotheses ure :


I. Simple hypothenia i Asimple hypothesis is a hypotleats that vutlta u
relationsip between two varialbles ie, inlependent nd dependunt
variablo.

Examples!
ligher the unemployment, higher would be the rale otove in
nOcioty.
Lawer the une offertilizorn, lower would be gricultural podwtivy
i. Higher the poverty in asociety, higlher would bo the rate of vmes
2. Complex hypotheain 1A conplex hypothesia ia alyotleaia tt vetlta
relationship anong nore thaa two variablea,
Examples1
igher the poverty, higher the illiteracy in a society, bigley will be
the rate of erime(three variablen (wo independont vaiablos and
one dependent variable),
Lawer the use of lertilizor, improved aoeda and nudern epuipents,
lower would be the agricultural produetivity(l'ourvariablea throo
independent variables and one dependent varinble),
ii. Higher the illiteracy in a Bowiety, higher willlhe poverty and erine
rate,(three variables-one independent variable nnd two dependel
variables).
3. Working hypothesin i
i. Ahypothesis, that ia acceepted to put to teat and work in a rosorch,
is called a working hypotheala,
ii. Iis a hypothess that ia anauned to be auitable to expluin cortan
facts and relationahip of phenomena.
ii It is aupposed that thus hypothenia would gonerate u productive
theory and is accepted to put to tent for inventigutlon.
iv. It ean be any hypothenia that in procensed lor work during the
researeh.
4. Alternative hypothenin i
If the worlking hypothenia ia provedwrong or rejected, mutler
hypotlenia (to replavo tlhn working bypthenin) in tormulated to bo
tested to generate the denired rosultn thia in owwn Aallenate
hypothesis.
ii. It is an alternate AHHHnption (a relationalhip or an xplmation)
which is adopted atler the working hypothnala faila to gonorate
required theory. Allernativo bypothonin in lenoted by H
34G(CSIT/OE-Sem-8) Evaluating Hypotheses

5. Null hypothesis : A null hypothesis is a hypothesis that has no


relationship between variables. It negates association between variables.
Examples:
Poverty has nothing to do with the rate of crime in a society.
ii. Illiteracy has nothing to do with the rate of unemployment in a
society.
ii. Anull hypothesis is made with an intention where the researcher
wants to disapprove, reject or nullify the null hypothesis to confirm
a relationship between the variables.
iv. Anull hypothesis is made fora reverse strategy -to prove it wrong
in order toconfirm that there is a relationship between the variables.
Anull hypothesis is denoted by HQ.
6. Statistical hypothesis :
i. Ahypothesis that can be verified statistically is known as a statistical
hypothesis.
i. It means using quantitative techniques, to generate statistical data,
can easily verify it.
ii. The variables in a statistical hypothesis can be transformed into
quantifiable sub-variable to test it statistically.
7. Logical hypothesis :
i. A hypothesis that can be verified logically is known as a logical
hypothesis.
It is a hypothesis expressing a relationship whose interlinks can be
joined on the basis of logical explanation.
iüi. It is verified by logical evidence.
be
iv. Being verified logically does not necessarily mean that it cannot
verified statistically.

Que 3.3. Describe the difficulties faced in estimating the accuracy


of hypotheses.
Answer
Difficulties faced in estimating the accuracy of hypotheses :
1. Bias in the estimate :
a The observed accuracy of the learned hypothesis over the training
examples is a poor estimator of its accuracy over future examples.
b Because the learned hypothesis was derived from these examples,
willprovide an optimistically biased estimate of hypothesis accuracy
over future examples.
Machine Learning 3-6G(CSIT/OE-Sem-8)

To obtain an unbiased estimate of future aceurncy, we test the


hypothesis on some set of testexaumples chosen independently of
the training examples and the hypothesis.
2. Variance in the estimate:
Even if the hypothesis aceuracy is measured over an unbiased set
of tost examples, independent of the training examples, the
measured accuracy can stillvaryfrom the true nceuracy, depending
on the makeup of the partieular set of test examples.
b. The smaller the set of test examples, the greater the expected
variance.

PART-2

Basies of Sampling Theory, Comparing Learning Algorithm.


Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que3.4.What are the steps used in the process of sampling ?


Answer
Steps used in the process of sampling are :
1. Identifying the population set.
2. Determination of the size of our sample set.
3. Providing a medium for the basis of selection of samples from the
Population medium.
4. Picking out samples from the medium using sampling techniques like
simple random, systematic or stratifiedsampling.
5. Checking whether the formed sumple set contains elements (actually
matches the different attributes of population set) without large
variations in between.
6. Checking for errors or inaccurate estimations in the formed sample set
that may or may not have occurred.
7. The set which we get after performing the above steps contributes to
the sample set.
Que 3.5. Write short note on sampling frame.
enumtig llvpothes

Anwwer

illetln of all the aple elementa taken int obaervation,


lmltoven bappen that all elenenta in the aAnplingrane, did not
naheptin the atunl atatiat iea
Inthate ,tho olenotathat took partin teatudy areealled anmplea
dpetentalclonenta thatcould have boenin he atudy but did nol

un, aanplion ae in the potential iat of elenenta on which we will


tmo atatiatken
mlig thane a wntntbeeaue it will lelpin predieting tle reaction
nttle atatiatlvn toaul witlh tle population aet
7 ARnliy me in not juat a vanlom aet of bandpiekel elenmenta
vatler it ovon conainta of dentiierawhich lhelp to identity ench and
ovory element nhe aet

Que .6, What are ditterent metlhoda of sanmpling?


Auawer
Derent methoda of Anmplingare
1 Nimpte RandomBampling (N)1
Nimple random aanpliy ia the elementary fornm of samplinE.
In thia etthod,nll he elenwnta in populationa are first divided into
rndom aeta ef equnl aipen.
IRandomaeta lhave no deining property nnong themsclves, ie.,
oe Ret eaot bekdentified tom another set bascd on s0me specific
identiier.
"hua every element han An equal property of being selected, ie.,
Iofgetting neleeted) - 1/2
"he banie nethoa for employing 8RS ure:
1 Choone the population aet
ldentity tlhe basia of nanpling.
ii Uae of randomnunberv/aesalongeneratora to pick an element
rom each ot.
4 Nyntematie aamplng
A yeteate nnpling in alao kown n atype of prolability anmpling
I ia more wCurate than NR8 and aleo the atandard error lormation
perventage ia very low but not error free.
Machine Learning 3-7G (CS/IT/OE-Sem-8)

C. In this method, first, the population tray elements are arranged


based ona specific order or scheme known as being sorted.
d It can be of any order, which totally depends upon the person
performing the statistics.
e. The elements are arranged either in ascending, descending,
lexicographically or any other known methods deemed fit by the
tester.

After being arranged the sample elements are picked on the basis
of pre-defined interval set or function.
P(ofgetting selected) = depends upon the ordered population tray after
it has been sorted
The basic methods of employing systematic random sampling are:
Choosing the population set wisely.
Checking whether systematic sampling will be the efficient
method or not.
iüi. If yes, then application of sorting method to get an ordered
pair ofpopulation elements.
iv. Choosing a periodicity tocrawl out elements.
3. Stratified sampling :
a. Stratified sampling is a hybrid method concerning both simple
random sampling as well as systematic sampling.
b. It is one of the most advanced types of sampling method available,
providing accurate result to the tester.
C. In this method, the population tray is divided into sub-segments
also known as stratum (singular).
d. Each stratum can have their own unique property. After being
divided intodifferent sub-stratum, SRS or systematic sampling can
be used to create and pick out samples for performing statistics.
e The elementary methods for stratified sampling are :
i Choosing the population tray.
Checking ior periodicity or any other features, so that they
can be divided into different strata.
üi. Dividing the population tray into sub-sets and sub-groups on
the basis of selective property.
iv. Using SRS or systematic sampling of each individual strata to
form the sample frame.
V We can even apply different sampling methods to different
sub-sets.
3-8G (CS/IT/OE-Sem-8) Evaluating Hypotheses

Que 3.7. What are the advantages and disadvantages of different


methods of sampling ?
Answer
Advcntages of SimpleRandom Sampling (SRS):
1 Less exhaustive with respect to time as it is the most elementary form of
sampling.
2 Very useful for population set with very less number of elements.
3 SRS can be employed anywhere, anytime even without the use of special
random generators.
Disadvantages of Simple Random Sampling (SRS) :
1. Not efficient for large population sets.
2. There are chances of bias and then SRS would not be able to provide a
correct result.
3. It does not provide a specific identifier to separate statistically similar
samples.
Advantages of systematic sampling :
1. Accuracy is higher than SRS.
2. Standard probability of error is lesser.
3. No problem for bias to creep in during creation of sample frame.
Disadvantages of systematic sampling:
1. Not much efficient when comes to the time wise.
2. Periodicity in population tray elements can lead to absurd results.
3. Systematic sampling can either provide the most accurate result or an
impossible one.
Advantages of stratified sampling :
1. It provides results with high accuracy measurements.
2. Different results can be desired just by changing the sampling method.
3. This method also compares different strata when samples are being
drawn.
Disadvantages of stratified sampling :
1. Inefficient and expensive when comes to resources as well as money.
2. This method willfail only where homogeneity in elements is present.
Que 3.8. Differentiate between supervised and unsupervised
learning.
Answer
Refer Q. 1.25,Page 1-22G, Unit-1.
Machine Learning 3-9 G(CS/TT/OE-Sem-8)

Que 3.9. Differentiate between unsupervised and reinforcement


learning.
Answer
Basis Unsupervised learning Reinforcement learning
Definition No external teacher or pre- Works on
trained data.
interacting with
the environment.

Preference Assets are depreciable Liabilities are non


depreciable.
Tasks Clustering and association. Exploitation and exploration.
Mapping between To find the underlying Will get constant feedback
input and output patterns rather than the from the user by suggesting
mapping. |few news articles and then
build a knowledge graph.
Platform Operated with interactive Supports and works better in
software or applications. AI where human interaction
is prevalent.
Algorithms Many algorithms exist in Neither supervised nor
using this learning. unsupervised algorithms are
used.
Integration Runs on any platform or Runs with any hardware or
with any applications. software devices.

PART-3

Bayesian Learning : Bayes Theorem, Concept Learning, Bayes


OptimalClassifier.

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 3.10. Explain Bayesian learning. Explain two category


classification.
3-10G(CSIT/OE-Sem-8) Evaluating llypotheseA

Answer
Bayesian learning :
1. Bayesian learning is a fundamental statistical appronch to the problem
of pattern classification.
2 This approach is based on quantifying the tradeofls between various
classification decisions using probability and costs that accompnny such
decisions.
3 Because the decision problem is solved on the basis of probabilistic term8,
hence it is assumed that all the relevant probabilities are known.
4 For this we define the state of nature of the things present in the
particular pattern. We denote the state of nature by 0.
5. For example, there are a number of balls which are red and blue in
colour then o =0, when the ball is red and o = o, when the ball is blue.
Because the state of nature is so unpredictable, we consider o to be a
variable that must be described probabilistically.
6. If one ball is red then we can say that the next ball is equally likely to be
red or blue.
7 We assume that there is a prior probability plo,) that the next ball is
blue.
8 These prior probabilities reflect the prior knowledge ofhow likelya ball
obtained is red or blue before the ball actually appears.
9. Now after defining the state ofnature and prior probabilities, the decision
has to be made that a particular ball is present in which class.
10. Adecision rule is used to take decision as:
Decide o, ifplo,) >p(o,), otherwise o,.
Twocategory classification :
1 Let o,,o, be the two classes of the patterns. It is assumed that the a
prioriprobabilities p(o,) and plo,) are known.
2 Even if they are not known, they can easily be estimated from the
available training feature vectors.
3 IfN is total number ofavailable training patterns and N,, N, of them
belong to o,and o,, respectively then plw,) =N/N and pto,) =N/N.
4. The conditional probability density functions plx| o,), i = 1,2 is also
assumed to be known which describes the distribution of the feature
vectors in each of the classes.
5 The feature vectors can take any value in the l-dimensional feature
space.
6. Density functions plr |o) become probability and will be denoted by
plx|o) when the feature vectors can take only disereto values.
Machine Learning 3-11G (CSIT/OE-Sem-8)

7. Consider the conditional probability,


plo; |x) = plxlo;)plo)
plx)
.(3.10.1)

where plr) is the probability density function of xand for which we have
2

plr) = i=1 P(x|o;) p(o;) ..(3.10.2)

8. Now, the Baye's classification rule can be defined as :


a. Ifplo, |*) >plo, |*) xis classified to o,
...(3.10.3)
b. Ifp(o, |x) <plo, |x) xis classified to w,
9. In the case of equality the pattern can be assigned to either of the two
classes. Using equation (3.10.1), decision can equivalently be based on
the inequalities :
a
plx| o) p(o) >plz| o,plo,)
..(3.10.4)
b. plzlo) plo,) <plr | o,)p(o,)
10. Here p(x) is not taken because it is same for all classes and it does not
affect the decision.
11. Further, if thepriori probabilities are equal, i.e.,
plo,)=plo,) = 12then Eq. (3.10.4) becomes,
b. pla| o) >plz O)
C. plx|o,) <plx|o,)
12. For example, in Fig. 3.10.1, two equiprobable classes are presented which
shows the variations ofp(x|o,), i= 1,2 as functions of xfor the simple
case of a single feature (l = 1).
plx\o) p«\o) plx\og)

Shade the part

-R, R
Fig. 3.10.1. Bayesian classifier for the case of two equiprobable classes.
13. The dotted line at x, is a threshold which partitions the space into two
regions, R, and R,. According to Baye's decisions rule, for all value of x
in R,the classifier decides o, and for all values in R, it decides o,:
3-12 G (CSTT/OE-Sem-8) Evaluating Hypotheses
14. From the Fig.3.10.1, it is obvious that the errors are unavoidable. There
is a finite probability for an x to lie in the R, region and at the same time
to belong in class o,. Then there is error in the decision.
15. The total probability, P of committing a decision error for two
equiprobable classes is given by,
1
P,=plxlog)dr
2
+ p(zlo)dx
which is equal to the total shaded area under the curves in Fig. 3.10.1.

Que 3.11. Explain how the decision error for Bayesian


classification can be minimized.

Answer
1. Bayesian classifier can be made optimal by minimizing the classification
error probability.
2 In Fig. 3.10.1, it is observed that when the threshold is moved away from
Ko, the corresponding shaded area under the curves always increases.
3.
Hence, we have to decrease this shaded area to minimize the error.
4 Let R, be the region of the feature space for o, and R, be the
corresponding region for o,.
5. Then an error will be occurred if.,
xeR, although it belongs tow, or if z eR, although it belongs to o, i.e.,
P,= plxeRy, 0,) +plreR, o,) ...3.11.1)
6. P. can be written as,
P =plreR, lo,) plo,)+preR, lo,)plo,)
=Po,) lx |o)dx+ plog) plx<og)da ...3.11.2)

R
7. Using the Baye's rule,
=P plo |*)p(xdx+plo|*)plr)d* ...3.11.3)
R
8 The error will be minimized ifthe partitioning regions R, and R, of the
feature space are chosen so that
R,: plo, |*) >plo,|x)
R,: plo,]*) >plo, |x) ...(3.11.4)
9. Since the union of the regions R,, R, covers all the space, we have
plo | *)px)dx + plo1|*)p)dx =1 ..(3.11.5)
R
Machine Learning 3-13G (CS/IT/OE-Sem-8)
10. Combining equation (3.11.3) and (3.11.5), we get,

P,= plw,) |(plo x)-plaglx) pldx ...(3.11.6)


R
11. Thus, the probability of error is minimized if R, is the
which p(a,x) >plo,|x). Then R, becomes region where region of space in
true.
the reverse is
12. In a classification task with Mclasses, ),, O,,.., 0y an
unknown
represented by the feature vector x, is assigned toclass pattern,
o, if plo, x) >
plo, lx) j*i.
Que 3.12. Consider the Bayesian classifier for the uniformly
distributed classes, where :
1

P(alw,) = a, -a,
9 *ela,, a,]
, muullion
1
P(artw,) = , -, *elb,, b,)
0 , muullion
Show the classification results for some values for a and b
(muullion" means "otherwise").

Answer
Typical cases are presented in the Fig.3.12.1.
Px|y) Pxly)
1 1 1
a, -a,
b, -b,

b,
(a) (b)
P ly) P ly)
1 1
b,-b
1

b, az b, b b, a)
(c) (d)
Fig. 3.12.1.

Que 3.13. What is Baye's theorem? Explain.


3-14 G (CSAT/OE-Sem-8) Bvaluating llypotheseN

Answer
1. Letx be a thing in a pattern. In Bayesian tern, x0s considered"evidenee".
Let H be some hypothesis such as thatx belongs toa npecified clusn C.
2. For classification problen, p(H|x) is determined which is the probubility
that the hypothesis holds given the observed x.
3. In other words, the probability that xbelongs to clasA Cis determined,
given that description of x is known.
4. plH|r) is the posterior probability of Hconditioned on x.
the
5 For example,suppose there are a number of custonners deseribed by
attributes age and income, respectively and thatx is a 36 yearn old
customer with an income of Rs. 40,000.
6. Suppose that H is the hypothesis that our customer will buy a computer.
Then p(H|x) rejects the probability that customerx will buy a computer
given that the customer's age and income is known.
7. Similarly plx |H) is the posterior probability of xconditioned on H.
8 It is the probability that custoner x is 35 years old and earns Rs, 40,000
given that we know that the customer will buy computer. plx) is the
prior probability of x.
9 It is the probability that a person from the set of customers is 36 years
old and earns Rs. 40,000.
10. Baye's theorem is useful in that it provides a way of caleulating the
posterior probability p(H|x), fromp(), plr| ) and plx)
)p(H)
Baye's theorem is, p(Hlx) = Px|plx)

Que 3.14. Write short note on Bayes classifier.


OR
Explain how classifieation is done by using Bayes classifier.
Answer
1 ABayes classifier is a simple probabilistie clasaifier based on applying
Bayes theorem (from Bayesian statisties) with strong (Naive)
independence assumptions.
2. In simple terms, a Naive Bayesclassifieor assumes that the presence
(or absence) of aparticular feature of aclasa iN unrelated to the preence
(or absence) of any other feature.
3. Depending on the precise nature of the probability model, Naive Bayes
classifiers can be trained very efficiently in a supervised learning.
4. In many practical applications, paranneter estimation for Naive Bayes
models uses the method of m£ximum likelihood; in other words, one
Machine Learning 3-15 G (CS/IT/OE-Sem-8)

can work with the Naive Bayes model without believing in Bayesian
probability or using any Bayesian methods.
5. An advantage of the Naive Bayes classifier is that it requires a small
amount of training data to estimate the parameters (means and
variances of the variables) necessary for classification.
6. The perceptron bears a certain relationship to a classical pattern
classifier known as the Bayes classifier.
7 When the environment is Gaussian, the Bayes classifier reduces to a
linear classifier.
In the Bayes classifier, or Bayes hypothesis testing procedure, we
minimize the average risk, denoted by R. For a two-class problem,
represented by classes C, and C, the average risk is defined:
R=
H H

H) H
where the various terms are defined as follows:
P, = Prior probability that the observation vector x is drawn from
subspace H, with i=1, 2, and P, +P,=1
C, =Cost of deciding in favour of class C; represented by subspace H,
when class C, is true, with i,j = 1, 2
P, (x/C)=Conditional probability density function of the random vector X
8 Fig. 3.14.1(a) depicts a block diagram representation of the Bayes
classifier. The important points in this block diagram are twofold :
The data processing in designing the Bayes classifier is confined
entirely to the computation of the likelihood ratio A(*).
Assign x to class &
Likelihood
Input vector A(x) if (x) >
ratio Comparator Otherwise, assign
computer Lit to class

(a)
Assign x to class 1
Input vector Likelihood |log A(x)
if log a () > log 5
ratio Comparator Otherwise, assign
computer it to class 2
(6) log
Fig. 3.14.1. Two equivalent implementations of the Bayes classifier :
(a) Likelihood ratio test, (b) Log-likelihood ratio test
3-16 G (CSIT/OE-Sem-8) Evaluating Hypotheses

b. This computation is completely invariant to the values assigned to


the prior probabilities and involved in the decision-making process.
These quantities merely affect the values of the threshold x.
C. From a computational point of view,we find it more convenient to
work with logarithm of the likelihood ratio rather than the
likelihood ratio itself.

Que 3.15. Discuss Bayes classifier using some example in detail.

Answer
labels. Each
1. Let D be a training set of features and their associated class
vector
feature is represented by an n-dimensional attribute
feature from
X=(x, xy .., x) depicting n measurements made on the
n attributes, respectively A,,A,, , A
Given a feature X, the
2. Suppose that there are mclasses, C, Cgy., Cm
classifier will predict that X belongs to the class having the highest
X
posterior probability, conditioned on X. That is, classifier predicts that
belongs to class C, if and only if,
p(C,|X) >p(C,|X) for 1sj Zm, j zi
Thus,we maximize p(C, |X). The class C, for which p(C,|X) is maximized
is called the maximum posterior hypothesis. By Bayes theorem,
p(C,|X) = p(X |C)p(C)
pX)
3. As p(X) is constant for all classes, only P(X| C,) P(C,) need to be
maximized. If the class prior probabilities are not known then it is
commonly assumed that the classes are equally likely i.e.,
p(C,)=pC) =...p(C) and therefore p(X|C) is maximized. Otherwise
pX|C) p(C) is maximized.
4. a.
Given data sets with many attributes, the computation ofp(X| C,)
will be extremely expensive.
b. To reduce computation in evaluating p(X|C,), the assumption of
class conditional independence is made.
C.
This presumes that the values of the attributes are conditionally
independent of one another, given the class label of the feature.

Thus, plX|C) = k=1| pl*|C,)


- plx, |C,) xp (x,1C,x..xp, | C)
d. The probabilities plx, | C), plr,|C),..p*,|C)are easily estimated
from the training feature. Here x, refers to the value of attribute
A, for each attribute, it is checked whether the attribute is
categorical or continuous valued.
Machine Learning 3-17 G (CS/IT/OE-Sem-8)
e. For example, to compute p(X | C) we consider,
IfA, is categorical then px, | C,) is the number of feature of
class C, in Dhaving the value x, for A, divided by |C, D|,the
number of features of class C, in D.
iü. IfA, is continuous valued then continuous valued attribute is
typically assumed to have a Gaussian distribution with a mean
u and standard deviation o, defined by,

-1(*
1 2 2
gx) = et

so that p(x, |C) =gl).


f. There is a need to compute the meanu and the standard deviation
G of the value of attribute A, for training set of class C,. These
values are used to estimate p(x,|C).
For example, let X = (35, Rs. 40,000) where A, and A, are the
attributes age and income, respectively. Let the class label attribute
be buys-computer.
h The associated class label for X is yes (i.e., buys-computer = yes).
Let's suppose that age has not been discretized and therefore exists
as a continuous valued attribute.
i Suppose that from the training set, we find that customer in Dwho
buy a computer are 38 +12 years of age. In other words, for attribute
age and this class, we have u =38 ando= 12.
5. In order topredict the class label of X,p(X | C,)p(C) is evaluated for each
class C, The classifier predicts that the class label of Xis the class C, if
and only if
pX|C) PC) >pX|C) plC)for 1s js m,j *i,
The predicted class label is the class C, for which p(X| C,) PC) is the
maximum.

Que 3.16. Let blue, green, and red be three classes of objects with
prior probabilities given by P(blue) =1/4, P(green) =1/2, P(red) =1/4.
Let there be three types of objects pencils, pens, and paper. Let the
class-conditional probabilities of these objects be given as follows.
Use Bayes classifier to classify pencil, pen and paper.
P(pencil/green) = /3 P(pen/green) = /2 P(paperlgreen) = 1/6
P(pencil/blue) = /2 P(pen/blue) = /6 P(paper/blue) = /3
P(pencilred) = 1/6 P(pen/red) = 1/3 P(paper/red) = /2
3-18 G (CS/IT/OE-Sem-8) Evaluating Hypotheses

Answer
As per Bayes rule :
Plgreen/pencil) = P(pencil/ green) Plgreen)
(P(pencil green) P(green) + P(pencil blue)
P(blue) + P(penci/ red) P(red)
1 1
1,12 6 =0.5050
1 1 1 1 1 0.33
X +
|3 4 6
P(pencil/ blue) P(blue)
P(blue/pencil) =
(P(penci/ green) P(green) +P(pencil blue)
Pblue) +P(pencil/ red) P(red)
1 1
X
2°4 = 0.378
0.33
P(pencil/ red) P(red)
P(red/pencil) =
(P(pencil/ red) P(red) + P(pencilV blue)
P(blue) +P(pencil/ green) P(green)
11 1
X
6 4 24 =0.126
0.33 0.33
Since, P(green/pencil) has the highest value therefore pencil belongs to
class green.
P(pen green) P(green)
P(green/pen) = P(pen/ green)
P(green) +P(pen/ blue)
P(blue) + P(pen red) P(red)
11X
1
2 4 =0.666
11 1 1 1 1 0.375
X + X +X
22 6 4 3 4
P(pen/ blue)P(blue)
P(blue/pen) =
P(pen/ green) P(green) + P(pen/ blue)
P(blue) +P(pen/ red) P(red)
1 1 1
X

64- 24 =0.111
0.375 0.375
P(pen/ red) P(red)
P(red/pen) =
P(pen/ green) P(green) +Ppen/ blue)
P(blue) +P(pen/ red) P(red)
3-19 G (CSIT/OF-Sem-8)
Machine earning

11 1
3 4 12 - 0,222
0.375 0.375
Since Pgreen/pen) has the highest value therefore, pen belongs to
class greon.
P(paper/ green) P(green)
Pgreen/paper) = Ppaper/ green) P(green) + P(paper/ blue)
Pblue) +P(paper/ red) P(red)
1 1 1
X
6 2 12
1 1 1 1 1 1 1 1
+ +
623 4 2 4 12 12 8

12 0,286
0.291
P(paper/ blue) P(blue)
Pbluepaper) =Ppaper/ green) P(green) + P(paper/ blue)
P(blue) + P(paper/ red) P(red)
1 1 1
X
3 4 12 = 0.286
0.291 0.291
P(paper/ red) P(red)
Pred/paper)=
P(paper/ green) P(green) + Ppaper/ blue)
Pblue) + P(paper/ red) P(red)
1 1 1
X
24 = 0.429
0,291 0.291
Since, Pred/paper) has the highest value therefore, paper belongs to
class red.

PART-4
Naive Bayes Classifier, Bayesian Belief Network, EMAlgorithm.

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 8.17.Explain Naive Bayes classifier.


3-20 G (CS/TT/OE-Sem-8) Evaluating Hypotheses

Answer
used
1 Naive Bayes model is the most common Bayesian network model
in machine learning.
predicted and the
2. Here, the class variable Cis the root which is to be
attribute variables X, are the leaves.
are
3 The model is Naive because it assumes that the attributes
conditionally independent of each other, given the class.
4 Assuming Boolean variables, the parameters are :
0= P(C = true), 0, = P(X, = true |C= true),
., = PX,= true | C=False)
which each
5. Naive Bayes models can be viewed as Bayesian networks in
X, has C as the sole parent and C has no parents.
6 ANaive Bayes model with gaussian PX,|C) is equivalent
to a mixture
of gaussians with diagonal covariance matrices.
7 While mixtures of gaussians are widely used for density
estimation in
mixed
continuous domains, Naive Bayes models used in discrete and
domains.
and
8. Naive Bayes models allow for very efficient inference of marginal
conditional distributions.
9. Naive Bayes learning has no difficulty with noisy data and can give
more appropriate probabilistic predictions.
set
test

0.9
correct
on
Proportion
0.8

0.7

0.6
Decision tree
0.5 Naive Bayes

0.4
0 20 40 60 80 100
Training set size
Fig. 3.17.1. The learning curve for Naive Bayes learning.
Machine Learning 3-21 G (CSTT/OE-Sem-8)

Que 3.18. Consider a two-class (Tasty or non-Tasty) problem with


the following training data. Use Naive Bayes classifier to elassify
the pattern :
"Cook = Asha, Health-Status = Bad, Cuisine= Continental".
Cook Health-Status Cuisine Tasty
Asha Bad Indian Yes
Asha Good Continental Yes

Sita Bad Indian No


Sita Good Indian Yes

Usha Bad Indian Yes

Usha Bad Continental No

Sita Bad Continental No

Sita Good Continental Yes

Usha Good Indian Yes

Usha Good Continental No

Answer

Cook Health Cuisine


status

Yes No Yes No Yes No

2 Bad 2 3 Indian 4 1
Asha

Sita 2 2 Good 4 1 Continental 2 3

Usha 2 2

Tasty

Yes No

6 4
3-22 G (CS/IT/OE-Sem-8) Evaluating Hypotlieses

Health Cuisine
Cook
status

Yes No Yes No
Yes No

0 Bad 2/63/4 Indian 4/6 1/4


Asha 2/6
Good 4/6 1/4 Continental 2/6 3/4
Sita 2/6 2/4

Usha 2/6 2/4

Tasty
Yes No

6/10 4/10

2 2 2 6
= 0.023
Likelihood of yes = 6 6 6 10

3 3 4
Likelihood of no = 0x X X =0
4 4 10
Therefore, the prediction is tasty.

Que 3.19. Describe Bayesian networks. How are the Bayesian


networks powerful representation for uncertainty knowledge ?
Answer
annotat:d
ABayesian network is a directed acyclic graph in which each node is
with quantitative probability information.
The full specification is as follows :
variables
1. A set of random variables makes up the nodes of the network
may be discrete or continuous.
2. A set of directed links or arrows connects pairs of nodes. If there is an
arrow from x to node y, x is said to be a parent ofy.
3. Each node x, has a conditional probability distribution Px, | parent(r,))
that quantifies the effect of parents on the node.
4. The graph has no directed cycles (and hence is a Directed Acyclic Graph
or DAG).
For example :
In Fig. 3.19.1 the weather is independent of the other three variables and
Toothache and catch are conditionally independent, ven cavity.
3-23 G (CSIT/OE-Sem-8)
Machine Learning

Weather Cavity

(Toothache Catch

Fig. 3.19.1. Asimple bayesian network.


description of the domain.
a. A Bayesian network provides a complete calculated
Every entry in the full joint probability distribution can be
from the information in the network.
b Bayesian networks provide a concise way to represent conditional
independence relationships in the domain.
each joint entry is
C ABayesian network specifies a full joint distribution; local conditional
defined as the product of the corresponding entries in the
distributions.
than the full joint
d A Bayesian network is often exponentially smaller
distribution.
and continvous
e. Hybrid Bayesian networks, includes both discrete
variables, use a variety of canonical distributions.
e. Inference in Bayesian networks means computing the probability
variables.
distribution of a set of query variables, given a set of evidence
evaluate sums
Exact inference algorithms, such as variable elimination,
of products of conditional probabilities as efficiently as possible.
uncertainty
Bayesian networks are powerful representation for
knowledge :
1. Two variables Aand Bare called conditionally independent; ifPA, B| C)
=PA|C). PB| C) or, equivalently, if P(A |B,C) =P(A|C).
2. Besides the foundational rules of computation for probabilities, the
following rules are also true:
Bayes Theorem : PA|B) = P(B|A). PA) P(B)
b. Marginalization : P(B) = PA, B) + P(¬A, B) = P(BIA) .P(A) +
P(B|-A).P(¬A)
C. Conditioning : PA |B) = PA |B, C = c) PC = c|B)
i. Avariable in Bayesian network is conditionally independent
of all non-successor variables. IfX, ,..., X_, are successors of
X, we have P(X, |X, ., X-) = PX,|Parents(X,). This
condition must be honored during the construction of a
network.
ii. During construction of a Bayesian network, the variables
should be ordered according to causality. First the cause, then
the hidden variables, and the diagnosis variables last.
d. Chain rule : PX,..,X) = I| =1P, | Parents(X)
i=l
3-24 G (CSIT/OE-Sem-8) Evaluating Hypotheses

3. Bayesian Network (BN) has been accepted as a powerful tool for common
knowledge representation and reasoning of partial beliefs under
uncertainty.
4. Bayesian networks utilize knowledge about the independence of variables
to simplify the model.
5. One of the most important features of Bayesian networks is the fact
that they provide an elegant mathematical structure for modeling
complicated relationships among random variables while keeping a
relatively simple visualization of these relationships.
Que 3.20. Write short note on Bayesian belief networks.
Answer
1 Bayesian belief networks specify joint conditional probability distributions.
2. They are also known as belief networks, Bayesian networks, or
probabilistic networks.
3 A BeliefNetwork allows class conditional independencies to be defined
between subsets of variables.
4. It provides a graphical model of causal relationship on which learning
can be performed.
5. We can use a trained Bayesian network for classification.
6. There are two components that define a Bayesian belief network :
a. Directed acyclic graph :
i. Each node in a directed acyclic graph represents a random
variable.
ii. These variable may be diserete or continuous valued.
ii. These variables may correspond to the actual attribute given
in the data.
Directed acyclic graph representation : The following diagram shows a
directed acyclic graph for six Boolean variables.

(Family History) Smoker

Lung Cancer Emphysema

Positive Xray Dyspnea


Machine Learning 3-25 G (CSTT/OE-Sem-8)

i. The arc in the diagram allows representation of causal


knowledge.
For example, lung cancer is influenced by a person's family
history of lung cancer, as well as whether or not the person is
a smoker.
ii. Itis worth noting that the variable Positive X-ray is independent
of whether the patient has a family history of lung cancer or
that the patient is a smoker, given that we know the patient
has lung cancer.
b. Conditional probability table :
The conditional probability table for the values of the variable
LungCancer (LC) showing each possible combination of the values
follows:
ofits parent nodes, FamilyHistory (FH), and Smoker (S) is as
FH,S FH,-S -FH,S -FH,S
0.8 0.5 0.7 0.1
LC
-LC 0.2 0.5 0.3 0.9

Que 3.21. Explain EM algorithm with steps.


Answer
1. The Expectation-Maximization (EM) algorithm is an iterative way to
find maximum-likelihood estimates for model parameters when the
variables.
data is incomplete or has missing data points or has some hidden
a
2. EM chooses random values for the missing data points and estimates
new set of data.
first
3. These new values are then recursively used to estimate a better
data, by filling up missing points, until the values get fixed.
4. These are the two basic steps of the EM algorithm:
a. Estimation Step:
i. Initialize H,E, and n, by random values,or by Kmeans clustering
results or by hierarchical clustering results.
Then for those given parameter values, estimate the value of
the latent variables (i.e., y;).
b. Maximization Step: Update the value of the parameters (i.e., H
2, and n, ) calculated using ML method :
Initialize the mean
the covariance matrix , and
the mixing coefficients ,
by random values, (or other values).
iü. Compute the , values for all k.
3-26 G (CSIT/OE-Sem-8) Evaluating Hypotheses

the current n, values.


ii. Again estimate all the parameters using
iv. Compute log-likelihood function.
V. Put some convergence criterion.
to some value
vi. If the log-likelihood value converges
values) then stop,
(or if all the parameters converge to some
else return to Step 2.
advantages and disadvantages of
Que 3.22. Describe the usage,
EM algorithm.
Answer
Usage of EM algorithm :
1. It can be used to fill the
missing data in asample.
unsupervised learning of clusters.
2. It can be used as the basis of
estimating the parameters of Hidden
3. It can be used for the purpose of
Markov Model (HMM).
values of latent variables.
4. It can be used for discovering the
Advantages of EM algorithm are :
will increase with each iteration.
1. Itis always guaranteed that likelihood
E-step and M-step are often pretty easy for many problems in terms
2. The
of implementation.
the closed form.
3. Solutions to the M-steps often exist in
Disadvantages of EMalgorithm are:
1. It has slow convergence.
optima only.
2. It makes convergence to the local
the probabilities, forward and backward (numerical
3. It requires both
probability).
optimization requires only forward
4
UNIT
Computational
Learning Theory

CONTENTS
Part-1: Computational Learning 4-2G to 4-4G
Theory, Sample Complexity
for Finite Hypothesis Spaces,
Sample Complexity for
Infinite Hypothesis Space,
The Mistake Bound Model
of Learning

Part-2 : Instance-Based Learning, 4-4G to 4-12G


K-Nearest Neighbour
Learning., Locally Weighted
Regression, Radial Basis
Function Network

Part-3: Case-Based Learning 4-12G to 4-18G

4-1G(CSTT/OE-Sem-8)
Computational Learning Theory
4-2G (CSIT/OE-Sem-8)

PART- 1

Computational Learning Theory, Sample Complexity for Finite


Hypothesis
Hypothesis Spaces, Sample Complexity for Infinite
Space, The Mistake Bound Model of Learning.

Questions-Answers
Questions
Long Answer Type and Medium Answer Type

Que 4.1. Write short note on computational learning theory.

Answer
for studying
Computational Learning Theory (CLT) is a field ofAI usedwhat
1
sorts of
the design of machine learning algorithms to determine
problems are learnable.
ideas of deep
2 The ultimate goals are to understand the theoretical
improving
learning programs, what makes them work or not, while
accuracy and efficiency.
3 This field merges many disciplines, such as probability theory, statistics,
programming optimization, information theory, calculus and geometry.
4. Computational learning theory is used to:
i. Provide a theoretical analysis of learning.
ii Show when a learning algorithm can be expected to succeed.
ii. Show when learning may be impossible.
5. There are three areas comprised by CLT:
i. Sample complexity : Sample complexity described the examples
we need tofind in a good hypothesis.
ii. Computational complexity : Computational complexity defined
the computational power we need to find in a good hypothesis.
iüi. Mistake bound : Mistake bound find the mistakes we will make
before finding a good hypothesis.
Que 4.2. Describe sample complexity for finite hypothesis space.
Answer
1. The sample complexity of a machine learning algorithm represents the
number of training samples that it needs in order to successfully learn a
target function.
Machine Learning 4-3G(CS/TT/OE-Sem-8)

2. Sample complexity is the number of training samples that we need to


supply to the algorithm, so that the function returned by the algorithm
is within an arbitrarily smallerror of the best possible function, with
probability arbitrarily close to 1.
3. There are two variants of sample complexity :
The weak variant fixes a particular input-output distribution.
b. The strong variant takes the worst-case sample complexity over all
input-output distributions.
3. It characterizes classes of learning problems or specific algorithms in
terms of sample complexity, i.e., the number of training examples
necessaryor sufficient to learn hypotheses of a given accuracy.
4. Complexity of alearning problem depends on:
Size or expressiveness of the hypothesis space.
b Accuracy to which target concept must be approximated.
C Probability with which the learner must produce a successful
hypothesis.
Manner in which training examples are presented, for example,
randomly or by query to an oracle.
Que 4.3. Discuss mistake bound model of learning.

Answer
1 An algorithm A learns a class Cwith mistake bound Miff
Mistake (A, C) < M.
2 In mistake bound model, learning proceeds in rounds, one by one.
Suppose Y= (-1, + 1).
3 At the beginning of round t, the learning algorithm A has the hypothesis
h,, in round t,we see x, and predict h,(x,).
4. At the end of the round, y, is revealed and Amakes a mistake if h,(r,) *
y,. The algorithm then updates its hypothesis to h,., and this continues
till time T.
5. Suppose the labels were actually produced by some function fin a given
concept class C.
6. Then we bound the total number of mistakes the learner commits as:
Mistake A,C):= max
7. Amount of computation A has to do in each round in order to update its
hypothesis from h, to h,,1:
8. Setting this issue aside for a moment, we have a remarkably simple
algorithm halving (C) that has a mistake bound of lg(|C|)for any finite
concept class C.
44G(CSIT/OE-Sem-8) Computational Learning Theory

majority (H) as
9 For a finite set H of hypotheses, define the hypothesis
follows:
[+1 |\he H| h(x) =+1)| 2|H|/2,
Majority()(x):= -1 otherwise
Algorithm :
HALVING (C)

h, majority (C,)
for t = 1 to T do
Receive x
Predict h,(x)
Receive y,
C+1t fe Ct | fx) =y)
h,, majority (C,,)
end for

PART-2

Instance-based Learning, K-Nearest Neighbour Learning, Locally


Weighted Regression, Radial Basis Function Network.

Questions-Answers

Questions
Long Answer Type and Medium Answer Type

Que 4.4. Write short note on instance-based learning.

Answer
1. Instance-Based Learning (IBL) is an extension of nearest neighbour or
K-NN classification algorithms.
created
2. IBL algorithms do not maintain a set of abstractions of model
from the instances.
3. The K-NN, algorithms have large space requirement.
4. They also extend it with a significance test to work with noisy instances,
since a lot of real-life datasets have training instances and K-NN
algorithms do not work well with noise.
5. Instance-based learning is based on the memorization of the dataset.
6. The number of parameters is unbounded and grows with the size of the
data.
Machine Learning 4-5 G (CSIT/OE-Sem-8)
7. The classification is obtained through
8.
memorized examples.
The cost of the learning process is 0, all the cost is in the
the prediction. computation of
9. This kind learning is also known as lazy learning.
Que 4.5.Explain instance-based learning representation.
Answer
Instance-based representation (1) :
1. The simplest form of learning is plain memorization.
2. This is a completely different way of representing the knowledge
from a set of instances: just store the instances themselves andextracted
by relating new instances whose class is operate
whose class is known.
unknown to existing ones
3. Instead of creating rules, work directly from the examples
themselves.
Instance-based representation (2) :
1. Instance-based learning is lazy, deferring the real work as long as
possible.
2. In instance-based learning, each new instance is
compared with
ones using a distance metric, and the closest existing instance isexisting
used to
assign the class to the new one. This is also called the
classification method. nearest-neighbour
3. Sometimes more than one nearest neighbor is used, and the majority
class of the closest k-nearest neighbours is assigned to the new instance.
This is termed the k-nearest neighbour method.
Instance-based representation (3) :
1 When computing the distance between two examples, the standard
Euclidean distance may be used.
2. When nominal attributes are present, we may use the
procedure. following
3 A distance of 0 is assigned if the values are identical,
distance is 1.
otherwise the
4 Some attributes will be more important than others. We need
kinds of attribute weighting. To get suitable attribute weights fromsome
the
training set is a key problem.
5. It may not be necessary, or desirable, to store all the
training instances.
Instance-based representation (4):
1. Generally some regions of attribute space are more stable with regard
to class than others, and just a few examples are needed
inside stable
regions.
4-6G(CSIT/OE-Sem-8) Computational Learning Theory
2. An apparent drawback to instance-based representation is that they do
not make explicit the structures that are learned.

Po

(a) (b) (c)


Fig. 4.5.1.
Que 4.6. What are the performance dimensions used for instance
based learning algorithm ?
Answer
Performance dimension used for instance-based learning algorithm
are:

1. Generality :
a This is the class of concepts that describe the representation of an
algorithm.
b IBL algorithms can pac-learn any concept whose boundary is a
union of a finite number of closed hyper-curves of finite size.
2. Accuracy: This concept describe the accuracy of classification.
3. Learning rate :
a This is the speed at which classification accuracy increases during
training.
b. It is a more useful indicator of the performance of the learning
algorithm than accuracy for finite-sized training sets.
4. Incorporation costs :
a These are incurred while updating the concept deseriptions with a
single training instance.
b. They include classification costs.
5. Storage requirement : This is the size of the concept description for
IBL algorithms, which is defined as the number of saved instances used
for classification decisions.

Que 4.7. What are the functions of instance-based learning ?

Answer
Functions of instance-based learning are :
1. Similarity function :
a. This computes the similarity between a training instance i and the
instances in the concept description.
Machine Learning 4-7G(CSIT/OE-Sem-8)
b Similarities are numeric-valued.
2. Classification function :
a.
This receives the similarity function's results and the classification
performance records of the instances in the concept description.
b. It yields a classification for i.
3. Concept description updater:
This maintains records on classification performance and decides
which instances to include in the concept deseription.
b. Inputs include i, the similarity results, the classification results,
and a current concept description. It yields the modified concept
description.
Que 4.8. What are the advantages and disadvantages of instance
based learning ?
Answer
Advantages of instance-based learning:
1. Learning is trivial.
2. Works efficiently.
3. Noise resistant.
4. Rich representation, arbitrary decision surfaces.
5. Easy to understand.
Disadvantages of instance-based learning :
1. Need lots of data.
2. Computational cost is high.
3. Restricted to xe R".
4. Implicit weights of attributes (need normalization).
5. Need large space for storage i.e., require large memory.
6. Expensive application time.

Que 4.9. Describe K-Nearest Neighbour algorithm with steps.

Answer
1 The KNN classification algorithm is used to decide the new instance
should belong to which class.
2 When K = 1, we have the nearest neighbour algorithm.
3. KNN classification is incremental.
4 KNN classification does not have a training phase, all instances are
stored. Training uses indexing to find neighbours quickly.
4-8G (CS/IT/OE-Sem-8) Computational Learning Theory

K-nearest
5. During testing, KNN classification algorithm has to find exhaustive
neighbours of a new instance. This is time consuming if we do
comparison.
6. K-nearest neighbours use the local neighborhood to obtain a prediction.
Algorithm: Let m be the number of training data samples. Let p be an
unknown point.
means
1 Store the training samples in an array of data points array. This
each element of this array represents a tuple (x, y).
2. For i = 0 to m :
Calculate Euclidean distance d(arrlil, p).
Each of these distances
3 Make set S of K smallest distances obtained.
corresponds to an already classified data point.
4 Return the majority label among S.

Que 4.10. What are the advantages and


disadvantages of K-nearest
neighbour algorithm ?

Answer
Advantages of KNN algorithm :
1. No training period :
a. KNN is called lazy learner (Instance-based learning).
b. It does not learn anything in the
training period. It does not derive
any discriminative function from the training data.
It stores the
C. In other words, there is no training period for it. making real
training dataset and learns from it only at the time of
time predictions.
d This makes the KNN algorithm much faster
than other algorithms
Regression etc.
that require training for example,SVM, Linear
predictions,
2 Since the KNN algorithm requires no training before making
the accuracy of
new data can be added seamlessly which will not impact
the algorithm.
parameters required
3. KNN is very easy to implement. There are only two
distance function (for
toimplement KNN ie., the value of K and the
example, Euclidean).
Disadvantages of KNN:
the cost of
1. Does not work well with large dataset : In large datasets, points
calculating the distance between the new point and each existing
algorithm.
is huge which degrades the performance of the
algorithm
2. Does not work well with high dimensions : The KNN large numier
does not work well with high dimensional data because with
Machine Learning 4-9G (CSTT/OE-Sem-8)

of dimensions, it becomes difficult for the algorithm to calculate the


distance in each dimension.
3. Need feature scaling: We need to do feature scaling (standardization
and normalization) before applying KNN algorithm to any dataset. If we
do not do so, KNN may generate wrong predictions.
4.
Sensitive to noisy data, missing values and outliers :KNN is
sensitive to noise in the dataset. We need to manually represent missing
values and remove outliers.

Que 4.11. Explain locally weighted regression.


Answer
1. Model-based methods, such as neural networks and the mixture of
Gaussians, use the data to build a parameterized model.
2. After training, the model is used for predictions and the data are generally
discarded.
3. In contrast, memory-based methods are non-parametric approaches
that explicitly retain the training data, and use it each time a prediction
needs to be made.
4. Locally Weighted Regression (LWR) is a memory-based method that
performs a regression around a point using only training data that are
local to that point.
5. LWR was suitable for real-time control by constructing an LWR-based
system that learned a difficultjuggling task.

X
Fig. 4.11.1.
6. The LOESS (Locally Estimated Scatterplot Smoothing) model performs
a linear regression on points in the data set, weighted by a kernel
centered at x.
7. The kernel shape is a design parameter for which the original LOESS
model uses a tricubic kernel :
h{*) =hx -x)= exp(- k(x-x)),
where k is a smoothing parameter.
8 For brevity, we will drop the argument x for h,(*), and define n = Lh,.
We can then write the estimated means and covariances as :
4-10 G (CSIT/OE-Sem-8) Computational Learning Theory

Zh,(a, - 4,(y, -,)


, O,y
n n

9 We use the data covariances to express the conditional expectations and


their estimated variances:

n'

Kernel too wide - includes region


-Kernel just right
Kernel too narrow - excludes some of linear region
X
Fig. 4.11.2.

Que 4.12. Explain Radial Basis Function (RBF).


Answer
1 A
Radial Basis Function (RBF) is a function that assigns a real value to
each input from its domain (it is a real-value function), and the value
produced by the RBF is always an absolute value i.e., it is a measure of
distance and cannot be negative.
2. Euclidean distance (the straight-line distance) between two points in
Euclidean space is used.
3. Radial basis functions are used to approximate functions, such as neural
networks acts as function approximators.
4. The following sum represents a radial basis function network:
N
ylr) = w, o(||x-x, |),
i=1

5. The radial basis functions act as activation functions.


6. The approximant ylr) is differentiable with respect to the weights which
are learned using iterative update methods common among neural
networks.

Que 4.18. Explain the architecture of a radial basis function


network.
Machine Learning 4-11G(CSTT/OE-Sem-8)

Answer
1. Radial Basis Function (RBF) networks have three layers :an input
layer, a hidden layer with a non-linear RBF activation function and a
linear output layer.
2 The input can be modeled as a vector of real numbers x e R",
3 The output of the network is then ascalar function of the input vector,
¢:R" ’ R, and is given by
N

o(x) = 4, pllx- c, |)
Output y

Linear weights

Radial basis
functions

Weights

Input x
Fig.4.13.1. Architecture of a radialbasis function net work. An input
vector x is used as input to all radial basis functions, each with different
parameters. The output of the network is a linear combination of the
outputs from radial basis functions.
where n is the number of neurons in the hidden layer, c, is the center
vector for neuron i and a, is the weight of neuron i in the linear output
neuron.

4 Functions that depend only on the distance from a center vector are
radially symmetric about that vector.
5 In the basic form all inputs are connected toeach hidden neuron.
6 The radial basis function is taken to be Gaussian
pl| x-c, |) = exp-pI|x-c, i
7. The Gaussian basis functions are local to the center vector in the sense
that

lim pl| x - c, |) = 0
iLe., changing parameters of one neuron has only asmall effect for input
values that are far away from the center of that neuron.
8. Given certain mild conditions on the shape of the activation function,
RBF networks are niversal approximators on a compact subset of P".
4-12G(CSTT/OE-Sem-8) Computational Learning Theory
9. This means that an RBF network with enough hidden neurons can
approximate any continuous function on aclosed, bounded set with
arbitrary precision.
10. The parameters a,,c, p, and Bare determined in a manner that optimize
the fit between and the data.

PART-3

Case-Based Learning.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 4.14.Write short note on case-based learning algorithm.


Answer
1 Case-Based Learning (CBL) algorithms contain an input as a sequence
of training cases and an output concept description, which can be used
togenerate predictions of goal feature values for subsequently presented
cases.

2. The primary component of the concept description is case-base, but


almost allCBL algorithms maintain additional related information for
the purpose of generating accurate predictions (for example, settings
for feature weights).
3. Current CBL algorithms assume that cases are described using a feature
value representation, where features are either predictor or goal
features.
4. CBL algorithms are distinguished by their processing behaviour.
Disadvantages of case-based learning algorithm :
1. They are computationally expensive because they save and compute
similarities to all training cases.
2. They are intolerant of noise and irrelevant features.
3. They are sensitive to the choice of the algorithm's similarity function.
4. There is no simple way they can process symbolic valued feature values.

Que 4.15. What are the functions of case-based learning algorithm?


Answer
Functions of case-based learning algorithm are :
1. Pre-processor:This prepares the input for processing (for example,
normalizing the range of numeric-valued features to ensure that they
Machine Lerning
4-40VITOE Aem 8)

Are freated with equal inportawe y the alnilarily funetion, formatting


the raw input into a Rel of eanea),
2 Similarity t
a. Thie function ARseRReA the ainilarities of a glven (ASe with the
previouely alored caReR 0n the coneept deawriplion.
AsseARIent may involve explieit encoding and/lor dynamie
computation.
CBL Bimilarity unetiona ind a compromise nlong the continuum
between these extremea,
3. Prediction : Thin funetion inputa the almilarity aaseAAMents and
generateN Aprediction lor the value ofthe given cane'a goal feature de.,
aclassification when it in aymbolie valued),
4 Memory updating : 'Thin updlaton the atored case-base, such A8 by
modifying or abatracting previounly nlored cuneN, lorgetling canes
presunmed to be noiny, or updating afoature'n relevance weight selting.
Que 4.16. Deseribe caNe-based learningK eyele with different
schemes of CBL.

Answer
Case-based learning nlgorithm processingK ntagen are:
the
1. Case retrieval: Afler the problem nituation has been anaeANed,
best matching case is searched in the cnne-bune nnd an approximate
solution is retrieved.
2 Case adaptation: The retrieved solution in adapted to fit better in the
new prolblem.
Problem

Retriove

Retain

Revine

Confirmed Proponed
solution Nolution

Fig.4.16.1. The CBL cycle.


4414 G (CSIT/OE-Sem-8) Computational Learning Theory
Solution evaluation :
a. The adapted solution can be evaluated either before the solution is
applied to the problem or after the solution has been applied.
b. In any case, if the accomplished result is not satisfactory, the
retrieved solution must be adapted again or more cases should be
retrieved.
4. Case-base updating:If the solution was verified as correct, the new
case may be added to the case base.
Different scheme of the CBL working cycle are :
1 Retrieve the most similar case.
2 Reuse the case to attempt to solve the current problem.
3 Revise the proposed solution if necessary.
4. Retain the new solution as a part of a new case.

Que 4.17. What are the benefits of CBL as a lazy problem solving
method?

Answer
The benefits of CBL as a lazy Problem solving method are:
1. Ease of knowledge elicitation:
a. Lazy methods can utilise easily available case or problem instances
instead ofrules that are difficult to extract.
b. So, classical knowledge engineering is replaced by case acquisition
and structuring.
2. Absence of problem-solving bias:
a. Cases can be used for multiple problem-solving purposes, because
they are stored in a raw form.
b. Thisin contrast to eager methods, which can be used merely for
the purpose for which the knowledge has already been compiled.
3. Incremental learning :
a. A
CBL system can be put into operation with a minimal set solved
cases furnishing the case base.
b The case base will be filled with new cases increasing the system's
problem-solving ability.
C. Besides augmentation of the case base, new indexes and clusters
categories can be created and the existing ones can be changed.
d. This in contrast requires a special training period whenever
informatics extraction (knowledge generalisation) is performed.
e. Hence, dynamic on-line adaptation a non-rigid environment is
possible.
Machine Learning 4-15 G (CS/IT/OE-Sem-8)

4. Suitability for complex and not-fully formalised solution spaces:


a.
CBLsystems can applied to an incomplete model ofproblem domain,
implementation involves both to identity relevant case features
and to furnish, possibly a partial case base, with proper cases.
b. Lazy approaches are appropriate for complex solution spaces than
eager approaches, which replace the presented data with
abstractions obtained by generalisation.
5. Suitability for sequential problem solving :
a. Sequential tasks, like these encountered reinforcement learning
problems, benefit from the storage of history in the form of sequence
of states or procedures.
b Such a storage is facilitated by lazy approaches.
6. Ease of explanation :
The results ofa CBL system can be justified based upon the similarity
of the current problem to the retrieved case.
b. CBL are easily traceable to precedent cases, it is also easier to
analyse failures of the system.
7. Ease of maintenance: This is particularly due to the fact that CBL
systems can adapt to many changes in the problem domain and the
relevant environment, merely by acquiring.
Que 4.18. What are the limitations of CBL?

Answer
Limitations of CBL are :
1. Handling large case bases :
a High memory/storage requirements and time-consuming retrieval
accompany CBL systems utilising large case bases.
b. Although the order of both is linear with the number of cases,
these problems usually lead to increased construction costs and
reduced system performance.
C These problems are less significant as the hardware components
become faster and cheaper.
2 Dynamic problem domains :
a CBL systems may have difficulties in handling dynamic problem
domains, where they may be unable to followa shift in the way
roblems are solved, since they are strongly biased towards what
has already worked.
b. This may result in an outdated case base.
3. Handling noisy data :
Parts of the problem situation may be irrelevant to the problem
itself.
Computational Learning Theory
4-16 G (CSTT/OE-Sem-8)
in a problem
b. Unsuccessful assessment of such noise present the
may result in
situation currently imposed on a CBL system
numerous times in the
stored
same problem being unnecessarily due to the noise.
case basebecause of the difference
retrieval of cases.
C. Inturn this implies inefficient storage and
4. Fully automatic operation :
fully covered.
a. In a CBL system, the problem domain is not
b. Hence, some problem situations can
occur for which the system
has no solution.
the user.
C. In such situations, CBL systems expect input from

Que 4.19. What are the applications of CBL ?


Answer
Applications of CBL:
1. Interpretation : It is a process of evaluatingsituations/ problems in
patent laws
some context (For example, HYPOfor interpretation ofinterpretation
KICS for interpretation of building regulations, LISSA for
of non-destructive test measurements).
2. Classification : It is a process of explaining a number of encountered
symptoms (For example, CASEY for classification of auditory
impairments, CASCADE for classification of software failures, PAKAR
for causal classification of building defects, ISFER for classification of
facial expressions into user defined interpretation categories.
3. Design : It is a process of satisfyinga number of posed constraints (For
example, JULIA for meal planning, CLAVIER for design of optimal
layouts of composite airplane parts, EADOCS for aircraft panels design).
4. Planning : It is a process of arranging a sequence of actions in time
(For example, BOLERO for building diagnostic plans for medical patients,
TOTLEC for manufacturing planning).
5. Advising: It is aprocess of resolving diagnosed problems (For example,
DECIDER for advising students, HOMER).

Que 4.20. What are major paradigms of machine learning ?


Answer
Major paradigmsof machine learning are :
1. Rote Learning :
a. There is one-to-one mapping from inputs to stored representation.
b. Learning by memorization.
Machine Learning 4-17 G(CS/TT/OE-Sem-8)
C. There is Association-based storage and retrieval.
2. Induction : Machine learning use specific examples to reach general
conclusions.
3. Clustering : Clustering isa task of grouping a set of objects in such a
way that objects in the same group are similar to each other than to
those in other group.
4. Analogy : Determine correspondence between two different
representations.
5. Discovery : Unsupervised i.e., specific goal not given.
6. Genetic algorithms :
a Genetic algorithms are stochastic search algorithms which act on a
population of possible solutions.
b. They are probabilisticsearch methods means that the states which
they explore are not determined solely by the properties of the
problems.
7. Reinforcement :
In reinforcement only feedback (positive or negative reward) given
at end of a sequence of steps.
b. Requires assigning reward to steps by solving the credit assignment
problem which steps should receive credit or blame for afinal result.
Que 4.21. Briefly explain the inductive learning problem.
Answer
Inductive learning problem are :
1. Supervised versus unsupervised learning:
a. We want to learn an unknown function fr) =y, where x is an input
example and y is the desired output.
b. Supervised learning implies we are given a set of (x, y) pairs by a
teacher.
C Unsupervised learning means we are only given the xs.
d. In either case, the goal is to estimate f.
2. Concept learning :
Given a set of examples of some concept/class/category, determine
if a given example is an instance of the concept or not.
b. Ifit is an instance, we call it a positive example.
C. Ifit is not, it is calleda negative example.
Computational Learning Theory
4-18 G (CSIT/OE-Sem-8)

3. Supervised concept learning by induction :


Given a training set of positive and negative examples of a concept,
a.
classify whether future
construct a description that will accurately
examples are positive or negative.
offunction fgiven a training set
b. That is, learn some good estimate
either +(positive) or
((«1, y1), (*2, y2), .., (xn, yn) where each y, is
-(negative).
5 UNIT
Genetic Algorithm

CONTENTS
Part-1: Genetic Algorithm 5-2G to 5-9G
An Ilustrative Example,
Hypothesis Space Search,
Genetic Programming
Part-2 : Models of Evolution 5-9G to 5-12G
and Learning

Part-3 : Learning First Order . 5-12G to 5-18G


Rules, Sequential Covering
Algorithm, General to
Specific Beam Search,

Part-4 : FOIL, Reinforcement 5-18G to 5-24G


Learning, The Learning
Task, Q Learning

5-1 G (CSIT/OE-Sem-8)
Genetic Algorithm
5-2 G (CS/IT/OE-Sem-8)

PART-1
Hypothesis Space
Genetic Algorithm : An Ilustrative Example,
Search, Genetic Programming.

Questions-Answers
Answer Type Questions
Long Answer Type and Medium

algorithm.
Que 5.1. Write short note on Genetic

Answer
search and optimization algorithm
Genetic algorithms are computerized
1
genetics and natural selection.
based on mechanics of natural
of natural genetics and natural
2 These algorithms mimic the principle
optimization procedure.
selection to construct search and genetic Design
space.
3 Geneticalgorithms convert the design space into
solutions.
space isa set of feasible
4 Genetic algorithms work
with a coding of variables.
of variablesspace is that coding
5 The advantage of working with a coding function may be continuous.
discretizes the search space even though the
space is the space for all possible feasible solutions of particular
6 Search
problem. algorithm:
7. Following are the benefits of Genetic
a. They are robust.
b They provide optimization over large space state.
or presence of noise.
input
C. They do not break on slight change in
:
8 Following are the application of Geneticalgorithms
Recurrent neural network
a

b. Mutation testing
C. Code breaking
d. Filtering and signal processing
e. Learning fuzzy rule base
algorithm with advantages
Que 5.2.Writeprocedure of Genetic
and disadvantages.
Answer
Procedure of Geneticalgorithm:
1. Generate a set of individuals as the initial population.
2. Use genetic operators such as selection or cross over.
Machine Learning 5-3 G(CSTT/OE-Sem-8)
3 Apply mutation or digital reverse if necessary.
4. Evaluate the fitness function of the new population.
5 Use the fitness function for determining the best individuals and replace
predefined members from the original population.
6 Iterate steps 2-5 and terminate when some predefined population
threshold is met.
Advantages of genetic algorithm
1. Genetic algorithms can be executed in paralel. Hence, genetic algorithms
are faster.
2 It is useful for solving optimization problems.
Disadvantages of Genetic algorithm :
1 Identification of the fitness function is difficult as it depends on the
problem.
2 The selection of suitable geneticoperators is difficult.
Que 5.3. Explain different phases of genetic algorithm.

Answer
Different phases of genetic algorithm are :
1. Initial population :
a. The process begins with a set of individuals which is called a
population.
b. Each individual is a solution to the problem we want to solve.
C. An individual is characterized by a set of parameters (variables)
known as genes.
d Genes are joined into a string to form a chromosome (solution).
e. Ina genetic algorithm, the set ofgenes of an individual is represented
using a string.
f. Usually, binary values are used (string of ls and 0s).
A1 0 00 0 0 Gene

A2 1
1111 Chromosome

A3 1 0 1 0 1 1

A4 1 1| 0 1 10 Population
6-4GCSTYOE-Sem-8) Genetic Algorithm

FA (Pactor Analysis) fitness function :


The fitness function determines how fit an individual is (the ability
of all individual tocompete with other individual).
b. It gives a fitness score to each individual.
C. The probability that an individual will be selected for reproduction
is based on its fitness score.
Selection:
The idea of selection phase is to select the fittest individuals and let
them pass their genes to the next generation.
b. Two pairs ofindividuals (parents) are selected based on their fitness
Scores,

C
Individuals with high fitness have more chance to be selected for
reproduetion.
4. Crossover :
1.
Crossover is the most significant phase in a genetic algorithm.
b. For each pair of parents to be mated, a crossover point is chosen at
random from within the genes.
C. For example, consider the crossover point to be 3 as shown:

A1 0 0 0

A2 |1 1 1

Crossover point
d. Offspring are created by exchanging the genes of parents among
themselves until the crossover point is reached.

A1 0 0 0 00

A2 1

e. The new offspring are added to the population.


A5 |1 1|||oo0

A6 00 0 1 11
Machine Learning 5-5 G (CSIT/OE-Sem-8)

5. Mutation :
a.
When new offspring formed, some of their genes can be subjected
to a mutation with a low random probability.
b. This implies that some of the bits in the bit string can be flipped.
Before mutation
A5 1 1 1

After mutation
A5 1 1 0 11 0

C. Mutation occurs to maintain diversity within the population and


prevent premature convergence.
6. Termination :
a. The algorithm terminates if the population has converged (does
not produce offspring which are significanty different from the
previous generation).
b Then it is said that the genetic algorithm has provided a set of
solutions to our problem.
Que 5.4. Explain briefly hypothesis space search.
Answer
1 The hypothesis space is the set of possible decision trees.
2 ID3 performs a simple to complex, hill-climbing search through
hypothesis space for finding locally-optimal solution.
3.
ID3 can be characterized as searching aspace of hypothesis for one that
fits the training examples.
Hypotheses that
Hypotheses that maximize fit to data
maximize fit to
data and prior
knowledge

Tanget prop Backpropagation


search search
4 ID3hypothesis space of all decision trees is the complete space of finite
discrete-valued functions, relativeto the available attributes.
5 It maintains only single current hypothesis i.e., candidate - elimination
algorithm.
6. No backtracking in the search is required for converging locally optimal
solution.
7. Using training examples at each step resulting search is less sensitive to
errors in individual training examples.
5-6G (CSTT/OE-Sem-8) Genetic Algorithm

Que 5.5. Write short note on Genetic Programming.

Answer
1 Genetic Programming (GP) is a type of Evolutionary Algorithm (EA,
i.e.,a subset of machine learning.
2 EAs are used to discover solutions to problems that humans do not
know how to solve.
3. Free ofhuman preconceptions or biases, the adaptive nature of EAs can
generate solutions that are comparable to, and often better than the
best human efforts.
4. GP software systems implement an algorithm that uses random
mutation, crossover, a fitness function, and multiple generations of
evolution to resolve a user-defined task.
5. GP is used to discover a functional relationship between features in data
(symbolic regression), to group data into categories (classification), and
to assist in the design of electrical circuits, antennae, and quantum
algorithms.
6. GP is applied to software engineering through code synthesis, genetic
improvement, automatic bug-fixing, and in developing game-playing
strategies.
Que 5.6. What are the advantages and disadvantages of Genetic
programming ?
Answer
Advantages of Genetic programming are :
1. It does impose any fixed length of solution, so the maximum length can
be extended up to hardware limits.
2 In genetic programming, it is not necessary for an individual to have
maximum knowledge of the problem and to their solutions.
Disadvantages of Genetic programming are :
1 In GP, the number of possible programs that can be constructed by the
algorithm is immense. This is one of the main reasons why people
thought that it would be impossible to find programs thát are good
solutions to a given problem.
2 Although GP uses machine code which helps in providing result very
fast but if any of the high level language is used which needs to be
compile, it can generate errors and can make the program slow.
3 There is a high probability that even a very small variation has a disastrous
effect on fitness of the solution generated.
Que 5.7. Explain different types of Geneticprogramming.
Machine Learning 5-7G(CSTT/OE-Sem-8)
Answer
Different types of Genetic programming are:
1. Tree-based Genetic programming:
In tree-based GP, the computer programs are represented in tree
structures that are evaluated recursively to produce the resulting
multivariate expressions.
b Traditional nomenclature states that a tree node (or just node) is
an operator (+, -, *, l and a terminal node (or leaf) is a variable
[a, b, c, d).
2. Stack-based Genetic programming :
a. In stack-based genetic programming, the programs in the evolving
population are expressed in a stack-based programming language.
b. In stack-based genetic programming, programs are composed of
instructions that take arguments from data stacks and push results
back on data stackS.
C. Aseparate stack is provided for each data type, and program code
itself can be manipulated on data stacks and subsequently executed.
3. Linear Genetic Programming (LGP):
Linear Genetic Programming (LGP) is a subset of genetic
programming where computer programs in a population are
represented as a sequence of instructions from imperative
programming language or machine language.
4 Grammatical Evolution (GE):
a Grammatical Evolution is a new evolutionary computation technique
use to find an executable program or program fragment that will
achieve a good fitness value for the given objective function.
b Grammatical Evolution applies genetic operators to an integer string,
subsequently mapped to a program (or similar) through the use of
a grammar.
C. The benefit of GE is that this mapping simplifies the application of
search to different programming languages and other structures.
5. Cartesian Genetic Programming(CGP):
CGP is a highly efficient and flexible form of Genetic programming
that encodes a graph representation of a computer program.
b CGP represents computational structures (mathematical equations,
circuits, computer programs etc) as a string of integers.
C. These integers, known as genes determine the functions of nodes
in the graph, the connections between nodes, the connections to
inputs and the locations in the graph from where outputs are taken.
5-8 G (CS/TT/OE-Sem-8) Genetic Algorithm

6. Genetic Improvement Programming (GIP) :


a. GIP uses replacement software components that maximise
achievement of multiple objectives, while retaining the interfaces
between the components so-evolved and the surrounding system.
b. The GISMO project will develop theory, algorithms and techniques
for GIP as a way to automatically optimise multiple software
engineering objectives such as maximal throughput, fastest response
time and most reliable performance, while minimising power
consumption, faults, memory use, compiled code size, peak disk
usage and disk transfers.

Que 5.8. Write procedure for cartesian genetic programming.


Answer
Procedure for cartesian Genetic programming :
1. For all i such that 0si<l+h do
2. Randomly generate individual i
3. End for
4. Select the fittest individual, which is promoted as the parent
5. While solution is not found or the generation limit is not reached do
6 For all i such that 0<i<l+ do
7. Mutate the parent to generate offspring i
8 End for
9 Generate the new parent using the following rules:
i. IfA single offspring has a better fitness than any other member of
the population then
iü. The offspring is chosen as parent
10. Else if one or more offspring have an equal fitness to the parent then
11. Randomly choose one of these as parent
12. Else
13. The parent chromosome remains the same as before
14. End if
15. End while
Que 5.9. Explain genetie algorithm with steps.
Answer
Genetic algorithm : Refer Q. 5.1, Page 5-2G, Unit-5.
Machine Learning 5-9 G(CSIT/OE-Sem-8)
Algorithm:
GAFitness, Fitness_threshold, p, r, m)
Fitness : Fitness function, Fitness_threshold :termination eriterion,
p:Number of hypotheses in thepopulation,
r:Fraction to be replaced by crossover,
m:Mutation rate.
1. Initialize population :PGenerate phypotheses at random.
2. Evaluate: For each h in P, compute Fitness(h).
3. While (max, Fitness(h)] < Fitness_threshold, Do:
i. Select: Probabilistically select (1 -r).pmembers ofPtoadd to Ps.
ii. Crossover : Probabilistically select r. p/2 pairs ofhypotheses from
P. For each pair <h,, h,> produce two offspring and add to Ps.
iii. Mutate :Choose m percent of the members of Ps with uniform
probability. For each,invert one randomly selected bit.
iv. Update: P Ps.
V. Evaluate: For each he P, compute Fitness(h).
4. Return the hypothesis from P that has the highest fitness.

PART-2

Models of Evolution and Learning.

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 5.10.Explain the adaptive functions oflearning inevolution.


Answer
Adaptive functions of learning inevolution :
1. It allows individuals to adapt to changes in the environment
that occur in the life span of an individual or across few
generations :
Learning has the same function attributed to evaluation, adaptation
to the environment.
b. Learning evolution enables an organism to adapt to changes in the
environment that happen too quickly to be tracked by evolution.
5-10 G(CSIT/OE-Sem-8) Genetic Algorithm

2. It allows evolution to use information extracted from the


environment thereby channeling evolutionary search :
a Whereas onto genetic adaptation can rely on a very rich, although
not always explicit, amount of feedback from the environment,
evolutionary adaptation relies on a single value which reflects how
well an individual coped with its environment.
b This value is the number of offspring in the case of natural evolution
and the fitness value in the case of artificialevolution.
C
Instead, from the point of view of onto genetic adaptation, individuals
continuously receive feedback information from the environment
through their sensors during the whole lifetime.
d. However, this amount of information encodes only how well an
individual is doing in different moments of its life or how it should
modify its own behaviour in order to increase its fitness.
However, onto genetic and phylogenetic adaptation together might
be capable of exploiting this information.
f. Indeed evolution is able to transform sensory information into self
generated reinforcement signals or teaching patterns.
3. It can help and guide evolution:
a Although physical changes of the phenotype cannot be written
back into the genotype, learning might indeed affect the
evolutionary course in subtle but in effective ways learning
accelerates evolution because sub-optimal individuals can reproduce
by acquiring during life necessary features for survival.
b. Learning allows to produce complex phenotype with short genotypes
by extracting some of the information necessary to build the
corresponding phenotypes from the environment.
Moreover learning can allow the maintenance of more genetic
diversity.
d Different genes have more chances to be preserved in the
population if the individuals who incorporate those genes are able
to learn the same fit behaviours.

Que 5.11.What are the disadvantages of learning in evolution ?


Answer
Disadvantages of learningin evolution:
1. Adelay in the ability to acquire fitness:
Learning individuals will have asub-optimal behaviour during the
learning phase.
b Asa consequence they willcollect less fitness than individuals who
have the same behaviour genetically specified.
5-11 G (CS/TT/OE-Sem-8)
Machine Learning
C. The longer the learning period, the more accumulated costs have
to be paid.
2. Increased unreliability :
a. Since learned behaviour is determined, atleast partly, by the
environment, if a vital behaviour-defining stimulus is non
encountered by a particular individual, then it will suffer as a
consequence.
b. The plasticity of learned behaviours provides the possibility that an
individual may simply learn the wrong thing,causing it to incur an
incorrect behaviour cost.
C.
Learning thus has a stochastic element that it is not present in
instinctive behaviours.
3. Other costs :
a. In natural organisms or in biologically inspired artificial organisms
learning might implies additional costs.
b.
If individuals are considered young during the learning period,
learning also implies a delayed reproduction time.
C.
Moreover, learning might implies the waste of energy resources
for the accomplishment if learning process itself or for parental
investment.
d. Finally, while learning, individuals without a fully formed behaviour
may damage themselves.
model.
Que 5.12. Write short note on learnable evolution

Answer
1 The Learnable Evolution Model (LEM) is a non-Darwinian methodology
for evolutionary computation that employs machine learning to guide
solutions).
the generation of new individuals (candidate problem
2. Unlike standard, Darwinian-type evolutionary computation methods
that use random or semi-random operators for generating new individuals
(such as mutations and/or recombination), LEM employs hypothesis
generation and instantiation operators.
3. The hypothesis generation operator applies a machine learning program
toinduce descriptions that distinguish between high-fitness and low
fitness individuals in each consecutive population.
4. Such descriptions delin ate he search space that contain the
desirable solutions.
5. Subsequently the instantiation operator samples these areas tocreate
new individuals.
5-12 G (CS/IT/OE-Sem-8) Genetic Algorithm

6. LEM has been modified from optimization domain to classification domain


by augmented LEM with ID3.
7. Learnable evolution model describes:
a. Lamarckian evolution :
Generation is directly influenced by the experiences of
individual organisms during their lifetime.
ii. Direct influence of genetic makeup on the offspring.
ii. Itis completely contradicted by science.
iv. Lamarckian processes improve the effectiveness of genetic
algorithms.
b. Baldwin effect:
A species in a changing environment underlies evolutionary
pressure that favours individuals with the ability to learn.
ii. Such individuals perform a small local search to maximize
their fitness.
ii. Additionally, such individuals rely less on genetic code.
iv. Thus, they support a more diverse gene pool, relying on
individual learning to overcome missing traits.
V. Indirect influence of evolutionary adaption for the entire
population.

PART-3

Learning First Order Rules, Sequential Covering Algorithm,


General to Specific Beam Search.

Questions-Answers
Questions
Long Answer Type and Medium Answer Type

algorithm.
Que 5.13. Explain the procedure of learn-one rule
Answer
Learn-one-rule (Target_attribute, Attributes, Examples, k)
Conducts a general
Returns a single rule that covers some of the examples. by the performance
to-specific greedy beam search for the best rule, guided
metric.
hypothesis Ø.
1. Initialize best_hypothesis to the most general
(Best_hypothesis).
2. Initialize candidate_hypotheses to the set
Machine Learning 5-13G(CSIT/OE-Sem-8)
3. While candidate_hypotheses is not empty. Do :
Generate the next more specific candidate_hypotheses :
All_ constraints - the set of all constraints of theform (a=v),
where a is a member of attributes, and v is a value of a that
occurs in the current set of examples.
ii. New_candidate_hypotheses
for each h in candidate_hypotheses.
for each cin All_canstraints.
iii. Create aspecialization of hby adding the constraint c.
iv. Remove from New _candidate_hypotheses any hypotheses that
are duplicates, inconsistent, or not maximally specific.
b. Update best_hypothesis :
For all h in New_candidate_hypotheses do.
ü. If (performance(h_Examples, Target_attribute)
>Performance(Best_hypothesis, Examples, Target_attribute)
Then Best_hypothesis - h.
C. Update candidate_hypotheses:
i. Candidate_hypotheses the k best members of
New_candidate_hypotheses, according to the performance
measure.

i. Return a rule of the form :

"IF Best_hypothesis THEN prediction"


where prediction is the most frequent value ofTarget_attribute
among those Examples that match Best_hypothesis.
Performance(h, Examples, Target_attribute).
iüi. h_examples - the subset of Examples that match h.
4. Return - Entropyh_example), where entropy is with respect to
Target_attribute.

Que 5.14. Write short note on sequential covering algorithm.

Answer
1 Sequential covering is a general procedure that repeatedly learns a
single rule to create adecision list (or set) that covers the entire dataset
rule by rule.
2. Many rule learning algorithms are variants of the sequential covering
algorithm.
3. This is the most popular algorithm implementing rule learning.
6-14QCNTOR-No-)

AcoveringH algorithn, in the contexl ol puapoaitoal leanayalns,


is an algorithn that developa acover lor the sul of pwaltivonploate,
Aset of hypothesos that accountlor all the wailive oxaploa but
the nogutive exanmples
5. Thia in caled soquential covering becauso it leanone ulo al atie nd
repoat thin procosa (ogradually cover the hill sol of oaitive uxaplus
6. The most ellective approach to LearnOue tule 0s lean seeh
7 The charawteriatic of this algorithn ia it requirea biglh aewy, te,he
prediction abould be correctwithi high probahilily
8. Another in that itis osaibly low coverago, eans thal i doosol nke
prediction for all problema
Que 6,16. Write procedure of soquential covering algorithm.
Answer
Sequential convering(arget attribute, Atributen, Bxaples, '"lreal)
I. Learned rulea (0.
Rule loarn one rule('Parget attribule, Aributea, Wxaplea)
3
While perlormaneetkuleo, Bxamplen) > hrenbold,deo
i Learned rule « Learned rules kulo,
i Exanplen Bxanplen (exanmplen correetlyelaanitiedby ltude).
ii Rule learn one rule(arget atlribute, Attributes, Wxamplen).
4 Learned rulen s0rt Learned rules aceording to perlormance over
Exanples.
5. Return Learned rulen.

Que 6.16. Explain general-to-wpecifle beam seurelh in detailn.


Answer
Bean neareh in a heuristie seareh algorithm that exploron a graplh by
expanding the moat promining node in nlimited net.
In beam search, onlyapredetornmined nunber of best partial nolutionn
are kopt ancnnlidaten, ltinn greedy algorithm.
Beam search unen brendth rat aearch to build ita noarch troo.
4. At each level of the troo, it generaten all nuceonnora of the atates nt the
current lovel, morting them in inereaning order of heuriatie cont.
5. However, it only storen a prodeternined number of stuten at each level
(called the beam width). Only thoe HLaten nro expanded next. "The gronter
the beam width, the fewer staten nro pruned.
6. Witlh an infinite beamwidth, no ntntea are pruned and beam search in
identical to breadth firat noareh, The boam width bounda the memory
required to perform the Reareh.
5-15 G (CSIT/OE-Sem-8)
Machine Learning
search sacrifices
7. Since a goal state could potentially be pruned, beamterminate with a
completeness (guarantee that an algorithm will
solution, if one exists).
that it will find
8. Beam search is not optimal (that is, there is no guarantee
the best solution).
solution found. Beam search
9. In general, beam search returns the first configured
the
for machine translation is a different case : once reaching algorithm will
the
maximum search depth (i.e., translation length), depths and return
evaluate the solutions found during search at various
the best one (the one with the highest probability).
Approach that uses a
10. The beam width can either be fixed or variable.minimum. Ifnosolution
variable beam width starts with the width at a
is found, the beam is widened and the procedure is repeated.
search.
Que 5.17. Write the procedure of general-to-specificbeam
Answer
1. Initialize a set of most general complexes.
2. Evaluate performances of those complexes over the example set.
a. Count how many positive and negative examples it covers.
b. Evaluate their performances.
3. Sort complexes according to their performances.
the hypothesis and
4. If the best complex satisfies some threshold, form
return.
5. Otherwise, pick k best performing complexes for the next generation.
general
6 Specializing all k complexes in the set to find new set of less
complexes.
7 Go to step 2.
search ? Give
Que 5.18. What are the properties of heuristic
example of heuristic search.
Answer
Properties of heuristic search are:
1. Admissibility condition:Algorithm Ais admissible if it is generated
to return as optimal solution.
2 Completeness condition : Algorithm A is complete if it always
terminates with a solution.
Dominance properties : Let A, and A, be admissible algorithms with
heuristic estimation function h, and h, respectively. A, is said to be more
informed than A, whenever h, (n) >h,(n) for all n, A, is said to dominate A,.
5-16G(CSTT/OE-Sem-8) Genetic Algorithm

4. Optimality property :AlgorithmA is optimal over a class of algorithms


ifA dominates all members of the class.
Example:
8-puzzle ;hn) = tiles out of place (heuristic function)
12 3 12 3
8 6 h(n) =3 8 4
7 5 4 7 6 5
Initial state Goal state

Example count
1 2 3
6 hén) =3
7 5 4

h(n) =3 h(n) = 2 h(n) = 4


1 2 1 2 1 2
8 6 6 4 6 3
7 5 4 75 7 5 4
move 1 move 2 move 3
Fig. 5.18.1.
After opting the move 2, we will be in the state as per Fig. 5.18.2.
1 2 3
6 h(n) = 3
7 5 4

h(n) = 3 h(n) = 2 h(n) = 4


1 2 3 1 2 3 1 2
8 6 6 4 8
75 4 75 7 5 4

h(n) = 3 h(n) = 1
1 23 1|23
6 8 6 4
7 5 4 7 5
move 1 move 2
Fig. 5.18.2.
After moving and selecting move 2, we will be in the position as per in
Fig. 5.18.3.
Machine Learning 5-17 G (CSIT/OE-Sem-8)

Example count
12 3
6 h(n) = 3
7 5 4

h(n) = 3 h(n) = 2 h(n) = 4


1 2 3 1 2 3 1 2
8 6 6 4 6 3
7 5 4 7 5 7 5 4
h(n) = 3 h(n) = 1
1|2 3 1 2 3
6 6 4
7 5 4 5
h(n) = 2 hn) = 0 h(n) = 2A
1 2 1 1|2 3
6 4 8 4 8 4
7 5 7 6 5 75
(goal)
Fig. 5.18.3.
This procedure is continuous process until we reach the
the search. goal node as end of

Que 5.19. Explain hill climbing algorithm.


Answer
1. Search methods based on hillclimbing get their names from the
nodes are selected for expansion. way the
2 At each point in the search path, a successor
most quickly to the top of the hill (the goal) isnode that appears to lead
3
selected for exploration.
This method requires that some
information be available with which to
evaluate and order the most promising choices.
4. Hillclimbing is like depth-first
child is selected for expansion. searching where the most promising
5. When the children have been generated,
using some type of heuristic function. alternative choices are
evaluated
6. Hill climbing can produce
substantial savings over blind searches when
an informative, reliable function is available to guide the
global goal. search to a
7. Here, the generate and test
method is
which measures the closeness of the augmented by a heuristic function
current state to the goal state.
Genetic Algorithm
5-18 G (CS/IT/OE-Sem-8)
state, then return and quit
Calculate the initial state, if it is a goal
state as the current state.
otherwise continue with the initial
found or until there are no new
b. Follow the loop, until a solution is
current state:
operators left to be applied in the
applied to the current
i. Select an operator that has not yet been
state.
state and apply it to produce a new
i. Evaluate the new state.
guit.
1. Ifit is agoal state, then return and
is better than the current
2. If it is not a goal state but it
state, then make it as the current state.
state, then continue in
3 Ifit is not better than the current
the current loop.

PART-4
Task, QLearning.
FOIL, Reinforcement Learning, The Learning

Questions-Answers

Type Questions
Long Answer Type and Medium Answer

Que 5.20. Explain FOIL algorithm with steps.


Answer
approach except for the
FOIL is similar to the propositional rule learning
following:
needs to accommodate
1 FOIL accommodates first-order rules and thus
variables in the rule pre-conditions.
(FOIL-GAIN) which takes
2. FOIL uses a special performance measure
intoaccount the different variable bindings.
(instead
3 FOIL seeks only rules that predict when the target literal is true
of predicting when it is true or when it is false).
search.
4. FOIL performs a simple hill-climbing search rather than a beam
5. The FOIL algorithm is as follows :
Input List of examples
Output Rule in first-order predicate logic
FOL (Examples)
Let Pos be the positive examples
Let Pred be the predicate to be learned
Machine Learning 5-19 G (CSTT/OE-Sem-8)

Until Pos is empty do :


Let Neg be the negative examples
Set Body to empty
Call LearnClauseBody
Add Pred+ Body to the rule
Remove from Pos all examples which satisfy Body
Procedure LearnClauseBody
Until Neg is empty do
Choose a literal L
Conjoin L to Body
Remove from Neg examples that do not satisfy L
Que 5.21. Describe reinforcement learning.
Answer
1. Reinforcement learning is the study of how animals and artificial systems
can learn to optimize their behaviour in the face of rewards and
punishments.
2 Reinforcement learning algorithms related to methods of dynamic
programming which isa general approach to optimal control.
3. Reinforcement learning phenomena have been observed in psychological
studies of animal behaviour, and in neurobiological investigations of
neuromodulation and addiction.
4 The task of reinforcement learning is to use observed rewards to learn
an optimal policy for the environment. An optimal policy is apolicy that
maximizes the expected total reward.
5 Without some feedback about what is good and what is bad, the agent
will have no grounds for deciding which move to make.
6 The agents needs to know that something good has happened when it
wins and that something bad has happened when it loses.
7. This kind of feedback is called a reward or reinforcement.
8. Reinforcement learning is valuable in the field of robotics, where the
tasks to be performed are frequently complex enough to defy encoding
as programs and no training data is available.
9. The robot's task consists of finding out, through trial and error (or
success), which actions are good in a certain situation and which are
not.
10. In many cases humans learn in a very similar way. For example, when
a child learns to walk, this usually happens without instruction, rather
simply through reinforcement.
11. Successful attempts at working are rewarded by forward progress, and
unsuccessful attemptsare penalized by often painful falls.
5-20 G (CS/IT/OE-Sem-8) Genetic Algorithm

12. Positive and negative reinforcement are also important factors in


successful learning in school and in many sports.
13. In many complex domains, reinforcement learning is the only feasible
way to train a program to perform at high levels.
Primary
reinforcement signal
State (input)
vector
Environment Critic

Heuristic
reinforcement
signal
Actions

Learning
system

Fig. 5.121.1. Block diagram of reinforcement learning.

Que 5.22. Differentiate between reinforcement and supervised


learning.
Answer
S. No. Reinforcement Supervised
learning learning
1. Reinforcement learning is all In supervised learning, the
about making decisions decision is made on the initial
sequentially. In simple words input or the input given at the
we can say that the output start.
depends on the state of the
current input and the next
input depends on the output
of the previous input.
2. In reinforcement learning Supervised learning decisions are
decision is dependent. So, we independent of each other so
give labels to sequences of labels are given to each decision.
dependent decisions.
3. Example : Chess game. Example: Object recognition.
Machine Learning 5-21 G (CSTT/OE-Sem-8)

Que 5.23. What is


reinforcement learning ? Explain passive
reinforcement learning and active reinforcement learning.
Answer
Reinforcement learning : Refer Q. 5.21, Page 5-19G, Unit-5.
Passive reinforcement learning :
1 In passive learning, the agent's policy T is fixed. In state s, it always
executes the action I(s).
2 Its goal is simply to learn how good the policy is - that is, to learn the
utility function U(s).
3. Fig. 5.23.1 shows a policy for the world and the corresponding utilities.
4. In Fig. 5.23.1(a) the policy happens to be optimal with rewards of
R(s)= -0.04 in the non-terminal states and no discounting.
5 Passive learning agent does not know the transition model T(s, a, s'),
which specifies the probability of reaching state s' from state s after
doing action a;nor does it know the reward function R(s) which specifies
the reward for each state.
6 The agent executes a set of trials in the environment using its policy T.
7. In each trial, the agent starts in state (1, 1) and experiences a sequence
of state transitions until it reaches one of the terminal states, (4, 2) or
(4, 3).
8. Its percepts supply both the current state and the reward received in
that state. Typical trials might look like this.
(1, 1) ’(1,2)’(1, 3)a(1, 2)_4 ’(1,3)004(2, 3)_001-’(8, 3)-0(4, 3).,
(1, 1) 0 ’ (1,2)0 ->(1,3) 004 ’(2, 3)(3,3). (3, 2)a-(3, 3)-(4,3)
(1, 1) ’ (2, 1) ’ (3, 1)_m ’ (3, 2)a (4, 2),

3
+1
3 0.812 0.868| 0.918 +1

2 2 0.762
-1 0.660 -1

1 1 0.705 0.655 0.611 0.388

1 2 3 4 2 3 4
(a) (b)
Fig. 5.23.1. (a) Apolicy nfor the 4 x3 world;
(b) The utilities of the states in the 4 x 3 world, given policy
5-22 G (CSIT/OE-Sem-8) Genetie Algorithm
9. Each state percept is subscripted with the reward received. The object is
to use the information about rewards to learn the expected utility U"9)
associated with each non-terminal state s.
10. The utility is defined to be the expected sum of (discounted) rewards
obtained if policy Tis followed:
U(s) = E|
where y is a discount factor, for the 4 x 5 world we set y = 1.
Active reinforcement learning:
1. An active agent must decide what actions to take.
2. First, the agent will need to learn a complete model with outcome
probabilities for all actions, rather than just model for the fixed policy.
3. We need to take into account the fact that the agent has a choice of
actions.
4. The utilities it needs to learn are those defined by the optimal policy;
they obey the Bellman equations:
US) = RS)+ YmaxT(s, a, s) U(s')
a

5. These equations can be solved to obtain the utility function Uusing the
value iteration or policy iteration algorithms.
6. Autility function Uis optimal for the learned model, the agent can
extract an optimal action by one-step look-ahead to maximize the expected
utility.
7. Alternatively, if it uses policy iteration, the optimal policy is already
available, so it should simply exccute the action the optimal policy
recommends.

Que 5.24. What are the different types of reinforcement learning ?


Explain.
Answer
Types of reinforcement learning:
1. Positive reinforcement learning:
a Positive reinforcement learning is defined as when an event, occurs
due to a partieular behaviour, inereases the strength and the
frequency of the behaviour.
b. In other words, it has a positive effect on the behaviour.
C. Advantages of positive reinforcement learning aro :
i. Maximizes performance.
:ii.
Sustain change for a long period of time.
Machine Learning 5-23 G (CSTT/OE-Sem-8)

d. Disadvantages of positive reinforcement learning:


Toomuch reinforcement can lead to overload of states which
can diminish the results.
2. Negative reinforcement learning :
Negative reinforcement is defined as strengthening of behaviour
because a negative condition is stopped or avoided.
b. Advantages of negativereinforcement learning:
i. Increases behaviour
ii. It provide defiance to minimum standard of performance
C. Disadvantages of negativereinforcement learning:
i It only provides enough to meet up the minimum behaviour.

Que 5.25. What are the elements of reinforcement learning?


Answer
Elements of reinforcement learning :
1. Policy (r):
a It defines the behaviour of the agent which action to take in agiven
state to maximize the received reward in the long term.
b It stimulus-response rules or associations.
C.
It could be a simple lookup table or function, or need more extensive
computation (for example, search).
It can be probabilistic.
2. Reward funetion (r):
a.
It defines the goal in a reinforcement learning problem, maps a
state or action toa scalar number, the reward (or reinforcement).
b. The RL agent's sole objective is to maximise the total reward it
receives in the long run.
C. It defines good and bad events.
d. It cannot be altered by the agent but may inform change of policy.
e. It can be probabilistic (expected reward).
3. Value function ():
It defines the total amount of reward an agent can expect to
accumulate over the future, starting from that state.
a high value (or the
b A state may yield a low reward but have
long term
opposite). For example, immediate pain/pleasure vs.
happiness.
4. Transition model (M):
environment action a taken in the
a It defines the transitions in the
states, will lead to state s2.
5-24 G (CS/IT/OE-Sem-8) Genetic Algorithm

b. It can be probabilistic.
Que 5.26. Write short note on Q-learning.
Answer
1 Reinforcement learning is the problem faced by an agent that must
learn behaviour through trial-and-error interactions witha dynamic
environment, Q-learning is model-free reinforcement learning, and it is
typically easier to implement.
2 Each residential load defined as an agent. Agents should learn how to
participate in the electrical market and optimize their cost,
simultaneously.
3. Areinforcement learning algorithm so-called a Q-learning algorithm is
employed.
4. When an agent i is modeled by a Q-learning algorithm, it keeps in
memory a function Q,:A,’ R such that Q,(a,) represents it will obtain
the expected reward if playing action a,.
5. It then plays with a high probability the action it believes is going to lead
to the highest reward, observes the reward it obtains and uses this
observation toupdate its estimate of Q,. Suppose that the tth time the
game is played, the joint action (a,, .... a,) represents the actions the
different agents have taken.

You might also like