ML - Questions & Answer
ML - Questions & Answer
MODULE I
In a two dimensional space, the maximum number of training points that can
c be
shattered by the line function is 3.Hence for line function VC dimension h =3.
7. Show that axis aligned rectangle can shatter 4 points in 2 dimension space.
The VC dimension of rectangles is 4 because there exists a set of 4
points that can be shattered by a rectangle and any set of 5 points can not
be shattered by a rectangle. So, while it's true that a rectangle cannot
shatter a set of four collinear points with alternate positive and negative,
the VC-dimension is still 4 because there exists one configuration of 4
points which can be shattered.
MODULE II
9. In a data set the attributes are denoted by X1 and X2. It can take only the values 0
and 1. The function hi denotes the different hypothesis class. The conditions are as
follows. Find which all hypothesizes will be selected based on the conditions.
X1 X2 Result
0 0 1
0 1 0
1 0 0
1 1 1
6
Probably approximately correct learning (PAC learning) is a frame work for
mathematical analysis of machine learning algorithms.In this framework, the
learner (that is, the algorithm) receives samples and must select a hypothesis
from a certain class of hypotheses. The goal is that, with high probability (the
“probably” part), the selected hypothesis will have low generalization error (the
“approximately correct” part).
How many training examples N should we have, such that with probability at
least 1 ‒ δ, hypothesis h has error at most ε? Where δ ≤ 1/2 and ε > 0
Definition of PAC learnability contains two approximation parameter; accuracy
parameter ε, determines how far the output classifier can be from optimal one
(approximately), and confidence parameter δ indicating how likely the classifier
is to meet that accuracy requirement (probably).In PAC given a class C and
examples drawn from some unknown but fixed probability distribution, to find
the number of examples, N with confidence probability at least 1-δ .
S is the tightest rectangle, error region between C and h=S is the sum of four
rectangular strips. Each strip is at most ε/4, Probability that we miss a strip is 1‒
ε/4. Probability that N instances miss 4 strips 4(1 ‒ ε/4)N
Take at least (4/ε)log(4/δ) independent examples from C and use the tightest
rectangle as our hypothesis a given point will be misclassified with error
probability at most ε.
11. Let X=R and C be the set of all possible rectangles in two dimensional plane
which are axis aligned (not rotated). Show that this concept class is PAC
learnable.
Let the instance space be the set X of all points in the Euclidean plane. Each point is
representedby its coordinates (x; y). So, the dimension or length of the instances is 2.
Let the concept class C be the set of all “axis-aligned rectangles” in the plane; that is,
the setof all rectangles whose sides are parallel to the coordinate axes in the plane
(see Figure above).
Since an axis-aligned
aligned rectangle can be defined by a set of inequalities of the following
formhaving four parametersa <=x <=b; c <=y<=dthe size of a concept is 4.
We take the set H of all hypotheses to be equal to the set C of concepts, H = C.
Given a set of sample points labeled positive or negative, let L be the algorithm which
outputsthe hypothesis defined by the axis
axis-aligned
aligned rectangle which gives the tightest
fit to the positiveexamples (that is, that rectangle with the smallest area that includes
all of the positiveexamples and none of the negative examples)
It can be shown that, in the notations introduced above, the concept class C is PAC--learnable
bythe algorithm L using the hypothesis space H of all axis-aligned rectangles.
12. What are the factors influence generalization performancein a model? Describe
how we can achieve best generalization in the prepared model.
8
Generalization indicates how well a model performs on new data. For best
generalization we should match the complexity of the hypothesis class with
complexity of function underlying the data.
We can choose the generalization performance by controlling the model
complexity.For good generalization performance, high model complexity is not
required, can be low enough to fit the data.
Two factors influence the generalization performance:
Overfitting
A complex model with insufficient data results overfitting. H more complex than
C or f.
Example: Fitting a sixth order polynomial to noisy data sampled from a third order
polynomial
A hypothesis h is said to overfit the training data if there is another hypothesis, h’,
such that h has smaller error than h’ on the training data but h has larger error on
the test data than h’.
Underfitting
Underfitting is the production of a machine learning model that is not complex
enough to accurately capture relationships between a dataset features and a target
variable. A simple model with large training set will result underfitting
Hypothesis H less complex than C or f. When model is too simple, both training
and test errors are large.
Example:
Trying to fit a line to data sampled from a third order polynomial.
13. Describe the two dimensionality reduction techniques.
i) Featureselection
In feature selection, we are interested in finding k of the total of n features that
give us the mostinformationanddiscardtheother(n−k)dimensions. Two popular
feature selection approaches.
a) Forward Selection
Inforwardselection,startwithnovariablesandaddthemonebyone,ateachstepaddin
gtheone thatdecreases the errorthemost, untilany further additiondoesnot
decrease the error (ordecreases it only slightly).
b) Backward Selection
9
Insequentialbackwardselection,startwiththesetcontainingallfeaturesandateachst
kwardselection,startwiththesetcontainingallfeaturesandateachst
epremove the one feature that causes the least error.
ii) Featureextraction
Infeatureextraction,weareinterestedin
Infeatureextraction,weareinterestedinfindinganewsetof k
featuresthatarethecombination of the original n features. These methods may be
supervised or unsupervised depending on whether or not they use the output
information. The best known and most widely used feature extraction methods
are Principal Components Analysis (PCA) and Linear Discriminant Analysis
(LDA), which are both linear projection methods, unsupervised and supervised
respectively.
14. Explain Principal Components Analysis (PCA).
Principal component analysis (PCA) is a statistical procedure tha uses an
orthogonal transformation to convert a set of observations of possibly correlated
variables into a set of values of linearly uncorrelated variables called principal
components. The number of principal components is less than or equal to the
smaller of the number of original variables or the number of observations. This
Th
transformation is defined
fined in such a way that the first principal component has the
largest possible variance (that is, accounts for as much of the variability in the
data as possible),and each succeeding component in turn has the highest variance
possible under the constraint that it is orthogonal to the preceding components.
Reconstruction error should be minimum
11
ver seen these output values before). The errors it makes are used to evaluate the
model. This method is mainly used when the data set D is large.
ii) K-fold cross-validation
validation
In K-fold cross-validation,
validation, the dataset X is divided randomly into K equal-sized
equal
parts, Xi, i= 1,...,K. Togenerateeachpair,wekeep neoftheK
partsoutasthevalidationsetVi,andcombine the remaining K−1 parts to form the
training set Ti. Doing this K times, each time leaving out another one of the K
parts out, we get K pairs(Vi,Ti): Then compute the average test set score of the k
rounds.
iii) Leave-one--outcross-validation
An extreme case of K-fold cross-validation is leave-one-out where given a
dataset of N instances, only one instance is left out as the validation set and
training uses the remaining N−1 instances. We then get N separate pairs by
leaving out a different instance at each iteration. This is typically used in
applications such as medical diagnosis, where labeled data is hard to find.
iv) Bootstrapping in machine learning
The term bootstrap sampling refers to process of “random sampling with
replacement”.Inmachinelearning,bootstrappingistheprocessofcomputingperform
ancemeasuresusingseveral randomly selected training and test datasets which
are selected through a process of sampling with replacement, that is, through
bootstrapping. Sample datasets are selected multiple times. The bootstrap
procedure will create one or more new training datasets some of which are
repeated. The corresponding test datasets are then constructed from the set of
examples that were not selected for the respective training datasets.
16. What is noise in data? Explain how it effects hypothesizes on the basis of the
following graph.
Noise is any unwanted anomaly in the data. Noise distorts data. When there is
noise in data, learning problems may not produce accurate result.
Due to noise the class may be more difficult to learn.Zero error may be infeasible
with a simple hypothesis class. There may be imprecision in recording input
attributes, which may shift data points in the input space.
There may be error in labelling data points, called teacher noise.
For example, in a binary classification problem with two variables, when there is
noise, there may not be a simple boundary between the positive and negative
instances and to separate them. So, when there is noise, we may make a complex
model which makes a perfect fit to the data and attain zero error; or, we may use
a simple model and allow some error.
MODULE III
13
Bayes’ theorem is named after Thomas Bayes.P(H|X), probability that
hypothesis H holds given observed data tuple X.
Posterior probability of a hypothesis H on X, P(H|X) follows the Baye’s
theorem
P( X | H )P(H )
P(H | X )
P( X )
P(X|H) is likelihood of X conditioned on H. P(H) is prior probability of
hypothesis H and P(X) is the prior probability of X (evidence).
15
TP- True positive: It is number of correctly classified positive samples.
TN- True negative :It is number of correctly classified negative samples
FP- False positive: It is number of negative samplesincorrectly classified as
positive samples
FN: False negative: It is number of positive samples incorrectly classified
negative samples
17 a) Explain the various classifier performance measures (accuracy, precision, recall,
sensitivity, specificity, ROC curve).
1. Accuracy = (TP +TN)/(TP +TN +FP +FN)
2. Sensitivity = recall = TP /(TP +FN)
3. Precision= TP/( TP +FP )
4. Specificity = TN /(TN +FP)
Precision p is the number of correctly classified positive examples (TP)
divided by the total number of examples that are classified as positive.
precision = TP/(TP + FP)
Recall r is the number of correctly classified positive examples divided by the
total number of actual positive examples in the test set.
recall = TP/(TP + FN)
5. ROC curve (see question no. 1)
b) Suppose a computer program for recognizing dogs in photographs identifies
eight dogs in a picture containing 12 dogs and some cats. Of the eight dogs
identified, five actually are dogs while the rest are cats. Compute the precision and
recall of the computer program.
Total pictures = 12 dogs + some cat (nos. unknown)
No. of pictures identified as dog = 8
No. of dogs identified as dog (TP) = 5
No. of cats identified as dog = FP = 8-5 = 3
16
No. of dogs identified as cats = FN = 12-5 = 7
Precision P = TP/ (TP+FP) = 5/(5+3) = 0.625
Recall R = TP /(TP+FN) = 5/(5+7) = 0.4165
18. Explain naive Bayes algorithm (Bayesian classifier)
Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given tuple belongs to a particular
class. Bayesian classification is based on “Bayes” theorem. The neural network
classifiers are suitable for larger data bases only and take long time to
train.Bayesian classifiers are more accurate and much faster to train and suitable
for low, medium and large sized data base. But they are slower when applied to
new instances.
Naïve Bayesian algorithm steps
1. Let D be a training set of tuples and their associated class labels, and each
tuple is represented by n-dimensional attribute vector X = (x1, x2, …, xn)
2. Suppose there are m classes C1, C2, …, Cm.
The classifier will predict that X belongs to class having highest posterior
probability, conditioned on X.
Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
P(X|Ci)P(Ci)
P(Ci | X)
This can be derived from Bayes’ theorem P(X)
3. Since P(X) is constant for all classes, only
P(Ci | X) P(X|Ci)P(Ci)
needs to be maximized.
P(Ci)=|Ci,D|/|D|, where |Ci,D| is number of training tuples of class Ci in D.
4. A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
n
P(X | Ci) P(x | C i ) P(x | C i ) P(x | C i ) ... P(x | Ci)
k 1 2 n
k 1
This greatly reduces the computation cost: Only counts the class distribution
a) If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak
divided by |Ci, D| (# of tuples of Ci in D)
b) If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
17
( x )2
1
g ( x , , )
2
e 2
2
P(X| Ci) g(xk , Ci ,Ci )
and P(xk|Ci) is
5. To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class
Ci.
The classifier predicts that the class label of tuple X is the class Ci if and only
if P(X|Ci)P(Ci) > P(X|Cj)P(Cj) for 1≤j≤m, j≠i.
In other words, the predicted class label is the class Ci for which
P(X|Ci)P(Ci) is the maximum.
19. Explain the general MLE method for estimating the parameters of a probability
distribution.
Maximum likelihood estimation (MLE) is particular method to estimate the
parameters of a probability distribution.To develop a Bayesian classifier, we need
to estimate the probabilities P(x∣ck) for the class labels c1,...,ck from the given
data.
If computed probabilities are good approximations of true probabilities, then
there would be an underlying probability distribution.Suppose the underlying
distribution has a particular form, say binomial, Poisson or normal and are
defined by probability density functions.There are parameters which define these
functions, and these parameters are to be estimated to test whether a given data
follow some particular distribution
MLE attempts to find the parameter values that maximize the likelihood
function, given the observations. The resulting estimate is called a maximum
likelihood estimate, which is also abbreviated as MLE.
Suppose we have a random sample X = {x1,...,xn} taken from a probability
distribution having the probability density function p(x∣θ), where x denotes a
value of the random variable and θ denotes the set of parameters that appear in
the function.
The likelihood of sample X is a function of the parameter θ and is defined as
l(θ)=p(x1∣θ)p(x2∣θ)...p(xn∣θ)
In ML estimation, find the value of θ that makes the value of the likelihood
function maximum.
18
For computation convenience, define the log likelihood function as the logarithm
of the likelihood function:
L(θ) = log l(θ)
= log p(x1∣θ)+log p(x2∣θ)+⋯+log p(xn∣θ)
19
We represent each outcome by an ordered K-tuple x=(x1,...,xK) where exactly
one of x1,...,xK is 1 and all others are 0. xi=1 if the outcome in the i-th class
occurs. The probability function can be expressed as
f(x∣p, ...,pK)=p1x1 ...pKxK .
Here, p1,...,pK are the parameters.
We choose n random samples. The i-th sample may be represented by
xi=(x1i,...,xKi).
The values of the parameters that maximizes the likelihood function can be
shown to be
Setting up the equations dL/dµ=0, dL/dσ=0and solving for µ and σ we get the
maximum likelihood estimates of µ and σ as
21. Apply Naive Bayes classification in the given dataset and determine the class of
the given sample. Given a new instance, X = (age <=30,Income =
medium,Student = yes, Credit_rating = Fair)
20
age income student redit_rating_com
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
21
22. What are the different types of regression.? Explain each
A regression problem is the problem of determining a relation between one or
more independent variables and an output variable which is a real continuous
variable, given a set of observed values of the set of independent variables and
the corresponding values of the output variable.
Example
Consider the navigation of a mobile robot, say an autonomous car. The output
is the angle by which the steering wheel should be turned at each time, to
advance without hitting obstacles and deviating from the route. Inputs are
provided by sensors on the car like a video camera, GPS, and so forth.
Differentregressionmodels
Thedifferentregressionmodelsaredefinedbasedontypeoffunctionsusedtoreprese
nttherelation between the dependent variable y and the independent variables
1. Simplelinearregression
Assume that there is only one independent variable x. If the relation between x
and y is modeled by the relation y=a+bx
then we have a simple linear regression.
2. Multipleregression
Let there be more than one independent variable, say x1, x2, ..., xn, and let the
relation between y and the independent variables be modeled as
y =α0+α1x1+⋯+αnxn
then it is case of multiple linear regression or multiple regression.
3. Polynomialregression
Let there be only one variable x and let the relation between x, y be modeled
as
y=a0+a1x+a2x2+⋯+anxn
for some positive integer n>1, then we have a polynomial regression.
4. Logisticregression
Logistic regression is used when the dependent variable is binary (0/1,
True/False, Yes/No) in nature. Even though the output is a binary variable,
what is being sought is a probability function which may take any value
from 0 to 1.
22
MODULE IV
2. Identify the best splitting attribute for the following data using information gain.
Attributes Classes
Gender Car ownership Travel cost Income Transportation mode
Male 0 cheap Low Bus
Male 1 cheap Medium Bus
Female 1 cheap Medium Train
Female 0 cheap Low Bus
Male 1 cheap Medium Bus
Male 0 standard Medium Train
Female 1 standard Medium Train
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car
2
Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated
by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
gain is de
defined as the difference between the
requirement (i.e.,based on just the proportion of classes) and the new requirement
(i.e., obtained after partitioning on A). Gain(A) tells us how much would be
gained by branching on A.
b. Gain Ratio
It is a kind of normalization to information gain using a “split information” value
defined analogously with Info(D). Normalization is applied to overcome the
problem of multivalued attribute.
GainRatio(A) = Gain(A)/SplitInfo(A)
c. Gini Index
Gini index measures the impurity of D, a data partition or set of training tuples, as
The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible splitting
points for each attribute)
It considers binary split for each attribute.
4. Outline the basic algorithm/ID3 for inducing a decision tree from training
tuples.
ID3(D, Attributes, Attribute_selection_method)
1. Create a node N;
2. If all tuples in D are of same class C, then
return N as a leaf node labelled with class C;
3. If Attributes is empty then
return N as a leaf node labelled with the majority class in D;.
4. Apply attribute_selection_method(D, attribute_list) to find the attribute
that best classifies tuples in D.
5. Label node N with splitting_criterion;
6. If splitting_attribute is discrete-valued and multiway splits allowed
then
Attribute_list = attribute_list-splitting_attribute
7. For each outcome j of splitting_criterion
8. Let Dj be the et of tuples in D that satisfying the outcome j;
9. If Dj is empty then
attach a leaf node with label of the majority class in D
10. Else add the subtree ID3(Dj, Attribute_list)
11. Return N
5. Analyze the practical issues of decision tree learning
Learning a tree that classifies the training data perfectly may not lead to the tree
with the best generalization performance. Training error no longer provides a
good estimate of how well the tree will perform on previously unseen records
The following are the practical issues of decision tree learning:
i) Overfitting: H more complex than C or f
4
Overfitting results in decision trees that are more complex than
necessary.Too many branches, some may reflect anomalies due t noise or
outliers. It creates poor accuracy for unseen samples.
A hypothesis h is said to overfit the training data if there is another
hypothesis, h’, such that h has smaller error than h’ on the training data but h
has larger error on the test data than h’.
Two cases of overfitting:
When there is noise in the data or when the number of training examples are
too small. In these cases the algorithm can produce trees that overfit
over the
training examples.
i) Overfitting due to noise in the training data.
6
Post-pruning: A cross validation approach
- Partition training data into “grow” set and “validation” set.
- Build a complete tree for the “grow” data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
7. CART Algorithm: Regression Trees for Prediction
Regression trees are used with continuous outcome variable. Procedure similar
to classification tree.Many splits attempted, choose the one that minimizes
impurity.
The CART, or Classification And Regression Tree is a binary decision tree
that is constructed by splitting a node into two child nodes repeatedly,
beginning with the root node that contains the whole learning sample. That is
data is split into two partitions. Partitions can also be split into sub-partitions.
Hence procedure is recursive. Each Split based on only one variable.
• CART tree is generated by repeated partitioning of data set
• Parent gets two children
• Each child produces two grandchildren
• Four grandchildren produce 8 great grandchildren
The main elements of CART are:
Rules for splitting data at a node based on the value of one variable
Stopping rules for deciding when a branch is terminal and can be split
no more
A prediction for the target variable in each terminal node
Key CART features
o Automated field selection
Handles any number of fields
Automatically selects relevant fields
No data preprocessing needed
Does not require any kind of variable transforms
7
Impervious to outliers
Missing value tolerant
Moderate loss of accuracy due to missing values
CART Algorithm Steps
Decision Tree building algorithm involves a few simple steps and these are:
i. Take Labelled Input data - with a Target Variable and a list of Independent
Variables
ii. Best Split: Find Best Split for each of the independent variables
iii. Best Variable: Select the Best Variable for the split
iv. Split the input data into Left and Right Nodes
v. Continue step ii
ii-iv on each of the nodes until meet stopping criteria
vi. Decision Tree Pruning : Steps to prune Decision Tree built
8. Explain the concept of perceptron by implementing Binary ANDand OR logic.
Y = s(x1+x2-1.5)
Boolean OR logic
Y = s(x1+x2-0.5)
Boolean functions AND and OR are linearly separable.
9. Explain back propagation algorithm and write down back propagation algorithm
in Neural networks.
Back propagation algorithm iteratively process a set of training tuples & compare
the network's prediction with the actual known target value. For each training
tuple, the weights are modified to minimize the mean squared error between the
network's prediction and the actual target value. Modifications are made in the
“backwards” direction: from the output layer, through each hidden layer down to
the first hidden layer, hence the name “backpropagation”
j1
I j wijOi j
i
wherewij is the weight of the connection from unit i in the previous layer to unit
j; Oi is the output of unit i from the previous layer; and θj is the bias of the unit.
v) Each unit in the hidden and output layers takes its net input and then applies
an activation function to it. The logistic, or sigmoid, function is used. Given
the net input Ij to unit j, then Oj, the output of unit j, is computed as
It maps a large input domain onto smallerrange of 0 to 1.
O 1
j I
1 e j
vi) Backpropagate the error: The error is propagated backward by updating the
weights and biases to reflect the error of the network’s prediction. For a unit j
in the output layer, the error Errj is computed by
12
wherewjk is the weight of the connection from unit j to a unit k in the next
higher layer, and Errk is the error of unit k.
viii) The weights and biases are updated to reflect the propagated errors. Weights
are updated by the following equations, where Δwij is the change in weight
wij:
Solution: Training can be posed as an optimization problem, in which the goal is to optimize a
function (usually to minimize a cost function E) with respect to a number of free variables,
usually weights wi. The gradient decent algorithm begins from an initialization of the weights
(e.g. a random initialization) and in an iterative procedure updates the weights wi by a quantity
wi, where wi = –(E / wi) and (E / wi) is the gradient of the cost function with respect to
the weights, while is a constant which takes small values in order to keep the updates low
and avoid oscillations.
(2) Question: Derive the gradient descent training rule assuming that the
targetfunction representation is:
od = w0 + w1x1 + … + wnxn.
Define explicitly the cost/error function E, assuming that a set of training examples D is
provided, where each training example d D is associated withthe target output td.
2
Solution: The error function: E = d D (t d– o )d The
From (2) (E / wi) = –d D 2(td – od)xid –(1/2xid)(E / wi) = (td – od) Substitute this in (*)
Instance Classification a1 a2
1 + T T
2 + T T
3 - T F
4 + F F
5 - F T
6 - F T
What is the information gain of a2 relative to these training examples? Providethe equation
for calculating the information gain as well as the intermediate results.
Solution:
Entropy E(S) = E([3+, 3-]) = -(3/6) log2 (3/6) - (3/6) log2 (3/6) = 1.
Gain (S, a2) = E(S) – (4/6)E(T) – (2/6)E(F) = 1 – 4/6 – 2/6 ≈ 0.E(T) = E([2+,
2-]) = 1.
E(F) = E([1+, 1-]) = 1.
(5) Question: Suppose that we want to build a neural network that classifies two
dimensional data (i.e., X = [x1, x2]) into two classes: diamonds and crosses. We
have a set of training data that is plotted as follows:
X2
X1
Draw a network that can solve this classification problem. Justify your choice of the number of
nodes and the architecture. Draw the decision boundary that your network can find on the
diagram.
Solution:
A solution is a multilayer FFNN with 2 inputs, one hidden layer with 4 neurons and 1 output
layer with 1 neuron. The network should be fully connected, that isthere should be connections
between all nodes in one layer with all the nodes inthe previous (and next) layer. We have to
use two inputs because the input data is two dimensional. We use an output layer with one
neuron because we have 2 classes. One hidden layer is enough because there is a single
compact region that contains the data from the crosses-class and does not contain data from the
diamonds-class. This region can have 4 lines as borders, therefore it suffices if there are 4
neurons at the hidden layer. The 4 neurons in the hidden layer describe 4 separating lines and
the neuron at the output layer describes the square that is contained between these 4 lines.
(6) Question: Suppose that we want to solve the problem of finding out what a
good car is by using genetic algorithms. Suppose further that the solution to
theproblem can be represented by a decision tree as follows:
size
large small
mid
brand no sport
yes
Volvo BMW SUV
engine no
no yes no
F12 V10 V8
no yes no
What is the appropriate chromosome design for the given problem? Which Genetic Algorithm
parameters need to be defined? What would be the suitablevalues of those parameters for the given
problem? Provide a short explanationfor each.
What is the result of applying a single round of the prototypical Genetic Algorithm? Explain your
answer in a clear and compact manner by providingthe pseudo code of the algorithm.
Solution:
size = {large, mid, small} → 100, 010, 001, 011, …, 111, 000
brand = {Volvo, BMW, SUV} → 100, 010, 001, 011, …, 111, 000
sport = {yes, no} → 10, 01, 11, 00
engine = {F12, V12, V8} → 100, 010, 001, 011, …, 111, 000
GoodCar = {yes, no} → 10, 01, 11, 00
→ chromosome design:
size brand sport engine GoodCar100 100 11
111 01
Fitness function for the given problem can be defined as a Sigmoid function f(x)
= 1 / (1+ e-x), where x is the percentage of all training examples correctlyclassified by a specific solution
(chromosome).
Selection method – e.g., rank selection method can be used;
Crossover technique – 2-point crossover can be used for the given problem witha crossover
mask 1111110000011; the reason is that either size + brand or sport + engine define the
solution
Crossover rate – usually k = 60%Mutation rate –
usually 1%
Termination condition – e.g., all training examples are correctly classified
GA pseudo code:
Step 1: Choose initial population.
Step 2: Evaluate the fitness of individuals in the population.
Step 3: Select k individuals to reproduce; breed new generation through crossover and mutation;
evaluate the individual fitness of offspring; replace kworse ranked part of population with offspring.
Step 4: Repeat step 3 until the termination condition is reached.
4. What do you mean by a well –posed learning problem? Explain the important features
that are required to well –define a learning problem.
5. Explain the inductive biased hypothesis space and unbiased learner
6. What are the basic design issues and approaches to machine learning?
7. How is Candidate Elimination algorithm different from Find-S Algorithm
8. How do you design a checkers learning problem
9. Explain the various stages involved in designing a learning system
10. Trace the Candidate Elimination Algorithm for the hypothesis space H’ given the
sequence of training examples from Table 1.
H’= < ?, Cold, High, ?,?,?>v<Sunny, ?, High, ?,?,Same>
11. Differentiate between Training data and Testing Data
12. Differentiate between Supervised, Unsupervised and Reinforcement Learning
13. What are the issues in Machine Learning
14. Explain the List Then Eliminate Algorithm with an example
15. What is the difference between Find-S and Candidate Elimination Algorithm
16. Explain the concept of Inductive Bias
17. With a neat diagram, explain how you can model inductive systems by equivalent
deductive systems
18. What do you mean by Concept Learning?
Module -2 Questions.
Instance Classification a1 a2
1 + T T
2 + T T
3 - T F
4 + F F
5 - F T
6 - F T
(a) What is the entropy of this collection of training examples with respect to the
target function classification?
(b) What is the information gain of a2 relative to these training examples?
3. NASA wants to be able to discriminate between Martians (M) and Humans (H) based on
the following characteristics: Green ∈{N, Y} , Legs ∈{2,3} , Height ∈{S, T}, Smelly
∈{N, Y}
Our available training data is as follows:
Species Green Legs Height Smelly
1 M N 3 S Y
2 M Y 2 T N
3 M Y 3 T N
4 M N 2 S Y
5 M Y 3 T N
6 H N 2 T Y
7 H N 2 S N
8 H N 2 T N
9 H Y 2 S N
10 H N 2 T Y
a) Greedily learn a decision tree using the ID3 algorithm and draw the tree.
b) (i) Write the learned concept for Martian as a set of conjunctive rules (e.g., if
(green=Y and legs=2 and height=T and smelly=N), then Martian; else if ... then Martian;...;
else Human).
(ii) The solution of part b)i) above uses up to 4 attributes in each conjunction. Find a set of
conjunctive rules using only 2 attributes per conjunction that still results in zero error in the training
set. Can this simpler hypothesis be represented by a decision tree of depth 2? Justify.
4. Discuss Entropy in ID3 algorithm with an example
5. Compare Entropy and Information Gain in ID3 with an example.
6. Describe hypothesis Space search in ID3 and contrast it with Candidate-Elimination
algorithm.
7. Relate Inductive bias with respect to Decision tree learning.
8. Illustrate Occam’s razor and relate the importance of Occam’s razor with respect to
ID3 algorithm.
9. List the issues in Decision Tree Learning. Interpret the algorithm with respect to
Overfitting the data.
10. Discuss the effect of reduced Error pruning in decision tree algorithm.
11. What type of problems are best suited for decision tree learning
12. Write the steps of ID3Algorithm
13. What are the capabilities and limitations of ID3
14. Define (a) Preference Bias (b) Restriction Bias
15. Explain the various issues in Decision tree Learning
16. Describe Reduced Error Pruning
17. What are the alternative measures for selecting attributes
18. What is Rule Post Pruning
Module -3 Questions.
Module -4 Questions.
Module -5 Questions.