Cs3351 Aiml Unit 4 Notes Eduengg
Cs3351 Aiml Unit 4 Notes Eduengg
Cs3351 Aiml Unit 4 Notes Eduengg
CONNECT WITH US
WEBSITE: www.eduengineering.net
TELEGRAM: @eduengineering
-
INSTAGRAM: @eduengineering
UNIT IV
Ensemble Techniques
4 And Unsupervised Learning
Syllabus
Combining multiple learners: Model combination schemes, Voting, Ensemble Learning -
bagging, boosting, stacking, Unsupervised learning: K. means, Instance Based Learning:
KNN, Gaussian mixture models and Expectation maximization.
• When designing a learning machine, we generally make some choices like parameters
of machine, training data, representation, etc. This implies some sort of variance in
performance. For example, in a classification setting, we can use a parametric
classifier or in a multilayer perceptron, we should also decide on the number of
hidden units.
• Each learning algorithm dictates a certain model that comes with a set of assumptions.
This inductive bias leads to error if the assumptions do not hold for the data.
• Different learning algorithms have different accuracies. The no free lunch theorem
asserts that no single learning algorithm always achieves the best performance in any
domain. They can be combined to attain higher accuracy.
• Data fusion is the process of fusing multiple records representing the same real-world
object into a single, consistent, and clean representation. Fusion of data for improving
prediction accuracy and reliability is an important problem in machine learning.
• Combining different models is done to improve the performance of deep learning
models. Building a new model by combination requires less time, data, and
computational resources. The most common method to combine models is by
averaging multiple models, where taking a weighted average improves the accuracy.
4.1
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Different Algorithms: We can use different learning algorithms to train different base-
learners. Different algorithms make different assumptions about the data and lead to
different classifiers.
• Different Hyper-parameters: We can use the same learning algorithm but use it with
different hyper – parameters.
• Different Input Representations: Different representations make different
characteristics explicit allowing better identification.
• Different training sets: Another possibility is to train different base – learners by
different subsets of the training set.
• Different methods are used for generating final output for multiple base learners are
Multiexpert and multistage combination.
1. Multiexpert combination.
• Let’s assume that we want to construct a function that maps inputs to outputs from a
set of known Ntrain input -output pairs.
• Let’s assume that we want to construct a function that maps inputs to outputs from a
set of known N train input -output pairs
D train = [(xi,yi )] N train
where𝑥𝑖 ∈×is a D dimensional feature input vector, 𝑦𝑖 ∈ 𝑌is the out put.
• Classification : When the output takes values in a discrete set of class labels
4.2
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
4.1.2 Voting
• In this methods, the first step is to create multiple classification/ regression models
using some training dataset. Each base model can be created using different splits of
the same training dataset and same algorithm, or using the same dataset with different
algorithms, or any other methods.
• Fig. 4.1.2 shows general idea of Base – learners with model combiner.
4.3
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• When combining multiple independent and diverse decisions each of which is at least
more accurate than random guessing, random errors cancel each other out, and correct
decisions are reinforced. Human ensembles are demonstrably better.
• Use a single, arbitrary learning algorithm but manipulate training data to make it learn
multiple models.
• The problem here is that if there is an error with one of the base-learners, there may
be a misclassification because the class code words are so similar. So the approach in
4.4
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
error-correcting codes is to have L>K and increase the Hamming distance between the
code words.
• One possibilityis pairwise separation of classes where there is a separate base –
learner to separate 𝐶𝑖 from 𝐶𝑗 for i<j.
• Pairwise L= K (K-1) /2
+1 +1 +1 0 0 0
−1 0 0 +1 +1 0
𝑊= [ ]
−0 −1 0 −1 0 +1
0 0 −1 0 −1 −1
• With reasonable L, find W such that the Hamming distance between rows and
between columns are maximized.
• Voting scheme are
𝑙
𝑛
𝑦𝑖 = (𝑥 + 𝑎) = ∑ 𝑊𝑖𝑗 𝐷𝑗
𝑗=1
4.5
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Ensemble modeling is the process of running two or more related but different
analytical models and then synthesizing he results into a single score or spread in
order to improve the accuracy of predictive analytics and data mining applications.
• Ensembles of classifiers is a set of classifiers whose individual decisions combined in
some way to classify new examples.
• Ensemble methods combine several decision trees classifiers to produce better
predictive performance than a single decision tree classifier. The main principle
behind the ensemble model is that a group of weak learners come together to form a
strong learner, thus increasing the accuracy of the model.
• Why do ensemble methods work?
• Based on one of two basic observations:
1. Variance reduction : If the training sets are completely independent, it will
always Helps to average an ensemble because this will reduce variance
without affecting bias (e.g. bagging) and reduce sensitivity to individual and
points.
2. Bias reduction: for simple models, average of models has much greater
capacity than single model Averaging models can reduce bias substantially by
increasing capacity and control variance by Citting one component at a time
4.2.1 Bagging
• Bagging is also called Bootstrap aggregating. Bagging and boosting are meta-
algorithms that pool decisions from multiple classifiers. It creates ensembles by
repeatedly randomly resampling he training data.
• Bagging was the first effective method of ensemble learning and is one of the simplest
methods of arching. The meta – algorithm, which is a special case of the model
averaging, was originally designed for classification and is usually applied to decision
tree models, but it can be used with any type of model for classification or regression
4.6
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Ensemble classifiers such as bagging, boosting and model averaging are known to
have improved accuracy and robustness over a single model. Although unsupervised
models, such as clustering, do not directly generate label prediction for each
individual, they provide useful constraints for the joint prediction of a set of related
objects.
• For given a training set of size n, create m samples of size n by drawing n examples
from the original data, with replacement. Each bootstrap sample will on average
contain 63.2% of the unique training examples, the rest are replicates. It combines the
m resulting models using simple majority vote.
• In particular, on each round, the base learner is trained on what is often called a
“bootstrap replicate” of the original training set. Suppose the training set consists of
n examples. Then a bootstrap replicate is a new training set that also consists of n
examples, and which is formed by repeatedly selecting uniformly at random and with
replacement n examples from the original training set. This means that the same
example may appear multiple times in the bootstrap replicate, or it may appear not at
all.
• It also decreases error by decreasing the variance in the results due to unstable
learners, algorithms (like decision trees) whose out put can change dramatically when
the training date is slightly changed.
• Pseudocode:
1. Given Training data (x1,y1) …….. (xm,ym)
2. For t = 1,… T :
a. Form bootstrap replicate dataset St by selecting m random examples from the
training set with replacement.
b. Let ht be the result of training base learn4ng algorithm on st
4.7
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Bagging Steps :
1. Suppose there are N observations and M features in training data set. A sample
from training data set is taken randomly with replacement.
2. A subset of M features is selected randomly and whichever feature gives the best
split is used to split the node iteratively.
3. The tree is grown to the largest.
4. Above steps are repeated n times and predictions is given based on the aggregation
of predictions from n number of trees.
Advantages of Bagging :
Disadvantages of Bagging:
1. Since final prediction is based on the mean predictions from subset trees, it won’t
give precise values for the classification and regression model.
4.2.2 Boosting
• Boosting is an ensemble learning method that combines a set of weak learners into a
strong learner to minimize training errors. In boosting, a random sample of data is selected,
fitted with a model and then trained sequentially—that is, each model tries to compensate for
the weaknesses of its predecessor..
4.8
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• A learner is weak if it produces a classifier that is only slightly better than random
guessing, while a learner is said to be strong if it produces a classifier that achieves a
low error with high confidence for a given concept.
• Revised to be a practical algorithm, AdaBoost, for building ensembles that
empirically improves generalization performance. Examples are given weights. At
each iteration, a new hypothesis is learned and the examples are reweighted to focus
the system on examples that the most recently learned classifier got wrong.
• Boosting is a bias reduction technique. It typically improves the performance of a
single tee model. A reason for this is that we often cannot construct trees which are
sufficiently large due to thinning out of observations in the terminal nodes.
• Boosting is then a device to come up with a more complex solution by taking linear
combination of trees. In presence of high – dimensional predictors, boosting is also
very useful as a regularization technique for additive or interaction modeling.
• To begin, we define an algorithm for finding the rules of thumb, which we call a weak
learner. The boosting algorithm repeatedly calls this weak learner, each time feeding
it a different distribution over the training data. Each call generates a weak classifier
and we must combine all of these into a single classifier that, hopefully, is much more
accurate than any one of the rules.
• Train a set of weak hypotheses ; h1 ….,hT.The combined hypothesis H is a weighted
majority vote of the T weak hypotheses. During the training, focus on the examples
that are misclassified.
4.9
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
AdaBoost:
Advantages of AdaBoost:
1. Very simple to implement
2. fairly good generalization
3. The prior error need not be known ahead of time.
Disadvantages of AdaBoost:
1. Suboptimal solution
2. Can over fit in presence of noise.
Boosting Steps:
4.10
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
1. Draw a random subset of training samples d1 without replacement from the training
set D to train a weak learner C1
2. Draw second random training subset d2 without replacement from the training set add
add 50 percent of the samples that were previously falsely classified / misclassified to
train a weak learner C2
3. Find the training samples d3 in the training set D on which C1 and C2 disagree to
train a third weak learner C3
4. Combine all the weak learners via majority voting.
Advantages of Boosting:
1. Supports different loos function.
2. Works well with interactions.
Disadvantages of Boosting:
1. Prone to over – fitting.
2. Requires careful tuning of different hyper – parameters.
4.2.3Stacking
• Stacking, sometimes called stacked generalization, is an ensemble machine learning
method that combines multiple heterogeneous base or component models via a meta -
model.
• The base model is trained on the complete training data, and then he meta – model is
trained on the predictions of the base models. The advantage of stacking is the ability
to explore the solution space with different models in the same problem.
• The stacking base model can be visualized in levels and has at least two levels of the
models. The first level typically trains the two or more base learners (can be
heterogeneous) and the second level might be a single meta learner that utilizes the
base models predictions as input and gives the final result as output . A stacked model
can have more than two such levels but increasing the levels doesn’t always guarantee
better performance.
• In the classification tasks, often logistic regression is used as a meta learner, while
linear regression is more suitable as a meta learner for regression – based tasks.
• Stacking is concerned with combining multiple classifiers generated by different
learning algorithms L1,….LN on a single dataset S, which is composed by a feature
vectors Si= (xi, ti)
4.11
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
features are the predictions of the base – level classifiers and the class is the correct
class of the example in hand.
• Why do ensemble methods work?
• Based on one of two basic observations:
1. Variance reduction: If the training sets are completely independent, it will
always helps to average an ensemble because this will reduce variance without
affecting bias (e.g. – bagging) and reduce sensitivity to individual data points.
2. Bias reduction : For simple models, average of models has much greater
capacity than single model Averaging models can reduce bias substantially by
increasing capacity and control variance by Citting one component at a time.
4.12
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
4.2.4Adaboost
• AdaBoost also referred to as adaptive boosting is a method in Machine Learning used
as an ensemble method. The maximum not unusual algorithm used with adaBoost is
selection trees with one stage meaning with decision trees with most effective I split.
These trees also are referred to as decision stumps.
Stump
• In the ensemble approach, we upload the susceptible fashions sequentially and then
teach them the use of weighted schooling records.
• We hold to iterate the process till we gain the advent of a pre-set range of vulnerable
learners or we can not look at further improvement at the dataset. At the end of the
algorithm, we are left with some vulnerable learners with a stage fee.
4.2.5 Difference between Bagging and Boosting
4.13
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
variance.
4. Every model receives an equal Models are weighted by their
weight. performance.
4.3 Clustering
• Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).
• Cluster analysis can be a powerful data-mining tool for any organization that needs to
identity discrete groups of customers, sales transaction, or other types of behaviors
and things. For example, insurance providers use cluster analysis to detect fraudulent
claims and banks used it for credit scoring.
4.14
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Fig. 4.3.1
4.15
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Clustering means grouping of data or dividing a large data set into smaller data sets of
some similarity.
• A clustering algorithm attempts to find natural groups components or data based on
some similarity. Also, the clustering algorithm finds the centroid of a group of data
sets.
4.16
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
on both the similarity measure used by the method and its implementation. The
quality of a clustering method is also measured by it’ s ability to discover some or all
of the hidden patterns.
• Clustering techniques types : The major clustering techniques are
a) Partitioning methods
b) Hierarchical methods
c) Density – based methods.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
4.17
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:
4.18
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting
the below two points as k points, which are not the part of our dataset. Consider
thebelow image:
o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied to
calculate the distance between two points. So, we will draw a median between
boththecentroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them
as blue and yellow for clear visualization.
4.19
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
o As we need to find the closest cluster, so we will repeat the process by choosing a
new centroid. To choose the new centroids, we will compute the center of gravity of
these centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
4.20
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of
the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
4.21
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, but here we are discussing the most
appropriate method to find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
4.22
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
4.23
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
to use the KNN algorithm, because it works on a similarity degree. Our KNN
version will discover the similar features of the new facts set to the cats and
dogs snap shots and primarily based on the most similar functions it will place it
in both cat or canine class.
• Suppose there are two categories, i.e., category A and category B and we’ve a brand
new statistics point x1, so this fact point will lie within of these classes. To solve this
sort of problem, we need a K-NN set of rules. With the help of K-NN, we will without
difficulty discover the category or class of a selected dataset. Consider the underneath
diagram :
• The K-NN working can be explained on the basis of the below algorithm :
Step -3 :Take the K nearest neighbors as according to the calculated Euclidean distance.
Step -4 :Among these ok pals, count number the number of the data points in each class.
Step -5 :Assign the brand new records points to that category for which the quantity of the
neighbor is maximum.
4.24
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• Suppose we’ve got a brand new information point and we want to place it in the
required category, Consider the under image
• Firstly, we are able to pick the number of friends, so we are able to select the ok=5
• Next, we will calculate the Euclidean distance between the facts points. The
Euclidean distance is the gap between points, which we’ve got already studied in
geometry. It may be calculated as :
4.25
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• As we are able to see the three nearest acquaintances are from category A,
subsequently this new fact point must belong to category A.
4.26
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• The Gaussian mixture model is a probabilistic model that assumes all the data points
are generated from a mix of Gaussian distributions with unknown parameters.
• For example, in modeling human height data, height is typically modeled as a normal
distribution for each gender with a mean of approximately 5’10” for males and 5’5”
for females. Given only the height data and not the gender assignments for each data
point, the distribution of all heights would follow the sum of two scaled (different
variance) and shifted (different mean) normal distributions. A model making this
assumption is an example of a Gaussian mixture model.
• Gaussian mixture models do not rigidly classify each and every instance into one
class or the other. The algorithm attempts to produce K-Gaussian distributions that
would take into account the entire training space. Every point can be associated with
one or more distributions. Consequently, the deterministic factor would be the
probability that each point belongs to a certain Gaussian distribution.
• GMMs have a variety of real - world applications. Some of them are listed below.
a) Used for signal processing
b) Used for customer churn analysis
c) Used for language identification
d) Used in video game industry
e) Genre classification of songs
4.5.1 Expectation – maximization
• In Gaussian mixture models, an expectation-maximization method is a
powerful tool for estimating he parameters of a Gaussian mixture model.
The expectation is termed E and maximization is termed M.
• Expectation is used to find the Gaussian parameters which are used to represent each
component of gaussian mixture models. Maximization is termed M and it is involved
in determining whether new data points can be added or not.
• The Expectation – Maximization (EM) algorithm is used in maximum likelihood
estimation where the problem involves two sets of random variables of which one, X,
is observable and the other, Z, is hidden.
• The goal of the algorithm is to find the parameter vector ∅ that maximizes the
likelihood of the observed values of X, L (∅ | X)
4.27
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• But in cases where this is not feasible, we associated the extra hidden variables Z and
express the underlying model using both, to maximize the likelihood of the joint
distribution of X and Z the complete likelihood Lc (∅|X,Z)
• Expectation -maximization (EM) is an iterative method used to find maximum
likelihood estimates of parameters in probabilistic models, where the model depends
on unobserved, also called latent, variables.
• EM alternates between performing an expectation € step, which computes an
expectation of the likelihood by including the latent variables as if they were
obserrved, and maximization (M) step, which computes the maximum likelihood
estimates of the parameters by maximizing the expected likelihood found in the E
step.
• The Parameters found on the M step are then used to start another E step, and the
process is repeated until some criterion is satisfied. EM is frequently used for data
clustering like for example in Gaussian mixtures.
• In the Expectation step, find the expected values of the latent variables (here you
need to use the current parameter values)
• In the Maximization step, first plug in the expected values of the latent variables in
the log-likelihood of the augmented data. The maximize this log-likelihood to
reevaluate the parameters.
• Expectation – Maximization (EM) is a technique used in point estimation. Given a set
a observable variables X and unknown (latent) variables Z we want to estimate
parameters ∅ in a model.
• The expectation maximization (EM ) algorithm is a widely used maximum likely-
hood estimation procedure for statistical models when the values of some of the
variables in the model are not observed
• The EM algorithm is an elegant and powerful method for finding the maximum
likelihood of models with hidden variables. The key concept in the EM algorithm is
that it iterates between the expectation step (E-setp ) and maximization step (M-step )
until convergence.
• In the E-step, the algorithm estimates the posterior distribution of the hidden variables
Q given the observed data and the current parameter settings; and in the M-step the
algorithm calculates the ML parameter settings with Q fixed.
4.28
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
• At the end of each iteration the lower bound on the likelihood is optimized for the
given parameter setting (M-step) and the likelihood is set to the bound (E-setp )
which guarantees an increase in the likelihood and convergence to a local
maximum, or global maximum if the likelihood function is unimodal.
• Generally, EM works best when the fraction of missing information is small and the
dimensionality of he data is not too large. EM can require many iterations, and higher
dimensionality can dramatically slow down the E-setp.
• EM is useful for several reasons: conceptual simplicity, ease of implementation, and
the fact that each iteration improves l (ø ). The rate of convergence on the first few
steps is typically quite good, but can become excruciatingly slow as you approach
local optima.
• Sometimes the M- step is a constrained maximization, which means that there are
constraints on valid solutions not encoded in the function itself.
• Expectation maximization is an effective technique that is often used in data analysis
to manage missing data. Indeed, expectation maximization overcomes some of the
limitations of other techniques, such as mean substitution or regression substitution.
These alternative techniques generate biased estimates – and specifically,
underestimate the standard errors. Expectation maximization overcomes this problem.
4.29
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
4.30
TELEGRAM: @eduengineering
Downloaded from www.eduengineering.net
CS3491-ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
2. The decision rule used to drive a classification from the K-nearest neighbors.
3. The number of neighbors used to classify the new example.
Q10. What is K-means clustering?
Ans: k-means clustering is heuristic method. Here each cluster is represented by the
center off the cluster. The k-means algorithm takes the input parameter, k, and partitions a
set of a objects into k-clusters so that the resulting intracluster similarity is high but the
intracluster similarity is low
Q.11 List the properties of K-Means algorithm.
Ans : 1. There are always k clusters.
2. There is always at least one item in each cluster.
3. The clusters are non – hierarchical and they do not overlap.
Q.12 What is stacking ?
Ans : Staking, sometimes called stacked generalization. Is an ensemble machine
learning method that combines multiple heterogeneous base or component models via a
meta – model.
Q. 13 How do GMMs differentiate from K- means clustering ?
Ans : GMMs and K-means, both are clustering algorithms used for unsupervised learning
tasks. However, the basic difference between the is that k-means is a distance -based
clustering method while GMMs is a distribution based clustering method.
4.31
TELEGRAM: @eduengineering
CONNECT WITH US
CONNECT WITH US
WEBSITE: www.eduengineering.net
TELEGRAM: @eduengineering
-
INSTAGRAM: @eduengineering