Machine Learning Notes
Machine Learning Notes
Machine Learning Notes
Unit I
BASIC MATHEMATICS FOR MACHINE LEARNING: Regression Correlation and Regression, types
of correlation – Pearson’s, Spearman’s correlations –Ordinary Least Squares, Fitting a
regression line, logistic regression, Rank Correlation Partial and Multiple correlation- Multiple
regression, multicollinearity. Gradient descent methods, Newton method, interior point
methods, active set, proximity methods, accelerated gradient methods, coordinate descent,
cutting planes, stochastic gradient descent. Discriminant analysis, Principal component
analysis, Factor analysis, k means.
Regression - Correlation and Regression
I. Introduction to Regression
A. Definition
Regression is a statistical method used in machine learning to model the relationship between
a dependent variable and one or more independent variables.
B. Purpose
Predict the value of the dependent variable based on the values of independent variables.
Understand the strength and nature of the relationships.
II. Correlation
A. Definition
Correlation measures the strength and direction of a linear relationship between two
variables.
B. Types of Correlation
1. Positive Correlation
Both variables move in the same direction.
2. Negative Correlation
Variables move in opposite directions.
3. No Correlation
No discernible relationship between variables.
C. Pearson Correlation Coefficient (r)
Quantifies the strength and direction of the linear relationship.
Range: -1 to 1.
1: Perfect positive correlation.
0: No correlation.
-1: Perfect negative correlation.
III. Regression Analysis
A. Assumptions in Regression
1. Linearity
2. Independence
3. Homoscedasticity
4. Normality of Residuals
B. Handling Outliers
Identify and address outliers that may skew the results.
C. Model Interpretation
Interpret coefficients and their significance.
V. Conclusion
Regression is a powerful tool for predictive modeling in machine learning.
Correlation helps understand the relationships between variables.
Types of Correlation
Introduction:
Correlation is a statistical technique used to measure the strength and direction of a
relationship between two variables. There are several types of correlation, each serving
different purposes in understanding the nature of the relationship between variables.
1. Positive Correlation:
Definition: A positive correlation exists when an increase in one variable is associated with an
increase in another variable.
Example: As the number of hours spent studying increases, the exam scores also increase.
2. Negative Correlation:
Definition: A negative correlation occurs when an increase in one variable is associated with a
decrease in another variable.
Example: The more time spent commuting to work, the less time available for leisure
activities.
3. Zero or No Correlation:
Definition: Zero correlation implies no systematic relationship between two variables.
Example: There is no correlation between the amount of ice cream sold and the number of
drowning incidents.
4. Perfect Correlation:
Definition: Perfect correlation occurs when all data points fall exactly on a straight line,
indicating a perfect relationship.
Example: The relationship between the height of a plant and the amount of sunlight it receives
might have a perfect correlation if all other factors are controlled.
5. Spurious Correlation:
Definition: Spurious correlation is a misleading association between two variables that is
caused by a third variable.
Example: A study might find a correlation between the number of swimming pool drownings
and the sale of ice cream, but the spurious variable here is the summer season.
6. Rank Correlation:
Definition: Rank correlation assesses the strength of a relationship between the ranks or
orderings of two variables, rather than their actual values.
Example: Spearman's rank correlation coefficient is often used when dealing with ordinal data.
7. Cross-Correlation:
Definition: Cross-correlation measures the similarity between two signals as a function of the
time lag applied to one of them.
Example: Used in signal processing to find similarities between two signals that may be shifted
in time.
Conclusion:
Understanding the type of correlation between variables is crucial in drawing meaningful
conclusions from data. It helps researchers and analysts make predictions, identify trends, and
establish causal relationships in various fields.
Pearson's Correlation Coefficient:
1. Definition:
Pearson's correlation coefficient, denoted by r, measures the strength and direction of a linear
relationship between two continuous variables.
2. Formula:
r = ∑(Xi−X2)∑(Yi−Y)
∑(Xi−X)2∑ (Yi−Y)2
Where Xi and Yi are individual data points, and ˉXˉ and ˉYˉ are the means of the respective
variables.
3. Interpretation:
r ranges from -1 to 1.
r=1 implies a perfect positive linear relationship.
r=−1 implies a perfect negative linear relationship.
r=0 implies no linear relationship.
4. Assumptions:
Linearity: Assumes a linear relationship.
Homoscedasticity: Assumes equal variance of errors.
Independence: Assumes independence of observations.
5. Use Cases:
Commonly used in statistics, psychology, economics, and many other fields to assess
relationships between variables.
Spearman's Rank Correlation:
1. Definition:
Spearman's rank correlation, denoted by rho, measures the strength and direction of a
monotonic relationship between two variables, whether linear or not.
2. Procedure:
Convert the raw data to ranks.
Calculate the differences between the ranks for each pair of data points.
Square the differences and sum them.
Use the formula to calculate rho.
3. Formula:
rho=1−6∑di2/n(n2-1)
Where di is the difference between the ranks of the paired data points, and n is the number of
pairs.
4. Interpretation:
rho ranges from -1 to 1.
rho=1 implies a perfect monotonic positive relationship.
rho=−1 implies a perfect monotonic negative relationship.
rho=0 implies no monotonic relationship.
5. Assumptions:
Assumes that the variables are at least ordinal.
6. Use Cases:
Useful when dealing with non-linear relationships or when the data is not normally
distributed.
3. Logistic Regression:
Introduction:
Type of Regression: Used for binary classification problems.
Output: Predicts the probability of an event occurring.
Key Concepts:
Log-odds: Transformation used to map linear combination of features to probability.
Application:
Examples: Medical diagnosis, spam detection, credit scoring.
Advantages:
Handles non-linear relationships well.
Provides probabilities.
4. Rank Correlation:
Definition:
Measure of Association: Determines the strength and direction of a relationship between two
variables.
Spearman's Rank Correlation:
Non-parametric: Does not assume a specific distribution.
Ranking: Variables are ranked, and the correlation is calculated based on ranks.
5. Partial and Multiple Correlation - Multiple Regression:
Partial Correlation:
Definition: Examines the relationship between two variables while controlling for the
influence of other variables.
Multiple Correlation - Multiple Regression:
Extension of Simple Regression: Includes multiple independent variables.
Equation: Y=b0+b1X1+b2X2+...+bkXk+ε
Interpretation: Each coefficient represents the change in the dependent variable for a one-
unit change in the corresponding independent variable, holding other variables constant.
6. Multicollinearity:
Definition:
Issue in Multiple Regression: High correlation among independent variables.
Consequences: Inflated standard errors, unreliable coefficients.
Detection and Remedies:
VIF (Variance Inflation Factor): Measures the degree of multicollinearity.
Remedies: Remove correlated variables, combine variables, or collect more data.
Gradient Descent Methods:
Overview:
Gradient descent is an iterative optimization algorithm used for finding the minimum of a
function. It is widely employed in machine learning and numerical optimization.
Steps:
1. Initialize Parameters: Start with initial values for the parameters.
2. Compute Gradient: Calculate the gradient of the objective function with respect to the
parameters.
3. Update Parameters: Adjust the parameters in the opposite direction of the gradient to
minimize the function.
4. Convergence Check: Repeat steps 2 and 3 until the algorithm converges or a stopping criterion
is met.
Variants:
Stochastic Gradient Descent (SGD): Uses a random subset of data for each iteration.
Mini-batch Gradient Descent: Combines features of both full-batch and stochastic gradient
descent by using a small random subset of data.
Newton's Method:
Overview:
Newton's method is an iterative optimization algorithm that uses second-order information
(Hessian matrix) in addition to the gradient for finding the minimum of a function.
Steps:
1. Initialize Parameters: Start with initial values for the parameters.
2. Compute Gradient and Hessian: Calculate the gradient and Hessian matrix of the objective
function with respect to the parameters.
3. Update Parameters: Adjust the parameters based on the Newton-Raphson update formula.
4. Convergence Check: Repeat steps 2 and 3 until the algorithm converges or a stopping criterion
is met.
Advantages:
Faster convergence compared to first-order methods like gradient descent.
Well-suited for smooth, convex functions.
Interior Point Methods:
Overview:
Interior point methods are optimization algorithms that find solutions to linear and nonlinear
programming problems by moving through the interior of feasible regions.
Key Concepts:
Barrier Function: Introduces a penalty term for violating constraints.
Central Path: A trajectory in the feasible region leading to the optimal solution.
Steps:
1. Initialization: Start with an interior point within the feasible region.
2. Path-Following: Move along the central path towards the optimal solution.
3. Barrier Update: Adjust the barrier parameter to approach the solution.
Active Set Methods:
Overview:
Active set methods are used for solving constrained optimization problems by iteratively
identifying and updating the active set of constraints.
Steps:
1. Initialization: Start with an initial feasible solution.
2. Identify Active Set: Determine which constraints are "active" or binding at the current
solution.
3. Update Parameters: Optimize the objective function within the active set.
4. Update Active Set: Adjust the active set based on changes in the optimization variables.
5. Convergence Check: Repeat steps 2-4 until convergence.
Proximity Methods:
Overview:
Proximity methods are optimization algorithms that incorporate proximal operators to handle
non smooth functions and constraints.
Proximal Operator:
The proximal operator of a function is a mapping that "proximally" solves the optimization
problem involving that function.
Steps:
1. Decomposition: Decompose the objective function into smooth and nonsmooth components.
2. Proximal Update: Apply proximal operators to nonsmooth components.
3. Update Parameters: Adjust the parameters based on the proximal updates.
4. Convergence Check: Repeat steps 2 and 3 until the algorithm converges.
Accelerated Gradient Methods:
Introduction:
Motivation: Traditional gradient descent can be slow, especially for convex optimization
problems.
Idea: Incorporate momentum to speed up convergence.
Nesterov's Accelerated Gradient Descent:
Update Rule: Combines the current gradient and a fraction of the previous update.
Advantage: Faster convergence, especially for ill-conditioned problems.
Convergence Rate: Achieves optimal rate for smooth, convex functions.
FISTA (Fast Iterative Shrinkage-Thresholding Algorithm):
Objective Function: Often used for solving convex optimization problems with sparsity-
inducing regularization terms.
Key Steps: Proximal operator application and acceleration step.
Advantage: Efficient for problems with L1 regularization.
Coordinate Descent:
Introduction:
Motivation: Gradient descent may be computationally expensive for high-dimensional
problems.
Idea: Update only one coordinate at a time.
Cyclic Coordinate Descent:
Update Rule: Iterate through coordinates and update them sequentially.
Advantages: Simplicity, easy implementation.
Drawbacks: May be slow for certain problems, especially if coordinates are highly correlated.
Randomized Coordinate Descent:
Update Rule: Randomly select coordinates to update.
Advantages: Faster convergence in some cases, especially when coordinates are not highly
correlated.
Applications: Commonly used in machine learning for sparse and large-scale problems.
Cutting Planes:
Introduction:
Motivation: Solving linear programming problems with a large number of constraints.
Idea: Solve the problem by iteratively adding constraints that cut off infeasible regions.
Ellipsoid Method:
Principle: Iteratively refine an ellipsoid containing the feasible region.
Advantages: Polynomial time complexity, guarantees convergence.
Limitations: Practical implementations can be computationally intensive.
Interior Point Methods:
Idea: Move towards the interior of the feasible region while maintaining feasibility.
Advantages: Polynomial time complexity, often more efficient than ellipsoid methods.
Applications: Widely used for linear and quadratic programming problems.
Stochastic Gradient Descent (SGD):
Introduction:
Motivation: Traditional gradient descent can be slow on large datasets.
Idea: Use a random subset (mini-batch) of data to estimate the gradient.
SGD Update Rule:
Stochasticity: Each iteration uses a random subset of data.
Learning Rate Schedule: Can be fixed, decay over time, or adapt dynamically.
Advantages: Fast convergence, suitable for large-scale problems.
Mini-Batch SGD:
Batch Size: Balance between computational efficiency and variance in gradient estimates.
Parallelization: Well-suited for parallel processing on GPUs and distributed systems.
Applications: Widely used in training neural networks and other large-scale machine learning
tasks.
Discriminant Analysis:
Overview:
Discriminant Analysis is a statistical technique used for classifying a set of observations into
predefined classes. The primary goal is to find the combination of predictor variables that best
separates the classes.
Key Points:
1. Objective:
Discriminant analysis aims to maximize the differences between classes while minimizing the
differences within each class.
2. Assumptions:
Assumes that the data follows a multivariate normal distribution.
Assumes homogeneity of covariance matrices across classes.
3. Linear Discriminant Function:
Computes a linear combination of predictor variables to classify observations into different
groups.
4. Wilks' Lambda:
A common statistic used to assess the significance of discriminant functions.
5. Applications:
Widely used in pattern recognition, medical diagnosis, and finance.
Principal Component Analysis (PCA):
Overview:
Principal Component Analysis is a dimensionality reduction technique that transforms
correlated variables into a set of linearly uncorrelated variables called principal components.
Key Points:
1. Objective:
Reduce the dimensionality of data while preserving as much variance as possible.
2. Steps in PCA:
Standardize the data.
Calculate the covariance matrix.
Compute the eigenvectors and eigenvalues.
Sort eigenvectors by eigenvalues to choose principal components.
Project the data onto the selected principal components.
3. Eigenvalues and Eigenvectors:
Eigenvalues represent the variance of the data along the corresponding eigenvector.
4. Applications:
Data compression, noise reduction, and visualization.
Factor Analysis:
Overview:
Factor Analysis is a statistical method used to identify underlying factors (latent variables) that
explain patterns of correlations within observed variables.
Key Points:
1. Factor Loading:
Represents the strength and direction of the relationship between an observed variable and a
factor.
2. Common Factors:
Factors that are shared among observed variables and contribute to their correlations.
3. Eigenvalue Criterion:
Factors are retained based on eigenvalues greater than 1 or a scree plot.
4. Rotation:
Varimax or oblique rotation can be applied to simplify factor structures and enhance
interpretability.
5. Applications:
Social sciences, marketing research, and psychology for identifying latent constructs.
k-Means:
Overview:
k-Means is a clustering algorithm that partitions a dataset into k clusters, where each
observation belongs to the cluster with the nearest mean.
Key Points:
1. Objective:
Minimize the sum of squared distances between observations and the centroid of their
assigned cluster.
2. Initialization:
Randomly select k initial cluster centroids.
3. Iterations:
Assign each observation to the cluster with the nearest centroid.
Recalculate the centroids based on the assigned observations.
Repeat until convergence.
4. Choosing k:
Elbow method or silhouette analysis can be used to determine the optimal number of clusters.
5. Applications:
Customer segmentation, image compression, and anomaly detection.
Unit – II
INTRODUCTION TO MACHINE LEARNING: Introduction, Examples of various Learning
Paradigms, Perspectives and Issues, Version Spaces, Finite and Infinite Hypothesis Spaces, PAC
Learning, VC Dimension.
Introduction to Machine Learning
What Is Machine Learning?
Arthur Samuel, an early American leader in the field of computer gaming and
artificial intelligence, coined the term “Machine Learning” in 1959 while at IBM. He
defined machine learning as “the field of study that gives computers the ability to
learn without being explicitly programmed.” However, there is no universally accepted
definition for machine learning. Different authors define the term differently.
Definition of learning
Definition
A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks T, as measured
by P, improves with experience E.
Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• training experience: A sequence of images and steering
commands recorded while observing a human driver
iii) A chess learning problem
• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
• Training experience E: Playing practice games against itself
Definition
A computer program which learns from experience is called a machine learning program
or simply a learning program. Such a program is sometimes also referred to as a learner.
Components of Learning
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation. Figure 1.1 illustrates the
various components and the steps involved in the learning process.
1. Data storage
Facilities for storing and retrieving huge amounts of data are an important component of the
learning process. Humans and computers alike utilize data storage as a foundation for
advanced reasoning.
• In a human being, the data is stored in the brain and data is retrieve using
electrochemical signals.
• Computers use hard disk drives, flash memory, random access memory and similar
devices to store data and use cables and other technology to retrieve data.
2. Abstraction
The second component of the learning process is known as abstraction.
Abstraction is the process of extracting knowledge about stored data. This involves creating
general concepts about the data as a whole. The creation of knowledge involves application of
known models and creation of new models.
The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original
information.
3. Generalization
The term generalization describes the process of turning the knowledge about stored data into
a form that can be utilized for future action. These actions are to be carried out on tasks that
are similar, but not identical, to those what have been seen before. In generalization, the goal
is to discover those properties of the data that will be most relevant to future tasks.
4. Evaluation
It is the process of giving feedback to the user to measure the utility of the learned
knowledge. This feedback is then utilised to effect improvements in the whole learning
process
For example, consider the space of hypotheses that could in principle be output by the above
checkers learner. This hypothesis space consists of all evaluation functions that can be
represented by some choice of values for the weights wo through w6. The learner's task is
thus to search through this vast space to locate the hypothesis that is most consistent with the
available training examples. The LMS algorithm for fitting weights achieves this goal by
iteratively tuning the weights, adding a correction to each weight each time the hypothesized
evaluation function predicts a value that differs from the training value. This algorithm works
well when the hypothesis representation considered by the learner defines a continuously
parameterized space of potential hypotheses.
Many of the chapters in this book present algorithms that search a hypothesis space
defined by some underlying representation (e.g., linear functions, logical descriptions, decision
trees, artificial neural networks). These different hypothesis representations are appropriate
for learning different kinds of target functions. For each of these hypothesis representations,
the corresponding learning algorithm takes advantage of a different underlying structure to
organize the search through the hypothesis space.
What algorithms exist for learning general target functions from specific training examples? In
what settings will particular algorithms converge to the desired function, given sufficient
training data? Which algorithms perform best for which types of problems and
representations?
How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the character of
the learner's hypothesis space?
When and how can prior knowledge held by the learner guide the process of generalizing from
examples? Can prior knowledge be helpful even when it is only approximately correct?
What is the best strategy for choosing a useful next training experience, and how does the
choice of this strategy alter the complexity of the learning problem?
What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn? Can
this process itself be automated?
How can the learner automatically alter its representation to improve its ability to
represent and learn the target function?
A concept is consistent if it covers none of the negative examples. The version space is the set
of all complete and consistent concepts. This set is convex and is fully defined by its least and
most general elements.
Representation
The Candidate – Elimination algorithm finds all describable hypotheses that are consistent
with the observed training examples. In order to define this algorithm precisely, we begin with
a few basic definitions. First, let us say that a hypothesis is consistent with the training
examples if it correctly classifies these examples.
Definition: A hypothesis h is consistent with a set of training examples D if and only if h(x)
= c(x) for each example (x, c(x)) in D.
Definition: version space- The version space, denoted V SH, D with respect to hypothesis space
H and training examples D, is the subset of hypotheses from H consistent with the training
examples in D
Definition: The specific boundary S, with respect to hypothesis space H and training data D, is
the set of minimally general (i.e., maximally specific) members of H consistent with D.
VS ={ h H | (s S ) (g G ) ( g h s )}
H,D g g
To Prove:
1. Every h satisfying the right hand side of the above expression is in VS
H, D
2. Every member of VS satisfies the right-hand side of the expression
H, D
Sketch of proof:
1. let g, h, s be arbitrary members of G, H, S respectively with g g h g s
By the definition of S, s must be satisfied by all positive examples in D. Because h s,
g
h must also be satisfied by all positive examples in D.
By the definition of G, g cannot be satisfied by any negative example in D, and because g g h h
cannot be satisfied by any negative example in D. Because h is satisfied by all positive
examples in D and by no negative examples in D, h is consistent with D, and therefore h is a
member of VS .
H,D
2. It can be proven by assuming some h in VSH,D,that does not satisfy the right-hand side of the
expression, then showing that this leads to an inconsistency
The CANDIDATE-ELIMINTION algorithm computes the version space containing all hypotheses
from H that are consistent with an observed sequence of training examples.
Initialize G to the set of maximally general hypotheses in H Initialize S to the set of maximally
specific hypotheses in H For each training example d, do
• If d is a positive example
• Remove from G any hypothesis inconsistent with d
• For each hypothesis s in S that is not consistent with d
• Remove s from S
• Add to S all minimal generalizations h of s such that
• h is consistent with d, and some member of G is more general than h
• Remove from S any hypothesis that is more general than another hypothesis in S
• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
An Illustrative Example
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes
CANDIDATE-ELIMINTION algorithm begins by initializing the version space to the set of all
hypotheses in H;
Initializing the S boundary set to contain the most specific (least general) hypothesis
S0 , , , , ,
When the first training example is presented, the CANDIDATE-ELIMINTION algorithm checks
the S boundary and finds that it is overly specific and it fails to cover the positive example.
The boundary is therefore revised by moving it to the least more general hypothesis that
covers this new example
No update of the G boundary is needed in response to this training example because Go
correctly covers this example
When the second training example is observed, it has a similar effect of generalizing S further to S2,
leaving G again unchanged i.e., G2 = G1 =G0
Consider the third training example. This negative example reveals that the G boundary of the
version space is overly general, that is, the hypothesis in G incorrectly predicts that this new
example is a positive example.
The hypothesis in the G boundary must therefore be specialized until it correctly classifies this
new negative example.
Given that there are six attributes that could be specified to specialize G 2, why are there only
three new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal specialization of G 2 that
correctly labels the new example as a negative example, but it is not included in G 3. The
reason this hypothesis is excluded is that it is inconsistent with the previously encountered
positive examples
Consider the fourth training example.
This positive example further generalizes the S boundary of the version space. It also results in
removing one member of the G boundary, because this member fails to cover the new positive
example
After processing these four examples, the boundary sets S4 and G4 delimit the version space
of all hypotheses consistent with the set of incrementally observed training examples.
PAC Learning
Probably approximately correct learning
In this framework, the learner (that is, the algorithm) receives samples and must
select a hypothesis from a certain class of hypotheses. The goal is that, with high probability
(the “probably” part), the selected hypothesis will have low generalization error (the
“approximately correct” part). In this section we first give an informal definition of PAC-
learnability. After introducing a few nore notions, we give a more formal, mathematically
oriented, definition of PAC-learnability. At the end, we mention one of the applications of PAC-
learnability.
PAC-learnability
Let X be a set called the instance space which may be finite or infinite. For example, X may be
the set of all points in a plane.
A concept class C for X is a family of functions c : X {0; 1}. A member of C is called a
concept. A concept can also be thought of as a subset of X. If C is a subset of X, it defines a
unique function µc : X {0; 1} as follows:
A hypothesis h is also a function h : X {0; 1}. So, as in the case of concepts, a hypothesis
can also be thought of as a subset of X. H will denote a set of hypotheses.
We assume that F is an arbitrary, but fixed, probability distribution over X.
Training examples are obtained by taking random samples from X. We assume that the
samples are randomly generated from X according to the probability distribution F.
Examples
To illustrate the definition of PAC-learnability, let us consider some concrete examples.
Example
Let the instance space be the set X of all points in the Euclidean plane. Each point is
represented by its coordinates (x; y). So, the dimension or length of the instances is 2.
Let the concept class C be the set of all “axis-aligned rectangles” in the plane; that is, the set
of all rectangles whose sides are parallel to the coordinate axes in the plane (see Figure).
Since an axis-aligned rectangle can be defined by a set of inequalities of the following
form having four parameters
Given a set of sample points labeled positive or negative, let L be the algorithm which outputs
the hypothesis defined by the axis-aligned rectangle which gives the tightest fit to the positive
examples (that is, that rectangle with the smallest area that includes all of the positive
examples and none of the negative examples) (see Figure bleow).
Figure : Axis-aligned rectangle which gives the tightest fit to the positive examples
It can be shown that, in the notations introduced above, the concept class C is PAC-learnable
by the algorithm L using the hypothesis space H of all axis-aligned rectangles.
VC Dimension.
Vapnik-Chervonenkis (VC) dimension
The concepts of Vapnik-Chervonenkis dimension (VC dimension) and probably
approximate correct (PAC) learning are two important concepts in the mathematical theory of
learnability and hence are mathematically oriented. The former is a measure of the capacity
(complexity, expressive power, richness, or flexibility) of a space of functions that can be
learned by a classification algorithm. It was originally defined by Vladimir Vapnik and Alexey
Chervonenkis in 1971. The latter is a framework for the mathematical analysis of learning
algorithms. The goal is to check whether the probability for a selected hypothesis to be
approximately correct is very high. The notion of PAC
learning was proposed by Leslie Valiant in 1984.
V-C dimension
Let H be the hypothesis space for some machine learning problem. The Vapnik-Chervonenkis
dimension of H, also called the VC dimension of H, and denoted by V C(H), is a measure of the
complexity (or, capacity, expressive power, richness, or flexibility) of the space H. To define the
VC dimension we require the notion of the shattering of a set of instances.
Shattering of a set
Let D be a dataset containing N examples for a binary classification problem with class labels
0 and 1. Let H be a hypothesis space for the problem. Each hypothesis h in H partitions D into
two disjoint subsets as follows:
Such a partition of S is called a “dichotomy” in D. It can be shown that there are 2 N possible
dichotomies in D. To each dichotomy of D there is a unique assignment of the labels “1” and
“0” to the elements of
Thus to specify a hypothesis h, we need only specify the set {x Є D | h(x) = 1}. Figure 3.1 shows
all possible dichotomies of D if D has three elements. In the figure, we have shown only one of
the two sets in a dichotomy, namely the set {x Є D | h(x) = 1}.The circles and ellipses represent
such sets.
Definition
A set of examples D is said to be shattered by a hypothesis space H if and only if for every
dichotomy of D there exists some hypothesis in H consistent with the dichotomy of D.
Example
In figure , we see that an axis-aligned rectangle can shatter four points in two
dimensions. Then VC(H), when H is the hypothesis class of axis-aligned rectangles in two
dimensions, is four. In calculating the VC dimension, it is enough that we find four points
that can be shattered; it is not necessary that we be able to shatter any four points in two
dimensions.
Fig: An axis-aligned rectangle can shattered four points. Only rectangle covering two points are
shown.
VC dimension may seem pessimistic. It tells us that using a rectangle as our hypothesis class,
we can learn only datasets containing four points and not more.