Machine Learning 1
Machine Learning 1
Machine Learning 1
Introduction
http://rajakishor.co.cc Page 3
Data:
Raw facts, unformatted information.
Database:
The collection of data related to a particular enterprise.
Database Management System:
The collection of data and the set of programs that operate on the data.
Information:
It is the result of processing, manipulating and organizing data in
response to a specific need.
Information relates to the understanding of the problem domain.
Knowledge:
It relates to the understanding of the solution domain – what to do?
Intelligence:
It is the knowledge in operation towards the solution – how to do? How to
apply the solution?
Artificial Intelligence:
It refers to the intelligence controlled by a computer machine.
Artificial intelligence is the study of how make computers to do things, at the
moment, people do better.
http://rajakishor.co.cc Page 4
A machine learns whenever it changes its structure, program, or data, based on
its inputs or in response to external information in such a manner that its expected
future performance improves.
Machine learning usually refers to the changes in systems that perform tasks
associated with artificial intelligence. Such tasks involve recognition, diagnosis,
planning, robot control, prediction, etc. The changes might be either enhancements to
already performing systems or ab initio synthesis of new systems.
To be slightly more specific, we show the architecture of a typical AI agent in the
following figure.
This agent perceives and models its environment and computes appropriate
actions, perhaps by anticipating their effects. Changes made to any of the components
shown in the figure might count as learning. Different learning mechanisms might be
employed depending on which subsystem is being changed.
http://rajakishor.co.cc Page 5
Why should machines have to learn?
The reasons include
Some tasks cannot be defined well except by example; that is, we might be able
to specify input-output pairs but not a concise relationship between inputs and
desired outputs. We would like machines to be able to adjust their internal
structure to produce correct outputs for a large number of sample inputs and
thus suitably constrain their input-output function to approximate the
relationship implicit in the examples.
It is possible that important relationships and correlations are hidden among
large amounts piles of data. Machine learning methods can often be used to
extract these relationships.
Human designers often produce machines that do not work as well as desired in
the environments in which they are used. In fact, certain characteristics of the
working environment might not be completely known at design time. Machine
learning methods can be used for on-the-job improvement of existing machine
designs.
The amount of knowledge available about certain tasks might be too large for
explicit encoding by humans. Machines that learn this knowledge gradually
might be able to capture more of it than humans would want to write down.
Environments change over time. Machines that can adapt to a changing
environment would reduce the need for constant redesign.
New knowledge about tasks is constantly being discovered by humans.
Vocabulary changes. There is a constant stream of new events in the world.
Continuing redesign of AI systems to conform to new knowledge is impractical,
but machine learning methods might be able to track much of it.
http://rajakishor.co.cc Page 6
Computational Structures of Machine Learning
We will consider a variety of different
di computational structures.
Functions
Logic programs and rule sets
Finite-state machines
Grammars
Problem solving systems
Methods exist both for the synthesis of these structures from examples and for
changing existing structures. In the latter case, the change to the existing structure
might be simply to make it more computationally efficient
cient rather than to increase
the coverage of the situations it can handle.
handle
http://rajakishor.co.cc Page 7
think of h as being implemented by a device that has X as input and h(X) as output. We
assume a priori that the hypothesized function, h, is selected from a class of functions H.
Sometimes we know that f also belongs to this class or to a subset of this class. We select
h based on a training set, T, of m input vector examples.
Types of Learning
There are two major categories in which we wish to learn a function. In case of
supervised learning, we know (sometimes only approximately) the values of f for the m
samples in the training set, T. We assume that if we can find a hypothesis, h, that closely
agrees with f for the members of T, then this hypothesis will be a good guess for f,
especially if T is large.
Curve fitting is a simple example of supervised learning of a function. Suppose
we are given the values of a two-dimensional function, f, at the four sample points
shown by the solid circles in the figure below.
We want to fit these four points with a function, h, drawn from the set, H, of
second-degree functions. We show there a two-dimensional parabolic surface above the
x1, x2 plane that fits the points. This parabolic function, h, is our hypothesis about the
function, f, that produced the four samples. In this case, h = f at the four samples, but we
need not have required exact matches.
In case of unsupervised learning, we simply have a training set of vectors
without function values for them. The problem in this case is to partition the training set
into subsets T1, …, TR, in some appropriate way. Unsupervised learning methods have
application in taxonomic problems in which it is desired to invent ways to classify data
into meaningful categories.
We shall also describe methods that are intermediate between supervised and
unsupervised learning.
http://rajakishor.co.cc Page 8
Input Vectors
The input vector is called by a variety of names. Some of these are input vector,
pattern vector, feature vector, sample, example, and instance. The components, xi, of the
input vector are variously called features, attributes, input variables, and components.
The values of the components can be of three main types. They might be real-
valued numbers, discrete-valued numbers, or categorical values. As an example of
illustrating categorical values, information about a student might be represented by the
values of the attributes class, major, sex, and adviser. A particular student would then be
represented by a vector such as: (sophomore, history, male, Higgins). Additionally,
categorical values may be ordered (as in {small, medium, large}) or unordered. Of
course, mixtures of all these types of values are possible.
In all cases, it is possible to represent the input in unordered form by listing the
names of the attributes together with their values. The vector form assumes that the
attributes are ordered and given implicitly by a form. As an example of an attribute-
value representation, we might have: (major: History, sex: male, class: sophomore,
adviser: Higgins, age: 19). We will be using the vector form exclusively.
An important specialization uses Boolean values, which can be regarded as a
special case of either discrete numbers (1, 0) or of categorical variables (True, False).
Outputs
1. The output may be a real number. In this case the process embodying the
function, h, is called a function estimator, and the output is called an output
value or estimate.
2. The output may be a categorical value. In this case the process embodying h is
variously called a classifier, a recognizer, or a categorizer, and the output itself is
called a label, a class, a category, or a decision.
Classifiers have application in a number of recognition problems, for example, in
the recognition of hand-written characters. The input in that case is some
suitable representation of the printed character, and the classifier maps this
input into one of, say, 64 categories.
3. The output may be a vector-valued object with components being real numbers
or categorical values.
4. The output may be Boolean. In that case, a training pattern having value 1 is
called a positive instance, and a training sample having value 0 is called a
negative instance. When the input is also Boolean, the classifier implements a
Boolean function.
Learning a Boolean function is sometimes called concept learning, and the
function is called a concept.
http://rajakishor.co.cc Page 9
Training Regimes, Training Domains or Training Sets
There are several ways in which the training set, T, can be used to produce a
hypothesized function.
1. Batch method: In this case, the entire training set is available and used all at
once to compute the function, h. A variation of this method uses the entire
training set to modify a current hypothesis iteratively until an acceptable
hypothesis is obtained.
2. Incremental method: Here, we select one member at a time from the training
set and use this instance alone to modify a current hypothesis. Then another
member of the training set is selected, and so on. The selection method can be
random (with replacement) or it can cycle through the training set iteratively.
3. Online method: Here, we use the training set members as they become available.
Online methods might be used, when the next training instance is some function
of the current hypothesis and the previous instance as it would. For example,
when a classifier is used to decide on a robots next action given its current set of
sensory inputs, the next set of sensory inputs will depend on which action was
selected.
Noise
Sometimes the vectors in the training set are corrupted by noise. There are two
kinds of noise.
1. Class noise: It randomly alters the value of the function, i.e., the output.
2. Attribute noise: It randomly alters the values of the components of the input
vector.
In either case, it would be inappropriate to insist that the hypothesized function
agrees precisely with the values of the samples in the training set.
Performance Evaluation
It is important to have methods to evaluate the result of learning. In supervised
learning, the induced function is usually evaluated on a separate set of inputs and
function values for them called the testing set.
A hypothesized function is said to generalize when it guesses well on the testing
set. Both mean-squared-error and the total number of errors are common measures
used in performance evaluation.
http://rajakishor.co.cc Page 10
M achine L earning A ims
Machine learning, as with automated reasoning, aims to use reasoning to find
new, relevant information given some background knowledge. This information may
then be used towards completing an intelligent task of more complexity. The same way
that the deductive process was harnessed for the particular task of proving theorems,
machine learning is harnessed for particular tasks.
One task is categorization: Given a set of examples, some background
information about those examples and a way of categorizing the examples, then learn
general reasons, based on the background information, why examples are put into
certain categories.
Of course, if an agent has learned a way of correctly categorizing examples, then
this can be used to predict the category of unseen examples. By stating prediction tasks
as categorization tasks, machine learning agents can be powerful tools for making
predictions.
Hence another task is: Given a set of examples, some background information
about those examples and an attribute of the examples, learn a way of predicting the
value of the attribute for an unseen example.
Machine learning application domains include:
1. Biochemistry: Predictive toxicology (described below) uses machine learning
agents to decide which drugs may turn out to be toxic to humans.
2. Medicine: Diagnosing patients based on their symptoms is an important
application of machine learning tools.
3. Bioinformatics: Machine learning tools have been used to predict the three
dimensional structure of proteins derived from gene sequences.
4. Natural language: Agents are programmed to learn grammars in order to
improve natural language comprehension.
5. Finance: Machine learning and other statistical tools can be used to predict stock
market fluctuations.
6. Military: An early application of neural networks was to identify certain vehicle
types (such as tanks) from visual information.
7. Music: Recent applications to computer music include the use of Markov models
to classify music into styles.
http://rajakishor.co.cc Page 11
2
Concept
oncept Learning or
Boolean Learning
http://rajakishor.co.cc Page 12
Writing a machine learning algorithm comprises three things:
1. how to represent the solutions,
2. how to search the space of solutions for a set of solutions which perform well,
3. how to choose from this set of best solutions.
Representation
The solution(s) to machine learning tasks are often called hypotheses, because
they can be expressed as a hypothesis that the observed positives and negatives for a
categorization is explained by the concept learned for the solution. The hypotheses have
to be represented in some representation scheme.
It is important to bear in mind that a solution to a machine learning problem will
be judged in worth along (at least) these three axes:
1. Accuracy: as discussed below, we use statistical techniques to determine how
accurate solutions are, in terms of the likelihood that they will correctly predict
the category of new (unseen) examples.
2. Comprehensibility: in some cases, it is highly desirable to be able to understand
the meaning of the hypotheses.
3. Utility: there may be other criteria for the solution which override the accuracy
and comprehensibility, e.g., in biological domains, when drugs are predicted by
machine learning techniques, it is imperative that the drugs can actually be
synthesised.
Each learning task will be better suited by one or more representation schemes.
http://rajakishor.co.cc Page 13
Choosing from Suitable Hypotheses
Certain learning techniques learn a range of hypothesis as solutions to the
problem at hand. These hypotheses usually range over two axes: their generality and
their predictive accuracy over the set of examples supplied.
We have mentioned that the application of machine learning algorithms is to
predicting the categorization of unseen examples, and this is also how learning
techniques are evaluated and compared. Hence, machine learning algorithms must
choose a single hypothesis, so that it can use this hypothesis to predict the category of
an unseen example.
The overriding force in machine learning assessment is the predictive accuracy
of learned hypotheses over unseen examples. The best bet for predictive accuracy over
unseen examples is to choose the hypothesis which achieves the best accuracy over the
seen examples, unless it overfits the data. Hence, the set of hypotheses to choose from is
usually narrowed down straight away to those which achieve the best accuracy when
used to predict the categorization of the examples given to the learning process. Within
this set, there are various possibilities for choosing the candidate to use for the
prediction.
B oolean A lgebra
Many important ideas about learning of functions are most easily presented
using the special case of Boolean functions. There are several important subclasses of
Boolean functions that are used as hypothesis classes for function (machine) learning.
A Boolean function, f(x1, x2, …, xn) maps an n-tuple of(0, 1) values to {0, 1}.
Boolean algebra is a convenient notation for representing Boolean Functions. Boolean
algebra uses the connectives . , + , and ‘ for and, or and complement respectively.
Sometimes the arguments and values of Boolean functions are expressed in
terms of the constants T (True) and F (False) instead of 1 and 0 respectively.
A Boolean formula consisting of a single variable, such as x1 is called an atom.
http://rajakishor.co.cc Page 14
example terms are x1x7 and x1x2x4’. The size of a term is the number of literals it
contains. The examples are of sizes 2 and 3, respectively.
A clause is any function written in the form l1 + l2 + … + lk, where the li are literals.
Such a form is called a disjunction of literals.
C oncept L earning
http://rajakishor.co.cc Page 15
Concept Learning Task
Concept learning task involves defining a target function over a set of instances.
The target function can be a model which has to identify each instance to a specified
concept, class or category. Here, the concept learner generates a set of candidate
hypotheses from a given set of available training examples.
Example
Concept: Days when you would enjoy water sports.
Instances X: Possible days each described by the attributes.
Sky (Sunny, Cloudy, Rainy)
AirTemp (Warm, Cold)
Humidity (Normal, High)
Wind (Strong, Weak)
Water (Warm, Cold)
Forecast (Same, Change)
Hypothesis H: Each hypothesis is a vector of 6 constraints, specifying the values
of 6 attributes.
For each attribute, hypothesis is:
– Value of ? if any value is acceptable for this attribute
– Single required value for the attribute
– Value of 0 if no value is acceptable
Sample hypothesis: (Rainy, ?, ?,?, Warm ,?)
General and specific hypothesis:
Most general hypothesis: (?, ?, ?, ?, ?, ?)
Most specific hypothesis: (0, 0, 0, 0, 0, 0
Target concept C: EnjoySport: X → (0,1)
Training Examples D:
Example Sky Air Temp Humidity Wind Water Forecast Enjoy
Sport
http://rajakishor.co.cc Page 16
Concept Learning as Search
Concept learning can be viewed as searching through a large space of hypotheses
implicitly defined by the hypothesis representation. Here, the goal is to efficiently
search hypothesis space to find the hypothesis that best fits the training data.
Hypothesis space is potentially very large and even possibly infinite.
T F he ind-S A lgorithm
http://rajakishor.co.cc Page 17
5. To construct more general hypotheses, each old hypothesis is taken and the least
general generalizations are constructed. These generalizations are such that
there is no more specific hypothesis which is also true of the two positives.
6. Note that, if we are using a logic representation of the hypotheses, then all that is
required to find the least general generalization is to keep changing ground
terms into variables until we arrive at a hypothesis which is true of P2.
7. Because the more specific hypothesis that we used to generalize from was true of
P1, then the generalized hypothesis must also be true of P2.
8. Once this process has been exhausted for the second positive, the FIND-S method
takes the enlarged set of hypotheses and does the same generalization routine
using the third positive example.
9. This continues until all the positives have been used. Of course, it is then
necessary to start the whole process with a different first positive.
Only once it has found all the possible hypotheses, ranging from the most specific
to the most general, does the FIND-S method check how good the hypotheses are at the
particular learning task. For each hypothesis, it checks how many examples are
correctly categorized as positive or negative, and the hypotheses learned by this method
are those which achieve highest predictive accuracy on the examples given to it.
Note that because this method looks for the least general generalizations, it is
guaranteed to find the most specific solutions to the problem.
http://rajakishor.co.cc Page 18
3
The first learning methods we present are based on the concepts of version
spaces and version graphs. These ideas are most clearly explained for the case of
Boolean function learning. Given an initial hypothesis set H, (a subset of all Boolean
functions) and the values of f(X) for each X in a training set, T, the version space is that
subset of hypotheses, Hv, that is consistent with these values. A hypothesis, h, is
consistent with the values of X in T if and only if h(X) = f(X) for all X in T. We say that the
hypotheses in H that are not consistent with the values in the training set are ruled out
by the training set.
Suppose that we have devices for implementing every function in H. We define
an incremental training procedure, which presents each pattern in T to each of the
functions in H and then eliminates those functions whose values for that pattern did not
agree with the given value of the training procedure. At any stage of the process we
would remain with some subset of functions that are consistent with the patterns
presented so far. This subset is the version space for the patterns already presented.
This idea is illustrated in the following figure.
http://rajakishor.co.cc Page 20
Consider a procedure for classifying an arbitrary input pattern, X: the pattern is
put in the same class (0 or 1) just as the majority of the outputs of the functions in the
version space. During the learning procedure, if this majority is not equal to the value of
the pattern presented, we say a mistake is made, and we revise the version space
accordingly. This revision of version space eliminates all those functions that vote
incorrectly. Thus, whenever a mistake is made, we rule out at least half of the functions
remaining in the version space.
Version Graphs
Boolean functions can be ordered by generality. A Boolean function, f1, is more
general than a function, f2, (and f2 is more specific than f1) if f1 has value 1 for all of the
arguments for which f2 has value 1, f1 ≠ f2. For example, x3 is more general than x2.x3 but
is not more general than x3 + x2.
We can form a graph with the hypotheses, {hi}, in the version space as nodes. A
node in the graph, hi, has an arc directed to node hj if and only if hj is more general than
hi. We call such a graph a version graph. In the following figure, we show an example of
a version graph over a 3-dimensional input space for hypotheses restricted to terms.
Figure-1
http://rajakishor.co.cc Page 21
The function, which has value 1 for all inputs, corresponds to the node at the top
of the graph. It is denoted here by “1”. It is more general than any other term. Similarly,
the function “0” is at the bottom of the graph. Just below 1 is a row of nodes
corresponding to all terms having just one literal, and just below them is a row of nodes
corresponding to terms having two literals, and so on. There are 33 = 27 functions
altogether (the function “0”, included in the graph, is technically not a term). To make
our portrayal of the graph less cluttered only some of the arcs are shown, each node in
the actual graph has an arc directed to all of the nodes above it that are more general.
We use this same example to show how the version graph changes as we
consider a set of labeled samples in a training set, T. Suppose we first consider the
training pattern (1, 0, 1) with value 0. Some of the functions in the version graph of
Figure-1 are inconsistent with this training pattern. These ruled out nodes are no longer
in the version graph and are shown shaded in Figure-2. We also show there the three-
dimensional cube representation in which the vertex (1, 0, 1) has value 0.
Figure-2
http://rajakishor.co.cc Page 22
In a version graph, there are always a set of hypotheses that are maximally
general and a set of hypotheses that are maximally specific. These are called the general
boundary set (gbs) and the specific boundary set (sbs) respectively. In Figure-3, we
have the version graph as it exists after learning that (1, 0, 1) has value 0 and (1, 0, 0)
has value 1. The gbs and sbs are shown.
Figure-3
http://rajakishor.co.cc Page 23
4
Decision
cision Tree Learning
http://rajakishor.co.cc Page 24
D efinitions
A decision tree is a tree whose internal nodes are tests on input patterns and
whose leaf nodes are categories of patterns. We show an example in the following
figure.
There are several dimensions along which decision trees might differ.
1. The tests might be multivariate (testing on several features of the input at once) or
univariate (testing on only one of the features).
2. The tests might have two outcomes or more than two. If all of the tests have two
outcomes, we have a binary decision tree.
3. The features or attributes might be categorical or numeric. Binary-valued attributes
can be regarded as either.
4. We might have two classes or more than two. If we have two classes and binary
inputs, the tree implements a Boolean function, and is called a Boolean decision tree.
http://rajakishor.co.cc Page 25
It is straightforward to represent the function implemented by a univariate
Boolean decision tree in DNF form. The DNF form implemented by such a tree can be
obtained by tracing down each path leading to a tip node corresponding to an output
value of 1, forming the conjunction of the tests along this path, and then taking the
disjunction of these conjunctions.
We show an example in the following figure.
http://rajakishor.co.cc Page 26
The k-DL (decision list) class of Boolean functions can be implemented by a
multi-variate decision tree having the highly unbalanced form shown in the following
figure. Each test, ci, is a term of size k or less. The vi all have values of 0 or 1.
Several systems for learning decision trees have been proposed. Prominent
among these are ID3 and its new version, C4.5, and CART. We discuss here only batch
methods, although incremental ones have also been proposed.
http://rajakishor.co.cc Page 27
Using Uncertainty Reduction to Select Tests
The main problem in learning decision trees for the binary-attribute case is
selecting the order of the tests. For categorical and numeric attributes, we must also
decide what the tests should be (besides selecting the order). Several techniques have
been tried; the most popular one is at each stage to select that test that maximally
reduces an entropy-like measure.
We show how this technique works for the simple case of tests with binary
outcomes. Extension to multiple-outcome tests is computationally straightforward but
gives poor results because entropy is always decreased by having more outcomes.
The entropy or uncertainty about the class of a pattern knowing that it is in some
set, T, of patterns is defined as:
http://rajakishor.co.cc Page 28
that as we travel down the decision tree, the uncertainty about the class of a pattern
becomes less and less.
Since we do not have the probabilities p(i|T), in general, we estimate them by
sample statistics. Although these estimates might be errorful, they are nevertheless
useful in estimating uncertainties. Let p’()i|T) be the number of patterns in T belonging
to class i divided by the total number of patterns in T. Then an estimate of the
uncertainty is:
H (T j ) = −∑ p(i | T j ) log 2 p (i | T j )
i
E[ H G (T )] = ∑ p (T j ) H (T j )
j
where HG(T) is the average uncertainty after performing test T on the patterns in T,
p(Tj) is the probability that the test has outcome j, and the sum is taken from 1 to k.
again, we don’t know the probabilities p(Tj), but we can use the sample values.
The estimate p’(Tj) of p(Tj) is just the number of those patterns in T that have outcome j
divided by the total number of patterns in T.
the average reduction in uncertainty is achieved by test G (applied to patterns in
T) is then:
RG(T) = H(T) – E[HG(T)].
A decision tree algorithm selects for the root of the tree and tests for the
attribute that gives maximum reduction of uncertainty or entropy. The algorithm
applies this criterion recursively until some termination condition is met.
The uncertainty calculations are particularly simple when the tests have binary
outcomes and when the attributes have binary values.
http://rajakishor.co.cc Page 29
We will give a simple example to illustrate how the test selection mechanism
works in that case. Suppose we want to use the uncertainty-reduction method to build a
decision tree to classify the following patterns:
Pattern Class
(0, 0, 0) 0
(0, 0, 1) 0
(0, 1, 0) 0
(0, 1, 1) 0
(1, 0, 0) 0
(1, 0, 1) 1
(1, 1, 0) 0
(1, 1, 1) 1
The initial uncertainty for the set, T, containing all eight points is:
H(T) = - (6/8) log2(6/8) – (2/8)log2(2/8) = 0.81
Next, we calculate the uncertainty reduction if we perform x1 first. The left-hand
branch of the decision tree has only patterns belonging to class 0 (we call the set Tl), and
the right-hand branch (Tr) has two patterns in each class. So, the uncertainty of the left-
hand branch is:
Hx1(Tl) = - (4/4)log2(4/4) – (0/4)log2(0/4) = 0
And the uncertainty of the right-hand branch is:
Hx1(Tr) = - (2/4)log2(2/4) – (2/4)log2(2/4) = 1
Half of the pattern “go left” and half “go right” on test x1. Thus, the average
uncertainty after performing the x1 test is:
½ Hx1(Tl) + ½ Hx1(Tr) = 0.5
Therefore, the uncertainty reduction on T achieved by x1 is:
Rx1(T) = 0.81 – 0.5 = 0.31
By similar calculations, we see that the test x3 achieves exactly the same
uncertainty reduction, but x2 achieves no reduction whatsoever.
Thus, our “greedy” algorithm for selecting a first test would select either x1 or x3.
Suppose x1 is selected. The uncertainty-reduction procedure would select x3 as
the next test.
The decision tree that this procedure creates thus implements the Boolean
function: f = x1x3.
http://rajakishor.co.cc Page 30
Non-Binary Attributes
If the attributes are non-binary, we can still use the uncertainty-reduction
technique to select tests. But now, in addition to selecting an attribute, we must select a
test on that attribute. Suppose for example that the value of an attribute is a real
number and that the test to be performed is to set a threshold and to test to see if the
number is greater than or less than that threshold. In principle, given a set of labeled
patterns, we can measure the uncertainty reduction for each test that is achieved by
every possible threshold. Similarly, if an attribute is categorical, (with a finite number of
categories), there are only a finite number of mutually exclusive and exhaustive subsets
into which the values of the attribute can be split. We can calculate the uncertainty
reduction for each split.
The decision tree at the left of the figure implements the same function as the
network at the right of the figure. Of course, when implemented as a network, all of the
features are evaluated in parallel for any input pattern, whereas when implemented as a
decision tree only those features on the branch traveled down by the input pattern need
to be evaluated. Thus the decision-tree induction methods can be thought of as
particular ways to establish the structure and the weight values for networks.
http://rajakishor.co.cc Page 31