Machine learning

Machine Learning
Dr. Amit Kumar, Dept of CSE, JUET, Guna

Chapter 1-
Introduction

Machine Learning
Chapter 1- Introduction
• The field of machine learning is concerned with
the question of how to construct computer
programs that automatically improve with
experience.
• In recent years many successful machine learning
applications have been developed.
– data-mining programs
– detect fraudulent credit card transactions
– information-filtering systems
– autonomous vehicles

Machine Learning
• The goal of this course is to present the key
algorithms and theory that form the core of
machine learning.
• Machine learning draws on concepts and
results from many fields, including –
–statistics, artificial intelligence, philosophy,
–information theory, biology, cognitive
science,
–computational complexity, and control
theory. Dr. Amit Kumar, Dept of CSE, JUET, Guna

Machine Learning
• A few specific achievements provide a
glimpse of the state of the art:
– Programs have been developed that successfully
learn to recognize spoken words (Waibel 1989; Lee
1989)
– predict recovery rates of pneumonia patients
(Cooper et al. 1997)
– detect fraudulent use of credit cards, drive
autonomous vehicles on public highways (Pomerleau
1989)
– play games such as backgammon at levels
approaching the performance of human world
champions (Tesauro 1992,1995).

Machine Learning
• Learning to recognize spoken words:
• All of the most successful speech recognition systems
employ machine learning in some form.
• For example, the SPHINX system (e.g., Lee 1989) learns
speaker-specific strategies for recognizing the primitive
sounds (phonemes) and words from the observed speech
signal.
• Neural network learning methods (e.g., Waibel et al.
1989) and methods for learning hidden Markov models
(e.g., Lee 1989) are effective for automatically customizing
to individual speakers, vocabularies, microphone
characteristics, background noise, etc.
• Similar techniques have potential applicationsin many
signal-interpretation problems.

Machine Learning
• Automated Transportation
• Machine learning methods have been used to
train computer-controlled vehicles to steer
correctly when driving on a variety of road types.
• For example, the ALVINN system (Pomerleau
1989) has used its learned strategies to drive
unassisted at 70 miles per hour for 90 miles on
public highways among other cars.
• Similar techniques have possible applications in
many sensor-based control problems.

Machine Learning
• Automated Transportation (contd..)
Google began testing a self-driving car in
2012, and since then, the U.S. Department
of Transportation has released definitions of
different levels of automation, with
Google’s car classified as the first level
down from full automation.
Other transportation methods are closer to
full automation, such as buses and trains.

Machine Learning
Classify new astronomical structures.
• Machine learning methods have been applied to
a variety of large databases to learn general
regularities implicit in the data.
• For example, decision tree learning algorithms
have been used by NASA to learn how to classify
celestial objects from the second Palomar
Observatory Sky Survey (Fayyad et al. 1995).
• This system is now used to automatically classify
all objects in the Sky Survey, which consists of
three terrabytes of image data.

Machine Learning
To play world-class backgammon.
• The most successful computer programs for playing
games such as backgammon are based on machiie
learning algorithms.
• For example, the world's top computer program for
backgammon, TD-GAMMON(T esauro 1992, 1995).
learned its strategy by playing over one million
practice games against itself.
• It now plays at a level competitive with the human
world champion.
• Similar techniques have applications in many
practical problems where very large search spaces
must be examined efficiently.

Machine Learning
Cyborg Technology
• Researcher Shimon Whiteson thinks that in the
future, we will be able to augment ourselves with
computers and enhance many of our own natural
abilities.
• Though many of these possible cyborg
enhancements would be added for convenience,
others might serve a more practical purpose.
• Yoky Matsuka of Nest believes that AI will become
useful for people with amputated limbs, as the
brain will be able to communicate with a robotic
limb to give the patient more control.
• This kind of cyborg technology would significantly
reduce the limitations that amputees deal with on a
daily basis. Dr. Amit Kumar, Dept of CSE, JUET, Guna

Machine Learning
Taking over dangerous jobs
• Robots are already taking over some of the most hazardous
jobs available, including bomb defusing.
• These robots aren’t quite robots yet, according to the
BBC. They are technically drones, being used as the physical
counterpart for defusing bombs, but requiring a human to
control them, rather than using machine learning.
• Other jobs are also being reconsidered for robot integration.
Welding, well known for producing toxic substances, intense
heat, and earsplitting noise, can now be outsourced to
robots in most cases.
• Robot Worx explains that robotic welding cells are already in
use, and have safety features in place to help prevent
human workers from fumes and other bodily harm.

Machine Learning
Solving climate change
• Solving climate change might seem like a tall
order from a robot, but as Stuart Russell
explains, machines have more access to data
than one person ever could—storing a mind-
boggling number of statistics.
• Using big data, machine learning could one
day identify trends and use that information
to come up with solutions to the world’s
biggest problems.

Machine Learning
Robot as friends
• Who wouldn’t want a friend like C-3PO?
• C-3PO is a humanoid robot character from
the ”Star Wars”.
• Built by Anakin Skywalker, C-3PO was
designed as a protocol droid intended to
assist in etiquette, customs,
and translation, boasting that he is "fluent
in over six million forms
of communication"Dr. Amit Kumar, Dept of CSE, JUET, Guna

Machine Learning
Robot as friends (Contd..)
• At this stage, most robots are still emotionless and
it’s hard to picture a robot you could relate to.
However, a company in Japan has made the first big
steps toward a robot companion—one who can
understand and feel emotions.
• Introduced in 2014, “Pepper” the companion robot
went on sale in 2015, with all 1,000 initial
units selling out within a minute.
• The robot was programmed to read human
emotions, develop its own emotions, and help its
human friends stay happy. Pepper goes on sale in
the U.S. in 2016, and more sophisticated friendly
robots are sure to follow.

Machine Learning
Improved elder care and medical care
• For many seniors, everyday life is a struggle, and
many have to hire outside help to manage their
care, or rely on family members.
• Machine learning and AI is at a stage where
replacing this need isn’t too far off, says
Matthew Taylor, computer scientist at
Washington State University.
• “Home” robots could help seniors with everyday
tasks and allow them to stay independent and in
their homes for as long as possible, which
improves their overall well-being.

Machine Learning
Sophia
• The first robot to be awarded citizenship in
the world, has said she not only wants to
start a family but also have her own career, in
addition to developing human emotions in
the future.
• Sophia, the humanoid AI robot has sprung
back to headlines after saying she wants to
start a family of her own and also mentioning
that all droids deserve to have children.

Machine Learning
The robot running for NZ Prime Minister in 2020
• The era of intelligent machines is well and truly
upon us. After Sophia was confirmed as the
world's first robot citizen in Saudi Arabia last
month, it's time for us to gape in awe at SAM,
who's become the world's first AI-powered
virtual politician from New Zealand.
• SAM's creator, Nick Gerritsen, believes that the
virtual politician may be able to contest New
Zealand's general elections in 2020.

Machine Learning
IBM Watson
• Watson is a question answering computer system capable of
answering questions posed in natural language, developed
in IBM's DeepQA project.
• The computer system was specifically developed to answer
questions on the quiz show Jeopardy! and, in 2011, the Watson
computer system competed on Jeopardy! against former
winners Brad Rutter and Ken Jennings] winning the first place
prize of $1 million.
• Watson had access to 200 million pages of structured and
unstructured content consuming four terabytes of disk
storage including the full text of Wikipedia, but was not
connected to the Internet during the game. For each clue,
Watson's three most probable responses were displayed on the
television screen. Watson consistently outperformed its human
opponents on the game's signaling device, but had trouble in a
few categories, notably those having short clues containing only
a few words.

WELL-POSED LEARNING PROBLEMS
• Definition: A computer program is said to learn
from experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improve
• For example, a computer program that learns to
play checkers might improve its performance as
measured by its ability to win at the class of tasks
involving playing checkers games, through
experience obtained by playing games against
itself with experience E.

• In general, to have a well-defined
learning problem, we must identity
these three features:
• the class of tasks (T)
• the measure of performance to be
improved (P),
• and the source of experience (E).

• A checkers learning problem:
–Task T: playing checkers
–Performance measure P: percent of games
won against opponents
–Training experience E: playing practice
games against itself
• We can specify many learning problems in this
fashion, such as learning to recognize
handwritten words, or learning to drive a
robotic automobile autonomously.

• A handwriting recognition learning
problem:
–Task T: recognizing and classifying
handwritten words within images
–Performance measure P: percent of words
correctly classified
–Training experience E: a database of
handwritten words with given
classifications

• A robot driving learning problem:
–Task T: driving on public four-lane
highways using vision sensors
–Performance measure P: average
distance traveled before an error (as
judged by human overseer)
–Training experience E: a sequence of
images and steering commands recorded
while observing a human driver

DESIGNING A LEARNING SYSTEM
• let us consider designing a program
to learn to play checkers,
• with the goal of entering it in the
world checkers tournament.
• We adopt the obvious performance
measure: the percent of games it
wins in this world tournament.

Choosing the Training Experience:
• The first design choice we face is to
choose the type of training experience
from which our system will learn.
• The type of training experience available
can have a significant impact on success
or failure of the learner.
• One key attribute is whether the training
experience provides direct or indirect
feedback regarding the choices made by
the performance system.

• For example, in learning to play checkers, the
system might learn from direct training examples
consisting of individual checkers board.
• Alternatively, it might have available only indirect
information consisting of the move sequences
and final outcomes of various games played.
• In this later case, information about the
correctness of specific moves early in the game
must be inferred indirectly from the fact that the
game was eventually won or lost.

• the learner faces an additional problem of credit
assignment, or determining the degree to which
each move in the sequence deserves credit or
blame for the final outcome.
• Credit assignment can be a particularly difficult
problem because the game can be lost even
when early moves are optimal, if these are
followed later by poor moves.
• Hence, learning from direct training feedback is
typically easier than learning from indirect
feedback.

• Choosing the Training Experience:
• A second important attribute of the
training experience is the degree to
which the learner controls the
sequence of training examples.
• For example, the learner might rely
on the teacher to select informative
board states and to provide the
correct move for each.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• Alternatively, the learner might itself
propose board states that it finds
particularly confusing and ask the teacher
for the correct move.
• Or the learner may have complete control
over both the board states and (indirect)
training classifications, as it does when it
learns by playing against itself with no
teacher present.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• A third important attribute of the training
experience is how well it represents the
distribution of examples over which the final
system performance P must be measured.
• In our checkers learning scenario, the
performance metric P is the percent of games
the system wins in the world tournament.
• If its training experience E consists only of
games played against itself.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• To proceed with our design, let us
decide that our system will train by
playing games against itself.
• This has the advantage that no
external trainer need be present,
• and it therefore allows the system to
generate as much training data as
time permits.

A checkers learning problem
• Task T: playing checkers
• Performance measure P: percent of games won
in the world tournament
• Training experience E: games played against
itself
In order to complete the design of the learning
system, we must now choose:
1. the exact type of knowledge to be learned
2. a representation for this target knowledge
3. a learning mechanismDr. Amit Kumar, Dept of CSE, JUET, Guna

A CHECKERS LEARNING PROBLEM
Choosing the Target Function
• The next design choice is to determine exactly
what type of knowledge will be learned and
how this will be used by the performance
program.
• Let us begin with a checkers-playing program
that can generate the legal moves from any
board state.
• The program needs only to learn how to
choose the best move from among these legal
moves. Dr. Amit Kumar, Dept of CSE, JUET, Guna

• Let us call this function ChooseMove
and use the notation:
ChooseMove : B  M
to indicate that this function accepts
as input any board from the set of
legal board states B and produces as
output some move from the set of
legal moves M.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• Although ChooseMove is an obvious choice
for the target function in our example, this
function will turn out to be very difficult to
learn given the kind of indirect training
experience available to our system.
• An alternative target function and one that
will turn out to be easier to learn in this
setting is an evaluation function that
assigns a numerical score to any given
board state. Dr. Amit Kumar, Dept of CSE, JUET, Guna

• Let us call this target function V and
again use the notation V : B  R to
denote that V maps any legal board
state from the set B to some real value
(we use R to denote the set of real
numbers).
• We intend for this target function V to
assign higher scores to better board
states. Dr. Amit Kumar, Dept of CSE, JUET, Guna

• If the system can successfully learn such a
target function V, then it can easily use it
to select the best move from any current
board position.
• This can be accomplished by generating
the successor board state produced by
every legal move, then using V to choose
the best successor state and therefore the
best legal move.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• Let us therefore define the target value V(b) for an
arbitrary board state b in B, as follows:
1. if b is a final board state that is won, then
V(b) = 100
2. if b is a final board state that is lost, then
V(b) = -100
3. if b is a final board state that is drawn, then
V(b) = 0
4. if b is a not a final state in the game, then
V(b) = V(b’),
where b' is the best final board state that can be achieved
starting from b and playing optimally until the end of the
game (assuming the opponent plays optimally, as well).

• Thus, we have reduced the learning task in this
case to the problem of discovering an operational
description of the ideal target function V.
• We often expect learning algorithms to acquire
only some approximation to the target function,
and for this reason the process of learning the
target function is often called function
approximation.
• we will use the symbol V* to refer to the function
that is actually learned by our program, to
distinguish it from the ideal target function V.

Choosing a Representation for the
Target Function
• let us choose a simple representation: for any given
board state, the function V* will be calculated as a
linear combination of the following board features:
• x1: the number of black pieces on the board
• x2: the number of red pieces on the board
• x3: the number of black kings on the board
• x4: the number of red kings on the board
• x5: the number of black pieces threatened by red
(i.e., which can be captured on red's next turn)
• X6: the number of red pieces threatened by black

Choosing a Representation for the
Target Function
• Thus, our learning program will represent V*(b)
as a linear function of the form:
V*(b) = w0+w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
• where w0 through w6 are numerical coefficients,
or weights, to be chosen by the learning
algorithm.
• Learned values for the weights w1 through w6
will determine the relative importance of the
various board features in determining the value
of the board, where as the weight w0 will provide
an additive constant to the board value.

Partial design of a checkers learning program
• Task T: playing checkers
• Performance measure P: percent of games won in
the world tournament
• Training experience E: games played against itself
• Target function:
V : Board  R
• Target function representation:
V*(b) = w0+w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
The first three items above correspond to the
specification of the learning task, whereas the
final two items constitute design choices for the
implementation of the learning program.Dr. Amit Kumar, Dept of CSE, JUET, Guna

Choosing a Function Approximation Algorithm
• In order to learn the target function V* we require a set
of training examples, each describing a specific board
state b and the training value Vtrain(b) for b.
• In other words, each training example is an ordered pair
of the form <b, Vtrain(b)>
• For instance, the following training example describes a
board state b in which black has won the game (note x2
= 0 indicates that red has no remaining pieces)
• and for which the target function value Vtrain(b) is
therefore +100.
<(x1=3, x2 =0, x3=1, x4=0,x5=0x6=6), +100>

ESTIMATING TRAINING VALUES
• Only training information available to our learner is
whether the game was eventually won or lost.
• On the other hand, we require training examples that
assign specific scores to specific board states.
• it is easy to assign a value to board states that
correspond to the end of the game,
• it is less obvious how to assign training values to the
more numerous intermediate board states that occur
before the game's end.
• Of course the fact that the game was eventually won or
lost does not necessarily indicate that every board state
along the game path was necessarily good or bad.

• Despite the ambiguity inherent in estimating training
values for intermediate board states,
• one simple approach has been found to be surprisingly
successful.
• This approach is to assign the training value of Vtrain(b)
for any intermediate board state b to be V*(successor(b))
• where V* is the learner's current approximation to V and
where Successor(b) denotes the next board state
following b for which it is again the program's turn to
move (i.e., the board state following the program‘s move
and the opponent's response).

• This rule for estimating training
values can be summarized as:
Vtrain(b)  V*(Successor(b))
(Rule for estimating training values.)

ADJUSTING THE WEIGHTS
• All that remains is to specify the learning algorithm for
choosing the weights wi to best fit the set of training
examples { <b, Vtrain(b)>}.
• As a first step we must define what we mean by the best
fit to the training data.
• One common approach is to define the best hypothesis,
or set of weights, as that which minimizes the squared
error E between the training values and the values
predicted by the hypothesis V*.

• Several algorithms are known for finding weights of a
linear function that minimize E.
• In our case, we require an algorithm that will
incrementally refine the weights as new training
examples become available and that will be robust to
errors in these estimated training values.
• One such algorithm is called the least mean squares, or
LMS training rule.
• For each observed training example it adjusts the
weights a small amount in the direction that reduces the
error on this training example.

The LMS algorithm is defined as follows:
LMS weight update rule.
For each training example <b, Vtrain(b)>
– Use the current weights to calculate V*(b)
– For each weight wi, update it as –
• Wi  wi + η (Vtrain(b) - V*(b)) xi
• Here η is a small constant (e.g., 0.1) that moderates
the size of the weight update.
• To get an intuitive understanding for why this weight
update rule works, notice that when the error
(Vtrain(b) - V*(b)) is zero, no weights are changed.

• When (Vtrain(b) - V*(b)) is positive (i.e., when
V*(b) is too low), then each weight is increased
in proportion to the value of its corresponding
feature.
• This will raise the value of V*(b), reducing the
error.
• Notice that if the value of some feature xi is
zero, then its weight is not altered regardless of
the error, so that the only weights updated are
those whose features actually occur on the
training example board.Dr. Amit Kumar, Dept of CSE, JUET, Guna

The Final Design
• The final design of our checkers learning
system can be naturally described by four
distinct program modules that represent
the central components in many learning
systems.
–The Performance System
–The Critic
–The Generalizer
–The Experiment Generator

The Final Design

The Final Design
• The Performance System is the module that
must solve the given performance task, in
this case playing checkers, by using the
learned target function(s). It takes an
instance of a new problem (new game) as
input and produces a trace of its solution
(game history) as output.
• The Critic takes as input the history or trace
of the game and produces as output a set of
training examples of the target function.

The Final Design
• The Generalizer takes as input the training
examples and produces an output hypothesis
that is its estimate of the target function.
• The Experiment Generator takes as input the
current hypothesis (currently learned
function) and outputs a new problem (i.e.,
initial board state) for the Performance
System to explore. Its role is to pick new
practice problems that will maximize the
learning rate of the overall system.

Chapter 2
CONCEPT LEARNING AND
THE
GENERAL-TO-SPECIFIC
0RDERING

INTRODUCTION-CONCEPT LEARNING
• The problem of inducing general functions from
specific training examples is central to concept
learning.
• Concept learning: acquiring the definition of a
general category given a sample of positive and
negative training examples of the category.
• Concept learning can be formulated as a
problem of searching through a predefined
space of potential hypotheses for the
hypothesis that best fits the training examples.

• In many cases this search can be efficiently
organized by taking advantage of a naturally
occurring structure over the hypothesis space - a
general-to-specific ordering of hypotheses.
• There are several learning algorithms and
considers situations under which they converge
to the correct hypothesis.
• We also examine the nature of inductive learning
and the justification by which any program may
successfully generalize beyond the observed
training data. Dr. Amit Kumar, Dept of CSE, JUET, Guna

• Much of learning involves acquiring general
concepts from specific training examples.
• People, for example, continually learn
general concepts or categories such as "bird,"
"car," "situations in which I should study
more in order to pass the exam," etc.
• Each such concept can be viewed as
describing some subset of objects or events
defined over a larger set (e.g., the subset of
animals that constitute birds).Dr. Amit Kumar, Dept of CSE, JUET, Guna

• Alternatively, each concept can be thought
of as a boolean-valued function defined
over this larger set (e.g., a function
defined over all animals, whose value is
true for birds and false for other animals).
• In this chapter we consider the problem of
automatically inferring the general
definition of some concept, given
examples labeled as members or non-
members of the concept.

• This task is commonly referred to as
concept learning, or approximating a
boolean-valued function from
examples.
Concept learning: Inferring a boolean-
valued function from training
examples of its input and output.

A CONCEPT LEARNING TASK
• Consider the example task of learning the target
concept "days on which my friend Aldo enjoys
his favorite water sport.“
• Table 2.1 describes a set of example days, each
represented by a set of attributes. The attribute
EnjoySport indicates whether or not Aldo enjoys
his favorite water sport on this day.
• The task is to learn to predict the value of
EnjoySport for an arbitrary day, based on the
values of its other attributes.

• What hypothesis representation shall we
provide to the learner in this case?Dr. Amit Kumar, Dept of CSE, JUET, Guna

• Let us begin by considering a simple
representation in which each
hypothesis consists of a conjunction of
constraints on the instance attributes.
• In particular, let each hypothesis be a
vector of six constraints, specifying
the values of the six attributes Sky,
AirTemp, Humidity, Wind, Water, and
Forecast. Dr. Amit Kumar, Dept of CSE, JUET, Guna

• For each attribute, the hypothesis will either-
– indicate by a "?' that any value is
acceptable for this attribute,
–specify a single required value (e.g., Warm,
cold for AirTemp) for the attribute, or
–indicate by a "Ø" that no value is
acceptable.
• If some instance x satisfies all the constraints
of hypothesis h, then h classifies x as a
positive example (h(x) = 1).Dr. Amit Kumar, Dept of CSE, JUET, Guna

• To illustrate, the hypothesis that
Aldo enjoys his favorite sport only
on cold days with high humidity
(independent of the values of the
other attributes) is represented by
the expression:
<?, Cold, High, ?, ?, ?>

• The most general hypothesis-that
every day is a positive example-is
represented by
<?, ?, ?, ?, ?, ?>
• and the most specific possible
hypothesis-that no day is a positive
example-is represented by
<Ø, Ø, Ø, Ø, Ø, Ø>

A CONCEPT LEARNING TASK- Notation
• The set of items over which the concept is
defined is called the set of instances,
which we denote by X.
• In the current example, X is the set of all
possible days, each represented by the
attributes Sky, AirTemp, Humidity, Wind,
Water, and Forecast.
• The concept or function to be learned is
called the target concept, which we
denote by c. Dr. Amit Kumar, Dept of CSE, JUET, Guna

A CONCEPT LEARNING TASK- Notation
• In general, c can be any boolean
valued function defined over the
instances X; that is, c : X  {0,1}.
• In the current example, the target
concept corresponds to the value of
the attribute EnjoySport (i.e., c(x) = 1 if
EnjoySport = Yes, and c(x) = 0 if
EnjoySport = No).

The definition of the EnjoySport concept
learning task

The Inductive Learning Hypothesis
• Although the learning task is to determine
a hypothesis h identical to the target concept
c over the entire set of instances X, the only
information available about c is its value over
the training examples.
• Therefore, inductive learning algorithms
can at best guarantee that the output
hypothesis fits the target concept over the
training data.

The Inductive Learning Hypothesis
•Lacking any further information, our assumption
is that the best hypothesis regarding unseen
instances is the hypothesis that best fits the
observed training data.
•This is the fundamental assumption of inductive
Learning.
The inductive learning hypothesis. Any hypothesis
found to approximate the target function well over
a sufficiently large set of training examples will
also approximate the target function well over
other unobserved examples.

CONCEPT LEARNING AS SEARCH
• Concept learning can be viewed as the task of
searching through a large space of hypotheses
implicitly defined by the hypothesis
representation.
• The goal of this search is to find the hypothesis
that best fits the training examples.
• It is important to note that by selecting a
hypothesis representation, the designer of the
learning algorithm implicitly defines the space of
all hypotheses that the program can ever
represent and therefore can ever learn.

• Consider, for example, the instances X and
hypotheses H in the EnjoySport learning task.
Given that the attribute Sky has three
possible values, and that AirTemp, Humidity,
Wind, Water, and Forecast each have two
possible values, the instance space X contains
exactly 3.2.2.2.2.2 = 96 distinct instances.
• A similar calculation shows that there are
5.4.4.4.4.4 = 5120 syntactically distinct
hypotheses within H.

• Notice, however, that every hypothesis
containing one or more " Ø " symbols represents
the empty set of instances; that is, it classifies
every instance as negative.
• Therefore, the number of semantically distinct
hypotheses is only 1+(4.3.3.3.3.3) = 973.
• EnjoySport example is a very simple learning
task, with a relatively small, finite hypothesis
space.
• Most practical learning tasks involve much larger,
sometimes infinite, hypothesis spaces.

General-to-Specific Ordering of Hypotheses
• To illustrate the general-to-specific ordering,
consider the two hypotheses
h1 = <Sunny, ?, ?, Strong, ?, ?>
h2 = <Sunny, ?, ?, ?, ?, ?>
• Any instance classified positive by hl will also
be classified positive by h2. Therefore, we say
that h2 is more general than h1.

Relation between instances and hypotheses

FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
* For each attribute constraint ai in h
If the constraint ai is satisfied by x
Then do nothing
Else replace ai in h by the next more general
constraint that is satisfied by x
3. Output hypothesis h
• To illustrate this algorithm, assume the learner is
given the sequence of training examples from Table
2.1 for the EnjoySport task. The first step of FIND-S
is to initialize h to the most specific hypothesis in H.

• h <Ø, Ø, Ø, Ø, Ø, Ø>
• Upon observing the first training example from Table
2.1, which happens to be a positive example, it
becomes clear that our hypothesis is too specific.
h <Sunny, Warm, Normal, Strong, Warm, Same>
• This h is still very specific.
• Next, the second training example (also positive in
this case) forces the algorithm to further generalize h,
this time substituting a "?' in place of any attribute
value in h that is not satisfied by the new example.
The refined hypothesis in this case is
h <Sunny, Warm, ?, Strong, Warm, Same>

•Upon encountering the third training example-in this
case a negative example- the algorithm makes no
change to h. In fact, the FIND-S algorithm simply
ignores every negative example.
• To complete our trace of FIND-S, the fourth(positive)
example leads to a further generalization of h.
h <Sunny, Warm, ?, Strong, ?, ?>
•The FIND-S algorithm illustrates one way in which
the more-general-than partial ordering can be used to
organize the search for an acceptable hypothesis.
•The search moves from hypothesis to hypothesis,
searching from the most specific to progressively
more general hypotheses along one chain of the
partial ordering. Dr. Amit Kumar, Dept of CSE, JUET, Guna

VERSION SPACES AND THE CANDIDATE-ELIMINATION
ALGORITHM
• The key idea in the CANDIDATE-ELIMINATlON
algorithm is to output a description of the set of all
hypotheses consistent with the training examples.
•The CANDIDATE-ELIMINATlON algorithm has been
applied to problems such as learning regularities in
chemical mass spectroscopy and learning control
rules for heuristic search.
•CANDIDATE-ELIMINATlON algorithm computes the
description of data set without explicitly enumerating
all of its members.
• This is accomplished by again using the more-
general-than partial ordering.

ALGORITHM
• The CANDIDATE-ELIMINATlON algorithm finds all
describable hypotheses that are consistent with the
observed training examples.
• In order to define this algorithm precisely, we begin
with a few basic definitions.
• First, let us say that a hypothesis is consistent with
the training examples if it correctly classifies these
examples.

ALGORITHM
• The CANDIDATE-ELIMINATlON algorithm
represents the set of all hypotheses consistent with
the observed training examples.
• This subset of all hypotheses is called the version
space with respect to the hypothesis space H and the
training examples D, because it contains all plausible
versions of the target concept.

The LIST-THEN-ELIMINATION Algorithm
•The LIST-THEN-ELIMINATION Algorithm first
initializes the version space to contain all hypotheses
in H,
• Then eliminates any hypothesis found inconsistent
with any training example.

General and Specific Boundary

VERSION SPACE THEORM

This is the version space with its general and specific boundary sets. The
version space includes all six hypotheses shown here, but can be represented
more simply by S and G. Arrows indicate instances of the more-general-than
relation. This is the version space for the Enjoysport concept learning problem
and training examples described in Table 2.1.Dr. Amit Kumar, Dept of CSE, JUET, Guna

Chapter 3
DECISION TREE
LEARNING

INTRODUCTION
• Decision tree learning is one of the most
widely used and practical methods for
inductive inference.
• It is a method for approximating discrete-
valued functions that is robust to noisy data
and capable of learning disjunctive
expressions.
• This chapter describes a family of decision
tree learning algorithms that includes widely
used algorithms.Dr. Amit Kumar, Dept of CSE, JUET, Guna

INTRODUCTION
• Decision tree learning algorithms such as
ID3, ASSISTANT, and C4.5 are mostly used.
• These methods search a completely
expressive hypothesis space and thus avoid
the difficulties of restricted hypothesis
spaces.
• Their inductive bias is a preference for small
trees over large trees.
• Learned trees can also be re-represented as
sets of if-then rules to improve human
readability.

DECISION TREE REPRESENTATION
• Decision trees classify instances by
sorting them down the tree from the
root to some leaf node, which provides
the classification of the instance.
• Each node in the tree specifies a test of
some attribute of the instance, and each
branch descending from that node
corresponds to one of the possible
values for this attribute.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• An instance is classified by starting at
the root node of the tree, testing the
attribute specified by this node, then
moving down the tree branch
corresponding to the value of the
attribute in the given example.

APPROPRIATE PROBLEMS FOR DECISION TREE
LEARNING
• Decision tree learning is generally best suited to
problems with the following characteristics:
• Instances are represented by attribute-value
pairs.
• The target function has discrete output values.
• Disjunctive descriptions may be required.
• The training data may contain errors.
• The training data may contain missing attribute
values. Dr. Amit Kumar, Dept of CSE, JUET, Guna

THE BASIC DECISION TREE LEARNING ALGORITHM
• Most algorithms that have been
developed for learning decision trees are
variations on a core algorithm that
employs a top-down, greedy search
through the space of possible decision
trees.
• This approach is exemplified by the ID3
algorithm (Quinlan 1986) and its
successor C4.5 (Quinlan 1993).Dr. Amit Kumar, Dept of CSE, JUET, Guna

THE BASIC DECISION TREE LEARNING ALGORITHM
• Algorithm, ID3, learns decision trees by
constructing them topdown, beginning with
the question "which attribute should be
tested at the root of the tree?”
• To answer this question, each instance
attribute is evaluated using a statistical test
to determine how well it alone classifies the
training examples.
• The best attribute is selected and used as the
test at the root node of the tree.Dr. Amit Kumar, Dept of CSE, JUET, Guna

THE ID3 LEARNING ALGORITHM

Which Attribute Is the Best Classifier?
• The central choice in the ID3 algorithm is
selecting which attribute to test at each node
in the tree.
• We will define a statistical property, called
information gain, that measures how well a
given attribute separates the training
examples according to their target
classification.
• ID3 uses this information gain measure to
select among the candidate attributes at
each step while growing the tree.Dr. Amit Kumar, Dept of CSE, JUET, Guna

ENTROPY MEASURES HOMOGENEITY OF EXAMPLES
• In order to define information gain precisely,
we begin by defining a measure commonly
used in information theory, called entropy,
that characterizes the (im)purity of an
arbitrary collection of examples.
• Given a collection S, containing positive and
negative examples of some target concept,
the entropy of S relative to this boolean
classification is:

• where p is the proportion of positive
examples in S and pΘ is the proportion of
negative examples in S.
• In all calculations involving entropy we define
0 log 0 to be 0.
• To illustrate, suppose S is a collection of 14
examples of some boolean concept,
including 9 positive and 5 negative examples
(we adopt the notation [9+, 5-] to summarize
such a sample of data). Then the entropy of S
relative to this boolean classification is:Dr. Amit Kumar, Dept of CSE, JUET, Guna

Entropy ([9+,5-]) = - (9/14) log2 (9/14)
- (5/14) log2 (5/14) = 0.940
• Notice that the entropy is 0 if all members of
S belong to the same class.
• Note the entropy is 1 when the collection
contains an equal number of positive and
negative examples.
• If the collection contains unequal numbers of
positive and negative examples, the entropy
is between 0 and 1.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• More generally, if the target attribute can take on
c different values, then the entropy of S relative to
this c-wise classification is defined as –
where pi is the proportion of S belonging to class i.
Note the logarithm is still base 2 because entropy is
a measure of the expected encoding length
measured in bits.
• Note also that if the target attribute can take on c
possible values, the entropy can be as large as log,
c. Dr. Amit Kumar, Dept of CSE, JUET, Guna

INFORMATION GAIN MEASURES THE
EXPECTED REDUCTION IN ENTROPY
• Given entropy as a measure of the impurity
in a collection of training examples,
• we can now define a measure of the
effectiveness of an attribute in classifying the
training data.
• The measure we will use, called information
gain, is simply the expected reduction in
entropy caused by partitioning the examples
according to this attribute.

• More precisely, the information gain, Gain(S,
A) of an attribute A, relative to a collection of
examples S, is defined as:
where Values(A) is the set of all possible
values for attribute A, and Sv is the subset
of S for which attribute A has value v .

• Note the first term in Equation is just the
entropy of the original collection S, and the
second term is the expected value of the
entropy after S is partitioned using attribute
A.
• The expected entropy described by this
second term is simply the sum of the
entropies of each subset Sv weighted by the
fraction of examples that belong to Sv.

• Gain(S,A) is therefore the expected
reduction in entropy caused by knowing the
value of attribute A. Put another way, Gain(S,
A) is the information provided about the
target & action value, given the value of
some other attribute A.
• The value of Gain(S, A) is the number of bits
saved when encoding the target value of an
arbitrary member of S, by knowing the value
of attribute A. Dr. Amit Kumar, Dept of CSE, JUET, Guna

• For example, suppose S is a collection of
training-example days described by
attributes including Wind, which can have
the values Weak or Strong.
• As before, assume S is a collection containing
14 examples, [9+, 5-].
• Of these 14 examples, suppose 6 of the
positive and 2 of the negative examples have
Wind = Weak, and the remainder have Wind
= Strong. Dr. Amit Kumar, Dept of CSE, JUET, Guna

Training Examples for the target concept
PalyTennis

• The information gain due to sorting the original 14
examples by the attribute Wind may then be calculated
as:

INFORMATION GAIN MEASURES

• ID3 determines the information gain for each
candidate attribute (i.e., Outlook, Temperature,
Humidity, and Wind), then selects the one with
highest information gain. The computation of
information gain for two of these attributes is
shown in Figure (last slide). The information gain
values for all four attributes are:
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029Dr. Amit Kumar, Dept of CSE, JUET, Guna

• According to the information gain measure, the
Outlook attribute provides the best prediction of
the target attribute, PlayTennis over the training
examples.
• Therefore, Outlook is selected as the decision
attribute for the root node, and branches are
created below the root for each of its possible
values (i.e., Sunny, Overcast, and Rain).
• The partially learned decision tree resulting from
the first step of ID3 is shown on next slide.

HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING
• As with other inductive learning methods, ID3
can be characterized as searching a space of
hypotheses for one that fits the training
examples.
• The hypothesis space searched by ID3 is the set
of possible decision trees.
• ID3 performs a simple-to-complex, hill-climbing
search through this hypothesis space, beginning
with the empty tree, then considering
progressively more elaborate hypotheses in
search of a decision tree that correctly classifies
the training data.Dr. Amit Kumar, Dept of CSE, JUET, Guna

By viewing ID3 in terms of its search space and
search strategy, we can get some insight into its
capabilities and limitations.
• ID3’s hypothesis space of all decision trees is a
complete space of finite discrete-valued
functions, relative to the available attributes.
Because every finite discrete-valued function can
be represented by some decision tree, ID3 avoids
one of the major risks of methods that search
incomplete hypothesis spaces, that the
hypothesis space might not contain the target
function.

• ID3 maintains only a single current
hypothesis as it searches through the space
of decision trees. This contrasts, for
example, with the earlier version space
candidate-Elimination method, which
maintains the set of all hypotheses
consistent with the available training
examples. By determining only a single
hypothesis, ID3 loses the capabilities that
follow from explicitly representing all
consistent hypotheses.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• ID3 in its pure form performs no backtracking in
its search. Once it, selects an attribute to test at
a particular level in the tree, it never backtracks
to reconsider this choice. Therefore, it is
susceptible to the usual risks of hill-climbing
search without backtracking: converging to
locally optimal solutions that are not globally
optimal. In the case of ID3, a locally optimal
solution corresponds to the decision tree it
selects along the single search path it explores.

• ID3 uses all training examples at each step in the
search to make statistically based decisions
regarding how to refine its current hypothesis.
This contrasts with methods that make decisions
incrementally, based on individual training
examples (e.g., FIND-S or CANDIDATE-
ELIMINATOIN ). One advantage of using
statistical properties of all the examples (e.g.,
information gain) is that the resulting search is
much less sensitive to errors in individual
training examples.

• ID3 can be easily extended to handle noisy
training data by modifying its termination
criterion to accept hypotheses that imperfectly
fit the training data.

INDUCTIVE BIAS IN DECISION TREE LEARNING
• Inductive bias is the set of assumptions that,
together with the training data, deductively
justify the classifications assigned by the learner
to future instances.
• Given a collection of training examples, there are
typically many decision trees consistent with
these examples.
• Describing the inductive bias of ID3 therefore
consists of describing the basis by which it
chooses one of these consistent hypotheses over
the others.

• Which of these decision trees does ID3 choose?
• It chooses the first acceptable tree it encounters
in its simple-to-complex, hill climbing search
through the space of possible trees.
• Roughly speaking, then, the ID3 search strategy
(a) selects in favor of shorter trees over longer
ones, and
(b) selects trees that place the attributes with
highest information gain closest to the root.

• Because of the subtle interaction
between the attribute selection
heuristic used by ID3 and the particular
training examples it encounters, it is
difficult to characterize precisely the
inductive bias exhibited by ID3.
• However, we can approximately
characterize its bias as a preference for
short decision trees over complex
trees.

• Approximate inductive bias of ID3:
Shorter trees are preferred over larger
trees.
• A closer approximation to the
inductive bias of ID3: Shorter trees are
preferred over longer trees. Trees that
place high information gain attributes
close to the root are preferred over
those that do not.

Why Prefer Short Hypotheses?
• Is ID3's inductive bias favoring shorter decision
trees a sound basis for generalizing beyond the
training data?
• Philosophers and others have debated this
question for centuries, and the debate remains
unresolved to this day.
• William of Occam was one of the first to discusst
the question, around the year 1320, so this bias
often goes by the name of Occam's razor.
• Occam's razor: Prefer the simplest hypothesis
that fits the data.

ISSUES IN DECISION TREE LEARNING?
• Practical issues in learning decision trees include:
• Determining how deeply to grow the decision
tree, (Avoiding Overfitting the Data, Reduced
Error Pruning, Rule Post-pruning)
• Handling continuous attributes,
• Choosing an appropriate attribute selection
measure,
• Handling training data with missing attribute
values,
• Handling attributes with differing costs, and
• Improving computational efficiency.Dr. Amit Kumar, Dept of CSE, JUET, Guna

Chapter 4
ARTIFICIAL
NEURAL
NETWORKS

INTRODUCTION
• Artificial neural networks (ANNs) provide a
general, practical method for learning real-
valued, discrete-valued, and vector-valued
functions from examples.
• Algorithms such as BACKPROPAGATION, gradient
descent to tune network parameters to best fit a
training set of input-output pairs.
• ANN learning is robust to errors in the training
data and has been successfully applied to
problems such as interpreting visual scenes,
speech recognition, and learning robot control
strategies. Dr. Amit Kumar, Dept of CSE, JUET, Guna

Biological Motivation
• The study of artificial neural networks (ANNs)
has been inspired in part by the observation
that biological learning systems are built of
very complex webs of interconnected
neurons.
• The human brain is estimated to contain a
densely interconnected network of
approximately 1011 neurons, each connected,
on average, to 104 others.
• Neuron activity is typically excited or inhibited
through connections to other neurons.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• The fastest neuron switching times are
known to be on the order of 10-3 seconds-
quite slow compared to computer switching
speeds of 10-10 seconds.
• Yet humans are able to make surprisingly
complex decisions, surprisingly quickly.
• For example, it requires approximately 10-1
seconds to visually recognize your mother.

• This observation has led many to speculate
that the information-processing abilities of
biological neural systems must follow from
highly parallel processes operating on
representations that are distributed over
many neurons.
• One motivation for ANN systems is to
capture this kind of highly parallel
computation based on distributed
representations.

• Most ANN software runs on sequential
machines emulating distributed processes,
although faster versions of the algorithms
have also been implemented on highly
parallel machines and on specialized
hardware designed specifically for ANN
applications.
• While ANNs are loosely motivated by
biological neural systems, there are many
complexities to biological neural systems that
are not modeled by ANNs.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• Historically, two groups of researchers have
worked with artificial neural networks.
• One group has been motivated by the goal of
using ANNs to study and model biological
learning processes.
• A second group has been motivated by the
goal of obtaining highly effective machine
learning algorithms, independent of whether
these algorithms mirror biological processes.

NEURAL NETWORK - Research History
• McCulloch and Pitts (1943) are generally recognized
as the designers of the first neural network.
• They combined many simple processing units
together that could lead to an overall increase in
computational power.
• They suggested many ideas like : a neuron has a
threshold level and once that level is reached the
neuron fires.
• It is still the fundamental way in which ANNs
operate. The McCulloch and Pitts's network had a
fixed set of weights.

• Hebb (1949) developed the first learning rule, that
is if two neurons are active at the same time then
the strength between them should be increased.
• In the 1950 and 60's, many researchers (Block,
Minsky, Papert, and Rosenblatt worked on
perceptron.
• The neural network model could be proved to
converge to the correct weights, that will solve the
problem.
• The weight adjustment (learning algorithm) used in
the perceptron was found more powerful than the
learning rules used by Hebb.Dr. Amit Kumar, Dept of CSE, JUET, Guna

• Minsky & Papert (1969) showed that perceptron
could not learn those functions which are not
linearly separable.
• The neural networks research declined throughout
the 1970 and until mid 80's because the perceptron
could not learn certain important functions.
• Neural network regained importance in 1985-86.
The researchers, Parker and LeCun discovered a
learning algorithm for multi-layer networks called
back propagation that could solve problems that
were not linearly separable.

Artificial Neuron Model
• An artificial neuron is a mathematical function
conceived as a simple model of a real (biological)
neuron.
• The McCulloch-Pitts Neuron: This is a simplified
model of real neurons, known as a Threshold Logic
Unit.

PERCEPTRONS
• One type of ANN system is based on a unit called a
perceptron. A perceptron takes a vector of real-
valued inputs, calculates a linear combination of
these inputs, then outputs a 1 if the result is
greater than some threshold and -1 otherwise.

PERCEPTRONS
• More precisely, given inputs x1 through xn the
output o(x1, . . . , xn) computed by the perceptron
is:
• where each wi is a real-valued constant, or weight,
that determines the contribution of input xi to the
perceptron output.
• To simplify notation, we imagine an additional
constant input x0 = 1, allowing us to write the
above inequality as > 0

PERCEPTRONS
• In vector form as For brevity, we will
sometimes write the perceptron function as :
• Where
• Learning a perceptron involves choosing values for
the weights w0, . . . , wn.
• Therefore, the space H of candidate hypotheses
considered in perceptron learning is the set of all
possible real-valued weight vectors.

Representational Power of Perceptrons
• We can view the perceptron as representing a
hyperplane decision surface in the n-dimensional
space of instances (i.e., points).
• The perceptron outputs a 1 for instances lying on
one side of the hyperplane and outputs a -1 for
instances lying on the other side, as illustrated in
Figure The decision surface represented by a
two-input perceptron.
(a) A set of training examples and the
decision surface of a perceptron that
classifies them correctly.
(b) A set of training examples that is
not linearly separable (i.e., that cannot
be correctly classified by any straight
line). xl and x2 are the Perceptron
inputs.Dr. Amit Kumar, Dept of CSE, JUET, Guna

Representational Power of Perceptrons
• To implement the AND function is to set the
weights w0 = -0.8 , and w1 = w2 = 0.5.
• This perceptron can be made to represent the OR
function instead by altering the threshold to
w0 = - 0.3.
• In fact, AND and OR can be viewed as special cases
of m-of-n functions: that is, functions where at
least m of the n inputs to the perceptron must be
true.

The Perceptron Training Rule
• Several algorithms are known to solve
learning problem.
• The perceptron rule and the delta rule .
These two algorithms are guaranteed to
converge to somewhat different acceptable
hypotheses, under somewhat different
conditions.
• They are important to ANNs because they
provide the basis for learning networks of
many units.

• One way to learn an acceptable weight
vector is to begin with random weights, then
iteratively apply the perceptron to each
training example, modifying the perceptron
weights whenever it misclassifies an
example.
• This process is repeated, iterating through
the training examples as many times as
needed until the perceptron classifies all
training examples correctly.

• Weights are modified at each step according to the
perceptron training rule, which revises the weight
wi associated with input xi according to the rule –
Where
t = target output, o = observed output, η = learning
rate. ∆ wi = weight update

Gradient descent algorithm for training a
linear unit.

MULTILAYER NETWORKS AND THE BACKPROPAGATION
ALGORITHM

Machine learning

More Related Content

Machine learning