Key Ideas in Machine Learning
Key Ideas in Machine Learning
Key Ideas in Machine Learning
Machine Learning
Copyright
2017.
c Tom M. Mitchell. All rights reserved.
*DRAFT OF December 4, 2017*
This is a rough draft chapter intended for inclusion in the upcoming second edi-
tion of the textbook Machine Learning, T.M. Mitchell, McGraw Hill. You are
welcome to use this for educational purposes, but do not duplicate or repost it
on the internet. For online copies of this and other materials related to this book,
visit the web site www.cs.cmu.edu/∼tom/mlbook.html.
Please send suggestions for improvements, or suggested exercises, to
Tom.Mitchell@cmu.edu.
1 Introduction
Machine learning is a discipline focused on two inter-related questions: “How can
one construct computer systems that automatically improve through experience?”
and “What are the fundamental theoretical laws that govern every learning system,
regardless of whether it is implemented in computers, humans or organizations?”
The study of machine learning is important both for addressing these fundamental
scientific and engineering questions, and for the highly practical computer soft-
ware it has produced and fielded across many applications.
1
Machine learning covers a diverse set of learning tasks, from learning to clas-
sify emails as spam, to learning to recognize faces in images, to learning to control
robots to achieve targeted goals. Each machine learning problem can be precisely
defined as the problem of improving some measure of performance P when ex-
ecuting some task T, through some type of training experience E. For example,
in learning an email spam filter the task T is to learn a function that maps from
any given input email to an output label of spam or not-spam. The performance
metric P to be improved might be defined as the accuracy of this spam classifier,
and the training experience E might consist of a collection of emails, each labeled
as spam or not. Alternatively, one might define a different performance metric P
that assigns a higher penalty when non-spam is labeled spam, than when spam
is labeled non-spam. One might also define a different type of training experi-
ence, for example by including unlabeled emails along with those labeled as spam
and not-spam. Once the three components hT, P, Ei have been specified fully, the
learning problem is well defined.
2 Key Concepts
This semester we examined many specific machine learning problems, applica-
tions, algorithms, and theoretical results. Below are some of the key overarching
concepts that emerge from this examination.
2
• Machine learning as probabilistic inference. A second perspective is that
machine learning tasks are often tasks involving probabilistic inference of
the learned model from the training data and prior probabilities. In fact, the
two primary principles for deriving learning algorithms are the probabilistic
principles of Maximum Likelihood Estimation (in which the learner seeks
the hypothesis that makes the observed training data most probable), and
Maximum a Posteriori Probability (MAP) estimation (in which the learner
seeks the most probable hypothesis, given the training data plus a prior prob-
ability distribution over possible hypotheses). In some cases, the learned
hypothesis (i.e., model) may itself contain explicit probabilities (e.g., the
learned parameters in a Naive Bayes classifier correspond to estimates of
specific probabilities). In other cases, even though the model parameters do
not correspond to specific probabilities (e.g., a trained neural network), we
may still find it useful to view the training algorithm as performing proba-
bilistic inference to find the Maximum Likelihood or the Maximum a Pos-
teriori probability network parameters’ values. Note this perspective that
machine learning algorithms are performing probabilistic inference is very
compatible with the above perspective that machine learning algorithms are
solving an optimization problem. In most cases, deriving a learning algo-
rithm based on the MLE or MAP principle involves first defining an objec-
tive function in terms of the parameters of the hypotheses and the training
data, then applying an optimization algorithm to solve for the hypothesis
parameter values that maximize or minimize this objective.
3
organism” may itself change over time, as the environment of the organism
and its set of competitors evolve as well.
4
possible decision trees in H is the one corresponding to the target function
being taught by the trainer. Although we use decision trees as an example
here, the argument holds for any learning algorithm.
5
by adding a penalty to the learning objective that penalizes the magnitude
of learned parameter values (e.g., in L1 and L2 regularization), providing a
bias in which the learner prefers simpler hypotheses. This increase in bias
typically reduces the sensitivity of the learning algorithm to variance in the
observed training examples. In many cases, regularization is equivalent to
placing a prior probability distribution on the values of the parameters to
be learned, then deriving their MAP estimates (e.g., L2 regularization cor-
responds to a zero mean Gaussian prior, whereas L1 corresponds to a zero
mean Laplace prior).
6
units are interconnected to perform a larger computation, and where learn-
ing involves simultaneously training the parameters of all units in the net-
work. Networks containing millions of learned parameters can be trained
using gradient descent methods, often with the help of specialized GPU
hardware. One important development in recent years is the growing use
of a variety of types of units such as non-linear rectilinear units, and units
that contain memory such as Long-Short Term Memory (LSTM) units. A
second important development is the invention of specific architectures for
specific families of problems, such as sequence-to-sequence architectures
used for machine translation and other types of sequential data, and convo-
lutional network architectures for problems such as image classification and
speech recognition where the architecture provides outputs that are invari-
ant of translations to network inputs (e.g. to recognize the same object in
different positions in the input image, or the same speech sound at different
positions in time). An important capability of deep networks is their abil-
ity to learn re-representations of the input data at different hidden layers in
the network. The ability to learn such representations has led, for example,
to networks capable of assigning text captions to input images, based on a
learned internal representation of the image content.
7
ter each example appears, a weighted vote is taken to produce an ensemble
prediction, and weights of individual ensemble members are adjusted once
the correct label is revealed. Interestingly, it can be proven that the num-
ber of mistakes made by this weighted vote, over the entire sequence of
examples, is only a small multiple of the number of mistakes made by the
best predictor in the ensemble, plus a term which grows only as the log of
the number of members of the ensemble. A second algorithm, called Ad-
aBoost, goes further, by learning both the voting weights and the hypotheses
themselves. It operates by training a sequence of distinct hypotheses from
a single set of training example, by reweighting the training examples to
focus at each iteration on the examples that were previously misclassified.
PAC-style theoretical results bound the degree of overfitting for AdaBoost
based on the VC dimension (complexity) of the hypothesis space used by
the base learner. Boosting algorithms that learn ensembles of short decision
trees (decision forests) are among the most popular classification learning
methods in practice.
8
train a model that maps data points from S into a lower dimensional space
in a way that allows reconstructing the original d-dimensional data as ac-
curately as possible. This can be accomplished via several methods, in-
cluding training a neural network with a low dimensional hidden layer to
output the same data point it is given as input, factoring the original data
matrix S into the product of two other matrices that share a lower dimen-
sional inner dimension. Principle Components Analysis (PCA) learns a
linear re-representations of the input data in terms of an orthogonal basis
whose top k dimensions give the best possible linear reconstruction of the
original data. Independent Components Analysis (ICA) also learns a linear
re-representations of the input data, but one where the coordinates of the
transformed data are statistically independent. Another approach, this one
probabilistic, is to represent the data as being generated by a probability dis-
tribution conditioned on hidden variables whose values constitute the new
representation, as in mixture of Gaussians models, or a mixture of latent top-
ics using Latent Dirichlet Allocation. In addition to these unsupervised ap-
proaches, supervised methods can be employed to learn re-representations
of the data useful for specific classification or regression problems, rather
than to minimize the reconstruction error of the original data. Supervised
training of neural networks with hidden layers performs exactly this func-
tion, learning re-representations of the input data at its hidden layers, where
the hidden layer representations are optimized to maximize the accuracy of
neural network outputs.
9
• Distant rewards and reinforcement learning. In standard supervised func-
tion approximation we wish to learning some target function f : X → Y from
labeled training examples corresponding to input-output pairs hx(i) , y(i) i of
f . However, in some applications, such as learning to play Chess or Go,
the training experience can be much different. In such games we wish to
learn a function from the current game state to the move we should make.
However, the training feedback signal is not provided until the game ends,
when we discover whether we have won or lost. To handle this kind of de-
layed feedback, reinforcement learning algorithms can be used, which are
based on a probabilistic decision theoretic formalism called Markov Deci-
sion Processes. In cases where the learner can simulate the effects of each
action (e.g., of any game move), algorithms such as value iteration can be
used, which employ a dynamic programming approach to learn an evalu-
ation function V (s) defined over board states s. In the more difficult case
where the learner is not able to simulate the effects of its actions (e.g., a car
driving on a slippery road), algorithms such as Q-learning can be used to
acquire a similar evaluation function Q(s, a) defined over state-action pairs.
One key advantage of Q-learning is that when the agent finds itself in state
s it can choose the best action simply by finding the action a that maxi-
mizes Q(s, a) even if it cannot predict accurately the next state that will
result in taking this action. In contrast, to choose an action from state s
using V (s), the system must perform a look-ahead search over states result-
ing from candidate actions, which requires the ability to internally simulate
action outcomes.
10
one of its predefined capabilities (e.g., tell you the weather forecast, or how
to drive to the movie theater). What if you could use that conversation to
teach the phone to do new things (e.g., whenever it snows at night, wake me
up 30 minutes earlier, because I don’t want to be late getting to work.). If
phones could be taught in this way by users, we would suddenly find that we
have billions of programmers - only they would be using natural language
to program their phones instead of learning the language of computers.
• Machine learning by reading. Today, the world wide web contains much of
human knowledge, but mostly in natural language which is not understood
by computers. However, significant advances are now occurring in many
areas of natural language processing (e.g., machine translation). If natu-
ral language understanding reaches a high enough level of competence, we
might suddenly see that learning by reading becomes a dominant compo-
nent of how machines learn. Machines would, unlike us humans, be able to
read the entire web, and they would suddenly be better read than you and I
by a factor of several million.
11