Machine Learning and Pattern Recognition Notation
Machine Learning and Pattern Recognition Notation
Textbooks, papers, code, and other courses will all use different names and notations for the
things covered in this course. While learning a subject, these differences can be confusing.
However, dealing with different notations is a necessary research skill. There will always be
differences in presentation in different sources, due to different trade-offs and priorities.
We try to make the notation fairly consistent within the course. This note lists some of the
conventions that we’ve chosen to follow that might be unfamiliar.
You can probably skip over this note at the start of the class. Most notation is introduced
in the notes as we use it. However, everything mentioned here is something that has been
queried by previous students of this class. So please refer back to this note if you find any
unfamiliar notation later on.
If we need to show the contents of a column vector, we often create it from a row-vector
with the transpose > to save vertical space in the notes. We use subscripts to index into a
vector (or matrix).
Depending on your background, you might be more familiar with ~x than x.
In handwriting, we underline vectors: x because it’s difficult to handwrite in bold! While not
everyone underlines vectors, and some authors use arrows, we recommend you write with
the same notation in this class to help communication. But as long is it’s clear from context
what you are doing, it’s not critical. We often forget to underline vectors when writing, so
we understand.
Matrices (for us rectangular arrays of numbers) are upper-case letters, like A and Σ. We’ve
chosen not to bold them, even though there are sometimes numbers represented by upper-
case letters floating around (such as D for number of dimensions). It should usually be clear
from context which quantities are matrices, and what size they are. See the maths cribsheet
for details on indexing matrices, and how sizes match in matrix-vector multiplication.
Addition: sizes of vectors or matrices should match when adding quantities: a + b, or A + B.
As an exception, to add a scalar c to every element, we’ll just write a + c or A + c.
Indexing items: Sometimes we use superscripts to identify items, such as x(n) for the nth
D-dimensional input vector in a training set with N examples. We can (and often do) stack
these vectors into an N × D matrix X, so we could use a notation such as Xn,: to access the
nth row. In this case we chose to introduce the extra superscript notation instead. In the
2 Probabilities
The probability mass of a discrete outcome x is written P( x ).
When it doesn’t seem necessary (nearly always) we don’t introduce notation for a corre-
sponding random variable X, and write more explicit expressions like PX ( x ) or P( X = x ).
Notation is a trade-off, and more explicit notation can be more of a burden to work with.
Joint probabilities: P( x, y). Conditional probabilities: P( x | y). Conditional probabilities
are sometimes written in the literature as P( x; y) — especially in frequentist statistics rather
than Bayesian statistics. The ‘|’ symbol, introduced by Jeffreys, is historically associated with
Bayesian reasoning. Hence for arbitrary functions, like f (x; w), where we want to emphasize
that it’s primarily a function of x controlled by parameters w, we’ve chosen not to use a ‘|’.
The probability density of a real-valued outcome x is written with a lower-case p( x ), such
Rb
that P( a < X < b) = a p( x ) dx. We tend not to introduce new symbols for density functions
over different variables, but again overload the notation: we call them all “p” and infer which
density we are talking about from the argument.
Gaussian distributions are reviewed later in the notes. We will write that an outcome x was
sampled from a Gaussian or normal distribution using x ∼ N (µ, Σ). We write the probability
density associated with that outcome as N ( x; µ, Σ). We could also have chosen to write
N ( x | µ, Σ), as Bishop and Murphy do. The ‘;’ was force of habit, because the Gaussian
(outside of this course) is used in many contexts, and not just Bayesian reasoning.
This
R ∞ is a definite
R integral over the whole range of the variable x. We might have written
−∞ . . . or X . . ., but because our integrals are always over the whole range of the variable,
we don’t bother to specify the limits.
The expectation notation is often quicker to work with than writing out the integral. As
above, we sometimes don’t specify the distribution (especially when handwriting), if it can
be inferred from context.
Please do review the background note on expectations and sums of random variables.
Throughout the course we will see generalizations of those results to real-valued variables
(as above) and expressions with matrices and vectors. You need to have worked through the
basics.
You may have seen multiple dimensional integrals written with multiple integral signs, for
example for a 3-dimensional vector:
ZZZ
f (x) dx1 dx2 dx3 . (3)
Our integrals are often over high-dimensional vectors, so rather than writing potentially
hundreds of integral signs, we simply write:
Z
f (x) dx. (4)
4 Derivatives
∂f ∂ sin(yx )
Partial derivative of a scalar with respect to another scalar: ∂x . Example: ∂x = y cos(yx ).
∂f >
h i
∂f ∂f
Column vector of partial derivatives: ∇w f = ∂x ∂x · · · ∂x .
1 2 D
∂y
These notes avoid writing derivatives involving vectors as Usually this expression would ∂z .
∂y ∂yi ∂f
be a matrix with ∂z = ∂z . Under this common convention, the derivative of a scalar ∂w
ij j
Later in the course we will review a notation more suitable for computing derivatives, where
derivative quantities are stored in arrays of the same size and shape as the original variables.
All will be explained in the note on Backpropagation of Derivatives.