APPPHYS202 - Tuesday 10 January 2012: Quantum Physics & Information Theory Classical Physics & Information Theory
APPPHYS202 - Tuesday 10 January 2012: Quantum Physics & Information Theory Classical Physics & Information Theory
APPPHYS202 - Tuesday 10 January 2012: Quantum Physics & Information Theory Classical Physics & Information Theory
Quantum(noncommutative)
probability
Quantumphysics
&informationtheory
Classical(commutative)
probability
Classicalphysics
&informationtheory
is in E
3
and whether or not it is in E
even
you can infer whether or
not it is in any of the following subsets:
3
E
odd
E
even
C
1
,
3
,
5
),
E
4
E
3
C
4
,
5
,
6
),
2
) E
even
E
3
,
5
) E
odd
E
4
,
1
,
3
) E
odd
E
3
,
4
,
6
) E
even
E
4
,
as well as any possible union of these subsets.
Note that random variables can also serve to define events:
0 Was the result of the die-roll such that X 5? E
5
)
0 Was the result of the die-roll such that Y 15 or Y 16? E
5
,
6
)
0 Was the result of the die-roll such that Z 5? E
3
,
4
)
Knowing the value of a random variable does not necessarily allow you to determine the exact configuration, but you
can narrow it down to a subset of . The term level set is commonly used to refer to the event that contains all
configurations for which a random variable assumes a given value. Note that the level sets of a random variable are
non-overlapping, and that the union of all level sets of a random variable is . There can of course be cases where
joint knowledge of the values of two random variables allows you to infer an exact configuration, even if knowing only
one of them would not be sufficient (this may remind you of the concept of a complete set of commuting observables
in quantum mechanics, but the analogy is rather subtle as we shall see).
It is natural to extend the probability distribution function m(-) so that it is defined not only on elementary
outcomes but also on events. Explicitly,
m(E)
iE
m(
i
).
When viewed as a function from subsets to the reals, m(-) is often referred to as a probability measure (especially in
scenarios with continuous random variables). It is easy to show that the following properties hold [Grinstead and
Snell, Theorem 1.1]:
1. m(E) 0 for every E .
2. m() 1.
3. If E F then m(E) m(F).
4. If A and B are disjoint subsets of , then m(A B) m(A) m(B).
5. m(A
C
) 1 m(A) for every A .
Here A
C
indicates the complement of A in , as in our above discussion of events.
Note that the probability distribution function thus induces probabilities for the values of random variables. For a
random variable A, if we define
E
A,a
i
: A(
i
) a)
as the event A a, then
Pr(A a) m(E
A,a
)
iE
A,a
m(
i
).
For example X 5 occurs only for
5
), so Pr(X 5) m(
5
) 1/6. On the other hand,
Pr(Z 5) m(
3
) m(
4
) 1/3.
Algebras of random variables
Once we have defined some random variables, such as X, Y, Z, it is very easy to generate more (here we will assume
that all random variables can be viewed as taking real values). Note that sums and products of random variables are
themselves random variables, as are the products of random variables with real numbers. Hence, random variables
have a natural algebraic structure. For example, if we define
R(-) X(-) Z(-),
4
with , real numbers, then
R(
1
) 2, R(
2
) 2 3, R(
3
) 3 5,
R(
4
) 4 5, R(
5
) 5 7, R(
6
) 6 7.
Similarly,
Z
2
(-) |Z(-)]
2
has values
Z
2
(
1
) 4, Z
2
(
2
) 9, Z
2
(
3
) 25, Z
2
(
4
) 25, Z
2
(
5
) 49, Z
2
(
6
) 49,
and
XZ(-) X(-)Z(-)
has values
XZ(
1
) 2, XZ(
2
) 6, XZ(
3
) 15, XZ(
4
) 20, XZ(
5
) 35, XZ(
6
) 42.
The probability distribution function on clearly provides probability distribution functions for such random variables
as well.
An indicator function
E
(-) of an event (subset) E is a random variable such that
E
(
i
) 1,
i
E,
0,
i
E.
Technically speaking, any random variable can be expressed in terms of indicator functions on its level sets:
R(-)
i
r
i
r
i
(-),
where R(-) takes values in the set r
i
) and
ri
is the level set corresponding to the value r
i
. For example,
Z(-) 2
1
)
(-) 3
2
)
(-) 5
3
,
4
)
(-) 7
5,
6
)
(-).
It thus appears that indicator functions are like basis functions for random variables. Note that for two events A and
B,
AB
(
i
)
A
(
i
)
B
(
i
).
Hence for a pair of random variables R(-) and T(-),
R(-)T(-)
i
r
i
r
i
(-)
j
t
j
t
j
(-)
i, j
r
i
t
j
r
i
t
j
(-) T(-)R(-).
Expectation, variance, and the notion of state
The expectation of a random variable R(-), which we will write R), is defined as
R)
R(
i
)m(
i
).
This is the average, or mean value of R with respect to the probability distribution function m(-). Note that for
indicator functions,
E) m(E).
Similarly, the variance of R(-) is defined as
var|R] R
2
)
i
R
2
(
i
)m(
i
)
i
|R(
i
)]
2
m(
i
).
It is common also to define the standard deviation of R(-), also called the uncertainty of R(-), as
std|R] R
2
) R)
2
|R(
i
)]
2
m(
i
)
R(
i
)m(
i
)
2
.
It is common also to define the covariance of two random variables A(-) and B(-) as
5
cov|A, B] (A A))(B B))) AB) A)B)
A(
i
)B(
i
)m(
i
)
A(
i
)m(
i
)
B(
i
)m(
i
) .
It should be clear from these definitions that, in general, R
2
) R)
2
and AB) A)B). If cov|A, B] 0 we say that
A(-) and B(-) are linearly independent random variables.
Formally, a state is a consistent assignment of an expectation value to every random variable in an algebra. It
should be clear from the above that a state specifies variances and covariances by virtue of the fact that if A(-) and
B(-) are random variables in our algebra, then so are A
2
(-), B
2
(-) and AB(-). The probability measure m(-) is a
compact way of summarizing the state on an algebra of random variables. Note that state and configuration are quite
different in our useage of the terms - classically we assume that there exists an actual configuration of the system in
question (the actual disposition of the die after it has been rolled), which may or may not be known to anyone, but we
also have a state of knowledge/belief that summarizes the information we use to make predictions within a
probabilistic framework.
Matrix notation
In a finite discrete setting, for which the sample space contains N elements, it is natural to associate random
variables with N N real matrices. For an arbitrary random variable R(-), we simply place the values R(
i
) along the
diagonal and put zeros everywhere else. Hence, continuing with our example of the six-sided die:
X(-)
1 0 0 0 0 0
0 2 0 0 0 0
0 0 3 0 0 0
0 0 0 4 0 0
0 0 0 0 5 0
0 0 0 0 0 6
, Z(-)
2 0 0 0 0 0
0 3 0 0 0 0
0 0 5 0 0 0
0 0 0 5 0 0
0 0 0 0 7 0
0 0 0 0 0 7
.
We use (X) to denote the matrix representation of a random variable X(-). With a bit of thought you can convince
yourself that with this matrix representation, we can use the usual rules of matrix arithmetic and multiplication to
carry out algebraic manipulations among random variables. For example,
R(-) X(-) Z(-)
1 0 0 0 0 0
0 2 0 0 0 0
0 0 3 0 0 0
0 0 0 4 0 0
0 0 0 0 5 0
0 0 0 0 0 6
2 0 0 0 0 0
0 3 0 0 0 0
0 0 5 0 0 0
0 0 0 5 0 0
0 0 0 0 7 0
0 0 0 0 0 7
diag( 2, 2 3, 3 5, 4 5, 5 7, 6 7),
where the diag() notation hopefully is obvious. Note that because of the fact that all matrices we use in this
classical probability setting are diagonal, the matrix representations of an algebra of random variables form a
commutative matrix algebra.
We note that the probability distribution can be written in exactly the same matrix notation, and that we thus
arrive with the convenient expressions such as
6
X)
X(
i
)m(
i
)
Tr
1 0 0 0 0 0
0 2 0 0 0 0
0 0 3 0 0 0
0 0 0 4 0 0
0 0 0 0 5 0
0 0 0 0 0 6
1/6 0 0 0 0 0
0 1/6 0 0 0 0
0 0 1/6 0 0 0
0 0 0 1/6 0 0
0 0 0 0 1/6 0
0 0 0 0 0 1/6
.
We will use the suggestive notation diag(m(
1
), , m(
N
)) for the matrix representing the probability distribution
function. Hence, in general, the expectation R) of an arbitrary random variable R can be computed by taking the
trace of the product of with (R). The matrix provides a convenient representation of a state for our algebra of
random variables.
Indicator functions have a somewhat special appearance in this matrix notation, as they correspond to matrices
with zeros and ones on the diagonal. Viewed as linear operators, they are therefore projection (idempotent)
operators. For example, the indicator function
E
(-) for the event E
1
,
2
) has matrix representation
E
(-)
1 0 0 0 0 0
0 1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
,
where clearly (
E
)
2
(
E
). It should be evident that the matrix representations of the indicator functions on all of the
individual outcomes
1
),
2
), ,
N
) provide a linear basis for the commutative matrix algebra representing all
possible random variables on . In particular,
(R)
i1
N
R(
i
)(
i )
).
It may occur to you that this is actually a sort of spectral decomposition of R viewed as a linear operator. Hopefully,
this perspective also highlights the fact that we can easily identify sub-algebras. For example if we think about the
linear span of the matrix representations of indicator functions on
1
,
3
,
5
) and
2
,
4
,
6
), we obtain a closed
matrix algebra for which the first, third and fifth diagonal elements are always the same, as are the second, fourth
and sixth. It is only really two-dimensional.
Note that once we have obtained the matrix representations for the observables that we care about, and for the
state, we can actually forget about and the underlying configurations! Our original notion of random variables as
functions on a sample space dictated the dimension of the matrix representations and their diagonality (required for
multiplication to be commutative).
Joint systems
Suppose now we have two six-sided dice. We can at first consider them to be independent systems living in
separate sample spaces:
A
1
A
,
2
A
,
3
A
,
4
A
,
5
A
,
6
A
),
B
1
B
,
2
B
,
3
B
,
4
B
,
5
B
,
6
B
).
Clearly we can define random variables on each space, such as
X
A
(
i
A
) i, X
B
(
j
B
) j.
Note that at this level of description,
1
B
is not in the domain of X
A
(-) and therefore X
A
(
i
B
) is undefined. Likewise, we
7
have two probability distribution functions m
A
(-) and m
B
(-), which we might as well take to be uniform.
We can clearly construct a joint sample space by taking Cartesian products:
AB
1
A
1
B
,
1
A
2
B
,
1
A
3
B
,
1
A
4
B
,
1
A
5
B
,
1
A
6
B
,
2
A
1
B
, ,
6
A
6
B
).
Now
AB
has 36 elements, corresponding to all possible outcomes of the rolling of a pair of six-sided dice. What
about the random variables and probability distribution functions? Consider the following definition:
R
AB
(-) R
A
(-) R
B
(-),
R
AB
(
i
A
j
B
) R
A
(
i
A
)R
B
(
j
B
),
where the final expression indicates simple scalar multiplication of the numerical values of R
A
(
i
A
) and R
B
(
j
B
).
Making use of the identity functions
1
A
(-)
A (-), 1
B
(-)
B(-),
we can thus define ampliations of the random variables we initially define on the factor spaces
A
and
B
to the joint
space
AB
. For example,
X
A
(-)
B
|X
A
](-) X
A
(-) 1
B
(-),
B
|X
A
](
i
A
j
B
) X
A
(
i
A
) i,
X
B
(-)
A
|X
B
](-) 1
A
(-) X
B
(-),
A
|X
B
](
i
A
j
B
) X
B
(
j
B
) j.
Often we will simply write
X
A
(
i
A
j
B
) i, X
B
(
i
A
j
B
) j,
with all the ampliation stuff implied. Note that we can now also consider things like
X
AB
(-) X
A
(-) X
B
(-), X
AB
(
i
A
j
B
) ij,
X
AB
(-) X
A
(-) 1
B
(-) 1
A
(-) X
B
(-), X
AB
(
i
A
j
B
) i j.
Normally in games we consider X
AB
(-) to correspond to the numerical value of the roll. Incidentally, note that X
AB
(-)
is a random variable on
AB
that does not simply factor into the product of a random variable on
A
with another
random variable on
B
.
Turning now to the probability distribution functions, we note that
m
AB
(-) m
A
(-) m
B
(-), m
AB
(
i
A
j
B
) m
A
(
i
A
)m
B
(
j
B
) 1/36
provides the proper joint probability distribution function on
AB
(it is clearly normalized). While we are free to define
m
AB
(-) m
A
(-) 1
B
(-) 1
A
(-) m
B
(-) 1/3,
this function on
AB
is not a valid probability measure. Note that the action of m
AB
(-) on subsets of
AB
follows in an
obvious way.
While the matrix representations of our new joint random variables are rather cumbersome to write down, we
note that they have dimension 36 which is the product of the matrix representation dimensions on the factor spaces.
In fact, using notation familiar from quantum mechanics we can write
(R
AB
) (R
A
) (R
B
),
where denotes tensor (Kronecker) product. If you didnt see this in your prevous quantum class dont worry - well
review this later. If you do know how to take tensor products of matrices, perhaps you could verify the above relation
for (X
AB
) and (X
AB
).
Finally we mention the issue of marginalization. Suppose we retain the joint sample space
AB
but I now tell you
that the dice are weighted and that I am going to roll them in some sneaky way that could correlate their outcomes. I
summarize the information numerically by giving you a new joint probability distribution function n
AB
(-). If we forget
about the B die, what is the marginal probability distribution function n
A
(-) for the A die only? In the functional
notation we can write
n
A
(
i
A
)
j1
6
n
AB
(
i
A
j
B
).
In matrix notation we would like to have a procedure for going from diag(n
AB
(
1
A
1
B
), , n
AB
(
6
A
6
B
)) to
diag(n
A
(
1
A
), , n
A
(
6
A
)) via linear algebra-type operations. Again, from previous quantum classes it may not surprise
you to hear that this is a partial trace operation; this also will be reviewed a bit later on in the course.
8
Conditioning
Suppose I roll the dice without showing you the exact outcome, but I tell you that X
AB
7. There are obviously
several joint configurations consistent with this but we can rule out others, such as
1
A
1
B
; how should you update
your original m
AB
(-) to obtain a conditional probability distribution m
AB
(- | X
AB
7)?
Most of you will have seen Bayes Rule on some previous occasion:
Pr(E| F)
Pr(F| E) Pr(E)
Pr(F)
,
which can be thought of as a summary of the equations
Pr(E, F) Pr(E| F) Pr(F),
Pr(F, E) Pr(F| E) Pr(E),
Pr(E, F) Pr(F, E).
For the present discussion it will be most useful to use the slightly modified form
Pr(E| F)
Pr(E, F)
Pr(F)
.
Here Pr(E, F) is the joint probability of E and F, while Pr(E| F) is the conditional probability of E given F. The
probabilitiles Pr(E) and Pr(F) are understood to be prior probabilities, that is, the probabilites we would have
assigned to the events E and F before gaining any updated information. Note here the use of the term events, which
should immediately alert you to how we are going to proceed. Inovking Bayes Rule for our dice scenario, we define
E
i
A
j
B
),
F
AB
: X
AB
() 7),
(F is a level set of X
AB
(-)) and find
m
AB
(
i
A
j
B
| X
AB
7) Pr(E| F)
Pr(E, F)
Pr(F)
m
AB
(E F)
m
AB
(F)
.
Clearly the numerator vanishes for any configuration not in the level set, and is equal to 1/36 for any configuration
that is in the level set. The denominator is actually independent of
i
A
j
B
, and in fact can be seen to be equal to the
sum over all
AB
of m
AB
( F). Hence it is simple a normalization factor for the conditional probability
distribution. So in the end,
m
AB
(
i
A
j
B
| X
AB
7)
1
6
,
i
A
j
B
F,
0,
i
A
j
B
F,
F
1
A
6
B
,
2
A
5
B
,
3
A
4
B
,
4
A
3
B
,
5
A
2
B
,
6
A
1
B
).
The basic structure of Bayes Rule, which we have just highlighted, is that you condition your probability distribution
function by eliminating all configurations that are inconsistent with the information gained and then renormalizing
whatever is left over.
9