Conditional Expectation and Least Squares Prediction
Conditional Expectation and Least Squares Prediction
squares prediction
An important problem of probability theory is to predict the value of a future
observation Y given knowledge of a related observation X (or, more generally, given
several related observations X1, X2,…). Examples are to predict the future course of the
national economy or the path of a rocket, given its present state.
Prediction is often just one aspect of a “control” problem. For example, in guiding a
rocket, measurements of the rocket’s location, velocity, and so on are made almost
continuously; at each reading, the rocket’s future course is predicted, and a control is
then used to correct its future course. The same ideas are used to steer automatically
large tankers transporting crude oil, for which even slight gains in efficiency result in
large financial savings.
There is one important case in which the optimal mean square predictor actually is the
same as the optimal linear predictor. If X and Y are jointly normally distributed, the
conditional expectation of Y given X is just a linear function of X, and hence the optimal
predictor and the optimal linear predictor are the same. The form of the
bivariate normal distribution as well as expressions for the coefficients â and b̂ and for
the minimum mean square error of prediction were discovered by the English
eugenicist Sir Francis Galton in his studies of the transmission of inheritable
characteristics from one generation to the next. They form the foundation of the
statistical technique of linear regression.
The Poisson process and the Brownian motion
process
The theory of stochastic processes attempts to build probability models for phenomena
that evolve over time. A primitive example appearing earlier in this article is the
problem of gambler’s ruin.
The Poisson process
An important stochastic process described implicitly in the discussion of the Poisson
approximation to the binomial distribution is the Poisson process. Modeling the
emission of radioactive particles by an infinitely large number of tosses of a coin having
infinitesimally small probability for heads on each toss led to the conclusion that the
number of particles N(t) emitted in the time interval [0, t] has the Poisson
distribution given in equation (13) with expectation μt. The primary concern of the
theory of stochastic processes is not this marginal distribution of N(t) at a particular
time but rather the evolution of N(t) over time. Two properties of the Poisson process
that make it attractive to deal with theoretically are: (i) The times between emission of
particles are independent and exponentially distributed with expected value 1/μ. (ii)
Given that N(t) = n, the times at which the n particles are emitted have the same joint
distribution as n points distributed independently and uniformly on the interval [0, t].
Examples of other phenomena for which the Poisson process often serves as
a mathematical model are the number of customers arriving at a counter and requesting
service, the number of claims against an insurance company, or the number of
malfunctions in a computer system. The importance of the Poisson process consists in
(a) its simplicity as a test case for which the mathematical theory, and hence
the implications, are more easily understood than for more realistic models and (b) its
use as a building block in models of complex systems.
Brownian motion process
The most important stochastic process is the Brownian motion or Wiener process. It
was first discussed by Louis Bachelier (1900), who was interested in modeling
fluctuations in prices in financial markets, and by Albert Einstein (1905), who gave
a mathematical model for the irregular motion of colloidal particles first observed by the
Scottish botanist Robert Brown in 1827. The first mathematically rigorous treatment of
this model was given by Wiener (1923). Einstein’s results led to an early, dramatic
confirmation of the molecular theory of matter in the French physicist Jean Perrin’s
experiments to determine Avogadro’s number, for which Perrin was awarded a Nobel
Prize in 1926. Today somewhat different models for physical Brownian motion are
deemed more appropriate than Einstein’s, but the original mathematical model
continues to play a central role in the theory and application of stochastic processes.
The process B(t) has many other properties, which in principle are all inherited from the
approximating random walk Bm(t). For example, if (s1, t1) and (s2, t2) are disjoint
intervals, the increments B(t1) − B(s1) and B(t2) − B(s2) are independent random
variables that are normally distributed with expectation 0 and variances equal to
σ2(t1 − s1) and σ2(t2 − s2), respectively.
Einstein took a different approach and derived various properties of the process B(t) by
showing that its probability density function, g(x, t), satisfies the diffusion equation
∂g/∂t = D∂2g/∂x2, where D = σ2/2. The important implication of Einstein’s theory for
subsequent experimental research was that he identified the diffusion constant D in
terms of certain measurable properties of the particle (its radius) and of the medium (its
viscosity and temperature), which allowed one to make predictions and hence to
confirm or reject the hypothesized existence of the unseen molecules that were assumed
to be the cause of the irregular Brownian motion. Because of the beautiful blend of
mathematical and physical reasoning involved, a brief summary of the successor to
Einstein’s model is given below.
Unlike the Poisson process, it is impossible to “draw” a picture of the path of a particle
undergoing mathematical Brownian motion. Wiener (1923) showed that the
functions B(t) are continuous, as one expects, but nowhere differentiable. Thus, a
particle undergoing mathematical Brownian motion does not have a well-defined
velocity, and the curve y = B(t) does not have a well-defined tangent at any value of t. To
see why this might be so, recall that the derivative of B(t), if it exists, is the limit as h →
0 of the ratio [B(t + h) − B(t)]/h. Since B(t + h) − B(t) is normally distributed with mean
0 and standard deviation h1/2σ, in very rough terms B(t + h) − B(t) can be expected to
equal some multiple (positive or negative) of h1/2. But the limit as h → 0 of h1/2/h =
1/h1/2 is infinite. A related fact that illustrates the extreme irregularity of B(t) is that in
every interval of time, no matter how small, a particle undergoing mathematical
Brownian motion travels an infinite distance. Although these properties contradict the
commonsense idea of a function—and indeed it is quite difficult to write down explicitly
a single example of a continuous, nowhere-differentiable function—they turn out to be
typical of a large class of stochastic processes, called diffusion processes, of which
Brownian motion is the most prominent member. Especially notable contributions to
the mathematical theory of Brownian motion and diffusion processes were made
by Paul Lévy and William Feller during the years 1930–60.
If one also assumes that the functions V(t) are continuous, which is certainly reasonable
from physical considerations, it follows by mathematical analysis that A(t) is a Brownian
motion process as defined above. This conclusion poses questions about the meaning of
the initial equation (18), because for mathematical Brownian motion the term dA(t)
does not exist in the usual sense of a derivative. Some additional mathematical analysis
shows that the stochastic differential equation (18) and its solution equation (19) have a
precise mathematical interpretation. The process V(t) is called the Ornstein-Uhlenbeck
process, after the physicists Leonard Salomon Ornstein and George Eugene Uhlenbeck.
The logical outgrowth of these attempts to differentiate and integrate with respect to a
Brownian motion process is the Ito (named for the Japanese mathematician Itō Kiyosi)
stochastic calculus, which plays an important role in the modern theory of stochastic
processes.
The displacement at time t of the particle whose velocity is given by equation (19) is
For t large compared with β, the first and third terms in this expression are small
compared with the second. Hence, X(t) − X(0) is approximately equal to A(t)/f, and the
mean square displacement, E{[X(t) − X(0)]2}, is approximately σ2/f 2 = RT/(3πaηN).
These final conclusions are consistent with Einstein’s model, although here they arise as
an approximation to the model obtained from equation (19). Since it is primarily the
conclusions that have observational consequences, there are essentially no new
experimental implications. However, the analysis arising directly out of Newton’s
second law, which yields a process having a well-defined velocity at each point, seems
more satisfactory theoretically than Einstein’s original model.
Stochastic processes
A stochastic process is a family of random variables X(t) indexed by a parameter t,
which usually takes values in the discrete set Τ = {0, 1, 2,…} or the continuous set Τ = [0,
+∞). In many cases t represents time, and X(t) is a random variable observed at time t.
Examples are the Poisson process, the Brownian motion process, and the Ornstein-
Uhlenbeck process described in the preceding section. Considered as a totality, the
family of random variables {X(t), t ∊ Τ} constitutes a “random function.”
Stationary processes
The mathematical theory of stochastic processes attempts to define classes of processes
for which a unified theory can be developed. The most important classes are stationary
processes and Markov processes. A stochastic process is called stationary if, for
all n, t1 < t2 <⋯< tn, and h > 0, the joint distribution of X(t1 + h),…, X(tn + h) does not
depend on h. This means that in effect there is no origin on the time axis; the stochastic
behaviour of a stationary process is the same no matter when the process is observed. A
sequence of independent identically distributed random variables is an example of a
stationary process. A rather different example is defined as follows: U(0) is uniformly
distributed on [0, 1]; for each t = 1, 2,…, U(t) = 2U(t − 1) if U(t − 1) ≤ 1/2, and U(t) =
2U(t − 1) − 1 if U(t − 1) > 1/2. The marginal distributions of U(t), t = 0, 1,… are
uniformly distributed on [0, 1], but, in contrast to the case of independent identically
distributed random variables, the entire sequence can be predicted from knowledge
Markovian processes
A stochastic process is called Markovian (after the Russian mathematician Andrey
Andreyevich Markov) if at any time t the conditional probability of an arbitrary future
event given the entire past of the process—i.e., given X(s) for all s ≤ t—equals the
conditional probability of that future event given only X(t). Thus, in order to make a
probabilistic statement about the future behaviour of a Markov process, it is no more
helpful to know the entire history of the process than it is to know only its current state.
The conditional distribution of X(t + h) given X(t) is called the transition probability of
the process. If this conditional distribution does not depend on t, the process is said to
have “stationary” transition probabilities. A Markov process with stationary transition
probabilities may or may not be a stationary process in the sense of the preceding
paragraph. If Y1, Y2,… are independent random variables and X(t) = Y1 +⋯+ Yt, the
stochastic process X(t) is a Markov process. Given X(t) = x, the conditional probability
that X(t + h) belongs to an interval (a, b) is just the probability that Yt + 1 +⋯
+ Yt + h belongs to the translated interval (a − x, b − x); and because of independence this
conditional probability would be the same if the values of X(1),…, X(t − 1) were also
given. If the Ys are identically distributed as well as independent, this transition
probability does not depend on t, and then X(t) is a Markov process
with stationary transition probabilities. Sometimes X(t) is called a random walk, but this
terminology is not completely standard. Since both the Poisson process and Brownian
motion are created from random walks by simple limiting processes, they, too, are
Markov processes with stationary transition probabilities. The Ornstein-Uhlenbeck
process defined as the solution (19) to the stochastic differential equation (18) is also a
Markov process with stationary transition probabilities.
The Ornstein-Uhlenbeck process and many other Markov processes with stationary
transition probabilities behave like stationary processes as t → ∞. Roughly speaking, the
conditional distribution of X(t) given X(0) = x converges as t → ∞ to a distribution,
called the stationary distribution, that does not depend on the starting value X(0) = x.
Moreover, with probability 1, the proportion of time the process spends in any subset of
its state space converges to the stationary probability of that set; and, if X(0) is given the
stationary distribution to begin with, the process becomes a stationary process. The
Ornstein-Uhlenbeck process defined in equation (19) is stationary if V(0) has a normal
distribution with mean 0 and variance σ2/(2mf).
where
The long run behaviour of the Ehrenfest process can be inferred from general theorems
about Markov processes in discrete time with discrete state space and stationary
transition probabilities. Let T(j) denote the first time t ≥ 1 such that X(t) = j and set T(j)
= ∞ if X(t) ≠ j for all t. Assume that for all states i and j it is possible for the process to go
from i to j in some number of steps—i.e., P{T(j) < ∞|X(0) = i} > 0. If the equations
have a solution Q(j) that is a probability distribution—
i.e., Q(j) ≥ 0, and ΣQ(j) = 1—then that solution is unique and is the stationary
distribution of the process. Moreover, Q(j) = 1/E{T(j)|X(0) = j}; and, for any initial
state j, the proportion of time t that X(t) = i converges with probability 1 to Q(i).
For the special case of the Ehrenfest process, assume that N is large and X(0) = 0.
According to the deterministic prediction of the second law of thermodynamics,
the entropy of this system can only increase, which means that X(t) will steadily increase
until half the molecules are on each side of the membrane. Indeed, according to the
stochastic model described above, there is overwhelming probability that X(t) does
increase initially. However, because of random fluctuations, the system occasionally
moves from configurations having large entropy to those of smaller entropy and
eventually even returns to its starting state, in defiance of the second law of
thermodynamics.
The accepted resolution of this contradiction is that the length of time such a system
must operate in order that an observable decrease of entropy may occur is so
enormously long that a decrease could never be verified experimentally. To consider
only the most extreme case, let T denote the first time t ≥ 1 at which X(t) = 0—i.e., the
time of first return to the starting configuration having all molecules on the right-hand
side of the membrane. It can be verified by substitution in equation (20) that the
stationary distribution of the Ehrenfest model is the binomial distribution
and hence E(T) = 2N. For example, if N is only 100 and transitions occur
at the rate of 106 per second, E(T) is of the order of 1015 years. Hence, on the macroscopic
scale, on which experimental measurements can be made, the second law of
thermodynamics holds.
The symmetric random walk
A Markov process that behaves in quite different and surprising ways is the symmetric
random walk. A particle occupies a point with integer coordinates in d-
dimensional Euclidean space. At each time t = 1, 2,… it moves from its present location
to one of its 2d nearest neighbours with equal probabilities 1/(2d), independently of its
past moves. For d = 1 this corresponds to moving a step to the right or left according to
the outcome of tossing a fair coin. It may be shown that for d = 1 or 2 the particle returns
with probability 1 to its initial position and hence to every possible position infinitely
many times, if the random walk continues indefinitely. In three or more dimensions, at
any time t the number of possible steps that increase the distance of the particle from
the origin is much larger than the number decreasing the distance, with the result that
the particle eventually moves away from the origin and never returns. Even in one or
two dimensions, although the particle eventually returns to its initial position, the
expected waiting time until it returns is infinite, there is no stationary distribution, and
the proportion of time the particle spends in any state converges to 0!
Queuing models
The simplest service system is a single-server queue, where customers arrive, wait their
turn, are served by a single server, and depart. Related stochastic processes are the
waiting time of the nth customer and the number of customers in the queue at time t.
For example, suppose that customers arrive at times 0 = T0 < T1 < T2 <⋯ and wait in
a queue until their turn. Let Vn denote the service time required by the nth customer, n =
0, 1, 2,…, and set Un = Tn − Tn − 1. The waiting time, Wn, of the nth customer satisfies the
relation W0 = 0 and, for n ≥ 1, Wn = max(0, Wn − 1 + Vn − 1 − Un). To see this, observe that
the nth customer must wait for the same length of time as the (n − 1)th customer plus
the service time of the (n − 1)th customer minus the time between the arrival of the (n −
1)th and nth customer, during which the (n − 1)th customer is already waiting but
the nth customer is not. An exception occurs if this quantity is negative, and then the
waiting time of the nth customer is 0. Various assumptions can be made about the input
and service mechanisms. One possibility is that customers arrive according to a Poisson
process and their service times are independent, identically distributed random
variables that are also independent of the arrival process. Then, in terms of Yn = Vn −
1 − Un, which are independent, identically distributed random variables, the recursive
in successive trials of a game of chance and fn is the fortune of a gambler after the nth
trial, then the martingale condition says that the game is absolutely fair in the sense
that, no matter what the past history of the game, the gambler’s conditional expected
fortune after one more trial is exactly equal to his present fortune. For example,
let X0 = x, and for n ≥ 1 let Xn equal 1 or −1 according as a coin having probability p of
heads and q = 1 − p of tails turns up heads or tails on the nth toss. Let Sn = X0 +⋯+ Xn.
Then fn = Sn − n(p − q) and fn = (q/p)Sn are martingales. One of the basic results of
martingale theory is that, if the gambler is free to quit the game at any time using any
strategy whatever, provided only that this strategy does not foresee the future, then the
game remains fair. This means that, if N denotes the stopping time at which the
gambler’s strategy tells him to quit the game, so that his final fortune is fN, then
Strictly speaking, this result is not true without some additional conditions that must be
verified for any particular application. To see how efficiently it works, consider once
again the problem of gambler’s ruin and let N be the first value of n such that Sn = 0
or m; i.e., N denotes the random time at which ruin first occurs and the game ends. In
the case p = 1/2, application of equation (21) to the martingale fn = Sn, together with the
observation that fN = either 0 or m, yields the equalities x = f0 = E(fN|f0 = x) = m[1
− Q(x)], which can be immediately solved to give the answer in equation (6). For p ≠
1/2, one uses the martingale fn = (q/p)Sn and similar reasoning to obtain
Basic martingale theory and many of its applications were developed by the American
mathematician Joseph Leo Doob during the 1940s and ’50s following some earlier
results due to Paul Lévy. Subsequently it has become one of the most powerful tools
available to study stochastic processes.
David O. Siegmund
infinitesimal
Introduction
Fast Facts
Related Content
Media
Images
More
infinitesimal
mathematics
Print Cite Share Feedback
Key People:
John Wallis Augustin-Louis Cauchy James Stirling Giuseppe Peano
Related Topics:
mathematics calculus