Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ORF309_stochasticprocesses

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Fundamental stochastic processes

Mark Cerenzia∗

1 From Bernoulli to Poisson Processes


Recall X has the Bernoulli distribution with parameter p ∈ [0, 1], written X ∼ Bernoulli(p), if
X = 1 with probability p and X = 0 with probability 1 − p. It serves as a model of indicating
heads or not for a flip of a p-coin. In this section, we interpret heads as a “success” and tails as a
“failure”. For repeated flips of the p-coin, we ask how many successes do we have by the nth flip,
how long before the first success, how long between successes, etc.?
Let X1 , X2 , . . . be i.i.d. Bernoulli random variables with Bernoulli p distribution. The sequence
X1 , X2 , . . . is sometimes referred to as a Bernoulli process, a sequence of successes and failures. To
record the number of successes by the nth flip, let1
n
X
Sn := Xk , n ≥ 1.
k=1

We saw that Sn has the Binomial distribution with parameters n ∈ N and p ∈ [0, 1], written
Sn ∼ Binomial(n, p), which recall is given by2
 
n k
P(Sn = k) = p (1 − p)k , k = 0, 1, . . . , n.
k
Using linearity of expectation, we also computed the expected number of successes by the nth flip
as E[Sn ] = n · p.
Interpreting n here as (discrete) “time”, {Sn }n≥1 is our first example of a stochastic process.3 To
determine a stochastic process in general, it is not enough to give the fixed time distributions µSn
of Sn for each n ∈ N. One further requires the joint distributions of the process at various times,
which explains how they are related.
For example, given n, m ∈ N, what is the distribution of Sn+m − Sn ? If all we knew was that
Sn ∼ Bin(n, p) and Sn+m ∼ Bin(n + m, p), then we would not be able to determine how Sn+m − Sn
is distributed. However, we understand the random model that produced Sn , so we have
n+m
X n
X n+m
X
Sn+m − Sn = Xk − Xk = Xk ∼ Binomial(m, p).
k=1 k=1 k=n+1


Although the choice of presentation and much of the commentary is my own, these notes draw heavily from many
sources: Gabor Székely’s Paradoxes in probability theory and mathematical statistics, Ramon van Handel’s ORF 309
notes, and Saeed Ghahramani’s “Fundamentals of Probability, with Stochastic Processes”
1
Think “S” for “S”uccesses.
2
The fact that these probabilities
Pn  sum to 1 (based on probabilistic intuition) is a proof of the Binomial theorem,
i.e., (p + (1 − p))n = 1 = k=0 nk pk (1 − p)k .
3
Loosely speaking, a stochastic process is simply a collection of random variables (Xt )t∈T indexed by some set T
representing time; usually, T = N or T = [0, +∞).

1
In words, Sn+m − Sn counts successes during the time period (n, n + m] ∩ N. There are at least two
important observations to make here:
1. (Stationarity) The distribution Binomial(m, p) only depends on the length of this time in-
terval, i.e., Length((n, n + m]) = (n + m) − n = m, and the given parameter p. Moreover,
Binomial(m, p) is the common distribution of Sn+m − Sn and Sm for any n ≥ 1. Intuitively,
if we arrived late, say, at some later time n > 1, to counting successes, then we will only have
observed the subsequent m trials, whose count of successes is exactly given by Sn+m − Sn ; but
this should be distributed the same as having watched the first m trials for successes Sm .
2. (Independent Increments) The quantity Sn+m − Sn is called the increment on the time period
(n, n+m], and it is independent of any other increment Sv −Su on some other non-overlapping
time interval (u, v]. This is simply because the two increments are made up of sums of trials
that are independent of each other.
WARNING: Although the second property guarantees that Sn+m − Sn is independent of Sn ,
this does not imply that Sn+m is independent of Sn ! However, the former allows us to compute the
joint distribution of the latter. Indeed, we have for 0 ≤ k ≤ ℓ ≤ n + m with ℓ − k ≤ m,
P(k successes by time n and ℓ by time n + m) = P(Sn = k, Sn+m = ℓ)
= P(Sn = k, Sn+m − Sn = ℓ − k)
(Independent increments) = P(Sn = k) · P(Sn+m − Sn = ℓ − k)
(Stationarity) = P(Sn = k) · P(Sm = ℓ − k)
   
n k n−k m
(Binomial distribution) = p (1 − p) · pℓ−k (1 − p)m−(ℓ−k) .
k ℓ−k
Hence, the joint probability mass function is given by
  
n m
f(Sn ,Sn+m ) (k, ℓ) := P(Sn = k, Sn+m = ℓ) = pℓ (1−p)n+m−ℓ , 0 ≤ k ≤ ℓ ≤ n + m, ℓ − k ≤ m.
k ℓ−k
Notice even though we do not have independence for Sn and Sn+m , we nevertheless exploited an
underlying independence to compute their joint distribution.

Arrival times and Geometric distribution


For each k ≥ 1, let Tk denote the first time of the kth success. More formally,
Tk := min{n ∈ N : Sn = k}.
Using the concepts just reviewed above, we can quickly compute its distribution: for n ≥ k,
 
ind. n−1 k
fTk (n) = P(Tk = n) = P(Sn−1 = k−1, Xn = 1) = P(Sn−1 = k−1)·P(Xn = 1) = p (1−p)n−k .
k−1
Thus, Tk has the negative Binomial distribution of parameter k ∈ N and p ∈ [0, 1]. Notice we arrived
at the interesting formula ∞ n−1 k
P  n−k
n=k k−1 p (1 − p) = 1 purely through probabilistic reasoning.
The random variable T1 with k = 1 has the simple form fT1 (n) = (1 − p)n p, which we already
derived in an example above. It is called the geometric distribution of parameter p, written T1 ∼
Geom(p). A basic computation with this distribution serves to corroborate a well-known adage:
(Murphy’s Law) “Anything that can go wrong will go wrong.”

2
If we masochistically think of a “success” as something “going wrong,” then the event that something
goes wrong is exactly {T1 < +∞}. The statement “can go wrong” is given by the assumption that
p ∈ (0, 1]. Intuitively, we can reason this event is sure to happen as follows:
n
ind.
Y
P(T1 < +∞) = 1 − P(Xk = 0, ∀k) = 1 − lim P(Xk = 0) = 1 − lim (1 − p)n = 1 − 0 = 1.
n→∞ n→∞
k=1

However, this argument also shows that if there is no chance of success, i.e., p = 0, then of course
a success will never occur, so P(T1 < +∞) = 0 and P(T1 = +∞) = 1.
To take Murphy’s law with a grain of salt, notice that its conclusion requires an experiment to
be allowed to run indefinitely, which is not even approximately obtainable for many improbable real
world possibilities.
But there is something useful Murphy’s law has given us: a derivation of the geometric series
formula! Letting q := 1 − p, we know that if p > 0, or equivalently q < 1, then we have that
∞ ∞
X
k−1 1 X
1 = P(T1 < +∞) = q p, =⇒ = qk .
k=1
1−q k=0

This formula in turn allows us to compute its expectation: if p > 0 (q < 1), then
∞ ∞ ∞ ∞  
X X
n−1
X d n d X n d q p 1
E[T1 ] = n·P(T1 = n) = n·q p = p· (q ) = p· q = p· = 2
= .
n=1 n=1 n=1
dq dq n=1 dq 1 − q (1 − q) p

We emphasize the conclusion of these manipulations separately because it is so important.

⋆ How long we expect to wait before a success is equal to one over the chance of success. ⋆

Fortunately, you do not need to remember all these manipulations to remember this result: very
loosely, p is the fraction of successes per time, so 1/p should represent amount of time per success!

Interarrival times
Next, we consider the interarrival times of successes, given by Tk − Tk−1 for k ≥ 1 with T0 ≡ 0. In
fact we do not have much to observe beyond the calculation
k
Y
n1 −1 n2 −1 nk −1
P(T1 = n1 , T2 − T1 = n2 , . . . , Tk − Tk−1 = nk ) = (q p) · (q p) · · · (q p) = P(T1 = nℓ ).
ℓ=1

We emphasize the conclusion of this result separately because it is so important.

⋆ The interarrival times of successes Tk − Tk−1 are i.i.d. with common distribution Geom(p). ⋆

Remark. The two results emphasized above in the starred displays are extremely useful for per-
forming computations and will be required for many problems. The reader should make sure to
remember them, even if all the details are not so clear at this stage for you.

Before we leave this section, there are two interesting applications of our work above:

3
• If p > 0 (q < 1), then Murphy’s law assures us there will be a success eventually...but will
that be the only one? Because the counting of successes starts “afresh” after the first one,
this law applies again to assure us of a second success...and a third and a fourth....To sketch,
n n
ind. ident.dist.
Y Y
P(Tk −Tk−1 < ∞, ∀k) = lim P(Tk −Tk−1 < ∞) = lim P(T1 < ∞) = lim 1n = 1.
n→∞ n→∞ n→∞
k=1 k=1

Thus, successes will occur “infinitely often” (or things can go wrong over and over...).

• As a second application, we compute E[Tk ]:4


k times
z }| { k
E[Tk ] = E[(Tk − Tk−1 ) + · · · + (T2 − T1 ) + T1 ] = 1/p + · · · + 1/p = .
p

Poisson Limit Theorem5


We now want to use the tools above to model the real world situation of random arrivals, say, of
customers at a store.6 Let λ > 0 denote the average number of customers to arrive in a given hour.
But we want to figure out the distribution of the number of customer arrivals in a given hour and
of the interarrival times? Denote this random number by N1 (“1” for one unit of time, which we
take to be hours here).
We need to make more precise modeling assumptions to derive these distributions:

⋆ Arrivals occur with equal probability and independently “at every time” ⋆

However, this assumption is hard to pin down mathematically because one thinks of time as a
continuum, so “at every time” is a difficult requirement to meet. But we will endeavor to achieve
it as a limit of its discrete counterpart. Namely, suppose we divide a given hour into n equal
subintervals and let Xi be i.i.d. Bernoulli(p) random variables indicating whether a customer
arrives in the ith subinterval, forPsome p ∈ [0, 1]. Then the number of customers who arrive in
the given hour is given by Sn := nk=1 Xk ∼ Binomial(n, p). Consequently, the average number of
customers to arrive in the hour is E[Sn ] = np. But we already assumed out the outset that λ > 0
is this average, so we have the relationship λ = np or p = nλ . Since we require that p ∈ [0, 1], we
must have λ ≤ n (but we intend on taking n large soon, so this is no issue).
Hence, we have an approximation of the number N1 of customers and thus an approximation of
its distribution: for any k ≥ 0,
   k  n−k
Binomial(n,p) n λ λ
P(N1 = k) ≈ P(Sn = k) = 1−
k n n
n  −k
λk
  
n n−1 n − (k − 1) λ λ
= · · ··· · 1− · 1−
k! n n n n n
k
n→∞ λ
= · 1 · exp(−λ) · 1.
k!
4
Notice this would be much harder to compute using the distribution of Tk that we derived above!
5
Some authors refer to this as the “law of rare events” since, as the reader will see, we take the parameter p of
success to decrease as the time interval decreases.
6
Though there are many other types of arrivals: typos as you read a book, shots fired by an enemy, buses and
passengers at a stop, etc.

4
Hence, the distribution of the number N1 of arrivals of customers in a given unit of time (an hour
in this case) should be given by
k
−λ λ
P(N1 = k) = e , k ≥ 0.
k!
This is called the Poisson distribution of rate or intensity λ > 0, written N1 ∼ Poisson(λ). Notice
that we started this section with the assumption E[N1 ] = λ, and it is an exercise for the reader to
check our derivation preserves this point.
Of course, we can repeat this argument for any amount t per units of time, and ask about the
number Nt of time. It is an exercise for the reader to repeat the above derivation in this case to see
that λ will be scaled by a factor of t, i.e., Nt ∼ Poisson(λt), so that

(λt)k
P(Nt = k) = e−λt , k ≥ 0.
k!
Thus, thought of as a stochastic processes, Nt inherits the main properties of stationary and
independent increments that we saw Sn possessed; namely, the distribution

Nt − Ns ∼ Nt−s ∼ Poisson(λ(t − s))

only depends on the length of the interval (s, t], 0 ≤ s ≤ t, and these increments are independent for
non-overlapping time intervals, i.e., Nt − Ns is independent of Nv − Nu whenever (s, t] ∩ (u, v] = ∅.
We now ask different questions: how long τ1 do we have to wait before the first customer arrives
or how long σk = τk − τk−1 between customer arrival times τk ? For τ1 , the answer is easy: for any
time t ≥ 0,
(λt)0
P(τ1 > t) = P(Nt = 0) = exp(−λt) · = exp(−λt).
0!
Similarly, letting τk , k ≥ 1, denote the arrival time of the kth customer, we have
k−1
X (λt)ℓ
P(τk > t) = P(Nt < k) = exp(−λt) · .
ℓ=0
ℓ!

We have thus tripped over our first random variables that are “continuous”, i.e., they may assume
any of a continuum of real numbers.
Rt This suggests we should posit that there exists a function
fτ1 (x) such that P(τ1 ≤ t) = −∞ fτ1 (x) dx. By the fundamental theorem of calculus, we identify
this function as fτ1 (t) = dtd P(τ1 ≤ t) = dtd [1 − P(τ1 > t)]. We record the results of this observation
as a definition.
Definition 1.1. A random variable τ has the Exponential distribution with rate λ > 0, written
τ ∼ Exp(λ), if it has pdf
fτ (t) := λ exp(−λt) 1[0,+∞) (t).
A random variable γ has the Gamma distribution with shape k ≥ 1 and rate λ > 0, written
γ ∼ Γ(k, λ), if it has pdf
tk−1
fγ (t) := · λk exp(−λt) 1[0,+∞) (t).
(k − 1)!
Notice in particular that Γ(1, λ) = Exp(λ). Also, in both cases, the indicator “1[0,+∞) (t)” serves to
tell us the random variables have the positive real line as their supports.

5
A straightforward calculation gives the expected amount of time before a customer arrives:
Z ∞ Z ∞
integration by parts
E[τ1 ] = t fτ1 (t) dt = t λe−λt dt = 1/λ.
−∞ 0

This last calculation is also quite intuitive: our original assumption that “λ customers per unit
of time” can be rewritten “1 customer per 1/λ unit of time,” which in turn can be rewritten “we
expect 1/λ units of time to pass before a customer arrive.”
However, it is much harder to compute E[τk ] using the Γ(k, λ) density. Fortunately, using our
probabilistic intuition, we know that we can write (setting τ0 ≡ 0)
k
X
τk = [τℓ − τℓ−1 ]
ℓ=1

and by our work in the discrete case, the interarrival times σk := τk − τk−1 should all be distributed
as τ1 ∼ Exp(λ), we have
" k # k k
X X X 1 k
E[τk ] = E σℓ = E[τℓ − τℓ−1 ] = =
ℓ=1 ℓ=1 ℓ=1
λ λ

Remark. The reader should pause for a moment and reflect on what just occurred. It is a regular
theme of probability theory that random variables, despite having a more complicated distribution,
can admit an interpretation that reveals it is built from parts with simpler distributions that are good
enough for answering simple questions (think about the quantities of the birthday problem examples
above, which admitted an expression as sums of certain indicators, and thus their expectations were
easy to compute).
In this case, the gamma distribution with shape k and rate λ of τk is realized as a sum of i.i.d.
random variables σℓ , 1 ≤ ℓ ≤ k, each with exponential distribution. The astute reader may also
notice this is just a continuous analog of how the negative binomial distribution with parameters k
and p is a sum of k many Geometric(p) random variables, which we saw before.

Example 1.1. Suppose that at a certain bus station, buses arrive according to a Poisson process
with rate λ = 2.

1. If we wait for two hours, what is the probability that we have not seen a bus yet?

2. Given that no bus has arrived in the first two hours, what is the conditional probability that
a bus will arrive in the next hour?

3. What is the expected time that the second bus arrives?

Let At := number of buses arrived by the tth hour. At ∼ P ois(2t).

40 e−4
P(A2 = 0) = = e−4 .
0!
Notice that A3 − A2 is independent of A2 = 0. Recall At − As ∼ P ois(2(t − s)).

P(A3 − A2 ≥ 1|A2 = 0) = P(A3 − A2 ≥ 1) = 1 − P(A3 − A2 < 1) = 1 − P(A3 − A2 = 0) = 1 − e−2 .

6
Let τi = time of the i−th arrival. Then P(τ1 > t) = P(At = 0) = e−2t , and P(τ1 ≤ t) = 1 − e−2t .
Thus the probability density function is
(
2e−2t , t ≥ 0
fτ1 (t) = .
0, otherwise.

Similarly, τ2 −τ1 has the same exponential distribution as τ1 , and hence the same probability density
function. Hence,
E(τ2 ) = E(τ2 − τ1 ) + E(τ1 ) = 1/2 + 1/2 = 1.

Let us explicitly argue that the inter-arrival times σk := τk − τk−1 are independent and show
that each is distributed as Exp(λ). First note that

P(τ1 > t) = P(Nt = 0) = e−λt ,

so that τ1 is exponentially distributed with parameter λ > 0. Intuitively,7 Yt1 := Nτ1 +t − Nτ1 , t ≥ 0
forms a Poisson process of parameter λ > 0 independent of what happened during the period [0, τ1 ].
Furthermore, σ2 is the first arrival time of Y 1 , so it is thus exponentially distributed with parameter
λ and independent of τ1 . Continuing in this manner, we argue that the σk are i.i.d since they are
arrival times of Poisson processes during periods where their behavior is independent.

2 Superposition and Thinning of Poisson Processes


Let us first summarize the formalization of the Poisson process.

Definition 2.1. A Poisson process (Nt )t≥0 is a continuous time stochastic process satisfying
(a) N0 = 0;
(b) (Independent Increments) Nt − Ns is independent of Nv − Nu , for all 0 ≤ s ≤ t ≤ u ≤ v.
(c) (Stationarity) Nt − Ns is distributed as Poisson(λ(t − s)) for all 0 ≤ s ≤ t;
(d) (RCLL sample paths) t 7→ Nt is P-almost surely right continuous with left limits.

There are some basic applications from the mere existence of such a stochastic process. Consider
the following claim, which we will use below:

If X and Y are independent Poisson random variables with respective parameters µ, ν > 0,
then X + Y is Poisson of parameter µ + ν.

Simply note that X ∼ Nµ ∼ Poisson(µ) and Y ∼ Nµ+ν − Nµ ∼ Poisson(ν). But then we know that
X + Y ∼ Nµ + (Nµ+ν − Nµ ) = Nµ+ν ∼ Poisson(µ + ν), as required.
Let us now pursue a topic to exercise our probabilistic intuition with the machinery we have
reviewed and developed. We start by establishing the following principles for Poisson processes that
are converses of one another:

1. Principle of Superposition: (Nt1 )t≥0 , . . . , (Ntk )t≥0 are independent Poisson process with re-
spective intensities λ1 , . . . , λk > 0, then defining Nt := kℓ=1 Ntℓ , the process (Nt )t≥0 forms a
P
Poisson process with intensity λ := λ1 + · · · + λk .
7
Rigorously establishing this result follows from our recent developments on the strong Markov property.

7
2. Principle of Thinning: Suppose (Nt )t≥0 is a Poisson process with intensity λ > 0 and each
arrival is assigned, independently of the Poisson process and each other, one of k types with re-
spective probabilities p1 , . . . , pk with p1 +· · ·+pk = 1. Then the processes (Nt1 )t≥0 , . . . , (Ntk )t≥0
counting the arrivals of each type are independent Poisson processes with respective rates
p1 λ, . . . , pk λ.
Example 2.1. Consider the following problem.
Customers depart from a bookstore according to a Poisson process with rate λ > 0 per hour.
Each customer buys a book with probability p, independent of everything else.
[a] Find the distribution of the time until the first sale of a book.
[b] Find the probability that no books are sold during a particular hour.
[c] Find the expected number of customers who buy a book during a particular hour.
Solution Let (Nt )t≥0 represent the number of arrivals of customers, which is a Poisson process
of parameter λ > 0. Then by the principle of thinning, the process (Bt )t≥0 representing the number
of arrivals of customers that buy a book is a p-thinning of (Nt )t≥0 , i.e., (Bt )t≥0 is a Poisson process
of parameter pλ. Hence, we have the following answers:
1(a) The time β1 until the first sale of a book is the first arrival time of the process (Bt )t≥0 , and
hence it has distribution Exp(pλ).
1(b) P(Bt+1 − Bt = 0) = P(B1 = 0) = e−pλ
1(c) E[Bt+1 − Bt ] = E[B1 ] = pλ
Example 2.2. Suppose that σ1 , σ2 , ..., σn are independent with distributions σi ∼ Exp(λi ) with
possibly different rates λi > 0, i = 1, . . . , n.
1. What is the distribution of smallest σ := min{σ1 , . . . , σn }?
2. Compute P(σ = σi ) for i = 1, . . . , n.
Solution:
1. Realize the σi , 1 ≤ i ≤ n, as the first arrival times of independent Poisson
Pn process (Nti )t≥0 of
k
respetive parameter λi > 0. By the principle of superposition, Nt := k=1 Nt is a Poisson
process of parameter λ := λ1 + · · · + λn . But σ is just the first arrival time of Nt and thus σ
is exponentially distributed with parameter λ = λ1 + · · · + λn .8
2. Conversely, by the principle of thinning, Nti is a pi -thinning of Nt , for some pi , with Poisson
rate pi λ. But we know its rate is λi , so we must have pi λ = λi , i.e., we must have pi := λi /λ.
The principle of thinning also interprets pi as the probability a given arrival is of type i. But
{σ = σi } is simply the event that the first arrival has label i. Hence, we have

λi λi
P(σ = σi ) = pi = = .
λ λ1 + · · · + λn
8
We could also just observe that by independence,
n
independence Y
P(σ > t) = P (∩nk=1 {σk > t}) = P(σk > t) = exp(−(λ1 + · · · + λn )t),
k=1

giving another proof that σ is exponentially distributed with parameter λ1 + · · · + λn .

8
Example 2.3 (Impatiently waiting for Paella). Suppose that a food counter in New York City
serves Paella at a rate λ > 0. Alice is first in line while Ben is second in line. Each of them has an
exponential impatience clock of respective rates µA , µB > 0 that is independent of each other and
the arrivals of Paellas. A person leaves the line either if they are in front and a Paella is served or
if their impatience clock rings.
Question: What is the probability that Ben receives Paella before his impatience clock rings?
Let Pt represent the number of Paellas to be served by time t ≥ 0. We realize the impatience
clocks as first arrivals of independent Poisson processes denoted At , Bt . Consider the superposition
of all these arrivals, namely,
Rt := Pt + At + Bt .
Further, once Alice leaves the line, we will need to consider the superposition of just the arrivals of
Ben’s impatience ticks and the Paellas:

St := Pt + Bt .

Next, we need to define a number of first arrival times: let τ, τP , τA , τS be the first arrival times of
Rt , Pt , At , St , respectively. This allows us to define the subsequent arrivals processes

Yt := R̃τ +t − R̃τ = YtP + YtB .

where
YtP := Pτ +t − Pτ , YtB := Bτ +t − Bτ .
Denote their first arrival times by σ, σP . The key point is that the processes Yt , YtP , YTB are inde-
pendent of everything that occurred on the time interval [0, τ ]. Hence, the desired probability is
computed rigorously as follows:

P(Ben receives Paella before getting impatient) = P({τ = min{τP , τA }} ∩ {σ = σP })


(Independence) = P (τ = min{τP , τA )) · P(σ = σP )
λ + µA λ
= · .
λ + µA + µB λ + µA
Proof of Superposition
Pk and Thinning principles. To prove the Principle of Superposition, note that
Nt − Ns = ℓ=1 (Nt − Nsℓ ) is a sum of independent random variables, each of which is independent

of {Nr }r≤s . Consequently, Nt − Ns is also independent of {Nr }r≤s and has distribution P ois(λ1 (t −
s)+· · ·+λk (t−s)) = P ois(λ(t−s)). Hence, (Nt )t≥0 has stationary independent Poisson-distributed
increments, i.e., (Nt )t≥0 forms a Poisson process with intensity λ := λ1 + · · · + λk .
The Principle of Thinning is a bit trickier to prove. Let us denote the k types by the labels
m , . . . , mk , and let (Yn )n≥1 be the sequence of independent random variables where Yn gives the
1

type of the nth arrival, so Yn is a {m1 , . . . , mk }-valued random variable with P(Yn = mi ) = pi .
Then for each 1 ≤ ℓ ≤ k, consider the number of arrivals of type mℓ :
Nt
X
Ntℓ := 1{Yn =mℓ } .
n=1

9
PNt
Notice that Ntℓ − Nsℓ = n=Ns +1 1{Yn =mℓ } is Binomial(Nt − Ns , pℓ ), so we can compute

X
P(Ntℓ − Nsℓ = n) = P(Ntℓ − Nsℓ = n|Nt − Ns = r)P(Nt − Ns = r)
r=0
∞  
X r n (λt)r
= pℓ (1 − pℓ )r−n · e−λt
r=n
n r!

X r! (λ(t − s))r
= pnℓ (1 − pℓ )r−n · e−λ(t−s)
r=n
n!(r − n)! r!

−λ(t−s) (pℓ λ(t − s))n X (λ(t − s)(1 − pℓ ))r−n
=e
n! r=n
(r − n)!
(pℓ λ(t − s))n
= e−pℓ λ(t−s)
n!
Hence, Ntℓ − Nsℓ has the desired Poisson(pℓ λ(t − s)) distribution. Further Ntℓ − Nsℓ is independent
of {Nr }r≤s since it is Binomial(Nt − Ns , pℓ ), which is independent of {Nr }r≤s since Nt − Ns is. In
summary, (Ntℓ )t≥0 is a Poisson process of intensity pℓ λ.
It remains to check that (Nt1 )t≥0 , . . . , (Ntk )t≥0 are independent. We only provide the check in
the case k = 2: for any r, s ∈ N0 ,

P(Nt1 = r, Nt2 = s) = P(Nt1 = r, Nt = r + s)


r+s
!
X
=P 1{Yn =m1 } = r, Nt = r + s
n=1
r+s
!
X
=P 1{Yn =m1 } = r P (Nt = r + s)
n=1
λr+s
 
r+s
= (p1 )r (p2 )s · e−λt
r, s (r + s)!
(r + s)! λr+s
= (p1 )r (p2 )s · e−λt
r!s! (r + s)!
r
(p1 λt) (p2 λt)s
= e−p1 λt · e−p2 λt
r! s!
1 2
= P(Nt = r) · P(Nt = s).

The general case then follows by induction on the number of types k ≥ 1, and as indicated, the
calculation can be done with the use of the multinomial coefficient.

Remark. Probably the most questionable assumption above is that the rate at which customers
arrive is the same irrespective of the time of day, which goes against our experience of most real-
world arrival processes. For example, most restaurants have much higher rates of arrival during
lunch or dinner time than in the middle of the afternoon or late at night.
Mathematically, one way to deal with the limitation of our current arrival process model corre-
sponds to allowing our rate λ to depend time, i.e., λ : R+ → R+ , t 7→ λt . Then we simply replace
item (c) stationarity with the following generalization:

10
R 
t
(c*) (Stationarity) Nt − Ns is distributed as Poisson s λr dr for all 0 ≤ s ≤ t;
Such a general Poisson Process is called time inhomogeneous. When λt is constant, i.e., λt ≡ λ for
all t ≥ 0, then this reduces to the time homogeneous case we studied originally.
Example 2.4 (Waiting time9 paradox). Let the interarrival times between buses at a stop be i.i.d.
with the same distribution as a random variable σ > 0. We expect the time between buses to be
µ = Eσ > 0. Suppose we arrive at the bus stop “randomly” at some time T (all that matters is
that this time T is independent of the arrivals of the buses, so we may as well take it as fixed). We
reason that we are equally likely to have just missed the most recent bus as be nearby the next one.
Thus, on average, we expect to wait µ/2 amount of time before the next bus arrives.
This argument is wrong: the waiting time may not only exceed µ/2, it can even be infinite!
To see this, let σ1 , σ2 , . . . denote i.i.d. interarrival times (not necessarily exponential), and for any
given time t ≥ 0, let Nt denote the number of arrivals by time t (again, not necessarily a Poisson
process), so that σNT +1 is the length of the interarrival interval containing T .
The key point of this example is that, given T ≥ 0, the object of interest is the distribution of
the random interarrival ω 7→ ρ(ω) := σNT (ω)+1 (ω). If we imagine the process has been running for
a “long time” before we select our arrival time T , then for any r > 0, we have
Pn
k=1 σk+1 1{σk+1 >r}
P(ρ > r) ≈ Relative length of interarrival intervals greater than r = lim Pn
k=1 σk
n→∞
n
limn→∞ n1 k=1 σk+1 1{σk+1 >r}
P
=
limn→∞ n1 nk=1 σk
P

E[σ · 1{σ>r} ]
(Strong law of large numbers) =

Z ∞
1
= · sfσ (s) ds
Eσ r
Hence, the density of the random interarrival ω 7→ ρ(ω) = σNT (ω)+1 (ω) is (the long time limit)
1 1
R 2 2
fρ (s) = Eσ sfσ (s), giving us Eρ = Eσ s fσ (s) ds = Eσµ . Thus the average waiting time is10

1 E[σ 2 ] 1 [Eσ]2 + Var(σ) µ Var(σ)


Average waiting time = E[ρ]/2 = = = + .
2 Eσ 2 Eσ 2 2µ

In summary, we see that

⋆ Average waiting time depends on both the mean and variance of the interarrival distribution ⋆

Consequently, if interarrival intervals have high dispersion (high variance), very few people will
have a short wait, most will wait a long time! At the one extreme, our naive answer of “µ/2” is
only true if Var(σ) = 0, i.e., σ is constant; while at the other extreme, it may well be possible that
Var(σ) = +∞ so that average waiting time is infinite. As an intermediate scenario, if σ ∼ Exp(λ)
(like for a Poisson process), then we just computed Var(σ) = 1/λ2 , so that the average waiting time
is
1/λ2
   
1 Var(σ) 1
µ+ = 1/λ + = 1/λ,
2 µ 2 1/λ
9
Also referred to as the “renewal-theory paradox” or “inspection paradox”. Many anamolous results in renewal
or queueing theory can find their origins with this paradox.
10
Apparently this is an instance of the Pollaczek-Khintchine formula from queueing theory.

11
which is exactly the full expected length of a interarrival interval.
The intuition for this result is actually quite simple: it is more likely the time T is covered
by a large interarrival interval than a small one. This simple intuition actually underlies common
statistical anomalies. For example, if we sample students at a university and ask their class size,
they might give some number that is much larger than if we asked the omniscient dean the same
question. The issue is that we are more likely to sample a student from a large class than a small
one! This naturally leads to the concept of the “biased distribution of an observer” versus the actual
distribution.
The above results are exact when (Nt )t≥0 is a Poisson process and σ ∼ Exp(λ). Write
ρ = σNt +1 = (τNt +1 − t) + (t − τNt ) =: Yt + At ,
where the At is the age and Yt is the excess life of the interarrival interval. Note by the memoryless
property, At and Yt are independent, and Yt ∼ Exp(λ). Similarly, we have for r ≥ 0
P(At > r) = P(Nt − Nt−r = 0) = e−λr =⇒ At ∼ Exp(λ).
Hence, ρ is a sum of two independent exponentials, and so has distribution Γ(2, λ):
λk k=2 r r
fρ (r) = · rk−1 · e−λr 1[0,∞) (r) = λ2 · r · e−λr 1[0,∞) (r) = · λe−λr 1[0,∞) (r) = fσ (r),
(k − 1)! 1/λ Eσ
which recovers exactly our long time average reasoning above.

3 Brownian motion
The i.i.d. sum Sn = nk=1 Xk also arises as a natural and famous toy model in physics.
P
In 1828, Robert Brown looked under a microscope at pollen particles suspended in water and was
surprised by their chaotic, jittery movements. He rejected that the pollen was alive after observing
the same phenomenon occurred for other objects in water, such as glass and minerals. In 1905,
Einstein argued that water is not a continuous fluid, like many top physicists thought at the time,
and that these unexplained movements were the aggregate effect of the many chaotic bombardments
of water molecules.
Suppose for simplicity that our i.i.d. sequence Xk , k ≥ 1 is mean zero µ = 0 and variance σ 2 > 0.
These will represent the random displacements of an object suspended in water. Accordingly, the
position of the object at time t should be given by
⌊tN ⌋
X
BtN := ϵ Xk .
k=1

We think of N as the number of bombardments per unit time and ϵ > 0 as representing that each
N 2 2
bombardment causes a small√ displacement. Notice that Var(Bt ) = ϵ · N · σ t, which will be of
constant order if ϵ = 1/ N . Intuitively, we can interpret the number N of bombardments per
unit time getting larger as including more and more molecules nearby, which requires them to be
smaller, and thus
√ their impact should also get smaller, i.e., the displacement ϵ should shrink. The
N
scaling ϵ = 1/ N is chosen so as to get a nontrivial limit Bt = limN →∞ Bt . Pn
Just as in the Poisson limit theorem to derive a Poisson process Nt from Sn = k=1 Xk ,
which works in a different, microscopic scale where the particles remain well-separated, the limiting
process Bt inherits the properties of independent increments and stationarity, and by the central
limit theorem, the increments have the normal distribution Bt − Bs ∼ N (0, σ 2 (t − s)). The process
(Bt )t≥0 is called Brownian motion with variance σ 2 > 0.

12
Discussion
There are three probabilistic limit theorems at play above: the Poisson limit theorem for the
binomial distribution, the lawPof large numbers, and finally the central limit theorem. Curiously,
all three involve a sum Sn = nk=1 Xk of i.i.d. random variables...so why are we getting different
results?
A quick answer to this question is we are looking at Sn under different scales: the law of large
numbers looks at Sn on a macroscopic scale, the central limit theorem on a mesoscopic scale 11 , and
the Poisson limit theorem on a microscopic scale (zoomed-in to see the well-separated “points”).
This statistical mechanics point of view should offer a way to remember these limit theorems and
comparing their statements.
11
Though still macroscopic, relatively speaking. It is for this reason that some authors do not like the word
“mesoscopic” being used here.

13
A Some generalities for stochastic processes (optional)
A way to illustrate how σ-algebras can encode information dynamically is through the following
concept of “dynamic refinement.”
Definition A.1. Consider a probability space (Ω, F, P). Let T be an ordered set (usually either
T = Z≥0 or T = R≥0 ). A collection F = {Ft }t∈T of sub σ-algebras of F is called a filtration if
respects the ordering of T with respect to set inclusion, in the sense

Fs ⊂ Ft , for all s ≤ t with s, t ∈ T .

Definition A.2. Given an ordered set T and a filtration F = {Ft }t∈T of sub σ-algebras of F,
a stochastic process {Xt , t ∈ T } = (Xt )t∈T is a collection of functions on the filtered probability
space (Ω, F, F = (Ft )t∈T , P) such that Xt is an Ft -random element taking values in some state
space S (endowed with some σ-algebra S). This measurability property is often referred to by
saying {Xt , t ∈ T } is F-adapted. It may be readily achieved by making the canonical choice
Ft := σ({Xs }s∈T,s≤t ).
Usually, T = Z≥0 := {0, 1, 2, 3, . . .} (discrete time) or T = R≥0 := [0, ∞) (continuous time), so
Xt is interpreted as a random evolution. The state space S where the process lives is usually R or
Rn or some discrete subset of those spaces.

Markov Processes
Definition A.3. Let T be either discrete time Z≥0 or continuous time R≥0 . Let (Xt )t∈T be a
stochastic process taking values in a discrete state space or in Rk . Then we say that (Xt )t∈T is a
Markov process if

P(Xt ∈ A|Fs ) = P(Xt ∈ A|Xs ), for any s ≤ t with s, t ∈ T and any open set A ⊂ S, (1)

We refer to (1) as the Markov property. Intuitively, it says that given the present, the past and the
future are independent.

Martingales
Definition A.4. A pair (Mt , Ft )t∈T is a martingale if for each t ≥ 0 in T , Mt is an Ft -random
variable (so S = R) satisfying E|Mt | < ∞ and for all 0 ≤ s ≤ t in T ,

E(Mt |Fs ) = Ms , or equivalently, E(Mt − Ms |Fs ) = 0. (2)

Similarly, (Mt , Ft )t∈T is a sub(/super)martingale if E(Mt |Fs ) ≥ (/ ≤)Ms .

Prototype processes
These notes reviewed three prototypes of stochastic processes with stationary independent incre-
ments. They are
1. the counts (Sn )n≥0 of successes for Bernoulli trials X1 , X2 , . . . (discrete time, discrete space);

2. the Poisson process (Nt )t≥0 of intensity λ > 0 (continuous time, discrete space);

3. the Brownian motion (Wt )t≥0 (continuous time, continuous space).

14
We close these notes by recasting these processes in the general framework for stochastic pro-
cesses above. The remainder of the course will involve identifying the above two key properties,
namely, the Markov and martingale properties, and using them to derive significant consequences.

Definition A.5. A discrete time counting process {(Sn , Fn ) : n ≥ 0} with success probability
p ∈ [0, 1] on a filtered probability space (Ω, F, F = (Fn )n≥0 , P) is an N0 -valued stochastic process
satisfying
(a) S0 = 0;
(b) (Independent Increments) Sn − Sm is independent of Fm for all n > m ≥ 0.
(c) (Stationarity) Sn − Sm is distributed as Binomial(n − m, p) for all n > m ≥ 0;

Because the last one is in discrete time, we do not require any regularity conditions.

Definition A.6. A Poisson process {(Nt , Ft ) : t ≥ 0} with intensity λ > 0 on a filtered probability
space (Ω, F, F = (Ft )t≥0 , P) is a continuous time Rd -valued stochastic process satisfying
(a) N0 = 0;
(b) (Independent Increments) Nt − Ns is independent of Fs for all t > s ≥ 0.
(c) (Stationarity) Nt − Ns is distributed as Poisson(λ(t − s)) for all t > s ≥ 0;
(d) (RCLL sample paths) t 7→ Nt is P-almost surely right continuous with left limits

Definition A.7. A d-dimensional standard Brownian motion or Wiener process {(Wt , Ft ) : t ≥ 0}


on a filtered probability space (Ω, F, F = (Ft )t≥0 , P), is a continuous time Rd -valued stochastic
process satisfying
(a) W0 = 0;
(b) (Independent Increments) Wt − Ws is independent of Fs for all t > s ≥ 0.
(c) (Stationarity) Wt − Ws is distributed as Nd (⃗0, (t − s)Id ) for all t > s ≥ 0;12
(d) (Continuous sample paths) t 7→ Wt is P-almost surely continuous, i.e., there is a set Ω′ ⊂ Ω
with P(Ω′ ) = 1 such that for all ω ∈ Ω′ , the function t 7→ Wt (ω) is continuous on [0, ∞).

Remark. Just like a simple random walk, a Brownian motion/Wiener process is spatially homo-
geneous, time homogeneous, and Markovian.

Remark. Notice the filtration F = (Ft )t≥0 is part of the definition. In our course, the filtrations
are almost always taken to be the natural one generated from observing all the relevant processes
of the problem; in this case, FW = (FtW )t≥0 where FtW := σ(Ws ; 0 ≤ s ≤ t) is the information
available from observing the process up to time t.
However, the freedom of filtration in the definition plays an important role for the solution
theory to stochastic differential equations that are “driven by a Brownian motion”. Indeed, there,
one may need to work with a filtration strictly larger than FW in order to achieve well-posedness
of such equations.

Remark. More generally, a Levi process is a stochastic process with stationary, independent incre-
ments possibility admitting both discrete and continuous motions. A simple example is to take the
sum of a Poisson process and a Brownian motion.

12
This notation means a d-dimensional normal with mean ⃗0 and positive definite covariance matrix (t − s)Id .

15

You might also like