TMRCA Estimates

Notes on Estimating the Time to Most Recent
Common Ancestor using Y-DNA Haplotypes

∗
David E. Johnston
June 5, 2011
1 The TMRCA Problem

The DNA sequences of the male Y chromosome can be used to determine
the genetic closeness of two individuals. The Short Tandem Repeats (STRs)
are sequences (alleles) of repeating genetic base-sequences and it is known
that the number of repeats mutates on a fairly short time-scale which makes
them useful for genetic genealogy. Various companies offer services where in-
dividuals can be tested and receive a string of numbers (markers) collectively
called a haplotype.
The statistical problem we will concern ourselves with is determining
the Time to Most Recent Common Ancestor (TMRCA) given the mutation
rates. This problem has been studied before. Walsh (2001) for example
uses an approximate method called the Infinite Alleles Model (IAM) which
makes the approximation of ignoring multiple mutations in the same allele,
including back mutations where a mutation occurs and is followed by its
reverse mutation so that none is observed.
In these notes, I show that this problem has an exact solution including
the generalization to multiple branching (mutating by more than one in one
generation at a given allele) and is computationally benign.
∗
Contact Author at Dave31415@gmail.com
1
2 A Bayesian Method
We wish to write down the posterior probability of TMRCA, which we will
call T, given the mutation rates. We will start with one allele at a time.
Since we will assume that mutations in different alleles are independent, the
probabilities for all alleles are just products of the individual probabilities.
2.1 Mutation Rates

Let the multiple branch, per-generation mutation rates for a single allele be
given by a vector µ ~ for example, µ
~ = [0.0030,0.0010,0.0005]. This means that
the probability of mutating by ±1 is 0.0030; the probability of mutating by
±2 is 0.0010 and the probability of mutating by ±3 is 0.0005. We will assume
symmetric mutation rates so that the probability of +1 is 0.0030/2=0.0015
and the same for −1 etc. The probability of mutating to any branch is the
sum of all mutation rates, which we will call µ and so the probability of
remaining the same is 1 − µ.
Chandler (2006) has studied mutation rates using Y-DNA haplotype data.
The mutation rates vary from allele to allele. It is also known that mutations
rates for multiple jumps (or branches) are non-zero but smaller. Furthermore,
there is at least anecdotal evidence that mutation rates vary between different
paternal lines. We will side-step this issue for the purpose of these notes and
assume the symmetric, multiple-branch mutation rates are known.
2.2 The Likelihood

Assume we start with some marker, M0 , for the ancestor. The child of
this ancestor will have a marker, M1 = M0 + D1 given by the probability
P (D1 |1, µ
~ ). The D1 denotes the difference between them and has a Bernoulli
distribution or rather a multiple-Bernoulli distribution with the mutation
rates as parameters. The 1 just denotes that this is the first generation. For
P
the T-th generation, we can write MT = M0 + i Di . This sequence of ran-
dom outcomes has the structure of a Markov chain (a specific kind of random
walk) since the probability of jumping to the next value is independent of
which value you are at. That is, the distribution of the Di are identical.
The probability distribution of a sum of random variables from the same
distribution P is just the convolution of the probability distribution of each.
P (Z ≡ X + Y ) = i P (Xi )P (Z − Xi ).
P
2
Likewise, the probability distribution of a difference is given by the cor-
relation of the two distributions P (W ≡ X − Y ) = i P (Xi )P (W + Xi ).
P
For distributions, like ours, that are symmetric about zero, it is easy to show
that convolution and correlation are the same.
Using this information we can immediately write down the probability
distribution of markers, M = M0 + D after T generations. It is given by
P (D|T, ~µ) = P ∗ P ∗ P... (T − times) ≡ P ∗T (1)
where * denotes convolution (or correlation). This is easy enough to compute,

one generation at a time. The probability distribution for each generation is
just given recursively from the one before. Generation N will have 2BN + 1
values with non-zero probability if B is the number of branches, e.g. B = 1
for only ±1 mutations but only BN + 1 of these need to be stored since all
distributions are symmetric about zero. It is easy to show that the number
of calculations as well as the number of numbers needing to be stored are of
order N 2 which is easy for modern computers up to N of a few thousand. This
can be made even easier computationally skipping ahead and just computing
every 5 or 10 generations. This is easy to do due to the associativity of
convolution.
The likelihood above is the distribution of differences between the orig-
inal haplotype marker and the final one. If you are comparing two final
descendants, you just take this distribution above and correlate/convolve it
with itself. Note that this is just the same as continuing the process to 2T
generations, again due to the associativity of convolution.
Figure 1 shows the original probability distribution of mutation rates
(black) and how the distribution spreads out with each generation.
2.3 Computing the Likelihood with the Discrete Fourier

Transform
The Discrete Fourier Transform (DFT) and in particular the Fast Fourier
Transform (FFT) algorithm can be used to speed up these calculations. Since
the distributions are symmetric (and real) the discrete cosine transform is
the one to use.
Fourier transforms have the useful property that the Fourier transform of
a convolution of two distributions is just the product of two Fourier trans-
3
forms. And so if P is the probability distribution of mutations,
P (D|T, ~µ) = DF T −1 (DF T (P )T ) (2)
and for calculating the difference between the haplotypes of two descendants,
you would just replace T in the above equation with 2T .
Generally the mutations are symmetric and so the DFT becomes the
DCF, discrete cosine transform which we define as
X
Fk = DCT [f ](k) = fj cos(2πjk/N ) (3)
j
and the inverse transform

1 X
fj = DCT −1 [F ](j) = Fk cos(2πjk/N ) (4)
N k
The DCT is defined for any value of N. But if we wish to use the circular
convolution theorem without worrying about wrap around effects we want
N larger than the effective width of the resultant distribution. In practice
something of the order N=32 should be sufficient.
For the simplest mutation model where there is a single symmetric branch-
ing, we have f0 = (1 − µ) and f1 = fN −1 = µ/2 and all else are zero and
so
Fk = (1 − µ) + µ cos(2πk/N ) (5)
!
µ
= (1 − µ) 1 + cos(2πk/N ) (6)
1−µ
≈ (1 − µ) (1 + µ cos(2πk/N )) (7)
Note we have dropped the 1 − µ in the second term since µ is small and we
never know it that accurately anyway.
Using this result and the one above we can write down the posterior
P (x|T ) = DF T −1 (DF T (P )T ) (8)
N −1
1
(1 − µ)T cos(2πxk/N ) (1 + µ cos(2πk/N ))T
X
= (9)
N k=0
This is true for any N but we can also take N arbitrarily large and this
becomes an integral
Z 2π
T 1
P (x|T ) = (1 − µ) dθ cos(xθ) (1 + µ cos(θ))T (10)
2π 0
Z π
1
= (1 − µ)T dθ cos(xθ) (1 + µ cos(θ))T (11)
π 0
4
This integral can be integrated in various ways. For small µ it can be
written
1Zπ
P (x|T ) ≈ exp(−µT ) dθ cos(xθ) exp(µT cos(θ)) (12)
π 0
= exp(−µT )Ix (µT ) (13)
where Ix (µT ) is the modified bessel function of order x evaluated at µT . This

is accurate when T is much less than 1/µ2 .
The integral can also be computed exactly by noticing that, for integer
x, cos(xθ) = Tx (cos(θ)) where Tx are the Chebyshev polynomials or order x.
x
ax,i cosi (θ)
X
Tx (cos(θ)) = (14)
i=0
where ax,i are the Chebyshev polynomial coefficients. Using this and the
binomial theorem, we have
1Zπ
P (x|T ) = (1 − µ) dθTx (cos(θ)) (1 + µ cos(θ))T
T
(15)
π 0
x T
Z π !
1 T
(1 − µ)T dθ cosi (θ) µj cosj (θ)
X X
= ax,i (16)
π i=0 0 j=0 j
x X T
!
1 T jZ π
(1 − µ)T dθ cosi+j (θ)
X
= ax,i µ (17)
π i=0 j=0 j 0
This integral has an exact expression.

Z π !
m −m m
dθ cos (θ) = π2 E(m) (18)
0 m/2
where E(m) = 1 when m is even and 0 when m is odd.

Finally, we arrive at an expression as a double sum which is useful when
µ is not very small or T is comparable to 1/µ2 .
x X
T
! !
T
X T j −(i+j) (i + j)
P (x|T ) = (1 − µ) ax,i µ2 E(i + j) (19)
i=0 j=0 j (i + j)/2
Generally, the x values are small especially for small T where this is useful
so you only need tabulate the first few Chebyshev polynomial coefficients.
5
When µ is very poorly known, one might wish to integrate over µ and
write Z ∞
P (x|T ) = dµP (µ) exp(−µT )Ix (µT ) (20)
0
This can easily be done numerically for any distribution P (µ). Genererally it
has very little effect unless the uncertianty in µ is quite large. The integral can
be computed analytically (albeit with some difficulty) when P (µ) is a Gamma
distribution, a not unreasonable choice. P (µ) = µ(k−1) (Γ(k) θk )−1 exp(−µ/θ).
This has mean, kθ and variance kθ2 . So that k is given by the square of the
mean over the variance and θ is given by the variance over the mean. Using
this we have
Z ∞
k −1
P (x|T ) = (Γ(k) θ ) dµµ(k−1) exp(−µ/θ) exp(−µT )Ix (µT ) (21)
0
Z ∞
= (Γ(k) θ ) T −k
k −1
dτ τ (k−1) exp(−τ s)Ix (τ ) (22)
0
(23)
with s = 1 + (T θ)−1 . This integral is the Laplace transform or rather than

k − 1 derivative of the Laplace transform of the modified Bessel function
which becomes
!(k−1) √
k −1 −k (k−1) d (s + s2 − 1)−|x|
P (x|T ) = (Γ(k) θ ) T (−1) √ (24)
ds s2 − 1
This becomes especially simply for k = 1 (the exponential distribution) where

it becomes h √ i−|x|
τ (1 + τ + 1 + 2τ )
P (x|T ) = √ (25)
1 + 2τ
2.4 Variances
Since our distributions are all symmetric about zero, the mean of the distri-
bution is always zero. The next moment of interest is the variance. Another
relevant fact is that the variance of a convolution of two functions is the
sum of the variance of each of them. This fact allows us to write down the
variance of distribution P (D|T, ~µ) as
V ar[P (D|T, ~µ)] = T V ar[P ] (26)
6
where V ar[P ] is the variance of the original mutation distribution. For single
branching, V ar[P ] = µ and so V ar[P (D|T, ~µ)] = T µ. As usual, we replace
T with 2T when talking about the variance between two descendants rather
than the variance between the descendant and the ancestor.
2.5 Using variances for full clade TMRCA estimates

Any group of haplotypes will have a TMRCA as well. This is the time when
all of the lineages have coalesced. We will refer to any group as a “clade”
though some might prefer to reserve that word for people sharing a given
SNP. We will use it more loosely to mean any chosen group.
Variances of marker values (combined across different alleles in some way)
are commonly used to estimate the TMRCA of the clade as well as between
pairs. However, we will show that this is not an unbiased estimator of the
clade TMRCA but rather an unbiased estimator of the average of the pair-
wise TMRCAs.
Lets us define Dij as the STR marker data where i indicates the allele
and j indicates the person. We can define the sample mean for each allele as
mi = N −1 j Dij . We can also define the sample variance in the usual way.
P
s2i = N −1 j (Dij − mi )2 . Its is shown in any standard statistics text book

P
that the sample mean is unbiased for the true mean but that the sample
variance is not unbiased for the true variance. The expectation value for s2
is (N − 1)/N σ 2 . But this bias is harmless because the corrected statistic
N/(N − 1)S 2 therefore is an unbiased estimator for σ 2 .
These well known results however make the assumption that the data are
independent. If the data are not independent but are, rather, correlated, the
this result for the sample variance is changed as follows. For clarity, we will
drop the i subscript for the moment and reintroduce it later when needed.
So we are just discussion the data in one allele.
1 X
s2 ≡ (Dj − m)2 (27)
N j
1 X 2 1 X
= Dj − 2 Dj m + m2 (28)
N j N j
(29)
7
Plugging in m = N −1
P
Dk , this becomes
k
1 X 2 1 X 1 X
s2 = Dj − 2 2 Dj Dk + 2 Dj Dk (30)
N j N jk N jk
1 X 2 1 X
= Dj − 2 Dj Dk (31)
N j N jk
Let µ be the population mean (just for now, we will use µ for mutation
rates later). The data values are Dj = µ + j where j are the random
deviations from the mean due to random mutations. The expectation values
are < j >= 0 (required if µ is to be the mean) and the expectation value of
j k defines the covariance matrix, Cjk =< j k >. So now, we can write
1 X 2 1 X 1 X 2
Dj = (µ + j )2 = (µ + 2µj + 2j ) (32)
N j N j N j
(33)
The expectation value of this is µ + N j Cjj = µ + N −1 T r(C) where T r(C)
1 P
is the trace (sum of diagonals) of the covariance matrix. Similarly, the expec-
tation value of this second term is < N −2 jk Dj Dk >= µ + N −2 jk Cjk and
P P
so we can finally write down the expectation value of the sample variance for
correlated data,
1 1 X
< s2 >= T r(C) − 2 Cjk (34)
N N jk
For the special case (uncorrelated and equal variances) Cjk = σ 2 δjk , we
recover the usual result < s2 >= σ 2 − N −1 σ 2 = (N − 1)/N σ 2 .
Now, lets apply to this the STR data for a clade. We only need to know
the covariance matrix of Dj . We are still just working with one allele so will
suppress the i subscript. We already know that the variances are µT . From
now on, µ will refer to mutation rates not the mean. But what about the off-
diagonal values? Here, we need to remember that the mutations are assumed
to be independent events. If two people have a pairwise TMRCA of Tjk , it
means that those people shared the exact same mutation events before that
time and, after that time, experienced independent (uncorrelated) mutations.
So it is clear that the off-diagonal covariances are given by µ(T − Tjk ). So
now, we can write down the expectation value of the sample variance for
STR marker data.
1 1 X
< s2 > = T r(C) − 2 Cjk (35)
N N jk
8
1 X
= µT − µ (T − Tjk ) (36)
N 2 jk
1 X
= µ Tjk (37)
N 2 jk
Note that the diagonals of Tjk are zero and there are N (N − 1) off-diagonal
terms so we can write this
N −1
< s2 >= µ TP (38)
N
where TP is the mean pair-wise TMRCA,
1 X
TP = Tjk (39)
N (N − 1) jk
So, at last, we have shown that the corrected sample variance N/(N −1)s2
is not in fact an unbiased estimator of T but is an unbiased estimator of this
TP , the mean pairwise TMRCA. This TP is of course always less than T .
The ratio of T /TP will depend on the structure of the particular tree and
mutations times but will usually be in the range 1 to 3.
3 References
Walsh,B. 2001, The Genetics Society of America
http://www.genetics.org/cgi/reprint/158/2/897
http://en.wikipedia.org/wiki/Gamma distribution
http://mathworld.wolfram.com/SampleVarianceDistribution.html
Chandler, J. 2006, Journal of Genetic Genealogy 2:27-33

TMRCA Estimates

Uploaded by

Copyright:

Available Formats

TMRCA Estimates

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TMRCA Estimates

Uploaded by

Copyright:

Available Formats

Notes on Estimating the Time to Most Recent

Common Ancestor using Y-DNA Haplotypes

1 The TMRCA Problem

2.1 Mutation Rates

2.2 The Likelihood

P (D|T, ~µ) = P ∗ P ∗ P... (T − times) ≡ P ∗T (1)

where * denotes convolution (or correlation). This is easy enough to compute,

2.3 Computing the Likelihood with the Discrete Fourier

and the inverse transform

where Ix (µT ) is the modified bessel function of order x evaluated at µT . This

This integral has an exact expression.

where E(m) = 1 when m is even and 0 when m is odd.

with s = 1 + (T θ)−1 . This integral is the Laplace transform or rather than

This becomes especially simply for k = 1 (the exponential distribution) where

V ar[P (D|T, ~µ)] = T V ar[P ] (26)

2.5 Using variances for full clade TMRCA estimates

s2i = N −1 j (Dij − mi )2 . Its is shown in any standard statistics text book

Chandler, J. 2006, Journal of Genetic Genealogy 2:27-33

You might also like