BMC Bioinformatics
BioMed Central
Open Access
Methodology article
HaploRec: efficient and accurate large-scale reconstruction of
haplotypes
Lauri Eronen*1, Floris Geerts2 and Hannu Toivonen1,3
Address: 1HIIT-BRU, Department of Computer Science, University of Helsinki, Finland, 2Laboratory for Foundations of Computer Science,
University of Edinburgh, UK and 3Department of Computer Science, University of Freiburg, Germany
Email: Lauri Eronen* - lauri.eronen@cs.helsinki.fi; Floris Geerts - fgeerts@inf.ed.ac.uk; Hannu Toivonen - hannu.toivonen@cs.helsinki.fi
* Corresponding author
Published: 22 December 2006
BMC Bioinformatics 2006, 7:542
doi:10.1186/1471-2105-7-542
Received: 29 June 2006
Accepted: 22 December 2006
This article is available from: http://www.biomedcentral.com/1471-2105/7/542
© 2006 Eronen et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Background: Haplotypes extracted from human DNA can be used for gene mapping and other
analysis of genetic patterns within and across populations. A fundamental problem is, however, that
current practical laboratory methods do not give haplotype information. Estimation of phased
haplotypes of unrelated individuals given their unphased genotypes is known as the haplotype
reconstruction or phasing problem.
Results: We define three novel statistical models and give an efficient algorithm for haplotype
reconstruction, jointly called HaploRec. HaploRec is based on exploiting local regularities
conserved in haplotypes: it reconstructs haplotypes so that they have maximal local coherence.
This approach – not assuming statistical dependence for remotely located markers – has two useful
properties: it is well-suited for sparse marker maps, such as those used in gene mapping, and it can
actually take advantage of long maps.
Conclusion: Our experimental results with simulated and real data show that HaploRec is a
powerful method for the large scale haplotyping needed in association studies. With sample sizes
large enough for gene mapping it appeared to be the best compared to all other tested methods
(Phase, fastPhase, PL-EM, Snphap, Gerbil; simulated data), with small samples it was competitive
with the best available methods (real data). HaploRec is several orders of magnitude faster than
Phase and comparable to the other methods; the running times are roughly linear in the number
of subjects and the number of markers. HaploRec is publicly available at http://www.cs.helsinki.fi/
group/genetics/haplotyping.html.
Background
The problem we consider is haplotype reconstruction: given
the genotypes of a sample of individuals, the task is to predict the most likely haplotype pair for each individual.
Computational haplotype reconstruction methods are
based on statistical dependency between closely located
markers, known as linkage disequilibrium. Many computational methods have been developed for the reconstruc-
tion of haplotypes. Some of these methods do not rely on
the statistical modeling of the haplotypes [1-3], but most
of them, like our proposed algorithm HaploRec, do [410]. For a review of these and other haplotyping methods
we refer to [11-13]. Laboratory techniques are being
developed for direct molecular haplotyping (see, e.g.,
[14,15]), but these techniques are not mature yet, and are
currently time consuming and expensive.
Page 1 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
The need for a new haplotyping method is motivated by
high throughput association analysis, where the goal is to
locate a disease susceptibility gene by finding a haplotype
fragment that is associated with the disease being studied.
More and more often gene mapping studies use a large
marker map spread over a long genomic region. A typical
strategy for computationally haplotyping a long map is to
first divide the map to small, overlapping windows, to
reconstruct the haplotypes in each window separately,
and then to combine haplotypes from the windows [16].
HaploRec is aimed to have the following important properties. First, increasing the window size should give relatively more accurate results since large windows contain
more information, i.e., adding markers should improve
accuracy (in the phases between the markers that already
were there), whether the new markers are added between
the old ones or not [17]. Second, the time complexity of
the algorithm should be close to linear in the number of
markers, in order to avoid unnecessary compromises
when choosing the window size, and also close to linear
in the number of genotypes to allow sample sizes of hundreds to thousands of individuals, as required by association analysis [18]. HaploRec produces accurate haplotype
reconstructions, and scales to long marker maps (or windows) that span long genetic regions.
While the statistical models of HaploRec are novel, some
of the algorithmic principles are similar to earlier work.
HaploRec follows a likelihood-based expectation-maximization (EM) haplotype inference strategy which was
introduced in [4]. PL-EM [5,6] overcomes the computational complexity of the basic EM approach by using a a
pruning strategy on the possible haplotype resolutions,
called Partition-Ligation (PL). The Snphap algorithm of
Clayton [19] is also based on the EM algorithm, but uses
a sequential pruning strategy; HaploRec also uses a similar
pruning approach. The PL strategy is also used in the current version of Phase [7,8]. PL-EM and Snphap are based
on a multinomial haplotype probability model with a
uniform Dirichlet prior. Phase, however, uses a prior distribution based on coalescent theory (see [20] for a
review) and uses Bayesian inference implemented with
Gibbs sampling instead of EM. The underlying idea for
our statistical models for haplotypes is that we derive an
overall probability for a haplotype from the probabilities
of its local fragments. We propose three probability models based on this idea: high-order Markov chains, variableorder Markov chains, and a segmentation-based model.
Moreover, a novelty is that the lengths of the haplotype
fragments utilized in the models vary, governed by a frequency threshold. In contrast to most other approaches,
long-range dependencies between markers are not
required for the methods to work, but they can be utilized
where they do exist. Thus all these models scale naturally
http://www.biomedcentral.com/1471-2105/7/542
to the long and sparse marker maps often used in gene
mapping.
Our segmentation-based model bears some resemblance
to methods which combine haplotype block finding and
haplotyping [9,10]. However, whereas these models place
universal block boundaries across the whole population,
our model averages over all possible segmentations for
each haplotype, without any fixed block boundaries.
Quite recently, Scheet and Stephens [21] introduced fastPhase, which models the population with a set of founder
haplotypes (or clusters); the cluster memberships are
allowed to change continuously along the chromosome,
according to a hidden Markov model. Similar to our models, fastPhase allows for both "block-like" patterns and
gradual decline of linkage disequilibrium with distance.
In our extensive experimental evaluation in the Results
section we will compare HaploRec with Phase, fastPhase,
PL-EM, Snphap, and Gerbil [10]. In the simulated settings, where sample sizes (number of subjects) are large
enough for gene mapping, we observe the concrete benefits of the ability of HaploRec to improve its performance
by haplotyping more markers at a time. The models we
describe are relatively simple and are slightly outperformed by Phase when the number of markers is small,
but when our method is given a longer map to be haplotyped, it can actually utilize the information contained in
the additional markers to outperform Phase. HaploRec is
in practice several orders of magnitude faster than Phase
and has running times comparable to the other methods.
We next define the necessary notation.
Notation and problem statement
We assume a set (map, or window) M of ᐍ markers 1,...,ᐍ
and denote the set of all observed alleles of marker i by Ai.
A haplotype H over M is then a vector of alleles: H ∈ Πi =
1,...,ᐍ Ai. A genotype G over M is a vector of (unordered)
allele pairs: G ∈ Πi = 1,...,ᐍ{{a1, a2} | a1, a2 ∈ Ai}. For SNP
markers, |Ai| = 2. Assuming that alleles are labeled "1" and
"2", SNP haplotypes are vectors in {1, 2}ᐍ and SNP genotypes are vectors in {{1, 1}, {1, 2}, {2, 2}}ᐍ. We thus use
terms haplotype and genotype to refer to data over the
whole marker map, and not e.g. to just one marker.
The allele of haplotype H at marker i is denoted by H(i).
Similarly, the unordered allele pair of a genotype G at
marker i is denoted by G(i). Given a pair of haplotypes
{H1, H2} and a genotype G such that G(i) = {H1(i), H2(i)}
for all i, we say that {H1, H2} is compatible with G, or that
{H1, H2} is a (possible) haplotype configuration for genotype G. Two haplotypes determine a unique compatible
Page 2 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
http://www.biomedcentral.com/1471-2105/7/542
genotype in the obvious way. A genotype, on the other
hand, can have several compatible haplotype configurations. For a genotype G with k heterozygous markers (i.e., k
= |{G(i) = {a1, a2} | a1 ≠ a2}|; a marker is homozygous if it
is not heterozygous), there are 2k - 1 different haplotype
configurations. The set of all haplotype configurations for
a genotype G is denoted by CG, with |CG| = 2k - 1. A single
haplotype H is said to be compatible with a genotype G, if
there exists a haplotype H' such that {H, H'} is compatible
with G. The set of input genotypes is denoted by . The
haplotype reconstruction problem is now defined as finding for each genotype G ∈ the most plausible haplotype
configuration
{H1,
H2}
∈
C G.
Here,
the
interpretation of "most plausible" depends on each different haplotyping method; in most methods the most plausible haplotype configuration is the one with the highest
estimated probability under the prior assumptions made
by the method. In the case that multiple haplotype configurations are equally probable, we simply select one randomly.
Results
Haplotype probability models
Our focus is on data sets with a large number of relatively
sparsely spaced markers. Under these conditions, recombinations between markers are common, and linkage disequilibrium between distant markers is weak. In this case
it cannot be expected that complete haplotypes are shared
between subjects; instead we aim to discover and utilize
local regularities, or patterns in the haplotypes. We restrict
our attention to a simple class of patterns: a haplotype fragment is a haplotype restricted to a (continuous) sub-range
of the original marker map. The idea is to model local
linkage disequilibrium by estimating the frequencies of
haplotype fragments, and to combine those frequencies
into a probability model for complete haplotypes. We
here very briefly describe three different haplotype probability models based on this idea; full specifications are
given in the Methods section. We introduced the ideas for
two of them (the Markovian models) in a preliminary
conference paper [22], one (the segmentation model) is
completely novel. The actual haplotyping algorithm has
also been greatly improved, resulting in significantly more
accurate results and reduced running times, while scaling
to much larger data sets. An overview of the algorithm is
given in the Methods section; a detailed description with
pseudo-code and complexity analysis is given in Additional file 1.
Let H(i, j) denote the sequence, or haplotype fragment,
from the ith to the jth marker in a given haplotype H. In
the variable-order Markov model the conditional probabili-
ties at each marker i are estimated from fragments H(si, i 1) of varying length:
P(H) = P(H(1))
∏
i =2,...,̀
P(H(i) | H( si , i − 1)).
(1)
The length of the context H(si, i - 1), and thus the order of
the Markov chain, is individually adjusted for each position and each haplotype by choosing the longest matching context that has a predetermined minimum
frequency.
The segmentation model considers each haplotype as a
sequence of independent, non-overlapping fragments,
and defines the probability of a haplotype to be the product of fragment probabilities. A robust estimate is
obtained by averaging over the set of all possible segmentations of H into frequent fragments:
P(H ) = C −1
∑ q|S|−1 ⋅ ∏
S∈
P(H( si , ei )),
(2)
( si ,ei )∈S
where S is a segmentation of H into (non-overlapping)
segments (si, ei), q is a parameter for penalizing large numbers of segments, |S| is the number of segments in segmentation S, and C is a normalization factor,
As a simpler alternative, we also consider d-order Markov
models with
P(H) = P(H(1, d))
∏
P(H(i) | H(i − d, i − 1)).
( 3)
i = d +1,...,̀
More details on each of these models are given in the
Methods section.
Experimental setting
Data simulation
Our main target application is data sets involving a large
number of subjects (hundreds or thousands) and markers
(hundreds or thousands per chromosome), because these
are needed for association-based gene mapping [18]. Such
data is not yet publicly available for benchmarking, and
thus our experiments are based mainly on simulated data.
Real data from the HapMap project [23] is used to validate
the general observations from simulations, and to test the
method in slightly different settings, especially with small
sample sizes.
We used Hudson's coalescence simulator [24] to simulate
data sets of 1000 genotypes. Our simulated settings range
from 5 to 500 markers, with average marker spacings
between 6.6 and 166 kb and map lengths between 166 kb
and 16.6 Mb. These marker distances correspond to
Page 3 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
http://www.biomedcentral.com/1471-2105/7/542
genome-wide studies with 500 k to 20 k markers; the average linkage disequilibrium between neighboring markers,
measured with Lewontins |D'| measure, ranges respectively from 0.88 to 0.36 (Table 1). More details on the
data simulation procedure are given in the Methods section.
subsequently abbreviated as HaploRec-VMM, HaploRecS, and HaploRec-FMM. The most important parameter of
HaploRec is the one that indirectly specifies the size of the
data structure used by the model (subsequently called
model complexity parameter). For the variable-order Markov
model and the segmentation model this parameter is the
fragment frequency threshold, with the fixed-order
Markov model, it is the order of the Markov model (for
details, see the Methods section).
All experiments are run separately for 10 independently
simulated data sets, and we report average results over
them. Results from the different replicates are quite similar: while there is some variance in the accuracy between
the different replicas, the relative performances of different methods are extremely similar. (See Additional file 2
for example results from all 10 replicas.) Unless otherwise
stated, 100 markers were used in the experiments with
HaploRec alone, and 30 markers in comparisons to other
methods, in order to keep running times of some of the
other methods reasonable.
The optimal value of the model complexity parameter
depends on the amount of linkage disequilibrium, so a
fixed value does not give comparable results across different marker spacings. In the first experiment, we compare
the proposed models to each other, by testing each model
for a large range of different values (minimum frequency
1–15 for the variable-order Markov and segmentation
models, order 1–17 for the fixed-order Markov model).
The optimal value was then chosen for each marker spacing separately. In the experiment, the accuracy of the models under different amounts of linkage disequilibrium was
tested using 100 markers with average spacing between
6.6 and 166 kb (Figure 1)
Performance measure
As an accuracy measure, we primarily use relative switch
accuracy, which is defined as the fraction of neighboring
phases (between each pair of consecutive heterozygous
markers) reconstructed correctly. An alternative measure
would be absolute accuracy, which is the fraction of haplotype pairs reconstructed completely correctly (ignoring
missing alleles). Absolute accuracy is problematic with
long or sparse haplotypes, where some switch errors can
be inevitable. It gives little information about the quality
of a haplotype, except whether it is exactly correct or not;
the switch measure is much more informative in this
respect. Chromosome-wide studies search for relatively
short disease-associated haplotype fragments, and switch
accuracy is almost directly related to the number of fragments correctly reconstructed. However, we also give
some examples of absolute errors since they are widely
used in the field (for short haplotypes).
The segmentation model performs here best, variableorder Markov model is slightly less accurate, and the fixedorder model is clearly inferior. This and other experiments
(not shown) show that the more complex models are
good alternatives over the simple Markov model. For clarity of exposition, the fixed-order Markov model is
excluded from the rest of the results as markedly inferior.
We also considered a variant of the segmentation model
where the maximum over different segmentations was
used, instead of averaging over different segmentations.
This alternative was slightly but consistently outperformed by the averaging model (results not shown).
Default values for model complexity parameters
To make later comparisons to other methods fair, we must
fix the minimum frequency parameter to a fixed default
value for each of the models. Consider now the accuracies
and model sizes (number of stored haplotype fragments)
as functions of the minimum frequency threshold, for
three different marker spacings: 33, 100 and 166 kb (Fig-
Experimental results
Performance across different amounts of linkage disequilibrium
The three proposed haplotype probability models,
Markov chain of variable order (VMM), the segmentation
model (S), and as simpler baseline the Markov chain of
fixed order (FMM), lead to different variants of HaploRec,
Table 1: Correspondence between marker spacing and linkage disequilibrium.
Marker spacing (kb)
|D'|
number of markers
6.6 kb
20 kb
33 kb
100 kb
166 kb
0.88
0.73
0.64
0.45
0.36
500 k
166 k
100 k
33 k
20 k
Average marker distance, average linkage disequilibrium (|D'|) between adjacent markers, and the corresponding number of markers in a genomewide study; simulated data.
Page 4 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
http://www.biomedcentral.com/1471-2105/7/542
1
1
HaploRec-S
HaploRec-VMM
HaploRec-FMM
0.95
HaploRec-S
HaploRec-VMM
HaploRec-FMM
0.9
0.8
0.9
Absolute accuracy
Switch accuracy
0.7
0.85
0.8
0.75
0.6
0.5
0.4
0.3
0.7
0.2
0.65
0.1
0.6
0
0
20
40
60
80
100
120
140
160
180
0
20
40
60
Marker spacing (kb)
80
100
120
140
160
180
Marker spacing (kb)
Figure 1 of FMM, VMM and segmentation models
Comparison
Comparison of FMM, VMM and segmentation models. Switch accuracy on the left, absolute accuracy on the right.
ure 2). With the segmentation model, decreasing the
threshold always improves results. A threshold of 1.2 was
chosen for the subsequent experiments as a compromise
between efficiency and accuracy (note that the frequency
threshold does not need to be an integer, as frequencies of
fragments are estimated from reconstructed haplotypes
which may have non-integer probabilities; see Methods
for more details). With the variable-order Markov model
the optimal choice seems to depend more on the amount
of linkage disequilibrium. For the Markov model, a
threshold of 2 was chosen for the rest of the experiments.
Comparison of HaploRec with fastPhase, Gerbil, Phase, PL-EM, and
Snphap
For the rest of this section we study how HaploRec performs in comparison to existing statistical haplotyping
methods. Five publicly available haplotyping programs
were chosen for benchmarking: fastPhase [21] (version
1.1.3), Gerbil [10] (version 1.0), Phase [7,8] (version 2.0),
PL-EM [6] (version 1.5, kindly provided by Zhaohui S.
Qin) and Snphap [19] (version 1.3.1). The current version
of HaploRec, 2.1, was used in the experiments. FastPhase,
Gerbil and Snphap were run with their default parameter
1
2e+06
HaploRec-S, 33kb
HaploRec-VMM, 33kb
HaploRec-S, 166kb
HaploRec-VMM, 166kb
1.8e+06
0.9
1.6e+06
0.8
Model size
Switch accuracy
1.4e+06
0.7
1e+06
800000
600000
HaploRec-S, 33kb
HaploRec-VMM, 33kb
HaploRec-S, 100kb
HaploRec-VMM, 100kb
HaploRec-S, 166kb
HaploRec-VMM, 166kb
0.6
1.2e+06
400000
200000
0.5
0
0
2
4
6
8
10
Minimum fragment frequency
12
14
16
0
2
4
6
8
10
Minimum fragment frequency
12
14
16
Figureof2minimum frequency on the VMM and segmentation models
Effect
Effect of minimum frequency on the VMM and segmentation models. Switch accuracy on the left, model size on the
right. (We use red color throughout this paper for HaploRec; solid line for the segmentation model and dashed line for the
variable order model.)
Page 5 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
http://www.biomedcentral.com/1471-2105/7/542
10000
0.95
1000
Running time per marker (seconds)
1
Switch accuracy
0.9
0.85
0.8
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
0.75
0.7
0
10
20
30
40
50
60
Number of markers
70
80
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
100
10
1
0.1
0.01
90
100
0
10
20
30
40
50
60
Number of markers
70
80
90
100
Figure 3 of methods, variable number of markers
Comparison
Comparison of methods, variable number of markers. Switch accuracy on the left, running times (plotted with logarithmic scale) on the right.
values. For PL-EM, buffer size was set to 300, number of
iterations to 20, and parsize to 2. For Phase, the number
of iterations and burn-in iterations were both set to 100,
the thinning interval was set to 1, and the recombination
model introduced in version 2.0 was used. All experiments were run on PCs with a 2.8 GHz Pentium 4 processor and 1 GB of memory. All methods were tested on a
range of data sets containing 5 to 100 markers with 33 kb
spacing (Figure 3; Table 2). HaploRec-S reaches the best
accuracy, 99.0% (with windows of 100 markers), closely
followed by Phase (98.1% accuracy with windows of 30
markers) and HaploRec-VMM (98.0% accuracy with windows of 100 markers); the HaploRec variants are several
orders of magnitude faster. PL-EM, Snphap, and Gerbil are
clearly less accurate but faster; fastPhase is both less accurate and slower than HaploRec. Phase aborted because of
running out of memory (1 GB) when there were more
than 40 markers.
An observation from this experiment is that the number
of markers can have a significant effect on haplotyping
accuracy. With the smallest tested numbers of markers, 5
and 10, all the methods achieve only a mediocre accuracy.
When the number of markers is increased to 20 or 30, the
accuracies of all methods except Gerbil improve clearly.
The accuracy of both HaploRec variants increases monotonically with the number of markers. None of the other
methods show this property. Phase's accuracy decreases
slightly after 30 markers (and it was not able to handle
more than 40 markers). Snphap and PL-EM gain less from
the increase in the number of markers, and also start to
actually lose accuracy when the number of markers
increases over 30. FastPhase only gains little and levels
after 20 markers.
Snphap, the HaploRec variants, and Gerbil are relatively
close to having linear running times in the number of
Table 2: Comparison of methods on simulated data sets with 33 kb marker spacing.
Method
HaploRec-S
HaploRec-VMM
(HaploRec-S)
(HaploRec-VMM)
Phase
fastPhase
PL-EM
Snphap
Gerbil
Switch accuracy
No. of markers
Time per marker (sec.)
0.990
0.980
(0.969)
(0.970)
0.981
0.858
0.938
0.921
0.745
100
100
(30)
(30)
30
20
30
30
5
19.2
5.2
(6.6)
(2.5)
1585
61.3
0.9
0.4
2.6
For each method, the table gives the accuracy, the number of markers to be haplotyped at a time to achieve that accuracy, and the running time per
marker. As 30 marker windows are optimal for many methods, results for them are given in parenthesis also for the HaploRec variants.
Page 6 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
markers (i.e., close to a horizontal line on the scale where
y-axis gives the time per marker; Figure 3, right panel).
Snphap is 1–2 orders of magnitude faster than PL-EM,
HaploRec and Gerbil; they are an order of magnitude
faster than fastPhase, which in turn is 0–2 orders of magnitude faster than Phase. For clarity of exposition and fairness of comparison, we use 30 marker windows in the rest
of the experiments for all methods. Based on the results
above, it is about an optimal choice for Phase, fastPhase,
PL-EM, and Snphap, slightly suboptimal for HaploRec,
and unfortunately bad for Gerbil. In particular, note that
if a larger window size was chosen for HaploRec, HaploRec-S could in many cases have given more accurate
results than Phase (cf. Table 2).
Effect of marker density
We next evaluated the performance of the methods as a
function of the amount of linkage disequilibrium, resulting from varying marker spacing between 6.6 and 166 kb
(Figure 4, cf. Table 1). The accuracy of all methods
decreases with increasing marker spacing, as expected.
Phase and the two HaploRec variants are clearly more
resistant to decreasing linkage disequilibrium than the
other methods. FastPhase and Gerbil have a drop in accuracy at first, but then come close to PL-EM and Snphap
with the sparsest maps. Marker spacing also has a clear
effect on running times for some of the methods (not
shown): for Phase, PL-EM and Gerbil, the difference of
running times between the densest (6.6 kb) and sparsest
(133 kb) settings is approximately 40-fold. For HaploRec
and Snphap, running times only vary by a factor of 1.5 –
3 (in the case of HaploRec, this is explained by the
increased number of iterations needed for convergence),
while fastPhase's running times are practically same for all
spacings.
Effect of sample size
The available sample size obviously has an effect on the
difficulty of the task. Most of the evaluations of haplotyping methods in the literature have been made on very
small data sets, and thus it can be interesting to see how
sample size effects different methods. Tests with sample
sizes between 25 and 1000 individuals clearly show that
all methods perform better with larger sample sizes, but
there are marked differences between methods (Figure 5;
30 markers with 33 kb spacing). Phase, HaploRec, PL-EM
and Snphap are able to benefit from larger sample sizes.
FastPhase and Gerbil gain much less from larger samples:
while they perform approximately equally to HaploRec
with very small sample sizes, HaploRec is more accurate
by a large margin with larger sample sizes. Also, Phase
seems to work relatively well also with the smallest sample sizes.
http://www.biomedcentral.com/1471-2105/7/542
For most methods, the running times are relatively close
to linear in the number of genotypes, but Phase's running
times increase somewhat more rapidly. We performed
some additional experiments with fastPhase, trying out
different parameter values (not shown). In particular,
increasing the maximum number of clusters from the
default value of 15 to 20 or 30 improved the accuracy at
the cost of longer running times, but the accuracies did
not reach those of HaploRec.
Effect of missing alleles and genotyping errors
Two sets of experiments were performed to see how up to
5% missing alleles or up to 5% genotyping errors affect
the accuracy. According to our experience, such numbers
are typical for current high-quality genotypes. The
number of markers was again fixed to 30 and marker spacing was 33 kb (Figure 6). Phase failed to complete when
the error fraction was more than 2%, due to increased
memory usage resulting from the increased haplotype
diversity. Also, the running time of Phase increased rapidly with increasing error fraction; already with 2%, Phase
ran for approximately 4 days for a single replica. For this
reason, results for Phase are not shown in the error fraction figure. Overall, missing alleles have only a slightly
negative effect on accuracy of all tested methods. HaploRec-S seems to be slightly more sensitive to missing data
than the other methods. Genotyping errors, on the other
hand, have a more clear effect on accuracy. HaploRecVMM, PL-EM and Snphap experience a clear drop in accuracy, whereas HaploRec-S, fastPhase and Gerbil are more
robust against genotyping errors. In this setting, the results
start to be unacceptable between about 1% (Snphap, PLEM, fastPhase) and 2% (HaploRec) of errors.
Tests with larger data sets
The applicability of the methods for larger amounts of
markers and longer genomic regions was tested by varying
the number of markers from 10 to 500 (with the map
growing at the same time from 330 kb to 16.5 Mb, with a
fixed marker spacing of 33 kb; Figure 7). When the length
of haplotype fragments was not limited, HaploRec reaches
the 1 GB memory limit at about 150 markers. To work
with a larger number of markers, HaploRec was also run
with the maximum fragment length set to 30, which significantly reduced memory usage, enabling it to handle all
tested window sizes. Also some of the other methods had
a practical maximum number of markers they could handle on these data sets; 200 for PL-EM and 300 for Gerbil.
Phase was excluded from this experiment, as it was
already observed that it was unable to handle data sets
with more than 40 markers.
HaploRec-S without the maximum pattern length constraint has the best accuracy, but its high memory usage
makes it unusable for the larger window sizes. On the
Page 7 of 18
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/7/542
1
1
0.9
0.8
Absolute accuracy
Switch accuracy
BMC Bioinformatics 2006, 7:542
0.8
0.7
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
0.6
0.5
0
20
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
0.6
0.4
0.2
0
40
60
80
100
120
140
160
180
0
20
40
60
Marker spacing (kb)
80
100
120
140
160
180
Marker spacing (kb)
Figure 4 of methods, variable marker density
Comparison
Comparison of methods, variable marker density. Switch accuracy on the left, absolute accuracy on the right.
other hand, the segmentation model does not work quite
as well with the pattern length constraint. HaploRec-VMM
has a fairly good accuracy, and is almost unaffected by the
pattern length constraint (the lines for the two variants are
indistinguishable in the figure), making it practical for
large data sets. Its accuracy stays approximately constant
with 100 markers or more. FastPhase and PL-EM maintain
a lower but constant accuracy. Gerbil works up to 300
markers and maintains its (low) level of accuracy. While
Snphap quickly loses accuracy with an increasing number
of markers, it is very fast (its running times per marker are
below a second). The running times of the HaploRec variants and PL-EM are slightly superlinear (i.e., approaching
a horizontal line; note that the y-axis shows the time per
marker) and Gerbil clearly superlinear.
Experimental results with real data
To complement the systematic experimental analysis with
several replicates of simulated data and reasonably sized
samples, the methods were also tested on publicly available real data from the HapMap project [23].
The HapMap data we used consists of two separate populations: 30 trios from the Yoruba population in Ibadan,
Nigeria, and another 30 trios from the CEPH population
(Utah residents with ancestry from northern and western
Europe). Both data sets (downloaded from the HapMap
web site [25]) have the same set of 3.8 million SNPs
spread over the whole genome. (For information on how
HapMap data was processed for the experiments, see the
Methods section.)
1
100
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
Running time per genotype (seconds)
0.95
0.9
Switch accuracy
0.85
0.8
0.75
0.7
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
0.65
0.6
0.55
0
100
200
300
400
500
600
Number of genotypes
700
800
10
1
0.1
0.01
0.001
900
1000
0
100
200
300
400
500
600
Number of genotypes
700
800
900
1000
Figureof5sample size
Effect
Effect of sample size. Switch accuracy on the left, running times (on a logarithmic scale) on the right.
Page 8 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
http://www.biomedcentral.com/1471-2105/7/542
1
1
HaploRec-S
HaploRec-VMM
fastPhase
PL-EM
Snphap
Gerbil
0.95
0.95
0.9
0.85
Switch accuracy
Switch accuracy
0.9
0.85
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
0.8
0.75
0.8
0.75
0.7
0.65
0.6
0.7
0.55
0
0.01
0.02
0.03
Fraction of missing alleles
0.04
0
0.05
0.01
0.02
0.03
Genotyping error probability
0.04
0.05
Figureof6missing alleles (left) and genotyping errors (right)
Effect
Effect of missing alleles (left) and genotyping errors (right).
In the following experiments with this data, HaploRec
was run with the same parameter values as before, chosen
based on the first experiments with large samples of simulated data (Figures 1 and 2).
In the first experiment, the marker spacing was fixed to 6
kb (which corresponds roughly to a genome-wide study
with 500 k markers). The number of markers was varied
from 10 to 100 to test the relationship between the
number of markers and haplotyping accuracy (Figure 8,
Yoruba data on the left, CEPH data on the right).
The younger CEPH population is clearly easier than the
Yoruba population for all haplotyping methods. However, for both data sets, even the best methods only
achieve a relatively low accuracy, which can probably be
attributed to the small sample size. For the Yoruba data,
Phase performs best across the range of tests, followed by
the HaploRec variants and fastPhase, which are followed
by Gerbil, PL-EM and Snphap. For the CEPH data, the differences between the best methods (Phase, HaploRec-S
and fastPhase) are very small.
The above results show that HaploRec is competitive also
on real data sets having a small sample size. On the other
hand, the number of marker has a smaller effect on accuracy than in the simulated data sets. Also, increasing the
number of markers to more than 40 does not improve
accuracy here, unlike in the experiments with simulated
data sets, where accuracy always improves with increasing
1
100
HaploRec-S
HaploRec-S, maxlen=30
HaploRec-VMM
HaploRec-VMM, maxlen=30
fastPhase
PL-EM
Snphap
Gerbil
Running time per marker (seconds)
0.95
Switch accuracy
0.9
0.85
HaploRec-S
HaploRec-S, maxlen=30
HaploRec-VMM
HaploRec-VMM, maxlen=30
fastPhase
PL-EM
Snphap
Gerbil
0.8
0.75
0.7
80
60
40
20
0.65
0.6
0
0
50
100
150
200
250
300
Number of markers
350
400
450
500
0
50
100
150
200
250
300
Number of markers
350
400
450
500
Figure 7 and running times for large data sets
Accuracies
Accuracies and running times for large data sets. Switch accuracy on the left, running times on the right.
Page 9 of 18
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/7/542
1
1
0.95
0.95
0.9
0.9
Switch accuracy
Switch accuracy
BMC Bioinformatics 2006, 7:542
0.85
0.8
0.75
0.65
10
20
0.8
0.75
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
0.7
0.85
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
0.7
0.65
30
40
50
60
Number of markers
70
80
90
10
100
20
30
40
50
60
Number of markers
70
80
90
100
Figure 8 on HapMap data, variable number of markers
Accuracies
Accuracies on HapMap data, variable number of markers. Marker spacing 6.0 kb. Yoruba on the left, CEPH on the
right.
number of markers. We believe the reason for this is the
small sample size (60 genotypes vs. 1000 genotypes in the
simulated data). In a larger sample, longer shared ancestral haplotype fragments (which are rarer than shorter
ones) can be detected and utilized, leading to increased
accuracy. To test this hypothesis, we ran HaploRec on
(sparse) simulated data sets consisting of various numbers
of genotypes and evaluated the effect of the number of
markers (data not shown). The same effect was visible
there as well: the improvement in accuracy gained by
using more markers decreases with decreasing sample
size.
Discussion
The HaploRec models and algorithm introduced in this
paper are designed for haplotyping data sets in large-scale
disease association studies. The statistical models are relatively simple. Their central idea is to represent the probability of a long haplotype as a function of the probabilities
of its local fragments. Our experimental results confirm
1
1
0.95
0.95
0.9
0.9
Switch accuracy
Switch accuracy
In a second experiment with the HapMap data, we varied
marker spacing between 1.5 and 15 kb, while the number
of markers was fixed to 30 (Figure 9, Yoruba data on the
left, CEPH data on the right). The accuracy drops with
increasing marker spacing, as expected, but the methods
roughly maintain their relative performances, in the order
Phase, HaploRec-S, fastPhase, HaploRec-VMM, Gerbil,
PL-EM and Snphap. Again, for the CEPH data, the best
three methods have practically identical accuracies.
0.85
0.8
0.75
0.65
0
2
0.8
0.75
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
0.7
0.85
HaploRec-S
HaploRec-VMM
Phase
fastPhase
PL-EM
Snphap
Gerbil
0.7
0.65
4
6
8
10
Marker spacing (kb)
12
14
16
0
2
4
6
8
10
Marker spacing (kb)
12
14
16
Figure 9 on HapMap data, variable marker spacing
Accuracies
Accuracies on HapMap data, variable marker spacing. 30 markers. Yoruba on the left, CEPH on the right.
Page 10 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
that locality and simplicity were successful design choices:
locality allows the models to actually benefit from large
numbers of markers even when they are sparsely located,
while simplicity allows an efficient implementation. In
combination these properties result in a method that
accurately and quickly haplotypes large windows of markers at a time.
We presented three different haplotype probability models: Markov chain of a fixed order, Markov chain of a variable order (HaploRec-VMM), and a model based on
segmenting each haplotype in frequent fragments (HaploRec-S). Our experimental results show that the simple
fixed-order Markov chain is clearly inferior, even if its
order is chosen optimally for the data set. In our results for
the more flexible and "self-adjusting" variable-order and
segmentation models, the segmentation model gave generally more accurate results than the variable-order
Markov model both in the simulated and real data sets,
with the downside of being somewhat slower. Our experiments also indicated another interesting difference
between the models: while the segmentation model is
more sensitive to missing data than the variable-order
Markov model, it is more robust against genotyping
errors.
In experimental comparisons with existing haplotyping
methods, the HaploRec models scale in a unique way to
large data sets: Their accuracies improve both in the
number of markers (Figure 3) and in the sample size (Figure 5) in an unparalleled way (Phase being an exception
in some aspects), while being robust to sparse marker
maps (Figure 4). This combination of properties makes
HaploRec especially suitable for chromosome or genomewide association analysis, where large numbers of sparsely
located markers are analyzed for hundreds or thousands
of individuals. In such settings, HaploRec can outperform
Phase in accuracy while being 2–3 orders of magnitude
faster. Although Phase is very accurate and can also benefit from large samples, it is computationally very intensive, restricting its usability for large data sets. The recently
proposed fastPhase method scales well but for large data
sets, it seems to be clearly less accurate and also clearly
slower than HaploRec.
The performance differences can be largely understood
from the properties of the different models. Snphap and
PL-EM are based on a multinomial model, which is suitable when all markers are in strong linkage disequilibrium,
but does not work well when the number of markers is
large or the markers are spaced far apart. Gerbil solves this
problem by dividing the marker map into blocks and
modeling each block by a set of common "founder" haplotypes. This approach assumes strong linkage disequilibrium within each block, and does not account for more
http://www.biomedcentral.com/1471-2105/7/542
flexible patterns of linkage disequilibrium across block
boundaries. Gerbil's relatively bad performance on simulated data can probably be attributed to two factors.
Firstly, it only considers a small number of common haplotypes for each block, and thus cannot utilize the rarer
shared haplotypes found in large samples. Secondly, there
are no recombination hot-spots in the simulated data. In
fastPhase, each haplotype is a mosaic of founder haplotypes. This allows for both block-like patterns of linkage
disequilibrium and gradual decay of linkage disequilibrium with distance. However, it does not account for the
fact that shared haplotypes are formed also in the later
stages of the population history, forming more recent
(and less frequent) patterns of linkage disequilibrium.
The segmentation model of HaploRec bears some resemblance to the model used in fastPhase: each haplotype is a
mosaic of frequent haplotype fragments, which may now
be ancestral haplotype fragments from different stages of
the population history. As old haplotypes are split by
recombinations and new haplotypes are formed during
the population history, longer shared fragments probably
are of more recent origin and have a smaller frequency
than shorter ones. HaploRec is not constrained to a fixed
set of founders, but can more flexibly utilize (local) patterns of linkage disequilibrium present in the data using
fragments of varying lengths. Being able to utilize the
information contained in rare shared fragments is one
possible explanation for the good performance of HaploRec on large samples. Phase works consistently well
across most settings. Also its model accounts for recombinations, which explains why it performs well also on
sparser marker maps. Its good accuracy especially on
small samples is probably due to its more realistic prior
compared to other methods, which becomes less important with increasing sample size. The accurate model
employed by Phase has the downside of being computationally very intensive. As a result, the window size
(number markers) it can handle is very limited, especially
when the sample size is large. In the experiments with variable window size, we also observed that Phase's accuracy
decreases when increasing the number of markers above
30; this may be because its default number of iterations is
not sufficient to obtain mixing for larger window sizes.
Results with data from the HapMap project demonstrate
that HaploRec works well also with real data (Figures 8
and 9). Unlike the assumed primary applications of HaploRec, these samples are very small (only 60 genotypes).
Still, HaploRec is very competitive, even with its default
parameter values chosen based on the simulated, larger
data sets with sparser marker spacing. Experiments with
the dense Daly data [26] (results not shown) indicate the
same: among the tested methods, HaploRec was second
only to Phase and fastPhase while being fastest of all
methods. Future experiments with large real data sets, as
Page 11 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
they become publicly available, will be used to test the
hypothesis that HaploRec will benefit more than the competing methods from larger sample sizes and from more
markers.
For evaluating performance on reasonably large samples
(1000 individuals), we used simulated data due to the
lack of suitable public real data sets. Another option
would have been to use pseudo-simulated data, by randomly combining pairs of real haplotypes to form new
pseudo-individuals (as done, e.g., in [10]). For the genetically long marker maps of association studies, this
method would generate unrealistic data: multiple copies
of complete haplotypes will appear in the final sample,
when in reality the haplotypes would mostly have been
segmented by recombinations. Such data would be unrealistically easy to haplotype, and already a simple multinomial model can give almost completely correct
solutions.
Most haplotype-based association analysis methods
assume haplotypes as input. There are several major fundamental issues related to this. The first questions the
need to haplotype longer windows of markers maps at all,
since gene mapping studies search for disease associations
of relatively short genetic regions and for this it is sufficient to haplotype such a shorter segment at a time. The
computational complexity can on one hand decrease –
especially for methods that are not close to linear in the
number of markers, such as Phase – but on the other hand
there is an additional cost of redundantly haplotyping
overlapping segments. Additionally, haplotyping accuracy improves with window size (cf. Figures 3 and 8), indicating that too short windows should be avoided.
The second issue relates to the use of estimated haplotypes
in association analysis. It seems obvious that haplotype
reconstruction tends to exaggerate linkage disequilibrium
since haplotyping methods more or less directly aim to
maximize it. It has been shown that estimated haplotypes
can, indeed, lead to false positives [27,28]. On the other
hand, this does not always have to be the case. A simulation study shows that in association analysis, the haplotypes produced by HaploRec can be equally powerful to
the true haplotypes, despite some inevitable phasing
errors [29]. The locality of the statistical models has a subtle role here: in the case-control settings normally used in
association studies, linkage disequilibrium is increased in
the cases in the vicinity of the disease gene, making this
most critical part of the marker map easier to haplotype.
More work is needed to identify when statistically predicted haplotypes are useful and when not.
Another view to this issue is that some information is lost.
As an extreme example, there may be several roughly
http://www.biomedcentral.com/1471-2105/7/542
equally likely haplotype configurations of which just one
is chosen to the output. Since association analysis methods are typically based on frequencies of haplotypes (or
fragments), the frequencies of different possible haplotypes – rather than the single most likely ones – should be
more informative and fairly easily usable for many association methods. Most statistical haplotyping methods
internally estimate haplotype frequencies; haplotype resolution can be seen as an extra step based on these frequencies. This also holds for HaploRec which actually already
estimates frequencies of all haplotype fragments that have
(an estimated) frequency above a small threshold. Ultimately, haplotype frequency estimation and association
analysis could be combined into one model and process
[27].
There are several possible directions for future work on
improving the haplotype probability models. Currently,
the variable-order Markov model only uses a simple frequency threshold to determine the context lengths. The
set of contexts could be pruned further using the accuracy
of predicting the next allele as a selection criterion [30].
Another possibility for refinement is to smooth the probability over several context lengths simultaneously [31].
Ideas from the models of Phase and Gerbil could be used
to better account for mutations and genotyping errors. A
possible approach would be to allow for a small number
of mismatches between the haplotype fragments and
complete haplotypes in the parameter estimation step,
dividing some of the probability mass to fragments that
are similar, but not identical with the observed ones.
Conclusion
Genotyping hundreds or even thousands of subjects for
hundreds of thousands of markers is becoming technologically and economically feasible. It is estimated that
data sets of this size start to be sufficiently powerful for
genome-wide disease association studies, depending on
the disease and the population [18,32]. However, many
methods for association-based gene mapping assume
haplotype data. It has been shown, too, that haplotypes
can be more powerful than single markers [33].
We presented models and methods for statistical haplotype reconstruction from genotypes of unrelated individuals, and specifically targeted large and sparse data sets,
such as those needed in chromosome or genome-wide
disease association studies. We introduced three different
haplotype probability models: Markov chain of a fixed
order, Markov chain of a variable order, and a model
based on segmenting each haplotype into frequent fragments. In Methods we give full specifications of the models and an concise description of the HaploRec algorithm;
Additional file 1 contains a more detailed description of
the algorithm and complexity analysis.
Page 12 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
http://www.biomedcentral.com/1471-2105/7/542
Experiments with simulated and real data demonstrate
that these models and methods, collectively called HaploRec, are competitive with existing methods in terms of
accuracy while being several orders of magnitude faster
than the most accurate competitors. Of the two HaploRec
models, the segmentation model is recommended as the
default choice, as it generally gives more accurate results.
However, for very large data sets, or when there is much
missing data, the variable-order Markov model may be a
better alternative, due to its smaller computational
demands and smaller sensitivity to missing data.
Methods
Haplotype probability models
Recall that H(i, j) denotes the sequence (haplotype fragment) from the ith to the jth marker in a given haplotype
H. We use the alternative notation frag(h, i, j) to denote a
haplotype fragment from i to j, consisting of marker string
h, when the fragment is not a projection from any particular haplotype H. We will denote H(i, i), a fragment consisting of a single marker, simply by H(i). Similarly, G(i, j)
denotes the sequence of allele pairs from the ith to the jth
marker in genotype G, called genotype fragment. Again, G(i,
i) is denoted by G(i). We say that a fragment H(i, j) and a
haplotype H' match if H(k) = H'(k) for all k : i ≤ k ≤ j. We
say that a fragment H(i, j) and a genotype G match if there
exists a string H ∈ Πk = i,...,j Ak such that {H(i, j), H } is
compatible with G(i, j). Given a set of haplotypes, the frequency of a fragment H(i, j) is defined as the number of
haplotypes matching the fragment, and is denoted by
(H(i, j)).
A simple model for haplotype probability is to consider
the haplotype as a (first-order) Markov chain in which the
probability for a marker having a certain allele depends
only on the preceding marker:
P(H) = P(H(1))
∏
P(H(i) | H(i − 1)).
(4)
i =2,...,̀
∏
Given a frequency threshold minfr, we first compute the
set of most frequent haplotype fragments, denoted by
minfr, which will determine the sizes of contexts:
minfr
minfr},
= {frag(h, i, j) | 1 ≤ i ≤ j ≤ ᐍ, (frag(h, i, j)) ≤
(6)
where h ranges over all possible fragments. Given a haplotype H, the longest matching fragments in minfr are then
used to estimate the conditional probabilities at each
marker i:
P(H ) = P(H(1))
∏
i = 2,...,̀
The obvious shortcoming of this model is that although
linkage disequilibrium is normally strongest between
neighbors, it is not limited to the immediate neighboring
markers; a neighborhood of several markers is thus potentially more informative. More power in predicting the
next allele can thus be obtained by increasing the order d
of the Markov model:
P(H) = P(H(1, d))
problem. Increasing the order increases accuracy of predicting the next allele, but only to a certain extent; at some
point, the conditional probability can no longer be reliably estimated from a limited sample of haplotypes.
Another problem with fixing d is the fact that linkage disequilibrium may vary within the marker map; it is thus
possible that no single value of d is suitable for all parts of
the map. A more flexible alternative to the fixed-order
Markov model is to use a variable-order Markov model,
where the context is adjusted for each marker and haplotype individually. Informally, the goal is to find a flexible
balance between generality and informativeness. We propose the following solution. When we estimate the probability of a haplotype H and consider the variable-order
Markovian distribution at marker i, we find the longest
observed fragment that (1) matches haplotype H and
ends at marker i - 1, and (2) has a frequency exceeding
some given threshold, minfr. The fragments whose frequency does not exceed this threshold are considered
uninformative. Using a frequency threshold is motivated
by the fact that it is not likely that a long fragment of haplotypes is shared by different individuals unless it is inherited from the same ancestor; thus using only frequent
fragments gives increased confidence in the fragments
being identical by descent.
P(H(i) | H(i − d, i − 1)).
( 5)
i = d +1,...,̀
With d = 1 we obviously have the standard Markov chain
as a special case. Selecting a suitable value for d can be a
P(H(i) | H( si , i − 1)),
where si = min{s | H(s, i - 1) ∈
minfr}.
(7)
The order of the
Markov chain is thus individually adjusted for each position and each haplotype.
Although both fixed and variable-order Markov models
have been extensively studied and used for many applications (see [34] for a review of variable-order Markov models), we are not aware of any previous applications to
haplotype reconstruction. There is also a subtle difference
between these models and typical applications of Markov
chains. The models employed here are inhomogeneous, i.e.,
each marker has its own states and transition probabiliPage 13 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
http://www.biomedcentral.com/1471-2105/7/542
ties, whereas usually they are not dependent on the location in the sequence. Also unlike typical applications, we
are simultaneously modeling two sequences whose
entries are observed together as unordered pairs.
Another alternative of building a haplotype probability
model from local fragments is to think of a complete haplotype as a mosaic of frequent fragments, originating from
different founders or via different coalescence histories. In
the segmentation model we consider non-overlapping
fragments to be independent, and consequently define
the probability of a haplotype to be the product of fragment probabilities:
P(H) =
∏
P(H( si , ei )),
( si ,ei )∈S
where S = (s1, e1), (s2, e2),..., (sn, en) is a segmentation of H
into consecutive non-overlapping fragments (such that s1
= 1, si = ei - 1 + 1 for all 1 <i ≤ n, and en = ᐍ).
The above formula leaves open the actual segmentation
used for each haplotype. As the recombination history of
the haplotypes is unknown, we of course have no way of
deducing the "correct" segmentation. As a solution, we
propose a model which averages over all possible segmentations:
P( H ) =
∑ S∈ ∏( s ,e )∈S P(H( si , ei ))
i
i
| ′|
,
(8)
where is the set of all possible segmentations of H:
minfr for all(s, e) ∈ S}, and ′ is the set
= {S : H(s, e) ∈
of all possible segmentations of the marker map:
′ = {S : for all(s, e) ∈ S there exists some frag(h, s, e) ∈
minfr} (note that the normalization factor is independent of the haplotype H). The above formula can be interpreted as having a set of all possible block models, each
with a uniform prior probability, over which the probability is averaged. Note that in contrast to widely used block
models, there are no preferences for shared block boundaries between individuals; instead, all possible segmentations are considered as equally viable alternatives.
A potential weakness of the model defined by Equation 8
is that there are more segmentations with a large number
of segments than ones with a smaller number. To favor
segmentations with less and longer segments, we introduce a penalty factor (controlled by an additional parameter q) for each additional fragment included in the
segmentation:
P( H ) =
∑ S∈ q|S|−1.∏( s ,e )∈S P(H( si , ei ))
,
∑ S∈′ q|S|−1
i
i
(9)
where 0 <q ≤ 1, and |S| is the number of segments in segmentation S. Setting a smaller value to q will cause segmentations with a large number of fragments to have a
smaller probability. Setting q = 1 corresponds to no penalty, in which case the formulation is equivalent to Equation 8. By experimenting we found that q = 0.1 works
reasonably well in different settings. This value was used
in all our experiments.
The HaploRec algorithm
Given one of the models we defined – fixed-order Markov
chain, variable-order Markov chain, or the segmentation
model – and its parameters, we can compute a probability
for any given haplotype.
Assuming independence between the two haplotypes of
an individual (Hardy-Weinberg equilibrium), the probability of any haplotype configuration for a genotype is just
the product of the probabilities of its constituent haplotypes. The algorithmic problem is twofold: the haplotype
reconstruction method has to simultaneously learn the
model parameters and reconstruct the haplotypes of each
individual.
Our algorithm, HaploRec, is a modified version of the the
EM algorithm introduced in [4]. In the EM framework, the
haplotype configuration underlying each genotype is considered as a latent variable, and the goal is to find a maximum-likelihood estimate for the model parameters .
The likelihood of a single genotype is a sum of the probabilities of all its possible haplotype configurations (the
values of latent variables), and the likelihood of the whole
data is then just a product over all genotypes:
L( | ) =
∏
∑
G∈ {H1 ,H2}∈CG
P({H1 , H2} | ),
( 10 )
where P({H1, H2} | ) = P(H1| )·P(H2| ) is the
probability of the haplotype pair {H1, H2}, given the
model parameters . The EM algorithm works by iteratively improving estimates of the model parameters (the
M-step) and values of the latent variables (the E-step), until
a (local) maximum of the likelihood is reached. The individual reconstructed haplotypes can then be obtained by
just selecting, for each genotype, the compatible haplotype pair that has maximal probability according to the
Page 14 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
http://www.biomedcentral.com/1471-2105/7/542
obtained maximum-likelihood estimate of the parameters.
The original algorithm [4] uses a multinomial model,
where parameters are the frequencies of complete haplotypes. Our modifications to the EM algorithm consist of
(1) replacing the multinomial model with one of our fragment-based probability models and (2) using a sequential
pruning strategy to overcome the exponential computational complexity of the E-step. Below, we give an outline
of the HaploRec algorithm; a more detailed description,
including pseudo-code, handling of missing data and
complexity analysis, is given in Additional file 1.
Representation of model parameters
Parameters of the three models are slightly different. In
the Markov models, the parameters are the conditional
allele probabilities; in the segmentation model, the
parameters are the fragment probabilities. In practice, the
conditional probabilities are derived from fragment probabilities as follows:
P(H(i) | H(i − d , i − 1)) =
(H(i − d , i))
,
(H(i − d , i − 1))
where d is the number of previous markers conditioned
on and (H) denotes the estimated probability (frequency) of fragment H. As conditional probabilities can
be straightforwardly derived from fragment frequencies,
we always use the set of fragments as a representation for
the model parameters, also in the case of the Markov
models.
E-step
In the E-step of the EM algorithm, the aim is to calculate,
for each G ∈ , and each compatible haplotype pair {H1,
H2} ∈ CG, the probability that the genotype actually con-
for each genotype) is built up marker by marker, starting
from a partial configuration containing only the allele
pair at the leftmost marker. At each step, all partial configurations are extended with the allele pair at the next
marker (for homozygous markers, each configuration is
extended with the same allele pair; for heterozygous
markers, there are two possible extensions for each configuration, and only the the B most probable configurations
are propagated to the next step). In the final step, the C(≤
B) most probable configurations are returned. The
approach is greedy; it is not guaranteed that the set of
returned configurations consists exactly of the most probable ones. In preliminary experiments it was found that B
= 25 and C = 10 give a reasonable computational efficiency, while increasing the parameters beyond these values does not significantly improve accuracy. These values
were used in the experiments. Using a sequential pruning
strategy to implement the E-step is conceptually simple,
and has already been used in [19]. However, implementing it efficiently for our models is not trivial. In Additional
file 1, we provide a detailed description of the data structures and algorithms for implementing the E-step for each
of proposed models.
M-step
In the M-step of the EM algorithm, the model parameters
are re-estimated (based on the current haplotype estimates from the previous E-step), such that the likelihood
of the data is improved from the previous iteration. The
haplotype estimates from the previous E-step give the
expected frequency of any haplotype in a given genotype.
The estimated frequency of a haplotype fragment in a single genotype is obtained by a sum over all the haplotypes
in its set of possible configurations that match the fragment, and the overall estimated frequency of a fragment h
is obtained as an average over all genotypes:
sists of that haplotype pair:
P(H1 | t −1 )P(H2 | t −1 )
.
Pt ({H1 , H2} | G) = P({H1 , H2} | G, t −1 ) =
∑{H ,H}∈C P(H | t −1 )P(H | t −1 )
G
The model parameters,
t - 1,
t (h) =
1
∑
2 || G∈ {H
∑
1 ,H 2 }∈CG
Pt ({H1 , H2} | G)δ h,{H1 ,H2} ,
where δ h,{H1 ,H2} = |{i ∈ {1, 2} | h matches Hi}| ∈ {0, 1,
are obtained from the pre-
2} is the number of haplotypes in configuration {H1, H2}
vious parameter estimation step. The normalization just
transforms the (prior) probabilities given by the model
into a (posterior) probability distribution for configurations of the single genotype G. Exhaustively going through
all possible configurations is feasible only when the
number of heterozygous markers is small (≅ 20 or less).
With more markers, we use a pruning strategy in which
the set of possible haplotype configurations (separately
that match h. To compute the fragment frequencies, we
first combine the sets of most probable haplotype configurations (from the previous E-step) from all genotypes
into a weighted (multi-)set of haplotypes, where the
weight of each haplotype is the sum of the probabilities of
its occurrences in the reconstructed configurations (note
that the weights are thus mostly non-integer). Computing
the set of frequent fragments is then done by by depth-
Page 15 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
http://www.biomedcentral.com/1471-2105/7/542
first search in the fragment containment lattice. First, all
possible fragments of length one are generated, after
which they are recursively extended to the right as long as
the frequencies of resulting fragments stay above the given
minimum frequency threshold and the fragments do not
exceed a given maximum length. To improve the efficiency of the algorithm, the list of matching haplotypes is
stored with each fragment during the execution of the
algorithm (as depth-first search is used, this does not significantly improve memory usage, because the list can be
freed when the fragment is no longer in the stack). This
way, the frequency of a each fragment can be computed by
only matching its last allele to the list of haplotypes
matching its prefix. The algorithm is guaranteed to find all
frequent fragments, as frequency decreases monotonically
when a fragment is extended.
Initialization
The initial frequencies are computed somewhat similarly
as in the parameter estimation step. The difference is that
there is not yet any information about the probability of
different haplotype configurations, and thus all configurations of a genotype must be considered equally likely. The
number of haplotype configurations compatible with a
genotype G is exponential in the number of heterozygous
markers in G, and enumerating all the possible configurations in CG is thus infeasible. Fortunately, the initial frequencies can be counted directly from the genotype data
as follows, without explicitly generating the elements of
CG:
( frag(h, i, j)) =
1
1− k
2 G( i , j ) ,
∑
2 || G∈ ,G matches frag(h,i , j)
( 11 )
where kG(i, j) is the number of heterozygous markers in
genotype fragment G(i, j) (note that a homozygous genotype has two identical haplotypes both matching the fragment, and thus weight 2). Computing the initial model
can be implemented by a slightly modified version of the
above-described depth-first search algorithm.
Using overlapping windows to haplotype very long marker
maps
The HaploRec algorithm described above is directly suitable for windows (marker maps) containing up to 500–
1500 markers. This practical limit depends on the number
of genotypes, amount of missing data, complexity of the
haplotype distribution, program parameters and the
amount of available main memory. We have extended the
implementation of the method by a simple but efficient
feature that allows it to handle an arbitrarily large number
of markers. Briefly, the idea is to sequentially haplotype a
window of w markers at a time. In each window except the
first one, phased haplotypes for the first o markers are
obtained from the haplotypes in the previous, overlapping window. After a window is haplotyped, r of its last
markers are discarded; they are only used to get better estimates for the w - o - r markers in the middle of the window. I.e., each window has length w = o + u + o + r, where
u is the number of markers unique to this window, the
first o represents the haplotypes obtained from the previous window and the second o those that are used by the
next window. The window is moved o + u markers at a
time. In experiments with Yoruba data (results not
shown), HAPLOREC had practically identical accuracies
with this extension and without it, at the expense of
slightly larger running times per marker resulting from the
overlapping windows.
Simulation of data
We used Hudson's coalescence simulator [24] to generate
chromosomes under the standard Wright-Fisher neutral
model of genetic variation with recombination. A long
chromosomal region with 16666666 base pairs was simulated. The probability of a mutation in each base pair
was set to 10-8 per generation, and the probability of crossover between adjacent base pairs was set to 10-8. These values give the mutation probability for the entire chromosomal region μ = 0.16666666, and cross-over probability
ρ = 0.16666665. The diploid population size, N0, was set
to the standard 10000, giving the mutation parameter θ =
4N0μ = 6666.6666, and the recombination parameter r =
6666.6665. This standard model simulates recombinations with a uniform distribution. As a result of the stochastic coalescence, however, the recombinations in the
final population tend not to be uniformly distributed.
A sample of 2000 chromosomes was generated, and these
were paired to form 1000 genotypes. On the average, one
simulation produced approximately 55000 segregating
sites. Markers were chosen from the segregating sites with
minor allele frequency at least 5%, such that marker spacing was as uniform as possible. The actual number of
markers and the position of the markers varied with different test settings. The amount of linkage disequilibrium
between markers is perhaps the most important factor
affecting the accuracy of population-based haplotyping
methods. Under the assumption of uniform recombination the amount of linkage disequilibrium is governed by
the average distance between adjacent markers. To assess
haplotyping performance with different amounts of linkage disequilibrium, marker spacing was varied between
6.6 and 166 kb. The average linkage disequilibrium
between neighboring markers, measured with Lewontins
|D'| measure, ranges respectively from 0.88 to 0.36 (Table
1). With 30 markers (the number of markers most commonly used in the following experiments), total map
lengths are between 200 and 5000 kb. The largest data sets
Page 16 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
have 500 markers picked from the whole simulated
region, giving an average marker spacing of 33 kb.
In real data, a fraction of alleles is practically always missing, and there may be genotyping errors. Therefore, in
some of the experiments, part of the alleles were masked
as missing. Either both or none of the homologous alleles
of each marker were masked. This was done by fixing for
each allele pair (single marker of a single genotype) a
probability for having missing data. Likewise, genotyping
errors were simulated by randomly changing each allele
according to a given probability.
http://www.biomedcentral.com/1471-2105/7/542
The same procedures were performed separately for the
Yoruba and CEPH data sets, and the same set of experiments were performed for both populations. For each
population, the reported experimental results are averaged over the 50 data sets to reduce variance in the results.
Availability and requirements
• Project name: HaploRec
• Project home page: http://www.cs.helsinki.fi/group/
genetics/haplotyping.html.
• Operating system: Platform independent
Processing of HapMap data
We used 30 trios from the Yoruba population in Ibadan,
Nigeria, and another 30 trios from the CEPH population.
Both data sets were downloaded from the HapMap web
site [25].
The haplotypes of the children were inferred from the
trios, and the non-transmitted parental chromosomes of
each trio were combined to form additional artificial haplotype pairs (as is common in association studies if trios
are available), resulting in a set of 60 genotypes for each
population. Markers for which the phase could not be
inferred, when all members of the trio were heterozygous,
are included in the resulting data sets, but are not used in
the switch accuracy calculations. In markers where only
one allele was missing, the other was marked missing as
well, since some of the tested programs could not handle
markers where only one allele is missing. Markers with
minor allele frequency less than 5% were discarded.
• Programming language: Java
• Other requirements: Java 1.5 or higher
• License: Free for educational, research and non-profit
purposes
• Any restrictions to use by non-academics: License
required
Authors' contributions
All authors contributed to the formulation of the models
and to drafting of the manuscript. LE developed the algorithms and carried out the experiments. FG participated in
the design of the study. HT conceived the study and participated in its design. All authors read and approved the
final manuscript.
Additional material
For testing, we first sampled 50 sets of markers from distinct regions of chromosome 1. The sampling was done by
systematically taking (from the set of markers fulfilling
the minor allele frequency threshold) markers 1–500 to
the first data set, markers 501–1000 to the second, etc. In
the resulting data sets, the average marker spacing is
approximately 1.5 kb and the fraction of missing alleles is
approximately 3.6%. Data sets with sparser maps were
obtained by sampling markers from these; e.g., an average
distance of 6 kb between markers was obtained by systematically choosing every fourth marker from the original
samples.
For experiments with different numbers of markers, the
markers were picked from the middle of each set of 500
markers. To reduce variance in results, caused by differences in the difficulty of haplotyping different markers,
accuracy was always evaluated only on the 10 middlemost
markers common to all the sets. This way, in each run, the
test is the same and results are better comparable.
Additional file 1
Detailed algorithms, proof of convergence, and complexity analysis.
Click here for file
[http://www.biomedcentral.com/content/supplementary/14712105-7-542-S1.pdf]
Additional file 2
Example results from the 10 data replicates.
Click here for file
[http://www.biomedcentral.com/content/supplementary/14712105-7-542-S2.pdf]
Acknowledgements
This research has been financially supported by the Finnish Funding Agency
for Technology and Innovation (Tekes). HT has been supported also by
Alexander von Humboldt Foundation.
Page 17 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:542
http://www.biomedcentral.com/1471-2105/7/542
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
Clark AG: Inference of haplotypes from PCR-amplified samples of diploid populations. Molecular Biological Evolution 1990,
7:111-122.
Gusfleld D: Haplotype Inference by Pure Parsimony. Proceedings of the 14:th Annual Symposium on Combinatorial Pattern Matching
2003.
Ding Z, Filkov V, Gusfleld D: A Linear-Time Algorithm for the
Perfect Phylogeny Haplotyping (PPH) Problem. International
Conference on Research in Computational Molecular Biology 2005
(RECOMB 2005), Volume 3500 of Lecture Notes in Computer Science,
Springer 2005:585-600.
Excoffier L, Slatkin M: Maximum-Likelihood estimation of
molecular haplotype frequencies in a diploid population.
Molecular Biological Evolution 1995, 12(5):921-927.
Niu T, Qin ZS, Liu JS: Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms. Am J Hum Genet
2002, 70:17-169.
Qin ZS, Niu T, Liu JS: Partition-Ligation-Expectation-Maximization Algorithm for Haplotype Inference with SingleNucleotide Polymorphisms.
Am J Hum Genet 2002,
71:1242-1247.
Stephens M, Smith NJ, Donnelly P: A New Statistical Method for
Haplotype Reconstruction from Population Data. Am J Hum
Genet 2001, 68:978-989.
Stephens M, Scheet P: Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-Data Imputation. Am J Hum Genet 2005, 76(3):449-462.
Greenspan G, Geiger D: Model-based inference of haplotype
block variation. In Proceedings of the seventh annual international conference on Computational molecular biology ACM Press; 2003:131-137.
Kimmel G, Shamir R: GERBIL: Genotype resolution and block
identification using likelihood. Proceedings of the National Academy of Sciences 2005, 102:158-162.
Halldórsson BV, Bafna V, Edwards N, Lippert R, Yoosepth S, Istrail S:
A Survey of Computational Methods for Determining Haplotypes. In Computational Methods for SNPs and Haplotype Inference,
Volume 2983 of Lecture Notes in Computer Science Berlin: Springer;
2004:26-47.
Gusfleld D: An Overview of Combinatorial Methods for Haplotype Inference. In Computational Methods for SNPs and Haplotype
Inference, Volume 2983 of Lecture Notes in Computer Science Berlin:
Springer; 2004:9-25.
Salem R, Wessel J, Schork N: A comprehensive literature review
of haplotyping software and methods for use with unrelated
individuals. Human Genomics 2005, 2:39-66.
Yan H, Papadopoulos N, Marra G, Perrera C, Josef Jiricny CR, Boland
, Lynch HT, Chadwick RB, de la Chapelle A, Berg K, Eshleman JR,
Yuan W, Markowitz S, Laken SJ, Lengauer C, Kinzler KW, Vogelstein
B: Conversion of diploidy to haploidy.
Nature 2000,
403:723-724.
Zhang K, Zhu J, Shendure J, Porreca GJ, Aach JD, Mitra RD, Church
GM: Long-range polony haplotyping of individual human
chromosome molecules. Nature Genetics 2006, 38:382-387.
Eskin E, Sharan R, Halperin E: A Note on Phasing Long Genomic
Regions using Local Haplotype Predictions. Journal of Bioinformatics and Computational Biology 2006, 4(3):639-647.
Coulonges C, Delaneau O, Girard M, Do H, Adkins R, Spadoni JL,
Zagury JF: Computation of haplotypes on SNPs subsets:
advantage of the "global method". BMC Genetics 2006, 7:50.
Wang WY, Barratt BJ, Clayton DG, Todd JA: Genome-wide association studies: theoretical and practical concerns. Nature
Reviews Genetics 2005, 6:109-118.
Clayton D: SNPHAP: A program for estimating frequencies
of large haplotypes of SNPs. [http://www-gene.cimr.cam.ac.uk/
clayton/software/snphap.txt].
Hudson RR: Gene genealogies and the coalescent process. In
Oxford surveys in evolutionary biology Volume 7. Edited by: Futuyma D,
Antonovics J. Oxford University Press, Oxford; 1990:1-44.
Scheet P, Stephens M: A Fast and Flexible Statistical Model for
Large-Scale Population Genotype Data: Applications to
Inferring Missing Genotypes and Haplotypic Phase. Am J Hum
Genet 2006, 78(4):629-644.
Eronen L, Geerts F, Toivonen H: A Markov chain approach to
reconstruction of long haplotypes. In Pacific Symposium on Bio-
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
computing 2004 (PSB 2004) Hawaii, USA: World Scientific;
2004:104-115.
The International HapMap Consortium: The International HapMap project. Nature 2003, 426:789-796.
Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 2002, 18:337-338.
The HapMap database [http://www.hapmap.org/genotypes/200510/non-redundant]
Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: High-resolution haplotype structure in the human genome. Nature Genetics 2001, 29:229-232.
Morris A, Whittaker J, Balding D: Little loss of information due
to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data.
The American Journal of Human Genetics 2004, 74:945-953.
Curtis D, Sham PC: Estimated Haplotype Counts from CaseControl Samples Cannot Be Treated as Observed Counts.
American Journal of Human Genetics 2006, 78:729-730.
Hintsanen P, Sevon P, Onkamo P, Eronen L, Toivonen H: An empirical comparison of case – control and trio-based study
designs in high-throughput association mapping. Journal of
Medical Genetics 2006, 43:617-624.
Ron D, Singer Y, Tishby N: The Power of Amnesia. In Advances in
Neural Information Processing Systems Volume 6. Edited by: Cowan JD,
Tesauro G, Alspector J. Morgan Kaufmann Publishers, Inc;
1994:176-183.
Kermorvant C, Dupont P: Improved Smoothing for Probabilistic Suffix Trees Seen as Variable Order Markov Chains. Proceedings of the 12th European Conference on Machine Learning, Springer
2002:185-194.
Kruglyak L: Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nature Genetics 1999,
22:139-144.
Akey J, Jin L, Xiong M: Haplotypes vs single marker linkage disequilibrium tests: what do we gain? European Journal of Human
Genetics 2001, 9:291-300.
Begleiter R, El-Yaniv R, Yona G: On Prediction Using Variable
Order Markov Models. Journal of Artificial Intelligence Research
2004, 22:385-421.
Publish with Bio Med Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical researc h in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
BioMedcentral
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
Page 18 of 18
(page number not for citation purposes)