BIOINFORMATICS
Vol. 20 no. 15 2004, pages 2421–2428
doi:10.1093/bioinformatics/bth266
How independent are the appearances of
n-mers in different genomes?
Yuriy Fofanov1, ∗, Yi Luo1 , Charles Katili1 , Jim Wang1 ,
Yuri Belosludtsev3 , Thomas Powdrill3 , Chetan Belapurkar1 ,
Viacheslav Fofanov1 , Tong-Bin Li1 , Sergey Chumakov1,4 and
B. Montgomery Pettitt1,2
1 Department
of Computer Science and 2 Department of Chemistry University of
Houston, 4800 Calhoun Road, Houston, TX 77204-3010, USA, 3 Vitruvius Biosciences,
The Woodlands, TX, USA and 4 Department of Physics, University of Guadalajara,
Guadalajara, Mexico
ABSTRACT
Motivation: Analysis of statistical properties of DNA
sequences is important for evolutional biology as well as for
DNA probe and PCR technologies. These technologies, in
turn, can be used for organism identification, which implies
applications in the diagnosis of infectious diseases, environmental studies, etc.
Results: We present results of the correlation analysis of
distributions of the presence/absence of short nucleotide subsequences of different length (‘n-mers’, n = 5 − 20) in more
than 1500 microbial and virus genomes, together with five
genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency
of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more
genomes (frequency of appearance). For organisms that are
not close relatives of each other, the presence/absence of
different 7–20mers in their genomes are not correlated. For
close biological relatives, some correlation of the presence of
n-mers in this range appears, but is not as strong as expected.
Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of
n-mers (with appropriately chosen n) to discriminate genomes
of different organisms and possibly individual genomes of the
same species including human with a low probability of error.
Contact: yfofanov@uh.edu.
Supplementary information: Supplementary data is available at http://www.bioinfo.uh.edu/publications/independence_
genomes/.
INTRODUCTION
Several hundred sequencing projects have been already completed, and several complete genomes of large multicellular
∗ To
whom correspondence should be addressed.
Bioinformatics 20(15) © Oxford University Press 2004; all rights reserved.
organisms have become available. Many sequencing projects
are progressing but the number of species and variations is
so large that comparative genomics is just now beginning to
be feasible. A relevant question arises as to whether there is
sufficient material to look at them from a statistical viewpoint
(Vainrub et al., 2003).
Statistical analysis of the appearance of short subsequences
of length n called motifs or n-mers in different DNA sequences
(see, e.g. Karlin, 2001), from individual genes to full genomes,
is of interest in terms of evolutionary biology. In addition,
knowledge of the distribution of appearance of n-mers is
necessary for PCR primer (Fislage et al., 1997; Fislage,
1998) and microarray probe design (Southern, 2001). Several
attempts (Nussinov, 1984; Karlin and Ladunga, 1994; Karlin
et al., 1997; Nakashima et al., 1997, 1998; Deschavanne et al.,
1999; Sandberg et al., 2001) have been made to employ the
distributions of appearance for n-mers to identify species
with relatively short genome sizes (microbial). In such an
approach, the shapes of the frequency distributions for particular short subsequences [2–4mers (Nussinov, 1984; Karlin and
Ladunga, 1994; Karlin et al., 1997; Nakashima et al., 1997,
1998; Campbell et al., 1999) and 8–9mers (Deschavanne et al.,
1999; Sandberg et al., 2001)] have been proposed as a measure
to decide what microbial genome we are dealing with, based
on a given piece of genome or a whole genome.
The above-mentioned papers deal with the case for frequency of appearance when n is small, such that the total
number of n-mers, 4n , is smaller than the genome sequence
length, M, 4n < M. It is clear, that distributions of appearance
of n-mers in this range are essentially different from that for
random sequences of the same lengths. Here, we calculate
whether a given n-mer is present or absent (frequency of
presence) in a given genome that is not the usual calculated
number of appearances of n-mers in one or more genomes
(frequency of appearance). We consider the distribution of
2421
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016
Received on October 16, 2003; revised on March 9, 2004; accepted on April 1, 2004
Advance Access publication April 15, 2004
Y.Fofanov et al.
METHODS
For our analysis, we picked all genomes available currently
(May 2003) from the NCBI (http://www.ncbi.nlm.nih.gov/
entrez/query.fcgi?db=Genome) including microbial (110),
viral (1405) and multicellular organisms (5) genomes, with
sizes ranging from 0.44 kb (Rice yellow mottle virus satellite) to 2.87 Gb (human). A complete list of all genomes
and the complete results of the analysis discussed below are
available as supplementary material at http://bioinfo.uh.edu/
publications/independence_genomes/.
For our computations with multicellular organisms, microbial and viruses, we used both complementary sequences
because it is the way how we can observe it based on the
present technology (PCR, cDNA Microarrays, etc.). This
trivially increases the amount of analyzed material by a factor
of two. To take this fact into account for normalization, we
will use the term ‘total sequence length’ (TSL), equal to twice
the genome. We will denote the TSL so defined by M.
2422
The calculation of the frequency of presence of n-mers
for n > 10 in large genome sequences is challenging
because of exponential growth of time/memory usage in brute
force algorithms. To be able to perform calculations for
n > 11, new algorithms and special data structures have been
developed and implemented (Fofanov et al., 2002a,b), see
http://bioinfo.uh.edu/publications/ for details.
In this study, we examined the presence/absence of short
subsequences in more than one genome simultaneously
obtaining a frequency of presence/absence across multiple
genomes. This distribution is not related to how many occurrences of an n-mer are in a particular genomic set, We
performed such analyses separately in four different sets
of genomes: RNA-based viruses (789 genomic sequences),
DNA-based viruses (616 genomic sequences), microorganisms (110 genomes) and human. In each group, the number
of simultaneously present 5–18mers were calculated for each
pair of genomes. The fourth group contains 24 human chromosomes, for which the numbers of simultaneously present
7–20mers were calculated for each pair of chromosomes.
RESULTS
Frequencies of presence of n-mers in different
genomes
As the first step of our analysis we have calculated the amount,
N (n, G), of distinct 5–20 long n-mers present in each of
1500+ considered genomes (G). The corresponding results
for 114 microbial genomes are shown in Figure 1. The value
of N (n, G) depends on two parameters: 4n —the total number of all possible n-mers, and the genome length, M. In
Figure 1, we show the frequency of presence of different nmers, p = N (n, G)/4n , as a function of the ratio 4n /M. Note,
that 4n grows very fast when n increases. For short n-mers,
n < 7, and long sequences, M > 4n , a kind of ‘saturation’
can be observed, when all or almost all possible n-mers are
present in the sequence. In turn, when M ≪ 4n , only a small
part of the total number of n-mers appears (and, for instance,
in microbial genomes most of n-mers appear only once). The
results for different M and n form a well-defined pattern. The
upper bound of this pattern is given by a simple analytical formula, which can be found under the assumption of the purely
random appearance of n-mers in genomes (see Appendix A
for details):
1
.
(1)
p=
1 + (4n /M)
This upper bound is shown in the figure as a solid line. Similar results for DNA- and RNA-based viruses and multicellular
organisms can be found in the supplementary data. It is worth
noting that such a pattern for multicellular organisms is located
notably below the expected upper bound, which is in agreement with a significant presence of repeated parts in these
genomes.
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016
presence/absence of n-mers in various species in the case of
larger n, such that the condition 4n ≫ M holds. One should
expect that distributions of presence of longer n-mers are also
not random. Consider that genomes (especially larger ones)
contain structural repeats. In addition, the occurrence statistics for short oligonucleotides (2 and 3mers) are found to be
not random, and this affects the occurrence distributions for
longer n-mers, since they contain 2 and 3mers as structural
elements. However, we found a remarkable similarity of the
distributions of presence of n-mers in different genomes to
the corresponding random distribution (‘random boundary’).
Deviations from a purely random distribution are really small
for viruses and bacterial genomes. Although the difference
grows for multicellular organisms, our results show that the
frequency of presence of n-mers in large genomes (as rice and
human) also resembles the random boundary.
In this paper, we examine the relationships between the
distributions of the presence/absence of all possible short nucleotide subsequences of various length, 5 < n < 20, in more
than 1500 different genomes, from viruses and microbes to
multicellular organisms. We found no such studies in literature for n > 11. Indeed, this type of calculations is challenging
for larger n because of exponential growth of time/memory
usage in brute force algorithms.
There are two aspects in which the present work differs
from the previous studies. First, we consider larger values of
subsequence lengths, to satisfy the condition 4n ≫ M. In particular, n up to 20 is well within the range of computational
convenience on a reasonably powerful workstation. In addition, we concentrate our attention on the presence/absence
of n-mers in different genomes (independent of the number
of its appearances), instead of the more commonly studied
frequency distributions of the appearance of n-mers. The
properties of the frequency of presence may be important for
biosensor design, as discussed below.
Independence of appearance of n-mers in genomes
Correlations of presence of n-mers in different
genomes
The principal goal of our research was to find out how independent/correlated the appearances of n-mers are in different
genomes. One of the possible ways to approach this question is by using the well-known multiplication property for
the joint probability of the intersection of events, according
to which two events A and B, can be treated as independent
if p(A ∩ B) = p(A)p(B).
Consider a simple example based on three different genomes: (1) Salmonella typhi (NC_003198), (2) Mycobacterium
tuberculosis H37Rv (NC_000962) and (3) Bacillus subtilis
(NC_000964). A complete set of n-mers would contain 4n
n-mers, which, for n = 12, is 412 =16 777 216. We use
both strands of the complete genome sequences for our calculations. In the text below M represents the TSL, and
N (n, G) stands for the number of different n-mers in genome G. In Table 1, we present the number of different
12mers that occur in each of these three genomes together
with the corresponding frequency of presence [i.e. the probability of finding randomly picked 12mers in each genome,
p = N (12, G)/412 ].
The number N (n, G1 , G2 ) of n-mers (n = 12) that appear in
each pair of genomes (G1 , G2 ) was also computed (Table 2).
Based on this, one can compare the probabilities of finding
randomly picked 12mers in two genomes simultaneously with
the probabilities calculated using the multiplication rule. As
shown in Table 2, the actual and calculated (expected) probabilities do not differ greatly from each other. This allows us
treating the presence/absence of randomly picked 12mers in
these three genomes as independent events.
We calculated the actual and expected probabilities for each
pair of genomes in the three groups (1 000 000+ pairs in total).
Table 1. The frequency of presence of 12mers within the three microbial
genomes
Genome (G)
Genome
length
Total
sequence
length
(bp)
p = N (12, G)/4n
Number of
different
(%)
12mers present
in genome:
N (12, G)
(1) S.typhi
4 809 037 9 618 074 5 813 330
(2) M.tuberculosis 4 411 529 8 823 058 4 361 508
H37Rv
(3) B.subtilis
4 214 814 8 429 628 5 346 103
34.65
26.00
31.87
Table 2. Actual and predicted simultaneous presence of 12mers within the
three microbial genomes: (1) S.typhi, (2) M.tuberculosis H37Rv and (3)
B.subtilis
Case
Number of N (n, G1 , G2 )/4n Calculated probability
12mers
(%)
assuming
independence (%)
Present in genomes 1 943 814
(1) and (2)
Present in genomes 2 335 710
(1) and (3)
Present in genomes 1 334 288
(2) and (3)
11.6
9.0
13.9
11.0
8.0
8.3
We were especially interested in the range of n which gives rise
to the frequency of presence, p ∗ , of different n-mers in the
genome between 5% and 50% of the total possible number
of possible n-mers (4n ). This range for different microbial
2423
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016
Fig. 1. The frequency of presence of different n-mers, p = N (n, G)/4n , as a function of the ratio 4n /M.
Y.Fofanov et al.
Table 3. The optimal length of n-mers (n∗ ) for different genome sizes and
frequencies of presence (p ∗ )
Total sequence
length (bp) (Mb)
n∗ determined for
frequency of
presence 50%
(P ∗ = 0.5)
n∗ determined for
frequency of
presence 5%
(P ∗ = 0.05)
0.8
2.0
10.0
9.80
10.47
11.63
11.93
12.59
13.75
log[M(1 − p ∗ )/p ∗ ]
.
n =
log(4)
∗
(2)
This formula works well for all the three groups of genomes
(viruses, microbes and multicellular organisms). The upper
and lower bounds of n∗ for genome sizes between 0.8 and
10 Mb, which are typical for microbials, are shown in Table 3.
In accordance with this, the value n = 12 seems to be the most
reasonable one for all microbial genomes. For viral genomes
the appropriate value was found to be n = 7.
We found that for all 11 990 pairs of microbial genomes and
the value of n = 12 the average ratio of actual and expected
probabilities is 1.37 ± 0.67. For viral genomes and the value
of n = 7, the average ratio of actual and expected probabilities was found to be 1.07 ±0.12 for 387 840 genome pairs
DNA-based viruses and 1.10 ± 0.12 for 621 732 genome pairs
RNA-based viruses. Thus, we conclude that for this range of
n the presence of n-mers in different genomes, to a good
approximation, can be treated as independent events.
The highest deviations between the expected and actual
probabilities were found among closely related genomes. For
instance, using 7mers, a high ratio (185%) was found for Duck
hepatitis B virus (NC_001344) versus Stork hepatitis B virus
(NC_003325) with 8.1% expected and 15.0% actual.
An example of closely related microbial genomes would be
Staphylococcus aureus N315 (NC_002745) versus S.aureus
Mu50 (NC_002758) with 4.0% expected and 19.7% actual
or 491% higher than that expected. Another extreme case
was found for three microbial genomes: Chlamydophila
pneumoniae CWL029 (NC_000922), C.pneumoniae AR39
(NC_002179) and C.pneumoniae J138 (NC_002491), which
have the highest (8-fold) ratio of actual and expected probabilities for 12mers (1.5% expected and 12.3% actual). The
results for these three microbial genomes are presented in
Table 4.
2424
Case
Number
Number of
Calculated
of 12mers 12mers/4n (%) probability
assuming
independence (%)
Present in genome (a) and
7 712
absent in genome (b)
Absent in genome (a) and
7 214
present in genome (b)
Present in genomes
2 058 304
(a) and (b)
Present in genome (a) and
11 526
absent in genome (c)
Absent in genome (a) and
10 706
present in genome (c)
Present in genomes
2 054 490
(a) and (c)
Present in genome (b) and
6 939
absent in genome (c)
Absent in genome (b) and
6 617
present in genome (c)
Present in genomes
2 058 579
(b) and (c)
0.046
0.043
12.268
1.52
0.069
0.064
12.246
1.52
0.041
0.039
12.270
1.52
We performed the same calculation for the 24 human chromosomes pairwise. The average ratio of actual and expected
probabilities of 14mers is 1.91±0.16, maximum ratio being
found for 20th and Y-chromosomes (expectation 2.9% versus
actual 6.9%).
DISCUSSION
Microbial/viral fingerprints using random
subsets of n-mers
It may be assumed that our results for 1500+ genomes can
be extended to other genomes (many yet to be sequenced).
In this case one may use relatively small sets of randomly
picked n-mers for differentiating between different viruses
and organisms.
This idea can be illustrated by continuing our example for
three microbial genomes. Let n∗ be the size of n-mer, which
fits the interval where from 5% to 50% of all possible n-mers
show up for a desirable range of genome lengths. In accordance with Table 3, we may choose the value n∗ = 12. Let
us randomly pick L 12mers (say, L = 1000). For example,
L can be the number of probes placed on a microarray. Given
a genome G1 with the frequency of presence of n-mers p1 ,
we expect that K = p1 L n-mers present in G1 will appear
also in our random set, forming a ‘fingerprint’ of G1 (in our
example, we expect 50 < K < 500). The probability, ε, that
the fingerprint of G1 will exactly coincide with the fingerprint
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016
genome sizes can be numerically determined from Figure 1.
The corresponding frequency of presence for purely random
sequences (random boundary) is shown in Figure 1 by a solid
line. The analytical formula for the random boundary can be
used to estimate this range analytically:
Table 4. Actual and predicted simultaneous presence of 12mers within the
three extremely close microbial genomes: (a) C.pneumoniae CWL029, (b)
C.pneumoniae AR39 and (c) C.pneumoniae J138.
Independence of appearance of n-mers in genomes
of some other genome G2 (with the frequency of presence of
n-mers p2 ) is found in Appendix B. The result is
ε = (1 − p1 − p2 + 2p12 )L .
(3)
Here p12 is the probability for the n-mer to be present in both
genomes simultaneously. Let us consider the numeric example
from Tables 1 and 2 of two species that are far from each
other, S.typhi versus M.tuberculosis H37Rv; p1 = 0.3465,
p2 = 0.2600, p12 = 0.1160. With L = 1000, a remarkable
accuracy of ε = 1.7∗ 10−204 can theoretically be achieved.
Given a desirable probability of error, ε, one can determine
the appropriate size, L, of a random set of n-mers which can
be used for reliable identification of genomes as
log ε
.
log(1 − p1 − p2 + 2p12 )
(4)
For related organisms, the genomes may contain large common parts. This means that p12 may be close to p1 and p2 .
To give a numeric example of close relatives, let us consider
S.aureus N315 versus S.aureus Mu50. Now p1 = 0.198,
p2 = 0.203, p12 = 0.197 and an accuracy of ε = 10−10
can be achieved with L = 3278. We would like to stress the
logarithmic dependence of the sampling size or the number
of probes L, on the error probability, ε. This feature is of
principal importance for our discussion.
Therefore, we can use practically any sufficiently random
subset of n-mers of appropriate size to construct a microarray
to diagnose to which organism a given DNA/RNA sample
belongs. Different sizes of n-mers must be employed for
recognition of different organisms based on their genome
lengths. Values of n that correspond to given intervals of genome lengths can be easily calculated using above formulas. In
fact, only 11 different n values, 7 ≤ n ≤ 17, would be enough
to cover a large variety of genome sizes from 1 kb to 9 Gb.
The important advantage of such an approach is that it can
be used without a priori knowledge of the sequence itself.
This implies that there is no need to perform the expensive and time-consuming process of sequencing before array
design. It is enough to obtain the purified DNA, hybridize it
on a sufficiently random microarray chip and check which nmers show up. Taking into account how accessible the DNA
of thousands of microbial and viruses are, how easily each
microarray can be produced, and the fact that we do not
need to determine quantitative values of expression (we need
just a yes/no answer)—it should be possible to produce an
essentially universal microbial/viral DNA chip.
Fingerprints of closely related organisms
We next consider what happens when we try to compare
closely related organisms using this approach (e.g. different
types of influenza or different strands of the same microbes).
We assume that two genomes G1 and G2 almost coincide and
differ only in m randomly located nucleotides. This situation
L=
M| log ε|
log ε
≤
.
p log(1 − mn/N )
pmn
(5)
Here, N is the number of different n-mers contained in G1
(which is approximately equal to the number of different nmers contained in G2 ).
Such an approach can provide the level of accuracy necessary for the individual human fingerprints. Let us assume
that the differences between individual human beings appear
only because of SNPs, which have equal probability and are
randomly located in genome. According to literature estimates (Weiner and Hudson, 2002), the total number of SNPs
in the human genome is expected to be ∼3 000 000. Then,
calculating the necessary size for the random microarray
(m/M ∼ 0.1%, ε = 10−10 , n = 17, p = 0.284) we have
L ∼ 4769. This rough estimation is promising and indicates
that this possibility deserves a proper experimental study.
We would like to recall, that our theoretical estimations
have been made for randomly picked sets of n-mers. The
further possibility exists to start with a larger than necessary random set of n-mers (say, L = 10 000) and then to
decrease the microarray size experimenting with the desirable set of genomes (using, for instance, a simple optimization
approach).
CONCLUSIONS
We presented results of a correlation analysis for distributions
of the presence/absence of short subsequences of different
length (n-mers, n = 5 − 20) in more than 1500 microbial
and viral genomes together with five genomes of multicellular organisms (including human). Our results show that for
organisms that are not close relatives to each other, a range of
values of n can be found, such that the presence/absence of
different n-mers in different genomes are practically not correlated (within a probabilistic tolerance, ε). For close relatives
such correlations do appear, but are not as strong as might be
expected.
The size of the correlations among the n-mers present in
different genomes leads to the possibility of using random
sets of n-mers (with appropriately chosen n) to discriminate
between different microbial and viral genomes, and possibly,
individual human beings with a convenient number of combinatorial experiments. The formulas derived, yield the size of
a combinatorial experiment designed to identify an organism
given the length of its genome, a convenient length of probe, n
and a tolerance or error, ε.
2425
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016
L=
simulates the existence of point mutations or single nucleotide
polymorphisms (SNPs). Let L be the size (number of probes)
of the microarray and p, the frequency of presence of n-mers
in a genome with a TSL value M. The value of L, necessary to distinguish the fingerprints of these two genomes with
the error probability ε, can be estimated by the formula (see
Appendix B):
Y.Fofanov et al.
ACKNOWLEDGEMENTS
The authors thank Prof. M. Hogan for interesting conversations. S.C., B.M.P. and Y.F. thank TLCC for partial funding
of this work. T.-B.L. was supported by a training fellowship
from the W.M. Keck Foundation to the Gulf Coast Consortia through the Keck Center for Computational and Structural
Biology. B.M.P. and Y.F. thank the NIH for partial support
of this work and NPACI for computational support. S.C.
is grateful to the University of Houston Computer Science
Department for hospitality.
REFERENCES
Campbell,A., Mrazek,J. and Karlin,S. (1999) Genome signature
comparisons among prokaryote, plasmid, and mitochondrial
DNA. Proc. Natl Acad. Sci., USA, 96, 9184–9189.
Deschavanne,P.J., Giron,A., Vilain,J., Fagot,G. and Fertil,B. (1999)
Genomic signature: characterization and classification of species
assessed by chaos game representation of sequences. Mol. Biol.
Evol., 16, 1391–1399.
Fislage,R. (1998) Differential display approach to quantitation of
environmental stimuli on bacterial gene expression. Electrophoresis, 19, 613–616.
Fislage,R., Berceanu,M., Humboldt,Y., Wendt,M. and Oberender,H.
(1997) Primer design for a prokaryotic differential display RT–
PCR. Nucleic Acids Res., 25, 1830–1835.
Fofanov,V., Fofanov,Y. and Pettitt,B.M. (2002a) Fast subsequence
search using incomplete search trees. The Seventh Structural
Biology Symposium of Sealy Center for Structural Biology. The
University of Texas Medical Branch, Galveston, TX, p. 51.
Fofanov,V., Fofanov,Y. and Pettitt,B.M. (2002b) Counting array
algorithms for the problem of finding appearances of all possible patterns of size n in a sequence. The 2002 Bioinformatics Symposium, Keck/GCC Bioinformatics Consortium. W.M.
Keck Center for Computational and Structural Biology, Houston,
TX, p. 14.
2426
Karlin,S. (2001) Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends in Microbiol.,
9, 335–343.
Karlin,S. and Ladunga,I. (1994) Comparisons of eukaryotic genomic
sequences. Proc. Natl Acad. Sci., USA, 91, 12832–12836.
Karlin,S., Mrazek,J. and Campbell,A.M. (1997) Compositional
biases of bacterial genomes and evolutionary implications.
J. Bacteriol., 179, 3899–3913.
Nakashima,H., Nishikawa,K. and Ooi,T. (1997) Differences in
dinucleotide frequencies of human, yeast, and Escherichia coli
genes. DNA Res., 4, 185–192.
Nakashima,H., Ota,M., Nishikawa,K. and Ooi,T. (1998) Genes
from nine genomes are separated into their organisms in the
dinucleotide composition space. DNA Res., 5, 251–259.
Nussinov,R. (1984) Doublet frequencies in evolutionary distinct
groups. Nucleic Acids Res., 12, 1749–1763.
Sandberg,R., Winberg,G., Branden,C.I., Kaske,A., Ernberg,I. and
Coster,J. (2001) Capturing whole-genome characteristics in short
sequences using a naive Bayesian classifier. Genome Res., 11,
1404–1409.
Southern,E.M. (2001) DNA microarrays—-history and overview.
Meth. Mol. Biol., 170, 1–15.
Vainrub,A., Li,T.-B, Fofanov,Y. and Pettitt,B.M. (2003) Theoretical
considerations for the efficient design of DNA arrays. In: Moore,J.
and Zouridakis,G. (eds.), Biomedical Technology and Devices
Handbook. CRC Press, pp. 14.11–14.14.
Weiner,M.P. and Hudson,T.J. (2002) Introduction to SNPs: discovery
of markers for disease. Biotechniques, 10(Suppl. 4–7), 12–13.
APPENDIX A
Here, we will analytically estimate the frequency of presence
of n-mers in a genome of length M. Let us apply the logic of
the example shown in Tables 1 and 3 to autocorrelations, i.e.
let us check whether the appearances of distinct n-mers are
independent or correlated within a single genome. Assume
that the multiple appearances of a given n-mer at different
locations within the same genome are also independent events.
Then, the probability of n-mer to appear once is p, twice is
p2 , thrice is p 3 and so on. The total number of n-mers in the
genome, taking into account multiple appearances is
M ≈ 4n (p + p 2 + p 3 + · · · ) =
4n p
,
(1 − p)
(A1)
from which one obtains,
p≈
M
.
(M + 4n )
(A2)
This formula has been presented in the text, and is shown
in Figure 1 by a solid line. One may also compare it with
the experimental values from the last column of Table 1.
In accordance with Equation (1) we have for S.typhi p =
34.44%, for M.tuberculosis H37Rv, p = 34.46% and for
B.subtilis p = 33.44%. This corresponds better to experimental values (34.65, 26.00 and 31.87%, respectively)
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016
Clearly, additional experimental study (including, e.g.
hybridization of microbial samples on random microarrays) is
necessary to verify if the statistical features described above
can lead to the creation of a real biosensor (as it is suggested by our in silico experiments). Future studies should
take into account errors in the course of hybridization. Rough
theoretical estimation, assuming independent probabilities of
hybridization error at different microarray sites, suggests that
total number of hybridization
errors on the array of size L
√
is proportional to L. Thus, the total relative error due to
imperfect hybridization can be made small by increasing the
number of probes on the microarray L. On the other hand, it is
currently not clear to what degree the genomes are correctly
assembled. Possible errors in sequences may have affected
our results. We however believe that the parts of genomic
sequences that have been correctly reconstructed are significant enough to determine the statistical properties described
above.
Independence of appearance of n-mers in genomes
than the estimation without taking into account multiple
appearances,
M
(A3)
p ≈ n,
4
which leads to the probabilities 57.3, 52.6 and 50.2%, respectively. This fact is in accordance with the conclusion about
the apparently nearly random statistical character of the
appearance of n-mers in a single genome.
APPENDIX B
N12
.
N1 + N2 − N12
(A4)
An error, E, occurs when two genomes share the same fingerprint, i.e. all n-mers that form the fingerprint represent the
intersection region. This will happen with probability
k
N12
P (E | k) =
.
(A5)
N1 + N2 − N12
In fact, this is a conditional probability of an error, E, if we
have a fingerprint of length k.
We now need to calculate an average with respect to all possible fingerprints. There are CkL = L!/[k!(L − k)!] different
fingerprints of the size k, which appear with equal probabilities [P (S ∈ G1 ∪ G2 )]k [1 − P (S ∈ G1 ∪ G2 )]L−k , where
P (S ∈ G1 ∪ G2 ) is the probability for n-mer S to find itself
in the intersection G1 ∪ G2 sampling L times. Therefore, we
come to a binomial distribution of fingerprint sizes,
L!
N1 + N2 − N12 k
P (k) =
(A6)
k!(L − k)!
4n
N1 + N2 − N12 L−k
× 1−
.
(A7)
4n
Calculating the average error we have,
P (E) =
P (E | k)P (k) = (1 − p1 − p2 + 2p12 )L . (A8)
k
Here, pj = Nj /4n is the probability of presence in Gj (j =
1, 2), and p12 = N12 /4n is the probability of presence in the
L=
log ε
.
log(1 − p1 − p2 + 2p12 )
(A9)
We would like to again stress the logarithmic dependence of
the microarray size L on the error level ε. This feature is of
principal importance for the analysis under discussion. The
following three cases will be considered separately.
Essentially different organisms
In this case, in accordance with the discussion in the text, the
presence/absence of n-mers in one genome is not correlated
with the presence/absence of n-mers in another genome and
we can write p12 ≈ p1 p2 . Taking, for simplicity, p1 ≈ p2 ≈
p, we obtain,
L=
log ε
.
log(1 − 2p + 2p 2 )
(A10)
For instance, if ε = 10−10 and p = 0.05, we obtain L = 230.
Related organisms
Now, p12 = p1 p2 . Assuming that intersection G1 ∩G2 almost
coincides with the union, G1 ∪ G2 , or
N1 + N1 − N12 > N12 ≫ N1 + N1 − 2N12 ,
(A11)
one can rewrite Equation (A9) in a slightly different form.
Starting once again with Equations A7–A9 and approximating the
√ binomial distribution by the Gaussian of width
s =
LP (1 − P ), centered at k = LP where P =
(N1 + N2 − N12 )/4n is the probability for an n-mer to be
present in the union G1 ∪ G2 we find,
1
2
2
P (E) =
e−αk √ e−(k−k) /2s ,
s 2π
k
N12
.
(A12)
N1 + N2 − N12
Provided that α ≪ 1 [which follows from inequality (5)] and
k ≫ 1 (which is consistent with a small error level), one can
change the summation to integration and obtain immediately,
1
2
2
2 2
P (E) = √
e−αk−(k−k) /2s dk = e−αk+α s /2 .
s 2π
(A13)
Finally,
k
N12
P (E) ≈
.
(A14)
N1 + N2 − N12
Now we can find the relation between the error level and the
microarray size in the form,
e−α =
k = PL =
log ε
.
log[N12 /(N1 + N2 − N12 )]
(A15)
Here, P , the probability of presence of n-mer in
the intersection of two genomes,
is given by
2427
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016
Here, we will estimate the probability to make an error discriminating organisms by their analysis (‘fingerprints’) in a
random microarray, which consists of L n-mers. Assume that
we need to discriminate between the two genomes G1 and
G2 of sizes M1 and M2 , respectively. Let G1 (G2 ) contains
N1 (N2 ) different n-mers and N12 = N (n, G1 , G2 ) n-mers
are present simultaneously in both genomes (this is the size
of intersection of two sets of n-mers corresponding to ‘n-mer
contents’ of G1 and G2 ; we denote this set as G1 ∩ G2 ). The
union G1 ∪ G2 contains N1 + N2 − N12 n-mers. Let us consider a fingerprint of the union of the two genomes, G1 ∪ G2 .
For every n-mer appearing in this fingerprint, the probability
that it occurs in the intersection region, G1 ∩ G2 , is
intersection G1 ∩ G2 . Given a desirable level of tolerance
or error, P (E) ∼ ε, one can now estimate the appropriate
combinatorial experiment (array) size:
Y.Fofanov et al.
P = (N1 + N2 − N12 )/4n ∼ p1 ∼ p2 . The last formula
leads to similar numerical values as Equation (A1) if
N12 ≫ N1 + N1 − 2N12 . Say, for P = 0.05, N12 /(N1 +
N2 − N12 ) = 0.9, ε = 10−10 , we have, L = 4371.
Closely related organisms
Let us assume that two genomes G1 and G2 almost coincide
and differ only in m randomly located characters (nucleotides). This situation simulates the existence of SNPs. For
simplicity, let us assume, that N1 = N2 = N. Every character that is different in G1 and G2 belongs simultaneously to n
different n-mers, and the size of the subset in G1 ∪ G2 which
consists of the n-mers that are different in G1 and G2 has a
size, nm = 2N − 2N12 . Then,
N12 = N −
mn
,
2
or
N1 + N2 − N12 = N +
nm k
P (E) ≈ 1 −
= ε.
N
mn
,
2
(A16)
Taking into account, that N ≤ M, we arrive at the estimation,
L=
log ε
M| log ε|
k
=
≤
.
P
P log(1 − mn/N )
P mn
(A17)
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016
2428