Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
BIOINFORMATICS Vol. 20 no. 15 2004, pages 2421–2428 doi:10.1093/bioinformatics/bth266 How independent are the appearances of n-mers in different genomes? Yuriy Fofanov1, ∗, Yi Luo1 , Charles Katili1 , Jim Wang1 , Yuri Belosludtsev3 , Thomas Powdrill3 , Chetan Belapurkar1 , Viacheslav Fofanov1 , Tong-Bin Li1 , Sergey Chumakov1,4 and B. Montgomery Pettitt1,2 1 Department of Computer Science and 2 Department of Chemistry University of Houston, 4800 Calhoun Road, Houston, TX 77204-3010, USA, 3 Vitruvius Biosciences, The Woodlands, TX, USA and 4 Department of Physics, University of Guadalajara, Guadalajara, Mexico ABSTRACT Motivation: Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc. Results: We present results of the correlation analysis of distributions of the presence/absence of short nucleotide subsequences of different length (‘n-mers’, n = 5 − 20) in more than 1500 microbial and virus genomes, together with five genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more genomes (frequency of appearance). For organisms that are not close relatives of each other, the presence/absence of different 7–20mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers in this range appears, but is not as strong as expected. Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms and possibly individual genomes of the same species including human with a low probability of error. Contact: yfofanov@uh.edu. Supplementary information: Supplementary data is available at http://www.bioinfo.uh.edu/publications/independence_ genomes/. INTRODUCTION Several hundred sequencing projects have been already completed, and several complete genomes of large multicellular ∗ To whom correspondence should be addressed. Bioinformatics 20(15) © Oxford University Press 2004; all rights reserved. organisms have become available. Many sequencing projects are progressing but the number of species and variations is so large that comparative genomics is just now beginning to be feasible. A relevant question arises as to whether there is sufficient material to look at them from a statistical viewpoint (Vainrub et al., 2003). Statistical analysis of the appearance of short subsequences of length n called motifs or n-mers in different DNA sequences (see, e.g. Karlin, 2001), from individual genes to full genomes, is of interest in terms of evolutionary biology. In addition, knowledge of the distribution of appearance of n-mers is necessary for PCR primer (Fislage et al., 1997; Fislage, 1998) and microarray probe design (Southern, 2001). Several attempts (Nussinov, 1984; Karlin and Ladunga, 1994; Karlin et al., 1997; Nakashima et al., 1997, 1998; Deschavanne et al., 1999; Sandberg et al., 2001) have been made to employ the distributions of appearance for n-mers to identify species with relatively short genome sizes (microbial). In such an approach, the shapes of the frequency distributions for particular short subsequences [2–4mers (Nussinov, 1984; Karlin and Ladunga, 1994; Karlin et al., 1997; Nakashima et al., 1997, 1998; Campbell et al., 1999) and 8–9mers (Deschavanne et al., 1999; Sandberg et al., 2001)] have been proposed as a measure to decide what microbial genome we are dealing with, based on a given piece of genome or a whole genome. The above-mentioned papers deal with the case for frequency of appearance when n is small, such that the total number of n-mers, 4n , is smaller than the genome sequence length, M, 4n < M. It is clear, that distributions of appearance of n-mers in this range are essentially different from that for random sequences of the same lengths. Here, we calculate whether a given n-mer is present or absent (frequency of presence) in a given genome that is not the usual calculated number of appearances of n-mers in one or more genomes (frequency of appearance). We consider the distribution of 2421 Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016 Received on October 16, 2003; revised on March 9, 2004; accepted on April 1, 2004 Advance Access publication April 15, 2004 Y.Fofanov et al. METHODS For our analysis, we picked all genomes available currently (May 2003) from the NCBI (http://www.ncbi.nlm.nih.gov/ entrez/query.fcgi?db=Genome) including microbial (110), viral (1405) and multicellular organisms (5) genomes, with sizes ranging from 0.44 kb (Rice yellow mottle virus satellite) to 2.87 Gb (human). A complete list of all genomes and the complete results of the analysis discussed below are available as supplementary material at http://bioinfo.uh.edu/ publications/independence_genomes/. For our computations with multicellular organisms, microbial and viruses, we used both complementary sequences because it is the way how we can observe it based on the present technology (PCR, cDNA Microarrays, etc.). This trivially increases the amount of analyzed material by a factor of two. To take this fact into account for normalization, we will use the term ‘total sequence length’ (TSL), equal to twice the genome. We will denote the TSL so defined by M. 2422 The calculation of the frequency of presence of n-mers for n > 10 in large genome sequences is challenging because of exponential growth of time/memory usage in brute force algorithms. To be able to perform calculations for n > 11, new algorithms and special data structures have been developed and implemented (Fofanov et al., 2002a,b), see http://bioinfo.uh.edu/publications/ for details. In this study, we examined the presence/absence of short subsequences in more than one genome simultaneously obtaining a frequency of presence/absence across multiple genomes. This distribution is not related to how many occurrences of an n-mer are in a particular genomic set, We performed such analyses separately in four different sets of genomes: RNA-based viruses (789 genomic sequences), DNA-based viruses (616 genomic sequences), microorganisms (110 genomes) and human. In each group, the number of simultaneously present 5–18mers were calculated for each pair of genomes. The fourth group contains 24 human chromosomes, for which the numbers of simultaneously present 7–20mers were calculated for each pair of chromosomes. RESULTS Frequencies of presence of n-mers in different genomes As the first step of our analysis we have calculated the amount, N (n, G), of distinct 5–20 long n-mers present in each of 1500+ considered genomes (G). The corresponding results for 114 microbial genomes are shown in Figure 1. The value of N (n, G) depends on two parameters: 4n —the total number of all possible n-mers, and the genome length, M. In Figure 1, we show the frequency of presence of different nmers, p = N (n, G)/4n , as a function of the ratio 4n /M. Note, that 4n grows very fast when n increases. For short n-mers, n < 7, and long sequences, M > 4n , a kind of ‘saturation’ can be observed, when all or almost all possible n-mers are present in the sequence. In turn, when M ≪ 4n , only a small part of the total number of n-mers appears (and, for instance, in microbial genomes most of n-mers appear only once). The results for different M and n form a well-defined pattern. The upper bound of this pattern is given by a simple analytical formula, which can be found under the assumption of the purely random appearance of n-mers in genomes (see Appendix A for details): 1 . (1) p= 1 + (4n /M) This upper bound is shown in the figure as a solid line. Similar results for DNA- and RNA-based viruses and multicellular organisms can be found in the supplementary data. It is worth noting that such a pattern for multicellular organisms is located notably below the expected upper bound, which is in agreement with a significant presence of repeated parts in these genomes. Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016 presence/absence of n-mers in various species in the case of larger n, such that the condition 4n ≫ M holds. One should expect that distributions of presence of longer n-mers are also not random. Consider that genomes (especially larger ones) contain structural repeats. In addition, the occurrence statistics for short oligonucleotides (2 and 3mers) are found to be not random, and this affects the occurrence distributions for longer n-mers, since they contain 2 and 3mers as structural elements. However, we found a remarkable similarity of the distributions of presence of n-mers in different genomes to the corresponding random distribution (‘random boundary’). Deviations from a purely random distribution are really small for viruses and bacterial genomes. Although the difference grows for multicellular organisms, our results show that the frequency of presence of n-mers in large genomes (as rice and human) also resembles the random boundary. In this paper, we examine the relationships between the distributions of the presence/absence of all possible short nucleotide subsequences of various length, 5 < n < 20, in more than 1500 different genomes, from viruses and microbes to multicellular organisms. We found no such studies in literature for n > 11. Indeed, this type of calculations is challenging for larger n because of exponential growth of time/memory usage in brute force algorithms. There are two aspects in which the present work differs from the previous studies. First, we consider larger values of subsequence lengths, to satisfy the condition 4n ≫ M. In particular, n up to 20 is well within the range of computational convenience on a reasonably powerful workstation. In addition, we concentrate our attention on the presence/absence of n-mers in different genomes (independent of the number of its appearances), instead of the more commonly studied frequency distributions of the appearance of n-mers. The properties of the frequency of presence may be important for biosensor design, as discussed below. Independence of appearance of n-mers in genomes Correlations of presence of n-mers in different genomes The principal goal of our research was to find out how independent/correlated the appearances of n-mers are in different genomes. One of the possible ways to approach this question is by using the well-known multiplication property for the joint probability of the intersection of events, according to which two events A and B, can be treated as independent if p(A ∩ B) = p(A)p(B). Consider a simple example based on three different genomes: (1) Salmonella typhi (NC_003198), (2) Mycobacterium tuberculosis H37Rv (NC_000962) and (3) Bacillus subtilis (NC_000964). A complete set of n-mers would contain 4n n-mers, which, for n = 12, is 412 =16 777 216. We use both strands of the complete genome sequences for our calculations. In the text below M represents the TSL, and N (n, G) stands for the number of different n-mers in genome G. In Table 1, we present the number of different 12mers that occur in each of these three genomes together with the corresponding frequency of presence [i.e. the probability of finding randomly picked 12mers in each genome, p = N (12, G)/412 ]. The number N (n, G1 , G2 ) of n-mers (n = 12) that appear in each pair of genomes (G1 , G2 ) was also computed (Table 2). Based on this, one can compare the probabilities of finding randomly picked 12mers in two genomes simultaneously with the probabilities calculated using the multiplication rule. As shown in Table 2, the actual and calculated (expected) probabilities do not differ greatly from each other. This allows us treating the presence/absence of randomly picked 12mers in these three genomes as independent events. We calculated the actual and expected probabilities for each pair of genomes in the three groups (1 000 000+ pairs in total). Table 1. The frequency of presence of 12mers within the three microbial genomes Genome (G) Genome length Total sequence length (bp) p = N (12, G)/4n Number of different (%) 12mers present in genome: N (12, G) (1) S.typhi 4 809 037 9 618 074 5 813 330 (2) M.tuberculosis 4 411 529 8 823 058 4 361 508 H37Rv (3) B.subtilis 4 214 814 8 429 628 5 346 103 34.65 26.00 31.87 Table 2. Actual and predicted simultaneous presence of 12mers within the three microbial genomes: (1) S.typhi, (2) M.tuberculosis H37Rv and (3) B.subtilis Case Number of N (n, G1 , G2 )/4n Calculated probability 12mers (%) assuming independence (%) Present in genomes 1 943 814 (1) and (2) Present in genomes 2 335 710 (1) and (3) Present in genomes 1 334 288 (2) and (3) 11.6 9.0 13.9 11.0 8.0 8.3 We were especially interested in the range of n which gives rise to the frequency of presence, p ∗ , of different n-mers in the genome between 5% and 50% of the total possible number of possible n-mers (4n ). This range for different microbial 2423 Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016 Fig. 1. The frequency of presence of different n-mers, p = N (n, G)/4n , as a function of the ratio 4n /M. Y.Fofanov et al. Table 3. The optimal length of n-mers (n∗ ) for different genome sizes and frequencies of presence (p ∗ ) Total sequence length (bp) (Mb) n∗ determined for frequency of presence 50% (P ∗ = 0.5) n∗ determined for frequency of presence 5% (P ∗ = 0.05) 0.8 2.0 10.0 9.80 10.47 11.63 11.93 12.59 13.75 log[M(1 − p ∗ )/p ∗ ] . n = log(4) ∗ (2) This formula works well for all the three groups of genomes (viruses, microbes and multicellular organisms). The upper and lower bounds of n∗ for genome sizes between 0.8 and 10 Mb, which are typical for microbials, are shown in Table 3. In accordance with this, the value n = 12 seems to be the most reasonable one for all microbial genomes. For viral genomes the appropriate value was found to be n = 7. We found that for all 11 990 pairs of microbial genomes and the value of n = 12 the average ratio of actual and expected probabilities is 1.37 ± 0.67. For viral genomes and the value of n = 7, the average ratio of actual and expected probabilities was found to be 1.07 ±0.12 for 387 840 genome pairs DNA-based viruses and 1.10 ± 0.12 for 621 732 genome pairs RNA-based viruses. Thus, we conclude that for this range of n the presence of n-mers in different genomes, to a good approximation, can be treated as independent events. The highest deviations between the expected and actual probabilities were found among closely related genomes. For instance, using 7mers, a high ratio (185%) was found for Duck hepatitis B virus (NC_001344) versus Stork hepatitis B virus (NC_003325) with 8.1% expected and 15.0% actual. An example of closely related microbial genomes would be Staphylococcus aureus N315 (NC_002745) versus S.aureus Mu50 (NC_002758) with 4.0% expected and 19.7% actual or 491% higher than that expected. Another extreme case was found for three microbial genomes: Chlamydophila pneumoniae CWL029 (NC_000922), C.pneumoniae AR39 (NC_002179) and C.pneumoniae J138 (NC_002491), which have the highest (8-fold) ratio of actual and expected probabilities for 12mers (1.5% expected and 12.3% actual). The results for these three microbial genomes are presented in Table 4. 2424 Case Number Number of Calculated of 12mers 12mers/4n (%) probability assuming independence (%) Present in genome (a) and 7 712 absent in genome (b) Absent in genome (a) and 7 214 present in genome (b) Present in genomes 2 058 304 (a) and (b) Present in genome (a) and 11 526 absent in genome (c) Absent in genome (a) and 10 706 present in genome (c) Present in genomes 2 054 490 (a) and (c) Present in genome (b) and 6 939 absent in genome (c) Absent in genome (b) and 6 617 present in genome (c) Present in genomes 2 058 579 (b) and (c) 0.046 0.043 12.268 1.52 0.069 0.064 12.246 1.52 0.041 0.039 12.270 1.52 We performed the same calculation for the 24 human chromosomes pairwise. The average ratio of actual and expected probabilities of 14mers is 1.91±0.16, maximum ratio being found for 20th and Y-chromosomes (expectation 2.9% versus actual 6.9%). DISCUSSION Microbial/viral fingerprints using random subsets of n-mers It may be assumed that our results for 1500+ genomes can be extended to other genomes (many yet to be sequenced). In this case one may use relatively small sets of randomly picked n-mers for differentiating between different viruses and organisms. This idea can be illustrated by continuing our example for three microbial genomes. Let n∗ be the size of n-mer, which fits the interval where from 5% to 50% of all possible n-mers show up for a desirable range of genome lengths. In accordance with Table 3, we may choose the value n∗ = 12. Let us randomly pick L 12mers (say, L = 1000). For example, L can be the number of probes placed on a microarray. Given a genome G1 with the frequency of presence of n-mers p1 , we expect that K = p1 L n-mers present in G1 will appear also in our random set, forming a ‘fingerprint’ of G1 (in our example, we expect 50 < K < 500). The probability, ε, that the fingerprint of G1 will exactly coincide with the fingerprint Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016 genome sizes can be numerically determined from Figure 1. The corresponding frequency of presence for purely random sequences (random boundary) is shown in Figure 1 by a solid line. The analytical formula for the random boundary can be used to estimate this range analytically: Table 4. Actual and predicted simultaneous presence of 12mers within the three extremely close microbial genomes: (a) C.pneumoniae CWL029, (b) C.pneumoniae AR39 and (c) C.pneumoniae J138. Independence of appearance of n-mers in genomes of some other genome G2 (with the frequency of presence of n-mers p2 ) is found in Appendix B. The result is ε = (1 − p1 − p2 + 2p12 )L . (3) Here p12 is the probability for the n-mer to be present in both genomes simultaneously. Let us consider the numeric example from Tables 1 and 2 of two species that are far from each other, S.typhi versus M.tuberculosis H37Rv; p1 = 0.3465, p2 = 0.2600, p12 = 0.1160. With L = 1000, a remarkable accuracy of ε = 1.7∗ 10−204 can theoretically be achieved. Given a desirable probability of error, ε, one can determine the appropriate size, L, of a random set of n-mers which can be used for reliable identification of genomes as log ε . log(1 − p1 − p2 + 2p12 ) (4) For related organisms, the genomes may contain large common parts. This means that p12 may be close to p1 and p2 . To give a numeric example of close relatives, let us consider S.aureus N315 versus S.aureus Mu50. Now p1 = 0.198, p2 = 0.203, p12 = 0.197 and an accuracy of ε = 10−10 can be achieved with L = 3278. We would like to stress the logarithmic dependence of the sampling size or the number of probes L, on the error probability, ε. This feature is of principal importance for our discussion. Therefore, we can use practically any sufficiently random subset of n-mers of appropriate size to construct a microarray to diagnose to which organism a given DNA/RNA sample belongs. Different sizes of n-mers must be employed for recognition of different organisms based on their genome lengths. Values of n that correspond to given intervals of genome lengths can be easily calculated using above formulas. In fact, only 11 different n values, 7 ≤ n ≤ 17, would be enough to cover a large variety of genome sizes from 1 kb to 9 Gb. The important advantage of such an approach is that it can be used without a priori knowledge of the sequence itself. This implies that there is no need to perform the expensive and time-consuming process of sequencing before array design. It is enough to obtain the purified DNA, hybridize it on a sufficiently random microarray chip and check which nmers show up. Taking into account how accessible the DNA of thousands of microbial and viruses are, how easily each microarray can be produced, and the fact that we do not need to determine quantitative values of expression (we need just a yes/no answer)—it should be possible to produce an essentially universal microbial/viral DNA chip. Fingerprints of closely related organisms We next consider what happens when we try to compare closely related organisms using this approach (e.g. different types of influenza or different strands of the same microbes). We assume that two genomes G1 and G2 almost coincide and differ only in m randomly located nucleotides. This situation L= M| log ε| log ε ≤ . p log(1 − mn/N ) pmn (5) Here, N is the number of different n-mers contained in G1 (which is approximately equal to the number of different nmers contained in G2 ). Such an approach can provide the level of accuracy necessary for the individual human fingerprints. Let us assume that the differences between individual human beings appear only because of SNPs, which have equal probability and are randomly located in genome. According to literature estimates (Weiner and Hudson, 2002), the total number of SNPs in the human genome is expected to be ∼3 000 000. Then, calculating the necessary size for the random microarray (m/M ∼ 0.1%, ε = 10−10 , n = 17, p = 0.284) we have L ∼ 4769. This rough estimation is promising and indicates that this possibility deserves a proper experimental study. We would like to recall, that our theoretical estimations have been made for randomly picked sets of n-mers. The further possibility exists to start with a larger than necessary random set of n-mers (say, L = 10 000) and then to decrease the microarray size experimenting with the desirable set of genomes (using, for instance, a simple optimization approach). CONCLUSIONS We presented results of a correlation analysis for distributions of the presence/absence of short subsequences of different length (n-mers, n = 5 − 20) in more than 1500 microbial and viral genomes together with five genomes of multicellular organisms (including human). Our results show that for organisms that are not close relatives to each other, a range of values of n can be found, such that the presence/absence of different n-mers in different genomes are practically not correlated (within a probabilistic tolerance, ε). For close relatives such correlations do appear, but are not as strong as might be expected. The size of the correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate between different microbial and viral genomes, and possibly, individual human beings with a convenient number of combinatorial experiments. The formulas derived, yield the size of a combinatorial experiment designed to identify an organism given the length of its genome, a convenient length of probe, n and a tolerance or error, ε. 2425 Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016 L= simulates the existence of point mutations or single nucleotide polymorphisms (SNPs). Let L be the size (number of probes) of the microarray and p, the frequency of presence of n-mers in a genome with a TSL value M. The value of L, necessary to distinguish the fingerprints of these two genomes with the error probability ε, can be estimated by the formula (see Appendix B): Y.Fofanov et al. ACKNOWLEDGEMENTS The authors thank Prof. M. Hogan for interesting conversations. S.C., B.M.P. and Y.F. thank TLCC for partial funding of this work. T.-B.L. was supported by a training fellowship from the W.M. Keck Foundation to the Gulf Coast Consortia through the Keck Center for Computational and Structural Biology. B.M.P. and Y.F. thank the NIH for partial support of this work and NPACI for computational support. S.C. is grateful to the University of Houston Computer Science Department for hospitality. REFERENCES Campbell,A., Mrazek,J. and Karlin,S. (1999) Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl Acad. Sci., USA, 96, 9184–9189. Deschavanne,P.J., Giron,A., Vilain,J., Fagot,G. and Fertil,B. (1999) Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol., 16, 1391–1399. Fislage,R. (1998) Differential display approach to quantitation of environmental stimuli on bacterial gene expression. Electrophoresis, 19, 613–616. Fislage,R., Berceanu,M., Humboldt,Y., Wendt,M. and Oberender,H. (1997) Primer design for a prokaryotic differential display RT– PCR. Nucleic Acids Res., 25, 1830–1835. Fofanov,V., Fofanov,Y. and Pettitt,B.M. (2002a) Fast subsequence search using incomplete search trees. The Seventh Structural Biology Symposium of Sealy Center for Structural Biology. The University of Texas Medical Branch, Galveston, TX, p. 51. Fofanov,V., Fofanov,Y. and Pettitt,B.M. (2002b) Counting array algorithms for the problem of finding appearances of all possible patterns of size n in a sequence. The 2002 Bioinformatics Symposium, Keck/GCC Bioinformatics Consortium. W.M. Keck Center for Computational and Structural Biology, Houston, TX, p. 14. 2426 Karlin,S. (2001) Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends in Microbiol., 9, 335–343. Karlin,S. and Ladunga,I. (1994) Comparisons of eukaryotic genomic sequences. Proc. Natl Acad. Sci., USA, 91, 12832–12836. Karlin,S., Mrazek,J. and Campbell,A.M. (1997) Compositional biases of bacterial genomes and evolutionary implications. J. Bacteriol., 179, 3899–3913. Nakashima,H., Nishikawa,K. and Ooi,T. (1997) Differences in dinucleotide frequencies of human, yeast, and Escherichia coli genes. DNA Res., 4, 185–192. Nakashima,H., Ota,M., Nishikawa,K. and Ooi,T. (1998) Genes from nine genomes are separated into their organisms in the dinucleotide composition space. DNA Res., 5, 251–259. Nussinov,R. (1984) Doublet frequencies in evolutionary distinct groups. Nucleic Acids Res., 12, 1749–1763. Sandberg,R., Winberg,G., Branden,C.I., Kaske,A., Ernberg,I. and Coster,J. (2001) Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res., 11, 1404–1409. Southern,E.M. (2001) DNA microarrays—-history and overview. Meth. Mol. Biol., 170, 1–15. Vainrub,A., Li,T.-B, Fofanov,Y. and Pettitt,B.M. (2003) Theoretical considerations for the efficient design of DNA arrays. In: Moore,J. and Zouridakis,G. (eds.), Biomedical Technology and Devices Handbook. CRC Press, pp. 14.11–14.14. Weiner,M.P. and Hudson,T.J. (2002) Introduction to SNPs: discovery of markers for disease. Biotechniques, 10(Suppl. 4–7), 12–13. APPENDIX A Here, we will analytically estimate the frequency of presence of n-mers in a genome of length M. Let us apply the logic of the example shown in Tables 1 and 3 to autocorrelations, i.e. let us check whether the appearances of distinct n-mers are independent or correlated within a single genome. Assume that the multiple appearances of a given n-mer at different locations within the same genome are also independent events. Then, the probability of n-mer to appear once is p, twice is p2 , thrice is p 3 and so on. The total number of n-mers in the genome, taking into account multiple appearances is M ≈ 4n (p + p 2 + p 3 + · · · ) = 4n p , (1 − p) (A1) from which one obtains, p≈ M . (M + 4n ) (A2) This formula has been presented in the text, and is shown in Figure 1 by a solid line. One may also compare it with the experimental values from the last column of Table 1. In accordance with Equation (1) we have for S.typhi p = 34.44%, for M.tuberculosis H37Rv, p = 34.46% and for B.subtilis p = 33.44%. This corresponds better to experimental values (34.65, 26.00 and 31.87%, respectively) Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016 Clearly, additional experimental study (including, e.g. hybridization of microbial samples on random microarrays) is necessary to verify if the statistical features described above can lead to the creation of a real biosensor (as it is suggested by our in silico experiments). Future studies should take into account errors in the course of hybridization. Rough theoretical estimation, assuming independent probabilities of hybridization error at different microarray sites, suggests that total number of hybridization errors on the array of size L √ is proportional to L. Thus, the total relative error due to imperfect hybridization can be made small by increasing the number of probes on the microarray L. On the other hand, it is currently not clear to what degree the genomes are correctly assembled. Possible errors in sequences may have affected our results. We however believe that the parts of genomic sequences that have been correctly reconstructed are significant enough to determine the statistical properties described above. Independence of appearance of n-mers in genomes than the estimation without taking into account multiple appearances, M (A3) p ≈ n, 4 which leads to the probabilities 57.3, 52.6 and 50.2%, respectively. This fact is in accordance with the conclusion about the apparently nearly random statistical character of the appearance of n-mers in a single genome. APPENDIX B N12 . N1 + N2 − N12 (A4) An error, E, occurs when two genomes share the same fingerprint, i.e. all n-mers that form the fingerprint represent the intersection region. This will happen with probability  k N12 P (E | k) = . (A5) N1 + N2 − N12 In fact, this is a conditional probability of an error, E, if we have a fingerprint of length k. We now need to calculate an average with respect to all possible fingerprints. There are CkL = L!/[k!(L − k)!] different fingerprints of the size k, which appear with equal probabilities [P (S ∈ G1 ∪ G2 )]k [1 − P (S ∈ G1 ∪ G2 )]L−k , where P (S ∈ G1 ∪ G2 ) is the probability for n-mer S to find itself in the intersection G1 ∪ G2 sampling L times. Therefore, we come to a binomial distribution of fingerprint sizes,   L! N1 + N2 − N12 k P (k) = (A6) k!(L − k)! 4n   N1 + N2 − N12 L−k × 1− . (A7) 4n Calculating the average error we have,  P (E) = P (E | k)P (k) = (1 − p1 − p2 + 2p12 )L . (A8) k Here, pj = Nj /4n is the probability of presence in Gj (j = 1, 2), and p12 = N12 /4n is the probability of presence in the L= log ε . log(1 − p1 − p2 + 2p12 ) (A9) We would like to again stress the logarithmic dependence of the microarray size L on the error level ε. This feature is of principal importance for the analysis under discussion. The following three cases will be considered separately. Essentially different organisms In this case, in accordance with the discussion in the text, the presence/absence of n-mers in one genome is not correlated with the presence/absence of n-mers in another genome and we can write p12 ≈ p1 p2 . Taking, for simplicity, p1 ≈ p2 ≈ p, we obtain, L= log ε . log(1 − 2p + 2p 2 ) (A10) For instance, if ε = 10−10 and p = 0.05, we obtain L = 230. Related organisms Now, p12 = p1 p2 . Assuming that intersection G1 ∩G2 almost coincides with the union, G1 ∪ G2 , or N1 + N1 − N12 > N12 ≫ N1 + N1 − 2N12 , (A11) one can rewrite Equation (A9) in a slightly different form. Starting once again with Equations A7–A9 and approximating the √ binomial distribution by the Gaussian of width s = LP (1 − P ), centered at k = LP where P = (N1 + N2 − N12 )/4n is the probability for an n-mer to be present in the union G1 ∪ G2 we find,  1 2 2 P (E) = e−αk √ e−(k−k) /2s , s 2π k N12 . (A12) N1 + N2 − N12 Provided that α ≪ 1 [which follows from inequality (5)] and k ≫ 1 (which is consistent with a small error level), one can change the summation to integration and obtain immediately,  1 2 2 2 2 P (E) = √ e−αk−(k−k) /2s dk = e−αk+α s /2 . s 2π (A13) Finally,  k N12 P (E) ≈ . (A14) N1 + N2 − N12 Now we can find the relation between the error level and the microarray size in the form, e−α = k = PL = log ε . log[N12 /(N1 + N2 − N12 )] (A15) Here, P , the probability of presence of n-mer in the intersection of two genomes, is given by 2427 Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016 Here, we will estimate the probability to make an error discriminating organisms by their analysis (‘fingerprints’) in a random microarray, which consists of L n-mers. Assume that we need to discriminate between the two genomes G1 and G2 of sizes M1 and M2 , respectively. Let G1 (G2 ) contains N1 (N2 ) different n-mers and N12 = N (n, G1 , G2 ) n-mers are present simultaneously in both genomes (this is the size of intersection of two sets of n-mers corresponding to ‘n-mer contents’ of G1 and G2 ; we denote this set as G1 ∩ G2 ). The union G1 ∪ G2 contains N1 + N2 − N12 n-mers. Let us consider a fingerprint of the union of the two genomes, G1 ∪ G2 . For every n-mer appearing in this fingerprint, the probability that it occurs in the intersection region, G1 ∩ G2 , is intersection G1 ∩ G2 . Given a desirable level of tolerance or error, P (E) ∼ ε, one can now estimate the appropriate combinatorial experiment (array) size: Y.Fofanov et al. P = (N1 + N2 − N12 )/4n ∼ p1 ∼ p2 . The last formula leads to similar numerical values as Equation (A1) if N12 ≫ N1 + N1 − 2N12 . Say, for P = 0.05, N12 /(N1 + N2 − N12 ) = 0.9, ε = 10−10 , we have, L = 4371. Closely related organisms Let us assume that two genomes G1 and G2 almost coincide and differ only in m randomly located characters (nucleotides). This situation simulates the existence of SNPs. For simplicity, let us assume, that N1 = N2 = N. Every character that is different in G1 and G2 belongs simultaneously to n different n-mers, and the size of the subset in G1 ∪ G2 which consists of the n-mers that are different in G1 and G2 has a size, nm = 2N − 2N12 . Then, N12 = N − mn , 2 or N1 + N2 − N12 = N +  nm k P (E) ≈ 1 − = ε. N mn , 2 (A16) Taking into account, that N ≤ M, we arrive at the estimation, L= log ε M| log ε| k = ≤ . P P log(1 − mn/N ) P mn (A17) Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 24, 2016 2428