Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Long-range correlation in the whole human genome R. Mansilla, N. Del Castillo, T. Govezensky, P. Miramontes, M. José and G. Cocho 1. Abstract We calculate the mutual information function for each of the 24 chromosomes in the human genome. The same correlation pattern is observed regardless the individual functional features of each chromosome. Moreover, correlations of different scale length are detected depicting a multifractal scenario. This fact suggest a unique mechanism of structural evolution. We propose that such a mechanism could be an expansion-modification dynamical system. I Introduction The sequencing of the human genome is probably the most important episode in nucleic acids biology after the disclosure of the DNA double helix structure by J.D. Watson and F.H.C. Crick. The availability of the human genome has opened new fields in medicine, forensic sciences and anthropology. Nonetheless, the relationship between functional and structural traits is a rather unexplored field. It is well known that human chromosomes are very different among themselves. Putting aside the obvious differences in size, there are also divergences in the density and spatial localization of genes and organization of Alu repeats [1], the distribution of CpG islands[2], etcetera. For instance, Chromosome 19 has 57% of repetitive elements whilst chromosomes 2, 8, 10, 13 and 18 have 36%. The chromosome 19 also has the highest gene density (around 29%, depending on the method) against 10% in chromosomes 4, 13 and 18. Many efforts have been made to unveil the nature of genome evolution. Gene duplication is a well accepted mechanism but it cannot explain the evolution of non-coding DNA. The challenge is then to find an evolutionary mechanism compatible with both the protein coding requirements and the structure of non-coding DNA. The aim of this paper is to carry out an study of the long-range correlation properties of the complete human genome using mutual information function, its biological implications and propose a model to account for the observed correlations. The firsts studies on correlation structure of DNA were reported in 1959 [3], [4]. Almost twenty years later [5]-[9], a burst of interest around the concept of long-range correlation encouraged a fruitful line of research. In [10] a mechanism for the explanation of this property is proposed. Those studies were inconclusive in the case of human DNA because of the incompleteness of the sequenced data. Human genome is now completely available and we show our first results. The structure of the paper is as follows. In Section II, the mathematical tools and concepts are introduced. Section III we presents the results of our analysis on the 24 human chromosomes. In Sec. IV, the result obtained are discussed. II Mutual Information Function. Statistical analysis of symbolic sequences as we had said, had a bloom when the first pieces of DNA data were available [5]-[9]. By applying a set of techniques, such as entropies and spectral analysis, the existence of long-range correlation at least in noncoding sectors of DNA was disclosed beyond any reasonable doubt. We would like to stress here that human genome is composed in a 95% of these type of sectors. The concept of mutual information function can be found for the first time in the seminal 1948 Claude Shannon’s paper [11]. His results were generalized to abstract alphabets by several Soviet mathematicians, culminating in the work of R.L. Dobrushin [12]. The original idea was to measure the difference between the average uncertainty in the input of an information channel before and after the output is received. In recent years it has been used to study some properties of regular languages and cellular automata [13], as well as DNA long-range correlations [5]-[9], [14], and some properties of strange attractors [15]. A comprehensive discussion of the properties of this function can be found in [16]-[18]. We use mutual information function to study long-range correlation properties because as proved in [17] it is a more sensitive measure than autocorrelation function. Let denote by Α = {A , C , G , T } an alphabet and s = (K a 0 , a1 ,K) an infinite string with ai ∈ Α , i ∈ Z , where Z represents the set of all integer numbers. The mutual information function of the string s is defined as: M (d , s ) =  Pα , β (d , s )  ( , ) ln P d s   ∑ α ,β α , β ∈Α  Pα ( s ) Pβ ( s )  (3) where the sum is over all pairs (α , β ) ∈ Α 2 and: Pα , β (d , s ) is the joint probability of having the symbol α followed d sites away by the symbol β on the string s . Pα (s ) is the density of the symbol α in the string s . Human genome, obviously is not an infinite string, but large enough to guarantee the accuracy of statistics. For instance, chromosome 1 has about 233,819,029 bps. III Main results. The results of our calculations are shown in Fig. 1. The mutual information function for the 24 human chromosomes are shown there. Although we calculated the values of mutual information function for d = 1, K,150,000 we only show them in the interval d = 1,K ,1024 . As could be notice, the shape of these curve is the same for all chromosome. The only difference is in the height of the graphs. The order in which these graphs appears from above to below is: 4, 13, Y, X, 6, 5, 18, 3, 8, 2, 7, 12, 21, 14, 9, 10, 11, 1, 20, 15, 16, 17, 22, 19. Every number stands for the chromosome number. The letter X stands for the chromosome X and letter Y for the chromosome Y. IV Discussion. It seems to us very striking that the mutual information functions of all chromosome have the same shape. The peaks are placed in the same position and only heights are different, reflecting different strength in correlation. It suggests that correlation among bases has the same pattern in all chromosome in spite of the difference in function for different chromosomes. This fact suggest a unique mechanism of structural evolution. In some sense, it is like a novel, in which each chapter develops a different theme, but the correlation among the letters in every chapter is the same. Another interesting fact is the different slope in different sectors of the curve. Because of the graph is in logarithmic scale, it suggests the existence of more than one exponent for the scaling. In [10] is proved for binary alphabet that a mechanism yielding this behavior is an expansion-modification system [19]. We propose that such a mechanism should account for the behavior of human genome [20]. References [1] Grover D., Majumder P.P., Rao C.B., Brahmachari S.K., Mukerji M.; “Nonrandom distribution of Alu elements in genes of various functional categories: insight from analysis of human chromosomes 21 and 22”, Mol. Biol. Evol. 2003 Sep; 20(9):1420-4. [2] Chen C., Gentles A. J., Jurka J., Karlin S.; “Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22”, Proc. Natl. Acad. Sci. U S A, 2002 Mar 5; 99(5):2930-5. [3] Noboru Sueoka; "A Statistical analysis of deoxyribonucleic acid distribution in density gradient centrifugation", Proceedings of the National Academy of Sciences, 45(10):14801490, 1959. [4] N Sueoka, J Marmur, P Doty (1959), "Heterogeneity in deoxyribonucleic acids: II. dependence of the density of deoxyribonucleic acids on guanine-cytosine content", Nature, 183:1429-1433. [5] To the interested reader we suggest the web site: http://www.nslij- genetics.org/dnacorr/. [6] Li, W., Kaneko, K.; “Long-range correlation and partial 1/f spectrum in a noncoding DNA sequence”, Europhysics Letters, 17 (1992) 655-660. [7] Li, W., Kaneko, K.; “DNA correlations”, Nature, 360 (1992) 635-636. [8] Peng, C. K. et al.; “Long-range correlation in nucleotides sequences”, Nature, 356, (1992) 168-170. [9] Peng, C. K. et al.; “Finite size effects on long-range correlations: Implications for analyzing DNA sequences”, Physical Review E, 47, (1993) 3730-3733. [10] Mansilla, R., Cocho, G.; “Multiscaling in expansion-modification system: an explanation for long-range correlation in DNA”, Complex Systems, 12, (2000) 207-240. [11] Shannon, C.; “A mathematical theory of communication”, Bell Systems Technologies Journal 27, (1948) 379-423. [12] Dobrushin, R.; “General formulation of Shannon’s main theorem in Information Theory”, Uspieji Matematicheskix Nauk 14, (1959) 1-104. English translation in American Mathematical Society Transaction 33, (1959) 323-438. [13] Li, W.; “Power spectra of regular languages and cellular automata”, Complex Systems 1, (1987) 107-113. [14] Mansilla, R., Mateo-Reig, R.; “On the mathematical modelling of intronic sectors of the DNA molecule”, International Journal of Bifurcation and Chaos 5, (1995) 1235-1241. [15] Fraser, A., Swinney, H.; “Independent coordinates for strange attractors from mutual information”, Phys. Rev. A 33, (1986) 1134-1151. [16] Blahut, R.; “Principles and practice of information theory”, Addinson-Wesley, (1987). [17] Li, W.; “Mutual information function versus correlation function”, Journal of Statistical Physics 60, (1990) 823-831. [18] Renyi, A.; “The probability theory”, Pergamon Press, (1970). [19] Li, W.; “Expansion-modification systems: a model for spatial 1 f spectra”, Phys. Rev. A, 43, (1991), 5240-5260. [20] R. Mansilla, N. Del Castillo, T. Govezensky, P. Miramontes, M. José and G. Cocho (forthcoming paper). 0.4 mutual information function 10 0.39 10 0 10 1 10 2 10 distance 3 10 Fig. 1: Graph of the mutual information function for the 24 human chromosome. In the X axis is shown the logarithms of the distances and in the Y axis the logarithms of the functions. 4 10