Abstract
Biological sequence data analysis has developed into an inevitable tool for macromolecular biology, key to any detailed understanding of the living cell. A brief survey on the biological macromolecules and their function is given. Sequence data analysis is introduced as a basic tool for the experimental bench biologist. So far, most queries for such analyses are issued on flat files and static indices. We discuss position tree structures and their potential in sequence data analysis. The hash position tree is introduced as a persistent, dynamic data structure for pattern searches in large sequence databases in biology.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
P. Edman: A method for the determination of the amino acid sequences in peptides Arch. Biochem. 22, 457 (1949)
F. Sanger: The arrangement of amino acids in proteins. Adv. ProteinChem. 7:1–67 (1952)
F. Sanger, E.O.P. Thompson: The amino-acid sequence in the phenylalanyl chain of insulin. Biochem. J. 53, 366–374 (1953)
M. Dayhoff (edt.): Atlas of Protein Sequence and Structure” National Biomedical Research Foundation. Silver Spring, Maryland (1978)
A.M. Maxam, W. Gilbert W.: A new method for sequencing DNA. Proc. Natl. Acad. Sci. USA 74, 560–564 (1977)
R.M. Schwartz, M.O. Dayhoff: Origins of Prokaryotes, Eukaryotes, Mitochondria, and Chloroplasts. Science 199, 355 (1978)
J. Devereux, P. Haeberli; O. Smithies: A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12, 387–395 (1984)
C. Rawlings: “Software Directory for Molecular Biologists.” MacMillan, London (1986)
W.C. Barker, D.G. George, H.W. Mewes, F. Pfeiffer, A. Tsugita: The PIR-International databases: Nucl. Acids Res. 22, 3089–3092 (1994)
D.A. Benson, M. Boguski, D.J. Lipman, J. Ostell: GenBank. Nucl. Acids Res. 22, 3441–3444 (1994)
D.B. Emmert, P.J. Stoehr, G. Stoesser, G. Cameron: The European Bioinformatics Institute (EBI) databases. Nucl. Acids Res. 22, 3445–3449 (1994)
K.H. Fasman, A.J. Cuticchia, D.T. Kingsbury: The GDB (TM) human genome data base anno 1994. Nucl. Acids Res. 22, 3462–3469 (1994)
A. Bairoch, B. Boeckmann: The SWISS-PROT protein sequence data bank: current status. Nucl. Acids. Research 22, 1994: 22, 3578–3580
Goffeau A. (edt.): Sequencing the Yeast Genome, A detailed assessment. Commission of the European Communities (1988)
S.G. Oliver, Q.J.M. van der Aart, M.L. Agostoni-Carbone, M. Aigle, L. Alberghina, D. Alexandraki, G. Antoine, R. Anwar, J.P.G. Ballesta, P. Benit, G. Berben, E. Bergantino, N. Biteau, P.A. Bolle, M. Bolotin-Fukuhara, A. Brown, A.J.P. Brown, J.M. Buhler, C. Carcano, G. Carignani, H. Cederberg, R. Chanet, R. Contreras, M. Crouzet, B. Daignan-Fornier, E. Defoor, M. Delgado, C. Doira, J. Demolder, E. Dubois, B. Dujon, A. Dusterhoft, D. Erdmann, M. Esteban, F. Fabre, C. Fairhead, G. Faye, H. Feldmann, W. Fiers, M.C. Francingues-Gaillard, L. Franco, L. Frontali, H. Fukuhara, L.J. Fuller, P. Galland, M.E. Gent, D. Gigot, V. Gilliquet, N. Glansdorff, A. Goffeau, M. Grenson, P. Grisanti, L.A. Grivell, M. de Haan, M. Haasemann, D. Hatat, J. Hoenicka, J. Hegemann, C.J. Herbert, F. Hilger, S. Hohmann, C.P. Hollenberg, K. Huse, F. Iborra, K.J. Indge, K. Isono, C. Jacq, M. Jacquet, C.M. James, J.C. Jauniaux, Y. Jia, A. Jimenez, A. Kelly, Kleinhans U., Kreisl P., G. Lanfranchi, C. Lewis, C.G. van der Linden, G. Lucchini, K. Lutzenkirchen, M.J. Maat, G. Mannhaupt, E. Martegani, A. Mathieu, C.T.C. Maurer, D. McConnell, R.A. McKee, H.W. Mewes, F. Messenguy, F. Molemans, M.A. Montague, M. Falconi, F. Muzi, L. Navas, C.S. Newlon, D. Noone, C. Pallier, L. Panzeri, B.M. Pearson, Perea J., P. Philippsen, A. Pierard, R.J. Planta, P. Plevani, B. Poetsch, F. Pohl, B. Purnelle, M. Ramezani-Rad, S.W. Rasmussen, A. Raynal, M. Remacha, P. Richterich, A.B. Roberts, F. Rodriguez, E. Sanz, I. Schaaff-Gerstenschlager, B. Scherens, B. Schweitzer, Y. Shu, J. Skala, P.P. Slonimski, F. Sor, C. Soustelle, R. Spiegelberg, L.I. Stateva, H.Y. Steensma, S. Steiner, A. Thierry, G. Thireos, M. Tzermia, L.A. Urrestarazu, G. Valle, I. Vetter, J.C. van Vliet-Reedijk, M. Voet, G. Volckaert, P. Vreken, H. Wang, J.R. Warmington, D. von Wettstein, B.L. Wicksteed, C. Wilson, H. Wurst, G. Xu, F.K. Zimmermann, J.G. Sgouros: The complete DNA sequence of yeast chromosome III. Nature 357, 38–46 (1992)
B. Dujon, D. Alexandraki, B. Andre, W. Ansorge, V. Baladron, J.P.G. Ballesta, A. Banrevi, P.A. A. Bolle, M. Bolotin-Fukuhara, P. Bossier, G. Bou, J. Boyer, M.J. Bultrago, G. Cheret, L. Colleaux, B. Daignan-Fornier, F. del Rey, C. Dion, H. Domdey, A. Duesterhoeft, S. Duesterhus, K.D. Entian, H. Erfle, P.F. Esteban, H. Feldmann, L. Fernandes, G.M. Fobo, C. Fritz, H. Fukuhara, C. Gabel, L. Gaillon, J.M. Carcia-Cantalejo, J.J. Garcia-Ramirez, M.E. Gent, M. Ghazvini, A. Goffeau, A. Gonzalez, D. Grothues, P. Guerreiro, J. Hegemann, N. Hewitt, F. Hilger, C.P. Hollenberg, O. Horaitis, K.J. Indge, A. Jacquier, C.M. James, J.C. Jauniaux, A. Jimenez, H. Keuchel, L. Kirchrath, K. Kleine, P. Koetter, P. Legrain, S. Liebl, E.J. Louis, A. Maia e Silva, C. Marck, A.L. Monnier, D. Moestl, S. Mueller, B. Obermaier, S.G. Oliver, C. Pallier, S. Pascolo, F. Pfeiffer, P. Philippsen, R.J. Planta, F.M. Pohl, T.M. Pohl, R. Poehlmann, D. Porteteile, B. Purnelle, V. Puzos, M.R. Rad, S.W. Rasmussen, M. Remacha, J.L. Revuelta, G.F. Richard, M. Rieger, C. Rodrigues-Pousada, M. Rose, T. Rupp, M.A. Santos, C Schwager, C. Sensen, J. Skala, H. Soares, F. Sor, J. Stegemann, H. Tettelin, A. Thierry, M. Tzermia, L.A. Urrestarazu, L. van Dyck, J.C. van Vliet-Reedijk, M. Valens, M. Vandenbol, C. Vilela, S. Vissers, D. von Wettstein, H. Voss, S. Wiemann, G. Xu, J. Zimmermann, M. Haasemann, I. Becker, H.W. Mewes H.W; “The complete sequence of chromosome XI of Saccharomyces Cerevisiae”, Nature (1994) 396, 371–378
H. Feldmann, M. Aigle, G. Aljinovic, B. Andre, M.C Baclet, A. Barthe, C. Baur, A.M. Becam, N. Biteau, E. Boles, T. Brandt, M. Brendel, M. Bruckner, F. Busereau, C. Christiansen, R. Contreras, M. Crouzet, C. Cziepluch, N. Demolis, T. Delaveau, F. Doignon, H. Domdey, S. Dusterhus, E. Dubois, B. Dujon, M. Elbakkoury, K.D. Entian, M. Feuermann, W. Fiers, G.M. Fobo, C. Fritz, H. Gassenhuber, N. Glansdorff, A. Goffeau, L.A. Grivell, M. Dehaan, C. Hein, C.J. Herbert, C.P. Hollenberg, K. Holmstrom, C. Jacq, M. Jacquet, J.C. Jauniaux, J.L. Jonniaux, T. Kallesoe, P. Kiesau, L. Kirchrath, P. Kotter, S. Koroll, S. Liebl, M. Logghe, A.J.E. Lohan, EJ. Louis, ZY. Li, M.J. Maat, L. Mallet, G. Mannhaupt, F. Messenguy, T. Miosga, F. Molemans, W. Muller, S. Nasr, B. Obermaier, J. Perea, A. Pierard, E. Piravandi, F.M. Pohl, T.M. Pohl, S. Potier, M. Proft, B. Purnelle, M.R. Rad, M. Rieger, M. Rose, I. Schaaff-Gerstenschlager, C. Scherens, B. Schwarzlose, J. Skala, P.P. Slonimski, P.H.M. Smits, J.L. Souciet, H.Y. Steensma, R. Stucka, A. Urrestarazu, Q.J.M. Vanderaart, L. Vandyck, A. Vassarotti, I. Vetter, S. Vierendeels, F. Vissers, G. Wagner, P. Dewergifosse, K.H. Wolfe, M. Zagulski, F.K. Zimmermann, H.W. Mewes, K. Kleine:’ Complete DNA-Sequence of Yeast Chromosome-II', EMBO JOURNAL (1994) 13, 5795–5809
M. Johnston, S. Andrews, R. Brinkman, J. Cooper, H. Ding, J. Dover, Z. Du, A. Favello, L. Fulton, S. Gattung, C. Geisel, J. Kirsten, T. Kucaba, L. Hillier, M. Jier, L. Johnston, Y. Langston, P. Latreille, E.J. Louis, C. Macri, E. Mardis, S. Menezes, L. Mouser, M. Nhan, L. Rifkin, L. Riles, H. St. Peter, E. Trevaskis, K. Vaughan, D. Vignati, L. Wilcox, P. Wohldman, R. Waterston, R. Wilson, M. Vaudin: Compltete Nucleiotide Sequence of Saccharomyces cerevisiae Chromosome VIII. Science 256, 2077–2082 (1994)
P. Bork, C. Ouzounis, C. Sander, M. Scharf, R. Schneider, E. Sonnhammer: Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III. Protein Science 1:1677–1690 (1992)
E.V. Koonin, P. Bork, C. Sander: Yeast chromosome III: new gene functions. EMBO Journal 13, 493–503 (1994)
Dujon B. et al.,: Detailed evalutation of the complete sequence of chromosome XI of S. cerevisiae'. Manuscript in preparation.
R.F. Doolitle: Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Mill Valley, CA (1987)
A.M. Lesk: Computational Molecular Biology. In: Encyclopedia of Computer Science and Technology Vol. 31, Marcel Dekker, New York (1994)
R.F. Doolittle: Searching through sequence databases, in: Methods in Enzymology (R.F. Doolittle edt.) 183, 99–110 (1990)
P. Argos, M. Vingron, G. Vogt: Protein sequence comparison: methods and significance. Protein Engineering 4, 375–383 (1991)
D.G. George, W.C. Barker, L.T. Hunt: Mutation Data Matrix and Its Uses. In: Methods in Enzymology (R.F. Doolittle edt.) 183, 333–351 (1990)
S.B. Needleman, C.D. Wunsch: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)
T.F. Smith, M.S. Waterman, W.M. Fitch: Comparative biosequence metrics. J. Mol. Evol 18, 38–46 (1981)
P. Argos: A sensitive procedure to compare amino acid sequences. J. Mol. Biol. 193, 385–396 (1987)
J.F. Colllins, S.F. Reddaway: High-Efficiency Sequence Database Searching: Use of the Distributed Array Processor. In: G.I. Bell, T.G. Marr (eds): Computers and DNA, Addison-Wesley (1990)
W.J. Wilbur, D.J. Lipman: Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. USA 80, 726–730 (1983)
S. Liebl, H.W. Mewes: A dynamic database of sequence similarities. Manuscript in preparation
M.S. Waterman, M. Vingron: Rapid and accurate estimates of statistical siginificance for sequence data base searches. Proc. Natl. Acad. Sci. USA 91, 4625–4628 (1994)
C. Sander, R. Schneider: Database of homology-derived protein structures and the structural meaning of sequence alignment. Protens 9, 56–68 (1991)
M. Vingron, M.S. Waterman: Sequence alignment and penalty choice. J. Mol. Biol. 235, 1–12 (1994)
P. Bork, R.F. Doolittle R.F.: Proposed acquisition of an animal protein domain by bacteria. Proc. Natl. Acad. Sci. USA 89, 8990–8994 (1992)
P. Bork, C. Sander, A. Valencia: An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins. Proc. Natl. Acad. Sci. USA 89, 7290–7294 (1992)
M. Murata, S.S. Richardson, J.L. Sussman: Simultanous comparison of three protein sequences. Proc. Natl. Acad. Sci. USA 82, 2444–2448 (1985)
G.J. Barton, M.J.E. Sternberg: Flexible Protein Sequence Patterns, A Sensitive Method to Detect Weak Structural Similarities. J. Mol. Biol. 212, 389–402 (1990)
M. Gribskov, R. Luthy, D. Eisenberg: Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 84, 4355–4359 (1987)
J.D. Thompson, D.G. Higgins, T.J. Gibbson: “Multiple sequence alignment”, Nucleic Acids Res. 22, 4673–4680 (1994)
P. Argos, M. Vingron, G. Vogt. Protein sequence comparison: methods and significance. Protein Engineering 4, 375–383 (1991)
Bishop J.: Nucleic Acid and Protein Sequence Analysis. A practical approach. IRL Press (1987)
Meier, D., “The compelxity of some problems on subsequences and supersequences”, Jour. Assoc. Comput. Mach. 25 (2) (1978), 322–336.
Knuth D.E.: The Art of Computer Programming, Vol.3, Sorting and Searching, Addison-Wessley, Reading Mass. (1973)
S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)
R. Baeza-Yates, G.H. Gonnet: A new Approach to Text Searching. Com. ACM 35, 10, 74–82 (1992)
U. Manber, R. Baeza-Yates: An algorithm for string matching with a sequence of don't cares. Information Processing Letters 37, 133–136 (1991)
R. Pearson: Rapid and Sensitive Sequence Comparision with FASTP and FASTA. In: Methods in Enzymology (R.F. Doolittle edt.) 183, 63–98 (1990)
S. Wu, U. Manber Fast Text Searching Allowing Errors. Com. AC 35, 83–91 (1992)
A. Califano, I. Rigoutsos: FLASH: A Fast Look-UP Algorithm for String Homology. In: Proceedings, First International Conference on Intelligen Sysem for Molecular Biology (Hunter L., Searls D., Shavlik J. eds.) AAAI Press, Menlo Park, CA, 56–64 (1993)
U. Manber, E.W. Meyers: Suffix Arrays: A New Method for On-Line String Searches. Proceedings: First Annual ACM-SIAM Symposium on Diskrete Algorithms. 319–327 (1990)
GCG, Genetic Computer Group. GCG-Manual Release 8. Madison, Wisconsin (1994)
ATLAS-User's Guide. Document Version 10.0. NBRF Washington D.C. (1994)
E.M. McCreight: A space-economical suffix tree construction algorithm; J. As soc. Comp. Mach. 23, 262–272 (1976)
M. Kempf, R. Bayer, U. Güntzer: Time Optimal Left to Right Construction of Position Trees. Acta Informatica 24, 461–474 (1987)
T.A. Sudkamp: Languages and Machines. Addison-Wesley (1988)
K. Heumann:’ The hashed position tree: a dynamic, persistant variant of position trees. Mansucript in preparation.
A. Bairoch: PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 20, 2013–2018 (1992)
J.T.L. Wang, T.G. Marr, D. Shasha, B.A. Shipiro, G.-W. Chirn: Discovering active motifs in sets of related protein sequences and using them for classification; Nucl. Acids Res. 22, 2769–2775 (1994)
J.D. Ullman: Principles of Dtabase and Knowledge-Base Systems, Vol. I. Computer scinece Press, Rockville. (1988)
G. Gonnet, A. Mark, S. Benner: Exhaustive Matching of the Entire Protein Sequence Database. Science 256, 1443–1445 (1992)
C. Lefevre, J. Ikeda: Pattern recognition in DNA sequences and its application to consensus foot-printing. Comp. Appl. Biosc. 9, 349–354 (1993)
C. Lefevere, J. Ikeda: The position end-set tree: A small automaton for ward recognition in biological sequences. Comp. Appl. Biosc. 9, 343–348 (1993)
P. Bieganski, J. Riedl, J.V. Cartis: Generalized suffix trees for biological sequence data: applications and implementation. In: Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences. Vol.V: Biotechnology Computing; IEEE Comput. Soc. Press, 35–44. (1994)
Object Design, Inc. (1993) Reference Manual. ObjectStore Release 3.0 Beta. For VAX/VMS Systems. Burlington.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1995 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mewes, H.W., Heumann, K. (1995). Genome analysis: Pattern search in biological macromolecules. In: Galil, Z., Ukkonen, E. (eds) Combinatorial Pattern Matching. CPM 1995. Lecture Notes in Computer Science, vol 937. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60044-2_48
Download citation
DOI: https://doi.org/10.1007/3-540-60044-2_48
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60044-2
Online ISBN: 978-3-540-49412-6
eBook Packages: Springer Book Archive