DNA Sequence Data Analysis: Steps Toward Computer Analysis of Nucleotide Sequences
DNA Sequence Data Analysis: Steps Toward Computer Analysis of Nucleotide Sequences
The search for rules to describe the relation between structures and their properties is a fundamental goal of scientific
endeavor. For the molecular biologist,
this means understanding the information encoded in the sequences of nucleic
acid molecules, a quest requiring both
the elucidation and analysis of these sequences. A problem that long thwarted
these efforts was the extreme length of
plausible that we would be able to progress much further in the collation of our
sequence data, much less be in a position
to analyze it adequately. It is the intent
of this article to report on the developing
role of computer technology in this field.
very thin urea-containing polyacrylamide gels (9) has permitted the resolution of products, resulting from a single
sequencing reaction, up to a chain length
of 250 to 300 nucleotides per loading.
The partial sequences, read from this
and similar gels, are then recorded and
compared in order to reconstruct the entire sequence. Two sources of trivial error are associated with this stage: one involves careless reading of the gel (for example, mistaking channels, skipping a
band); the other involves errors introduced when manually recording the
data.
Ill
mR
"--
q,,w,~ I
~,
m
m
Fig. 2. A simple, semiautomated gel reading station. This station has a translucent digitizing
tablet and a cathode ray tube terminal used for display. An autoradiograph is positioned on the
surface of the tablet and the sequence is read from the autoradiograph (by the signal pen) directly into memory. The nucleotide sequence is displayed on the terminal screen. The limit of
resolution for this tablet is 0. I ram. A menu of other functions (editing, homology searches) is
encoded on the surface of the tablet in an area not covered by the autoradiograph. These functions can be activated by touching the designated areas with the signal pen.
OVERLAPPINg
7
1
42
48
88
70
109
77
115
84
124
96
135
112
152
AND
SEGMENTS C - 1 ( 0 9 / 1 8 / 7 9 / )
C-44B(I0/08/79)
44
21
6
5
11
16
8
KATTAATgAC
>QATTAAO
CAgACACGTC CTOAgTgTgT TACTTTTCAg CA>gATTAAO
GATAATTgCG
gATAATTgCg
CTAATGANCT
CTAATgAGCT
~ gAgCAgCT@A
,* g A g C A g C T g A
CCA<C>TTAC
CCA< > T T A C
TQ<G>CTGCA ~GC
>gggg
TG< >CTOCA < A g C A > g g g g
gAOgAg@CTA
gAggAggCTA
GgTGCATTAg
ATGATTT~T>
ATgATTT< >
gCAgATTgCA
GTACACAT>
Fig. 3. A sample printout of the results generated by the nucleotide sequence assembly program
called ASSEMBLER(12). A table detailing the positions of the overlapping sequences is presented
above the printed sequences. In this example, this table indicates that there are seven regions of
homology between the two sequences (C-I and C-44B) that are being compared. These overlapiaing sequences begin with nucleotide 1 of C-I and nucleotide 42 of C-44B. This match lasts
for a length of 44 bases when it is interrupted by a mismatch of three nucleotides at position 45
of C-1. Such areas of mismatch are denoted in the aligned sequences printed below this table by
a set of brackets. The rest of the overlapping segments between C-1 and C-44B are summarized
by the remaining lines of the table. An asterisk is placed above the sequence whenever an N (an
unidentified nucleotide) occurs in either or both sequence elements.
Program
language
Program name*
Reference
Presequencing preparations
"'oAl"t
Least-squares method for restriction mapping:l:
"REVTRANS"
Nucleic acid sequence analysisll
SAIL
FORTRAN
FORTRAN
PL/I
(6)
(8)
(13)
~READ"
FORTRAN
"'ASSEMBLER"tt
FORTRAN
(30)
(12)
Overlap-meld:l:~
FORTRAN
(11 )
(13)
(16)
FORTRAN
FORTRAN
FORTRAN
APL, FORTRAN
FORTRAN and BLISS
(2,5)
Scanning program**
"'tRNA"*:~
Secondary structure program
Secondary structure program II II
3D Molecular modeling
(28)
(18)
(17, 19)
(40)
*Listing in quotation marks is a program title; the other names are brief descriptions, tSumex System, Stanford University, P. Friedland, D. Brutlag, L.
Kedes. :~Universityof Wisconsin, J. L. Schroeder, F. Blattner. ColdSpring Harbor Laboratory, R. Blumenthal, R. J. Roberts (unpublished).
IlNational
Institutes of Health, C. Queen, L. Korn.
Cold Spring Harbor Laboratory, T. R. Gingeras, P. Rice, R. J. Roberts (unpublished).
**EuropeanMolecular
Biology Laboratory, Heidelberg, Germany, S. Provencher, R. Vogel, V. Dovi, H. Lehrach. fftColdSpring Harbor Laboratory, T. R. Gingems, J. Milazzo, R. J.
Roberts. :~$MRCLaboratory of Molecular Biology, Cambridge, England, R. Staden. SyracuseUniversity, G. Pavlakis, J. Vournakis.
IIIlUniversityof
California at Los Angeles, G. M. Studnick G. M. Rahn, I. W. Cummings, W. A. Salser. NationalInstitutes of Health, R. J. Feldmann. ***Universityof
Wisconsin, C. Fuchs, E. C. Rosenvold, A. Honigman, W. Szybalski. tttThis version is available from the Sumex System at Stanford University.
Conclusions
Only a casual look at the literature is
required to appreciate the growth of interest in DNA, RNA, and protein sequences. There is no indication that this
interest is about to subside. Rather, it
seems to be gaining momentum (41).
This poses severe problems for those
who wish to analyze the data since few,
ff any, of us have the capacity to absorb
the subtle features inherent in each individual sequence. It is inevitable, there-
1328
1. M. Zabeau and R. J. Roberts, in Molecular Genetics, J. H. Taylor, Ed. (Academic Press, New
York, 1979), vol. 3, p. 1.
2. V. Ling, J. Mol. Biol. 64, 87'(1972); Proc.
Natl. Aead. Sci. U.S.A. 69,742 (1972).
3. A. M. Maxam and W. Gilbert, Proc. Natl.
Acad. Sci. U.S.A. 74, 560 (1977); Methods Enzymol. 65, 499 (1980).
4. F. Sanger, S. Nicklen, A. R. Coulson, Proc.
Natl. Acad. Sci. U.S.A. 74, 5463 (1977).
5. A . J . H . Smith, Nucleic Acids Res. 6,831 (1979).
6. M. Stefik, Artif. lntell. 11, 85 (1978).
7. B. G. Buchanan and E. A. Feigenbaum, ibid., p.
5.
8. J. L. Schroeder and F. Blattner, Gene 4, 167
(1978).
9. F. Sanger and A. R. Coulson, FEBS Len. 87,
107 (1978).
10. J. 1. Garrells,J. Biol. Chem. 254, 7961 (1979); J.
Taylor, N. L. Anderson, B. P. Coulter, A. E.
Scandira, N. G. Anderson, in Electrophoresis
"79," B. J. Radola, Ed. (Gruyter, Berlin, 1980).
11. R. Staden, Nucleic Acids Res. 6, 2601 (1979).
12. T. R. Gingeras, J. P. Milazzo, D. Sciaky, R. J.
Roberts, ibid. 7,529 (1979).
13. J. L. Korn, C. L, Queen, M. N. Wegman, Proc.
Natl. Acad. Sci. U.S.A~ 74, 4401 (1977); C. L.
Queen and L. J. Korn, Methods Enzymol. 65,
595 (1980).
14. D. Brutlag, personal communication.
15. D. McCallum and M. Smith, J. Mol. Biol. 116,
29 (1977).
16. R. Staden, Nucleic Acids Res. 4, 4037 (1977);
ibid. 5, 1013 (1978).
17. W. M. Fitch, J. Mol. EvoL 1, 185 (1972); J. M.
Pipas and J. E. McMahon, Proc. Natl. Acad.
Sci, U.S.A. 72, 2017 (1975); G. M. Studnicka, G.
M. Rahn, I. W. Cummings, W. A. Salser, Nucleic Acids Res. 5, 3365 (1978).
18. G. Pavlakis and J. Vournakis, personal communication.
19. R. Nussinov and G. Pieczenik, personal communication; L. Garber, G. Garber, R. Nussinov,
G. Pieczenik, personal communication; R. Nussinov, G. Pieczenik, J. Griggs, D. Kleitman,
SIAM (Soc. Ind. Appl. Math.) J. Appl. Math.
3S, 68 (1978).
20. M. Philipp, D. BaUinger, H. Seliger, Naturwissenschaften 65, 388 (1978).
21. J. Gralla and D. M. Crothers, J. Mol. Biol. 73,
497 (1973); L Tinoco, O. C. Uhlenbeek, M. Levine, Nature (London) 230, 362 (1971); I. Tinoco,
P. W. Borer, B. Dengler, M. D. Levine, O. C.
Uhlenbeck, D. M. Crothers, J. Gralla, Nature
(London) New Biol. 246, 40 (1973).
22. G. Pavlakis, R. E. Lockhard, N. Vamvakopoulos, L, Rieser, U. L. Rajbhandary, J. N.
Vournakis, Cell 19, 91 (1980).
23. R. W. Holley et al.: Science 147, 1462 (1965).