Kyte Doolitle
Kyte Doolitle
Kyte Doolitle
1. Introduction
One of the most persistent and absorbing problems in protein chemistry has been
the unraveling of the various forces involved in folding polypeptide chains into
their unique conformations. Insight into this question has been gained both from a
consideration of non-covalent forces as they apply in model systems and from a
detailed examination of the actual structures of protein molecules. It is generally
accepted that. to a rough approximation, two opposing, but not independent.
tendencies are reflected in the final structure of a protein when it folds. The
resulting compromise allows hydrophilic side-chains access to the aqueous solvent
while at the same time minimizing contact between hydrophobic side-chains and
002%28:36/82/1:30105-28
$03.00/O 80 19W Academic Press Inc. (London) I,td.
106 J. KYTE AND K. F. 1~00LITTLE
water. Recently, however, it has been noticed that there are important subtle
deviations from these expectations (Lifson & Sander, 1979; Janin & Chothia. 1980).
suggesting that the extent to which residues are buried depends not only upon
strict hydrophobicity but also upon steric effects that determine packing between
the secondary structures in the crowded interior of the macromolecule.
nevertheless, if one could evaluate the contrary forces of hgdrophobicity and
hydrophilicity inherent within the residues themselves, then it would be possible,
perhaps, at least to distinguish the exterior portions of a protein from the interior
ones. on the basis of the amino acid sequence alone. Moreover. in the case of a
protein that interacts directly with the alkane portion of a phospholipid bilayer in a
membrane. there is general agreement that the amino acid side-chains involved are
chiefly hydrophobic. Once again, an appropriate evaluation of a given amino acid
sequence should be able to predict whether or not a given peptide segment is
sufficiently hydrophobic to interact, with or reside within the interior of the
membrane.
Considerable effort has already been expended in devising schemes for predicting
three-dimensional aspects from amino acid sequences alone. The most notable of
these have dealt with the prediction of local secondary structure (Chou & Fasman.
1973: Wu & Kabat. 1973: Garnier et al., 1978). These are empirical methods in that
they utilize a library of known protein structures from which t,he distribution of the
20 amino acids among various conformational settings is rigorously tallied. If the
frequency with which the individual amino acids or short peptides occur in #I-
helices. p-sheets or reverse turns is known, any seyuence can be syst,ematicallJ
scanned and the probability of those secondary structures can be evaluated.
Interestingly. even earlier attempts had been made to predict the general shape of a
protein on the basis of the types of amino acids it contained. Thus. in the light of
the general observation that the interiors of water-soluble proteins are
predominantly composed of hydrophobic amino acids. while the hydrophilic side-
chains are on the exterior where they can interact with water. Fisher (1964) and
Bigelow (1967) tried to correlate the sizes and shapes of proteins with their overall
amino acid compositions.
Recently; a method for displaying the distribution of hydrophobicity over the
sequence of a protein was presented by Rose (1978) and Rose & Roy (1980). This
procedure combines the progressive-evaluation approach of the secondar!
st)ructure predictions with the earlier empirical observation that the hydrophobic:
side-chains tend to be buried within the native structure (Chothia, 1976). Rose &
Roy (1980) also have demonstrated convincingly that this approach can distinguish
regions of interior sequence from regions of exterior sequence.
In this paper we describe a simple computer program. similar to that employed
by Rose (1978) and Rose & Roy (1980). that systematically evaluates the
hydrophilic and hydrophobic tendencies of a polypeptide chain. The present
program uses a hydropathy scale in which each amino acid has been assigned a value
reflecting its relative hydrophilicity and hydrophobicity. The program
continuously determines the average hydropathy of a moving segment as it
advances through the sequence from the amino to the carboxy terminus. As such.
the procedure gives a graphic visualization of the hpdropathic character of the
EVALUATION OF PROTEIN HYDROPATHY 107
chain from one end to the other, tracking the hydrophilic and hydrophobic regions
relative to a universal midline. We have examined in detail the profiles of several
proteins whose three-dimensional structures are known, and have found excellent
agreement between the observed interiors and the calculated hydrophobic regions,
on the one hand, and the observed exterior portions and the calculated hydrophilic
regions, on the other. We have also examined a number of membrane proteins and
have been able to identify membrane-spanning segments, as well as those
hydrophobic regions that anchor certain proteins in membranes.
2. Experimental Procedures
(a) The computer program
The computer program, SOAP, assigns the appropriate hydropathy value to each residue
in a given amino acid sequence and then successively sums those values, starting at the
ammo terminal, within overlapping segments displaced from each other by one residue.
Although a segment of any size can be chosen, ordinarily spans of 7, 9, 11 or 13 were
employed, odd numbers being used so that a given sum could be plotted above the middle
residue of the segment. Thus, in the case of SOAP-7 the first value corresponds to the sum of
the hydropathies of residues 1 to 7 and is plotted at location 4, the second value corresponds
to the sum for residues 2 to 8 and is plotted at location 5, and so on.
The program was written originally in the language C (Kernighan & Ritchie, 1978) for use
in the software system Unix, which is leased from the Western Electric Co. Because Unix is
now widely used, and because C compilers are now available for many computers, the
original program is supplied as a short Appendix to this paper so that interested readers may
employ it directly. Plots may be obtained from any terminal that prints a standard 199
character output. The program has also been modified for use with a more sophisticated
system linked to a Zeta Plotter (we are grateful to S. Dempsey, Department of Chemistry,
University of California, San Diego, for these modifications). In this latter format the values
are presented as averages rather than sums, and all the figures accompanying this paper were
obtained with this system.
-ET*n(3$i)z-RTln(!$!?)
AGIransfer
where N, = equilibrium mole fraction in aqueous phase, N, = equilibrium mole fraction in the vapor at
standard temperature and pressure, and y is the partition coefficient in the units M M- as tabulated b>
Him & Mookerjee (1975). The advantage of this choice of standard states is that free energies of
salvation are directly presented and only the molecular interactions between water and the solutes are
reflected in the values.
EVALUATION OF PROTEIN HYDROPATHY 199
TABLE 1
Free energies of transfer for the side-chains of the
amino acids between various phases
(kcal mol-i)
AGrans~sr
4 Water into Water into Ethanol into
Side-chain (cm3 mol-i) condensed vapor ethanol condensed vapor
The apparent molal volumes (4) at 25C of model compounds for the side-chains (Wolfenden et al.,
1979) of the structure RH, where R is the side-chain of a given amino acid (+HaNCH(R)COO-) are
either the observed values (a) tabulated by Cohn et al. (1934) or values calculated (b) by the methods of
Cohn et al. (1934), which themselves were adapted from Traube (1899). The water-vapor partition
coefficients for the various model compounds are available in the Tables published by (c) Hine &
Mookerjee (1975) or (d) Wolfenden et al. (1979). The standard states chosen for the free energies are 199
mole fraction for the solution and the condensed vapor at a volume equal to its apparent molal
volume (4). The water-ethanol transfer free energies were copied directly from the tabulations of (c)
Cohn & Edsall (1943) or (f) Nozaki & Tanford (1971). The standard states in each solvent are 199 mole
fraction. The transfer free energies for the ionized side-chains (g) were corrected to pH 7.0 (Wolfenden
et al., 1979) using the following pK, values (Tanford, 1968): lysine, 10.4; histidine, 6.4; glutamic acid,
45: and aapartic acid, 4.1.
hydropathy profiles, and as a result we did not hesitate to adjust the values subjectively
when only this level of accuracy was in question. Nevertheless, we tried to derive the best
numbers we could from the data listed in the last 3 columns of Table 2. The hydropathy
values for valine, phenylalanine, threonine, serine and histidine were simple averages of the 3
other numbers in the Table. When 1 of the 3 numbers for a given amino acid was
significantly different from the other 2, the mean of the other 2 was used. This was done for
cysteine/cystine, methionine and isoleucine. After a good deal of futile discussion concerning
the differences among glutamic acid, aspartic acid, asparagine and glutamine, we came to
the conclusion that they all had indistinguishable hydropathies and set their hydropathy
value by averaging all of the normalized water-vapor transfer free energies and the
normalized fractions of side-chains 100% buried. Because the structural information was so
uncertain, tryptophan was simply assigned its normalized transfer free energy. Glycine was
arbitrarily assigned the hydropathy value which was the weighted mean of the hydropathy
values for all of the sequences in our data base because it was clear from a careful analysis of
the actual distribution of glycine that it is not hydropathic ; that is to say, it does not have
strong feelings about water. On the basis of both the transfer free energy scale and the
fraction buried, alanine ought to be more hydrophobic on our scale, its value exceeding that
110 J. KYTE AND R. F. DOOLITTLE
TABLE 2
Hydropathy scale and information used in the assignments
Fraction of Fraction of
Hydropathy LlP ,L3SH side-chains side-chains
Side-chain index (wate-vapor)a 100% buriedb 95% buried
All values in the last 3 columns result from arbitrary normalization to spread them between -45
and + 45. The normalization functions were :
a -0~679(dG,,,,,,, ; Table 1) + 2.32.
b 48l(fraction lOOo/, buried ; Chothia, 1976) -450.
16.45(fraction 95% buried; Chothia, 1976) -4.71,
of leucine. We find it difficult to accept that a single methyl group can elicit more
hydrophobic force than a cluster of 4 methyl groups, and for that reason we have arbitrarily
lowered the hydropathy value of the alanine side-chain to a point half-way between the
hydropathy value of glycine and the value determined for alanine when the transfer energy
and its distribution were used. No suitable model exists for proline, and in terms of its
tendency to become buried it is fairly hydrophilic. Its hydropathy value was made
somewhat more hydrophobic than this consideration because of its 3 methylene groups. The
hydropathy value for arginine was arbitrarily assigned to the lowest point of the scale.
Because it was difficult to accept the fact that tyrosine is a hydrophilic amino acid, even
though the available data in Table 2 indicate that it is, its hydropathy vafue was
subjectively raised to one closer to the water-vapor Jransfer free energy than the structural
data would have yielded. Similarly, the hydropathy value for leueine was also raised above
the average of the structural data and the transfer free energy, and the hydropathy value for
lysine was lowered. None of these last 3 adjustments, the result of personal bias and heated
discussion between the authors, affects the hydropathy profiles in any significant way.
3. Results
(a) Choice of parameters
The effectiveness of the program and the progressive-evaluation approach in
general depend upon two decisions. First, we had to determine how large a span of
consecutive residues yields a hydropathy profile that most consistently reflects the
EVALUATION OF PROTEIN HYDROPATHY 111
exterior and interior portions of proteins. Second, we had to determine how critical
the hydropathy assignments for the individual amino acids are to the outcome of
the calculations. For example, is the profile of a given sequence radically changed if
the hydropathy values for one or more residues are changed by an arbitrary factor Z
We met these problems directly by examining the same protein sequences under a
variety of conditions.
CHYM 5
40
20
-20
CHYM 9
40 I
40 CHYM 13
2d
O-
-20
-40
I
i .I. /, I. a., I, ., I, i
FIG. 1. SOAP profiles of bovine chymotrypsinogen (CHYM) at 3 different span settings (5,9 and 13).
The solid bars above the midpoint line on the SOAP-9 profile denote interior regions as determined by
crystallography (Freer et al., 1970). Similarly, the solid bars below the midpoint line indicate regions
that are on the outside of the molecule.
112 J. KYTE AND R. F. DOOLITTLE
With respect to the most effective choice of span, we compared the hydropathy
profiles of a number of different proteins over a range of spans from 3 to 21 residues.
Selected profiles from two of these surveys, for chymotrypsin and lactate
dehydrogenase, respectively, are shown in Figures 1 and 2. Naturally, the
hydropathy profiles using the shortest spans are noisier than intermediate spans.
and runs employing spans less than seven residues were generally unsatisfactory.
Long spans on the other hand tended to miss small. consistent features. Frequent
and subjective analysis of the degree of correlation of the profiles with the exteriors
and interiors of globular proteins (see below), as well as the resolution of the profile
itself, revealed that information content was maximized when the spans were set at
7 to 11 residues.
The impact of the choice of hydropathy values was examined in two different
40 I LDH9 1
FIG. 2. SOAP profiles of dogfish lactate dehydrogenaae (LDH) at 3 different span settings (5, 9 and
13). The solid bars above the midpoint line on the SOAP-9 profile denote interior regions as determined
by the crystallographic study of the protein (Eventhoff et al., 1977). Similarly, the solid bars below the
midpoint line indicate regions that are known to be on the outside of the molecule.
EVALUATION OF PROTEIN HYDROPATHY 113
ways. As an initial test, the 20 side-chains were assigned to three groups according
to their rank on the hydropathy scale (Table 2). Thus, arginine, lysine, asparagine,
aspartic acid, glutamine, glutamic acid and histidine were assigned to cluster I:
proline, tyrosine, serine, tryptophan, threonine and glycine to cluster II; and
alanine, methionine, cysteine/cystine, phenylalanine, leucine, valine and isoleucine
to cluster III. The individual values contributing to each cluster were averaged
(cluster I, -3.7, cluster II, -1.0 and cluster III, +3.0) and the mean values
incorporated into a modified SOAP program called LARD. Comparisons of LARD
against SOAP in the cases of chymotrypsinogen and lactic dehydrogenase are
shown in Figures 3 and 4. Although the patterns exhibit some general similarities.
as might be expected since the moving average itself tends to have a leveling
aspect, an experimental approach loses nothing by using the best values available
rather than settling for less precise estimates.
As a second test, the values of four of the most controversial assignments were
shifted radically in order to assess the impact on the hydropathy profile. Thus, the
values for tyrosine, histidine, proline and tryptophan, all of which have arguably
(Nozaki & Tanford, 1971) low hydropathy scores (Table 2), were arbitrarily
increased by 3-O units. When the same two proteins were examined with this
modified scale there was a noticeable if modest change in the patterns (Figs 3
and 4). That the change was modest is partly due to the fact that histidine and
tryptophan are among the least common amino acids.
I-4e--i- - -.-~A
20 40 60 80 100 120 140 160 180 200 220 240
Sequence number
Pro. 3. SOAP profiles of bovine chymotrypsinogen (CHYM) using different hydropathy values for the
20 amino acids. In the top panel (9L) the program (LARD) used a set of clustered values in which case
the 20 amino acids were divided into 3 sets (hvdrophobic. net&xl and hydrophilic). In thr lower panel
(9 S), the program used radically different weighting factors for some of the more controvrrsial amino
acid assignments. including those of histidine. tryptophan. tyrosine and proline. In the middle panel the
program used the standard set of assignments presented in Table 2. All plots utilize a span setting of 9.
runs along the exterior of the protein, even though the hydropathy profile shows it
to have a very hydrophobic character. This nine-residue sequence. Lys-Leu-Lys-
Ile-Ala-Lys-Val-Phe-Lys, contains four positive charges intermingled with live
very hydrophobic residues. The contradiction arises from the fact that the high
concentration of positive charge does not weigh hea.vily enough in the moving
EVALUATION OF PROTEIN HYDROPATHY 115
LDH 7L
40
t
, .-
40 LDH 7
;c 20
2
d 0 - 4
E
p
Ix -20
-40
i
LDH 75
i
20
0
I
-20
I
t
-40
0 20 40 60 00 100 120 140 160 100 200 220 240 260 260 300 320 340
Sequence number
Frc:. 4. SOAP profiles of dogfish lactate dehydrogenase in which different hydropathies were used for
the 20 amino acids. All plots utilize a span setting of 7. See legend to Fig. 3 for meanings of 7L. 7S and 7.
average to offset the alkane side-chains present in this rather unusual sequence. A
close examination of the model reveals that the five hydrophobic side-chains are all
directed toward the interior while the lysine side-chains point out into the aqueous
environment.
In the case of dogfish lactate dehydrogenase (Fig. 2), the designations of external
and internal residues had already been made and published (Eventhoff et al., 1977),
and the hydropathy profile correlates well with these crystallographic findings. The
five major regions of the profile that lie below the midpoint line are all external
sequences in the native protein and six of the eight major regions above the midline
are internal sequences. Again, as with chymotrypsinogen, the profile was least
successful in evaluating those regions in which the main chain is only partly buried,
such as the regions between residues 66 and 78, and 112 and 126, where the
backbone repeatedly passes in and out of the aqueous phase.
116 J. KYTE AND R. F. DOOLITTLE
40
20
G
._r
.s
0
s
Et
$
I -20
-40 c i
I
0 20 40 60 80 100 120 140
Sequence number
-40
Fm. 5. SOAP profiles for 3 different proteins that have membrane affiliation. In the upper panel, the
plot is that of erythrocyte glycophorin (GLYC), which has an easily recognized membrane-spanning
segment in the region of residues 75 to 94. In the middle panel, rabbit cytochrome b, (CB5R) is depicted.
In this case there is a membrane-anchoring unit involving the ZO-residue carboxy-terminal segment. In
the lower panel, the carboxy-terminal region of vesicular stomatitis virus glycoprotein (VSVG) is
shown ; a membrane-spanning segment is clearly evident from residues 470 to 490. All profiles are at
span settings of 7.
118 J. KYTE AND R. F. DOOLITTLE
, I I I , 71-r----7 i-v,
RHOD 7
i
FIN. 6. SOAP profile of bacteriorhodopsin (RHOD) at a span setting of 7. Five of the well-known 7
transmembrane shafts (Henderson & Cnwin, 1975) are clearly delineated; the separation point for the
remaining 2 is not so clear and has been set arbitrarily at about residue 299.
TABLE 3
Iransmembrane sequences of bacteriorhodopsin,
aligned next to each other, in correct polarity
A R c D L7 F G
examined and the most hydrophobic region from each was picked. From this
preliminary collection, a group of twelve 20-residue sequences, which were judged to
be the most hydrophobic of the lot, was chosen for closer inspection. It was assumed
that, since these were in each case the most hydrophobic region in the entire
sequence of a given protein, they would serve as the most extreme models for a
peptide that traverses the interior of a protein. From these 12 proteins, the most
hydrophobic segment of each span-length from 9 to 21 residues was identified and
its average hydropathy tabulated. The collected values for each span length were
compared directly with those of the most hydrophobic segments of the same span
length taken from bacteriophage Ml3 coat protein (Nakashima & Konigsberg,
1974), glycophorin, and the seven transmembrane sequences of bacterial rhodopsin
(Table 3). These nine hydrophobic sequences, each of which is known to span the
membrane, were chosen as models for a sequence which in the native protein is
within the bilayer.
The discrimination between the segments from the soluble proteins as a group
and those from the membrane-spanning sequences was most unequivocal when the
span was lengthened to I9 residues (Fig. 7). This may be due to the fact that
protein-spanning sequences passing through the interior are usually shorter than
membrane-spanning sequences. Nevertheless, from an examination of Table 4 it
can be concluded that when the hydropathy of a given 19-residue segment averages
greater than + 1.6 there is a high probability that it will be one of the sequences in a
I-
,-
I I I I I I I
II 15 19
Span
PIG. 7. Comparison between the hydropathy of sequences that span membranes and the hydropathy
of those that span proteins. The most hydrophobic sequences from 9 globular proteins (Table 4). except
for lactic dehydrogenase, were compared with 9 membrane-spanning sequences (Table 4). For each span
length, the average hydropathies of the most hydrophobic segments were collected and the means and
standard deviations of the 9 values were calculated. These are presented as a function of the span length
for the membrane-spanning group (-O-O-) and the protein-spanning group (-o-O-).
120 J. KYTE AND R. F. DOOLITTLE
TABLE 4
Nineteen-residue hydropathy averages for the most
hydrophobic sequences from various proteins
Sequence Mean
Protein Length position hydropathy
Soluble
Dogfish lactate dehydrogenase 329 23-41 2.26
Klebsiella aerogenes ribitol dehydrogenase 247 142-160 1.52
Human transferrin 676 c:53-c71 1.32
Rabbit phosphorylase 841 139-157 1.18
Bovine chymotrypsinogen A 245 51-69 1.14 199
Lobster glyceraldehyde-3-P dehydrogensse 333 14-32 194 _+022
Bovine prothrombin 582 35G368 0.99 (n = 9)
Bacillus stearothermophilus phosphofructokinase 316 213-231 @96
Human carbonic anhydrase B 260 135-153 988
Escherichia wli dihydrofolate reductase 156 81-99 0.81I
Bovine carboxypeptidase A 307 95-l 13 0.81
Bovine proalbumin 588 25-43 0.53
Membrane-spanning
Ml3 coat protein 50 II-39
Human glycophorin 131 73-91
Halobacterium halobium bacteriorhodopsin 248 11-29
44-62
833101
1088126
1366154
1777195
296224
TABLE 5
Candidates for membrane-spanning sequences in cytochrome oxidase
Average Average
hydropathy hydropathy
Sequence (n = 19) Sequence (n = 19)
Gravy plot
33
I-
2-
- ODn
I- A A o
OAAo no A
E _ 0
D
x
XX
x
-2t
L I I1 I I I I I, I , I I I , I , I I
0 100 200 300 400 500 600 700 800 900 IC lo
Length (residues)
Fra. 8. Plot of mean hydropathies (GRAVY scores) of various proteins against their lengths : ( x ) 84
fully-sequenced soluble enzymes whose amino acid sequences have been taken from the recent
literature; (0) 8 membrane-embedded proteins whose sequences have been determined
(bacteriorhodopsin: yeast mitochondrial cytochromr oxidase subunits J to III. cytochromr h, and the
(oIi-2. oli-4) ATPase subunit; and 2 carbodiimide-sensitive mitochondrial pro&s); (A) 8 putative
proteins inferred from the unidentified reading frames found in the DNA of human mitochondria
(Anderson et al.. 1981).
TABLE 6
Average hydropathy (GRAVY) for the entire amino acid composition
of a collection of membrane-spnning proteins
a From sequence.
b From composition.
From all subunits listed in Table 5
EVALUATION OF PROTEIN HYDROPATHY 123
compared to the spread of the values for an array of soluble proteins (Fig. 8) the
claim that membrane-bound proteins can be distinguished from soluble proteins by
their amino acid compositions alone (Capaldi & Vanderkooi, 1972) appears tenuous.
There remains the possibility, however, that the unexpected hydrophilicity of the
membrane-spanning proteins whose compositions are known only from amino acid
analysis may actually be due to a failure to hydrolyze membrane-spanning
sequences completely even after 72 hours at 108C.
4. Discussion
The equilibrium that determines the unique molecular structure of a protein is
the one that exists between it and a random coil (Anfinsen, 1973). It is generally
assumed that this process can be described as a simple two-state equilibrium
between the native structure and the random coil, and experimental results
consistent with this assumption have been presented (Tanford, 1968). If this is
indeed the case, the individual contributions to the overall free energy change for
this isomerization would be the most critical factors in determining the outcome,
rather than any kinetic features of the reaction. These thermodynamic forces, by
the very nature of the process, must be non-covalent interactions. Several
provocative discussions of these matters have been presented (Cohn & Edsall, 1943 ;
Kauzmann, 1959; Jencks, 1969; Chothia, 1976). Moreover, it has been
demonstrated definitively, by experimental observation, that neither hydrogen
bonds (Klotz & Franzen, 1962), nor ionic interactions (Cohn & Edsall, 1943), nor
dispersion forces (Deno & Berkheimer, 1963) can provide any net favorable free
energy for the formation of the native structure in aqueous solution. Therefore, by
exclusion, and perhaps for the lack of a better candidate, hydrophobic forces
(Kauzmann, 1959) have attracted the most attention in discussions of this process.
Felicitously, this has drawn attention to the significant role of the aqueous
solvent, per se. The hydrophobic force is simply that force, arising from the strong
cohesion of the solvent, which drives molecules lacking any favorable interactions
with the water molecules themselves from the aqueous phase (Jencks, 1969). In the
case of the formation of the native structure from the random coil, this force
participates in the reaction because hydrophobic side-chains, which are exposed to
water in the extended coil, are removed to the interior of the protein during the
folding of the native structure (Chothia, 1976). This transfer appears to provide the
only favorable free energy available to drive the reaction to completion. Therefore.
the more aversion water has for a given amino acid side-chain, the more free energy
is gained when that residue or a portion of it ends up inside the native structure.
Conversely, and of equal importance, it is also the case that the more attraction
water has for a functional group on an amino acid side-chain, the more free energy
is lost when that functional group is removed from water during the folding
process. This point becomes clear upon examination of the data in Table 1, when it
is realized that most of the free energies of transfer from water to the condensed
vapor are actually unfavorable, many by a considerable amount. This is due, of
course, to the fact that water participates in strong interactions with hydrogen-
bond donors and acceptors (Klotz & Farnham, 1968), as well as to the need to
144 J. KYTE AND R. F. DOOLITTLE
neutralize charged side-chains. As a result, one of the major free energy deficits in
the folding of a protein results from the requirement to unsolvate those hydrophilic
functional groups destined for the interior. Some of this investment is returned
when hydrogen bonds are formed in the interior. Nevertheless, because of
geometric constraints, the hydrophilic side-chains in the center of a protein
participate in far fewer hydrogen bonds than they would in the unfolded and
exposed random coil, where both the donors and acceptors interact fully with
water. As such, there is a high probability that significant free energy will be lost
whenever a hydrophilic residue is removed from water during the folding process.
It is undeniably the case therefore that both the hydrophobicity and the
hydrophilicity of a given sequence of amino acids affect the outcome of the
equilibrium between the random coil and the native structure. Although one or the
other of these two properties is often emphasized to make a, particular point.
neither is more important than the other. For example, it, is often stated that the
interior of a protein is formed from its hydrophobic sequences. but it is seldom
pointed out that the interior of the protein is also formed because the hydrophilic
sequences cannot be buried. Thus, it can be concluded that any description of the
folding process that fails to consider either hydrophobicity or hydrophilicity is
discarding half of the information contained within the sequence of the protein.
It has also been pointed out (Lifson & Sander. 1979; Janin & Chothia, 1980:
Chothia & Janin, 1981) that the packing properties of residues such as leucine,
isoleucine and valine might have an effect, independent of hydropathy, on the
folding process as the interior of the protein is fitted together. [Jnfortunately. very
little is known about the features of this steric interplay. and our understanding of
the folding process does not extend beyond the conclusion that hydropathy is of
central importance to it.
The conclusion that can be drawn from all of these considerations is that, to a
first approximation, the native structure of a protein molecule will be that
structure that permits the removal of the greatest amount of hydrophobic surface
area and the smallest number of hydrophilic positions from exposure to water
(Bigelow, 1967; Fisher, 1964; Chothia, 1976). The obvious prediction that follows
from this conclusion is that the most hydrophobic sequences in a protein will be
found in the interior of the native structure and the most hydrophilic sequences will
be found on the exterior. In order to exploit this prediction with the greatest
success. the most accurate evaluations of the hydrophobicity and hydrophilicit,y of
each amino acid side-chain should be formulated.
To this end,. a number of hydropathy scales have been proposed in other
publications, but, in our view, they all suffer from serious drawbacks. Those based
on water-ethanol transfer free energies (Xozaki B Tanford. 1971 : Segrest Cy;
Feldman, 1974; Rose, 1978) are imperfect due to the peculiarities of ethanol as a
solvent. which seem almost as unusual as those of water itself (Table 1). A scale
based on the partition coefficient between the bulk aqueous phase and the air--
water interface (Bull & Breese, 1974) also seems a poor choice. because the
hydrogen bonds that must be broken and the charges that must be neutralized to
remove a residue from the aqueous phase during the formation of the native
structure probably remain intact at the air-water interface and are thus not a
EVALUATION OF PROTEIN HYDROPATHY 1%
others. The most unequivocal values are those associated with leucine, isoleucine,
valine . phenylalanine , methionine, threonine, serine, lysine, glutamine and
asparagine. These ten residues together comprise slightly more than half (52% of
the present census) of the amino acids found in proteins. Most of these side-chains
have partial specific volumes between 50 and 100 cm3 mol-, which suggests that
dispersion forces may not influence their rank. Their relative positions change little
from the fraction 95% buried, to the fraction 100% buried, to the free energy of
transfer (Table 2). As such, these residues anchor the scale and are probably those
most responsible for its success.
There is a group of amino acids that are less reliable : cysteine is complicated by
the problem of disulfide bonds; proline, by the lack of an adequate model
compound in the transfer free energies as well as its tendency to participate in /3-
turns on the exterior; and aspartic acid, glutamic acid and tyrosine. by the large
differences between their tendency to be buried and their free energies of transfer.
Certain amino acids (tryptophan, tyrosine, glutamic acid and histidine) are very
reluctant to bury the last 5% of their surface area while some (alanine, glycine and
cysteine) are far more likely than the others to become fully buried. Finally.
arginine was arbitrarily assigned a parameter of -4.5. even though no arginine was
found to be even as much as 95% buried (Chothia. 1976) and no model compound
for arginine was employed in the water-vapor transfer studies of Wolfenden et al.
(1979). It is possible that the parameter for this side-chain should be even more
negative (Wolfenden et al., 1981).
Glycine and alanine are especially difficult to categorize. Both lack satisfactory
model compounds for phase-transfer studies. Because methane is such a small
molecule, its relative hydrophobicity is probably seriously overestimated by
water-vapor transfer energy. because of the ambiguity introduced by dispersion
forces. Indeed, the use of hydrogen gas as a model compound for glycine
(Wolfenden et al., 1979) is such an extreme case of the problem that arises from the
contributions of the dispersion forces when molecules of such radically different
electron densities are compared, that its water-vapor transfer free energy is
probably a meaningless number in this context, and it, has not been included in
Table 1. On the other hand, both alanine and glycine are quite insensitive to
becoming fully buried (Table 2), which suggests that the side-chains contribute
little energy one way or the other to protein folding because they are not
hydropathic. The conclusion from these arguments is that the more alanine and
glycine a segment contains the more equivocal its hydropathy becomes.
A rather interesting and unforeseen feature of the interaction between the
various side-chains and water is that the aromatic amino acids. tryptophan.
tyrosine and histidine, are far more polar than previously thought (Xozaki &
Tanford, 1971). It has been noted that aromatic compounds are more soluble in
water, by an order of magnitude, than their surface areas would indicate
(Hermann. 1972). The phenylalanine side-chain, however, is much less hydrophilic
than the other three and this suggests that it is the heteroatoms in the latter that
are the major contributors to their hydrophilicity.
In this context, tryptophan is one of the most difficult residues to which to assign
a hydropathy index. It has a fairly positive water-vapor transfer free energy
EVALUATION OF PROTEIN HYDROPATHY 127
(Table 1), but much of this may result from large, favorable dispersion forces due to
the residues large volume, the opposite problem to the one experienced with
glycine and alanine. Examination of the actual location of tryptophan in a number
of proteins (Chothia, 1976), however, clearly indicates that this sidechain is
infrequently totally buried (Table 2). In the specific case of the interaction of
gramicidin with a phospholipid bilayer, it should be mentioned that its tryptophan
residues are clustered at the two ends of the pore rather than being distributed
evenly throughout, again suggesting an unexpected hydrophilicity for these
residues and a reluctance to bury the last 5% of surface (Table 2). Although the
hydropathy of tryptophan is relatively unimportant in considerations of soluble
proteins, since its frequency is only about l-2%, there are indications that it may be
very significant in membrane-affiliated sequences. In particular, 18 of the 19
tryptophan residues in Ca +-ATPase seem to be within sequences directly
associated with the bilayer (Allen et al., 1980). Another instance is the tryptophan
cluster in cytochrome b,, noted above, which the hydropathy profile clearly
positions at the aqueous interface (Fig. 5). Finally, earlier claims that tryptophan
was the most hydrophobic of the amino acids were based entirely on transfer free
energies between water and ethanol or dioxane (Nozaki & Tanford, 1971). It was
not recognized at that time that when the tryptophan side-chain, which possesses a
hydrogen-bond donor only, is transferred between water, a solvent with equal
numbers of donors and acceptors, and ethanol or dioxane, solvents with excesses of
hydrogen-bond acceptors, there is a net increase of one mole of hydrogen bond
(mole indole)- formed during the transfer, causing the side-chain to appear much
more hydrophobic than it actually is. Using the same logic, it is clear that in a
protein solution, which necessarily contains more acceptors than donors, the
removal of the donor on tryptophan from access to the solvent is a significantly
unfavorable reaction. It seems, when all points are considered objectively, that
tryptophan is a fairly hydrophilic side-chain.
The particular values chosen for the amino acid hydropathies embody one of the
major differences between the method presented here and a similar one proposed
earlier by Rose (1978). He chose to employ water-ethanol transfer free energies in
his scale, the disadvantages of which are noted above. He also chose to ignore, at
least in principle, the hydrophilic force, the attraction that the aqueous solvent
exhibits for many side-chains, by simply assigning a value of zero to all side-chains
for which partition free energies were not listed in the Tables of Nozaki & Tanford
(1971). In addition, Roses curve-smoothing procedure. although mathematically
sound, tends to remove a great deal of the simplicity and clarity of the
unsmoothed moving average. In the program described in this paper the meaning
of each value is clearly understood and a more distinct and graphic rendering of the
sequence obtained.
In addition, we have extended the use of this approach to the area of membrane-
spanning segments of protein sequences. In this regard, the most novel feature of
the approach is that membrane-spanning segments can be identified and
distinguished from sequences that merely pass through the interior of a protein
(Table 4). Since it is these membrane-spanning portions of sequences that have
proven to be most difficult to study (Allen et al., 1980), a method for their
128 J. KYTE AND R. F. DOOLITTLE
weakens the arguments used earlier (Engelman et al., 1980) to correlate the
sequence of bacteriorhodopsin with the electron density profile. In the first place,
the total scattering power of the A sequence in Table 3 is not less than those of the
others, and this would make its correlation with a low-intensity shaft unnecessary.
More to the point of the present discussion, the model preferred in the earlier study
(Engelman et al., 1980) places sequence B, one of the most hydrophobic (Table 4),
at a location where it is completely surrounded by protein ; and sequence F, clearly
the most hydrophilic (Table 4), at a location well-exposed to the alkane of the
bilayer. These considerations demonstrate that an examination of the hydropathy
of a given sequence may provide additional information to the crystallographer in
situations where structural decisions are ambiguous. An assignment that is
different from the previous one and that satisfies the demands of hydropathy more
successfully would be to place sequence A into shaft 5, B into shaft 6, C into
shaft 2, D into shaft 3, E into shaft 4, F into shaft 7 and G into shaft 1, in the
enumeration of Engelman et al. (1980). This assignment positions the most
hydrophilic sequence, F, in the location most shielded from the alkane, and the
most hydrophobic, G, in the location most exposed to the alkane; juxtaposes
sequences F, G, C and B, forming the proton relay system as well as permitting the
strong hydrogen bonds mentioned earlier ; and places the retinal that is attached to
lysine 216 (Bayley et al., 1981) in the very center of the carboxylate relay system, as
well as at the location that it occupies within the projected neutron density profile
(King et al., 1980). Furthermore, no crossovers of the sequence are required. and
the only long connection coincides fortuitously with the longest stretch of polar
sequence, residues 158 to 175. This elaboration, as well as others presented earlier.
is an example of the information that might be gained from an informed
consideration of a hydropathy profile.
APPENDIX
A C program for evaluating the hydropathic character of sequence segments
main (1
int i,j,k,;
float total;
char residue;
extern char code C231;
extern float factor r-231;
char sequence[10991;
float value110951;
j q 0;
while (getchat- != \n);
while (j <1099) {
for (i E 0; i <ll; i++) getchar0;
while (j <1099)(
sequence[j++l = getchart);
if (getchar == \n) break;
j I (j - 1);
for (i 3 0; i <j; i++) I
residue q sequence[il;
for (k = 0; k <23; k++)
if(residue == code[k]) value[i] q factorCk1; 1
for (i = 0; i <(j - 6); i++) I
total = 0;
for (k I 0; k <7; k++) total = total + value[i + kl;
printf(84d $c $6.lf, i + 4, sequence[i+31, total);
for (k = 0; k <total; k++) {if(k == 29) printf(.);
else printf ( 1; I
printf ( X\n 1;
I
printf(\n);
1
char code[l RKDENSEHZQTGXAPVYCMILWF ;
float factor[] [0.0,0.6,1.0,1.0,1.o,3.6,1.o,1.3,1.o,1.o,3.8,4.1,4.1,6.3,
2.9,8.7,3.2,7.0,6.4,9.0,5.2,3.6,7.21;
We thank many of our colleagues including ,J. Kraut and S. tJ. Singer for helpful
discussions about many of the matters discussed in this paper. We are also grateful t!o
S. Dempsey for assistance with various aspects of programming for the graph plotter. This
work was supported by National Institutes of Health grants HL18576, HL26873, HL17879.
RR00757 and by National Science Foundation grant PCM78-24284.
REFERENCES
Allen, G., Trinnaman, IS. J. & Green, N. M. (1980). B&hem. J. 187, 591-616.
Anderson, 8.. Bankier, A. T., Barrel], B. G.. de Bruijn, M. H. I,., Coulson. A. R.. Drouin. J..
Eperson. I. C.. Nierlich, D. P.. Roe, B. A., Sanger, F.. Schreier. P. H., Smith. A. J. H.,
Staden. R. 8r Young, I. G. (1981). ~Vutuw (London), 290. 457-465.
Anfinsen, C. B. (1973). Scien~cr, 181, 223-230.
Bayley, H.. Huang, K. S., Radhakrishnan, It., Ross, A. H., Takayaki. \I-. & Khorana. H. (:.
(1981). Proc. Szt. Acad. Aci., C.l\.il 78, 22252229.
Bigelow. C. C. (1967). J. Theoret. Biol. 16, 187-21 I.
Bonitz, 6. G.. Coruzzi. (i., Thalenfeld, B. E. & Tzagoloff. A. (1980). J. Hiol. Chrm. 255, 119277
11941.
Bull, H. B. & Breese, K. (1974). rlrch. Biochrm. Bioph,ys. 161, 665-670.
Buse, G. & Steffens, (:. J. (1978). Hoppr-Xeylers Z. Physiol. Chrm. 359, 1006-1999.
Capaldi. R. A. & Vanderkooi, (:. (1972). Proc. Sat. i2cad. Sri., V.S.A. 69, 930-932.
Chothia. c. (1976). J. Mol. Riol. 105. l-14.
Chothia. (. & Janin. J. (1981). Proc. Ant. dcad. Sci.. CCA.=1. 78, 4146-4150.
Chou, I. I. & Fasman, G. (1973). J. Mol. Biol. 74, 263281.
Cohn, E. .J. $ Edsall, J. T. (1943). Proteins. Amino Acids. and Peptides as Ions and Bipolar
lorrs. Reinhold, New York.
Cohn, E. J., McMeekin. T. L., Edsall, .J. T. & Blanchard, M. H. (1934). J. ,-I mu. Chrm. Sot.
56. 784-794.
Coruzzi, (:. 8r Tzagoloff. A. (1979). J. Rio/. Chem. 254, 9324-9330.
Dayhoff. M. 0. (1978). Atlas of Protein Sequence and Structure, vol. 5. supp. l-3. National
Biomedical Research Foundation, Washington, I).(.
Deno, PI. C. 8r Berkheimer, H. E. (1963). J. Org. Chem. 28. 21432144.
Doolittle, R. F. (1981). Science, 214, 149-159.
Drickamer. L. K. (1977). J. Rio/. Chum. 252, 690!+6917.
EVALUATION OF PROTEIN HYDROPATHY 131
Engelman, D. M., Henderson, R., McLachlan, A. D. & Wallace, B. A. (1980). Proc. Nat.
Acad. Sci., U.S.A. 77, 2023-2021.
Eventhoff, W., Rossmann, M. G., Taylor, S. S., Torff, H.-J., Meyer, H., Keil, W. & Kiltz,
H.-H. (1977). Proc Nat. Ad. Sci., U.S.A. 74, 2677-2681.
Fisher, H. F. (1964). Proc. Nut. Acud. Sci., U.S.A. 51, 1285-1291.
Fleming, P. J., Koppel, D. E., Lau, A. L. Y. & Strittmatter, P. (1979). Biochemistry, 18,
5458-5464.
Freer, S. T., Kraut, J., Robertus, J. D., Wright, H. T. & Xuong, Ng. H. (1970). Biochemistry,
9, 1997-2008.
Fuller, S. D., Capaldi, R. A. & Henderson, R. (1979). J. Mol. Biol. 134, 305-327.
Gamier, J., Osguthorpe, D. J. & Robson, B. (1978). J. Mol. Biol. 120, 97-120.
Heller, J. (1968). Biochemistry, 7, 2906-2913.
Henderson, R. & Unwin, N. (1975). Nature (London), 257, 28-32.
Hermann, R. B. (1972). J. Phys. Chem. 76, 2754-2759.
Hine, J. & Mookerjee, P. K. (1975). J. Org. Chem. 40, 292-298.
Ho, M. K. & Guidotti, G. (1975). J. Biol. Chem. 250, 675-683.
Janin, J. & Chothia, C. (1980). J. Mol. Biol. 143, 95-128.
Jencks, W. P. (1969). Catalysis in Chemistry and Enzymology, McGraw-Hill, New York.
Kauzmann, W. (1959). Advan. Protein Chem. 14, 1-63.
Krrnighan. B. W. & Ritchir. D. M. (1978). The C IrogrammirLg Language. Prentice-Hall.
Englewood Cliffs. N.J.
Khorana, H. G., Gerber, G. E., Herlihy, W. C., Gray, C. P., Anderegg, R. J., Nihei, K. &
Biemann, K. (1979). Proc. Nat. Acad. Sci., U.S.A. 76, 5046-5050.
King, G. I., Mowery, P. C., Stoeckenius, W.. Crespi, H. L. & Schoenborn, B. P. (1980). Proc.
AVat. Acad. Sci., U.S.A. 77, 4726-4730.
Klotz, I. M. & Farnham, S. B. (1968). Biochemistry, 7, 3879-3881.
Klotz, I. M. & Franzen, J. S. (1962). J. Amer. Chem. Sot. 84, 3461-3466.
Kristof, W. & Zundel, G. (1980). Biophys. Struct. Mech. 6, 209-225.
Kyte, J. (1972). J. Biol. Chem. 247, 7642-7649.
Lifson, S. & Sander, C. (1979). Nature (London), 282, 109-111.
Nakashima, Y. & Konigsberg, W. (1974). J. Mol. Biol. 88, 598-600.
Nobraga, F. G. & Tzagoloff, A. (1980). J. Biol. Chem. 255, 9828-9837.
Nozaki, Y. & Tanford, C. (1971). J. Biol. Chem. 246, 2211-2217.
Parsegian, A. (1969). Nature (London), 221, 844846.
Pober, J. S. & Stryer, L. (1975). J. Mol. Biol. 95, 477-481.
Rose, G. D. (1978). LVature (London), 272, 586-590.
Rose, G. D. & Roy, S. (1980). Proc. Nat. Acad. Sci., U.S.il. 77, 4643-4647.
Rose, J. K., Welch, W. J., Sefton, B. M., Esch, F. S. & Ling, N. C. (1980). Proc. Nat. Acad.
Sci., U.S.A. 77, 3884-3888.
Sacher, R.. Steffens, G. J. & Buse, G. (1979). Hoppe-Seylers 2. Physiol. Chem. 360, 1385-
1392.
Segrest, J. P. & Feldman, R. J. (1974). J. Mol. Biol. 87, 853-858.
Sogin, D. C. & Hinkle, P. C. (1978). J. Supramol. Struct. 8, 447-453.
Steck, T. L., Koziarz, J. J., Singh, M. K., Reddy, G. & Kiihler, H. (1978). Biochemistry, 17,
1216-1222.
Steffens, G. J. & Buse, G. (1979). Hoppe-Seylers 2. Physiol. Chem. 360, 613-619.
Strittmatter, P., Rogers, M. J. & Spatz, L. (1972). J. Biol. Chem. 247, 7188-7194.
Tanaka, M., Haniu, M., Yasunobu, K. T., Yu, C. A., Yu, L., Wei. Y. H. & King, T. E. (1979).
J. Biol. Chem. 254, 3879-3885.
Tanford, C. (1968). &van. Protein Chem. 23, 121-282.
Thalenfeld, B. E. & Tzagoloff, A. (1980). J. Biol. Chem. 255, 6173-6180.
Tomita, M. & Marchesi, V. (1975). Proc. Nut. Acud. Sci., U.S.A. 72, 2964-2968.
Tomita, M., Furthmayr, H. & Marchesi, V. (1978). Biochemistry, 17, 4756-4770.
Traube, .J. (1899). Samml. Chem. Chem.-Tech. Vortr. 4, 19-332.
132 J. KYTE AND R. F. DOOLITTLE
Vandlen, R. L., Wu, W. C. S., Eisenach, J. C. & Raftery, M. A. (1979). Biochemistry, 18,
1845-1854.
von Heijne, G. t Blomberg, C. (1979). Eur. J. Biochem. 97, 175-181.
Wang, J. H. (1968). Science, 161, 328-334.
Wolfenden, R. V., Cullis, P. M. & Southgate, C:. C. F. (1979). Science, 206, 575-577.
Wolfenden, R., Andersson, L., Cullis, P. M. & Southgate, C. C. B. (1981). Biockmicstry, 20;
849-855.
Wu, T. T. & Kabat, E. A. (1973). J. Mol. Biol. 75, 13-31.
Zimmerman. ,J. M., Eliezer. N. & Simha. R. (1968). J. Throrrt. Biol. 21. 170.-201.
Edited by M. F. Moody
,lotr added ire proof: A similar prediction method has recently been reported by Hopp Hr
Woods (1981). Proc. Xat. =Icnd. Sci.. U.S.A. 78. 3824-3828.