Intro To Protein Folding
Intro To Protein Folding
Intro To Protein Folding
r
X
i
v
:
0
7
0
5
.
1
8
4
5
v
1
[
p
h
y
s
i
c
s
.
b
i
o
-
p
h
]
1
3
M
a
y
2
0
0
7
Introduction to protein folding for physicists
Pablo Echenique
and
,
respectively, with = 1, . . . , N) and those belonging to the water molecules
(denoted by X
m
and
m
, with m = N + 1, . . . , N + N
w
). The whole set of
microscopic states shall be called phase space and denoted by
w
, explicitly
indicating that it is formed as the direct product of the protein phase space
and the water molecules one
w
.
The central physical object that determines the time behaviour of the system
is the Hamiltonian (or energy) function
H(x
, X
m
,
,
m
) =
2M
2
m
2M
m
+V (x
, X
m
) , (4.1)
where M
and M
m
denote the atomic masses and V (x
, X
m
) is the potential
energy.
After equilibrium has been attained at temperature T, the microscopic de-
tails about the time trajectories can be forgot and the average behaviour can be
described by the laws of statistical mechanics. In the canonical ensemble, the
partition function [116] of the system, which is the basic object from which the
rest of relevant thermodynamical quantities may be extracted, is given by
Z =
1
h
N+Nw
N
w
!
w
exp
H(x
, X
m
,
,
m
)
dx
dX
m
d
d
m
,
(4.2)
25
At this point of the discussion, the possible presence of non-zero ionic strength is consid-
ered to be a secondary issue.
26
Although non-relativistic quantum mechanics may be considered to be a much more
precise theory to study the problem, the computer simulation of the dynamics of a system with
so many particles using a quantum mechanical description lies far in the future. Nevertheless,
this more fundamental theory can be used to design better classical potential energy functions
(which is one of the main long-term goals of the research performed in our group).
27
Sometimes, the term Cartesian is used instead of Euclidean. Here, we prefer to use the
latter since it additionally implies the existence of a mass metric tensor that is proportional to
the identity matrix, whereas the Cartesian label only asks the n-tuples in the set of coordinates
to be bijective with the abstract points of the space [117].
30
where h is Plancks constant, we adhere to the standard notation := 1/RT
(per-mole energy units are used all throughout this work, so R is preferred over
k
B
) and N
w
! is a combinatorial number that accounts for the quantum indistin-
guishability of the N
w
water molecules. Additionally, as we have anticipated,
the multiplicative factor outside the integral sign is a constant that divides out
for any observable averages and represents just a change of reference in the
Helmholtz free energy. Therefore, we will drop it from the previous expression
and the notation Z will be kept for convenience.
Next, since the principal interest lies on the conformational behaviour of
the polypeptide chain, seeking to develop clearer images and, if possible, re-
duce the computational demands, water coordinates and momenta are custom-
arily averaged (or integrated) out [115, 118], leaving an eective Hamiltonian
H
e
(x
; T)) is called
potential of mean force or eective potential energy.
This eective Hamiltonian may be either empirically designed from scratch
(which is the common practice in the classical force elds typically used to per-
form molecular dynamics simulations [106, 107, 119128]) or obtained from the
more fundamental, original Hamiltonian H(x
, X
m
,
,
m
) actually perform-
ing the averaging out process. In statistical mechanics, the theoretical steps
that must be followed if one chooses this second option are very straightforward
(at least formally):
The integration over the water momenta
m
in equation (4.2) yields a T-
dependent factor that includes the masses M
m
and that shall be dropped by
the same considerations stated above. On the other hand, the integration of the
water coordinates X
m
is not so trivial, and, except in the case of very simple
potentials, it can only be performed formally. To do this, we dene the potential
of mean force or eective potential energy by
W(x
; T) := RT ln
exp
V (x
, X
m
)
dX
m
, (4.3)
and simply rewrite Z as
Z =
exp
H
e
(x
; T)
dx
, (4.4)
with the eective Hamiltonian being
H
e
(x
; T) =
2M
+W(x
; T) . (4.5)
At this point, the protein momenta
exp
W(x
; T)
dx
, (4.6)
where now denotes the positions part of the protein phase space .
Some remarks may be done at this point: On the one hand, if one further
assumes that the original potential energy V (x
, X
m
) separates as a sum of
intra-protein, intra-water and water-protein interaction terms, the eective po-
tential energy W(x
; T) on
the temperature T (see equation (4.3)) and the associated fact that it contains
the entropy of the water molecules, justies its alternative denomination of
internal or eective free energy, and also the suggestive notation F(x
) :=
W(x
; T) used in some works [130]. Here, however, we prefer to save the name
free energy for the one that contains some amount of protein conformational
entropy and that may be assigned to nite subsets (states) of the conformational
space of the chain (see equation (4.10) and the discussion below).
Finally, we will stick to the notational practice of dropping (but remember-
ing) the temperature T from W and H
e
. This is consistent with the situation of
constant T that we wish to investigate and also very natural and common in the
literature. In fact, most Hamiltonian functions (and their respective potentials)
that are considered to be fundamental actually come from the averaging out of
degrees of freedom more microscopical than the ones regarded as relevant, and,
as a result, the coupling constants contained in them are not really constant,
but dependent on the temperature T.
Now, from the probability density function (PDF) in the protein conforma-
tional space , given by,
p(x
) =
exp
W(x
Z
, (4.7)
we can tell that W(x
i
exp
W(x
; T)
dx
, (4.8)
so that the probability of
i
be given by
P
i
:=
Z
i
Z
. (4.9)
The Helmholtz free energy F
i
of this state is
F
i
:= RT lnZ
i
, (4.10)
and the following relation for the free energy dierences is satised:
F
ij
= F
j
F
i
= RT ln
Z
j
Z
i
= RT ln
P
j
P
i
= RT ln
[j]
[i]
= RT lnK
ij
,
(4.11)
where [i] denotes the concentration (in chemical jargon) of the species i, and
K
ij
is the reaction constant (using again images borrowed from chemistry) of
the i j equilibrium. It is precisely this dependence on the concentrations,
together with the approximate equivalence between F and G at physio-
logical conditions (where the term PV is negligible [115]), that renders equa-
tion (4.11) very useful and ultimately justies this point of view based on states,
since it relates the quantity that describes protein stability and may be estimated
theoretically (the folding free energy at constant temperature and constant pres-
sure G
fold
:= G
N
G
U
) with the observables that are commonly measured
in the laboratory (the concentrations [^] and [|] of the native and unfolded
states) [24, 54, 131].
The next step to develop this state-centred formalism is to dene the mi-
croscopic PDF in
i
as the original one in equation (4.7) conditioned to the
knowledge that the conformation x
lies in
i
:
33
p
i
(x
) := p(x
[ x
i
) =
p(x
)
P
i
=
exp
W(x
Z
i
. (4.12)
Now, using this probability measure in
i
, we may calculate the internal
energy U
i
as the average potential energy in this state:
U
i
:= W)
i
=
i
W(x
)p
i
(x
) dx
, (4.13)
and also dene the entropy of
i
as
S
i
:= R
i
p
i
(x
) lnp
i
(x
) dx
. (4.14)
Finally, ending our statistical mechanics reminder, one can show that the
natural thermodynamic relation among the dierent state functions is recovered:
F
ij
= U
ij
TS
ij
G
ij
= H
ij
TS
ij
, (4.15)
where H is the enthalpy, whose dierences H
ij
may be approximated by U
ij
neglecting the term PV again.
Retaking the discussion about the mechanisms of protein folding, we see
(again) in equation (4.7) that the potential of mean force W(x
) completely
determines the conformational preferences of the polypeptide chain in the ther-
modynamic equilibrium. Nevertheless, it is often useful to investigate also
the underlying microscopic dynamics. The eective potential energy W(x
)
in equation (4.3) has been simply obtained in the previous paragraphs using
the tools of statistical mechanics; the dynamical averaging out of the solvent
degrees of freedom in order to describe the time evolution of the protein subsys-
tem, on the other hand, is a much more complicated (and certainly dierent)
task [132136]. However, if the relaxation of the solvent is fast compared to the
motion of the polypeptide chain, the function W(x
) of
the protein (in which case the folding process is said to be thermodynamically
controlled) or it is just the lowest-lying kinetically-accessible local minimum (in
which case we talk about kinetic control ) [115]. This question was raised by
Annsen [76], who assumed the rst case to be the correct answer and called
the assumption the thermodynamic hypothesis. Although Levinthal pointed out
a few years later that this was not necessary and that kinetic control was per-
fectly possible [140], and also despite some indications against it [151, 152], it is
now widely accepted that the thermodynamic hypothesis is fullled most of the
times, and almost always for small single-domain proteins [24, 77, 81, 115]. Of
course, nothing fundamental changes in the overall picture if the energy land-
scape is funneled towards a local minimum of W(x
) (for
example, using simulated annealing [153, 154] or similar schemes), whereas, if
the thermodynamic hypothesis is broken, the native structure may still be found
performing molecular dynamics simulations, but minimization procedures could
be misleading and technically problematic. This is so because, although local
minima may also be found and described, the knowledge about towards which
one of them the protein trajectories converge depends on kinetic information,
which is absent from the typical minimization algorithms.
Now, even though a funneled energy function provides the only consistent
image that accounts for all the experimental facts about protein folding, one
must still explain the fact that the landscape is just like that. If one looks
at a protein as if it were the rst time, one sees that it is a heteropolymer
made up of twenty dierent types of amino acid monomers (see section 2).
Such a system, due to its many degrees of freedom, the constraints imposed
by chain connectivity and the dierent anities that the monomers show for
their neighbours and for the environment, presents a large degree of frustration,
that is, there is not a single conformation of the chain which optimizes all the
interactions at the same time
31
. For the vast majority of the sequences, this
30
See appendix A for some technical but relevant remarks about the minimization of the
eective potential energy function.
31
In order to be entitled to give such a simple denition, we need that the eective potential
energy of the system separates as a sum of terms with the minima at dierent points (either
because it is split in few-body terms, or because it is split in dierent types of interactions,
37
would lead to a rugged energy landscape with many low-energy states, high
barriers, strong traps, etc.; up to a certain degree, a landscape similar to that
of spin glasses. A landscape in which fast-folding to a unique three-dimensional
structure is impossible!
However, a protein is not a random heteropolymer. Its sequence has been
selected and improved along thousands of millions of years by natural selec-
tion
32
, and the score function that decided the contest, the tness that drove
the process, is just its ability to fold into a well-dened native structure in a
biologically reasonable time
33
. Henceforth, the energy landscape of a protein is
not like the majority of them, proteins are a selected minority of heteropoly-
mers for which there exists a privileged structure (the native one) so that, in
every point of the conformational space, it is more stabilizing, on average, to
form native contacts than to form non-native ones (an image radically imple-
mented by Go-type models [156]). Bryngelson and Wolynes [146] have termed
this fewer conicting interactions than typically expected the principle of min-
imal frustration, and this takes us to a natural denition of a protein (opposed
to a general polypeptide): a protein is a polypeptide chain whose sequence has
been naturally selected to satisfy the principle of minimal frustration.
Now, we should note that this funneled shape emerges from a very delicate
balance. Proteins are only marginally stable in solution, with an unfolding free
energy G
unfold
typically in the 5 15 kcal/mol range. However, if we split
this relatively small value into its enthalphic and entropic contributions, using
equation (4.15) and the already mentioned fact that the term PV is negligible
at physiological conditions [115],
G
unfold
= H
unfold
TS
unfold
, (4.16)
we nd that it is made up of the dierence between two quantities (H
unfold
and TS
unfold
) that are typically an order of magnitude larger than G
unfold
itself [115,157], i.e., the native state is enthalpically favoured by hundreds of kilo-
calories per mole and entropically penalized by approximately the same amount.
In addition, both quantities are strongly dependent on the details of the ef-
fective potential energy W(x
) implicitly depends).
For the same reasons, if the folding process is intended to be simulated
theoretically, the chances of missing the native state and (what is even worse)
of producing a non-funneled landscape, which is very dicult to explore us-
ing conventional molecular dynamics or minimization algorithms, are very high
if poor energy functions are used [144, 158, 159]. Therefore, it is not surpris-
ing that current force elds [106, 107, 119128], which include a number of
strong assumptions (additivity of the interactions, mostly pairwise terms, sim-
ple functional forms, etc.), are widely recognized to be incapable of folding
proteins [24, 86, 100, 102, 160163].
The improvement of the eective potential energy functions describing poly-
peptides, with the long-term goal of reliable ab initio folding, is one of the main
objectives pursued in our group, and probably one of the central issues that
must be solved before the wider framework of the protein folding problem can
be tackled. The enormous mathematical and computational complexity that
the study of these topics entails, renders the incorporation of the physicists
community essential for the future advances in molecular biology. That the
boundaries of what is normally considered physics are expanding is obvious,
and so it is that the investigation of the behaviour of biological macromolecules
is a very appealing part of the new territory to explore.
Acknowledgments
I wish to thank J. L. Alonso, J. Sancho and I. Calvo for illuminating discussions
and for the invaluable help to perform the transition mentioned in the title of
this work.
This work has been supported by the research projects E24/3 and PM048
(Arag on Government), MEC (Spain) FIS2006-12781-C02-01 and MCyT (Spain)
FIS2004-05073-C04-01. P. Echenique and is supported by a BIFI research con-
tract.
A Probability density functions
Let us dene a stochastic or random variable
34
as a pair (X, p), with X a subset
of R
n
for some n and p a function that takes n-tuples x (x
1
, . . . , x
n
) X to
positive real numbers,
p : X [0, )
x p(x)
Then, X is called range, sample space or phase space, and p is termed proba-
bility distribution or probability density function (PDF). The phase space can be
34
See Van Kampen [164] for a more complete introduction to probability theory.
39
discrete, a case with which we shall not deal here, or continuous, so that p(x) dx
(with dx := dx
1
dx
n
) represents the probability of occurrence of some n-
tuple in the set dened by (x, x + dx) := (x
1
, x
1
+ dx
1
) (x
n
, x
n
+ dx
n
),
and the following normalization condition is satised:
X
p(x) dx = 1 . (A.1)
It is precisely in the continuous case where the interpretation of the function
p(x) alone is a bit problematic, and playing intuitively with the concepts derived
from it becomes dangerous. On one side, it is obvious that p(x) is not the
probability of the value x happening, since the probability of any specic point
in a continuous space must be zero (what is the probability of selecting a random
number between 3 and 4 and obtaining exactly ?). In fact, the correct way
of using p(x) to assign probabilities to the n-tuples in X is to multiply it by
dierentials and say that it is the probability that any point in a dierentially
small interval occurs (as we have done in the paragraph above equation (A.1)).
The reason for this may be expressed in many ways: one may say that p(x) is an
object that only makes sense under an integral sign (like a Dirac delta), or one
may realize that only probabilities of nite subsets of X can have any meaning.
In fact, it is this last statement the one that focuses the attention on the fact
that, if we decide to reparameterize X and perform a change of variables x
(x),
what should not change are the integrals over nite subsets of X, and, therefore,
p(x) cannot transform as a scalar quantity (i.e., satisfying p
(x
) = p(x(x
))),
but according to a dierent rule.
If we denote the Jacobian matrix of the change of variables by x/x
, we
must have that
p
(x
) =
det
x
x
p(x(x
)) , (A.2)
so that, for any nite set Y X (with its image by the transformation denoted
by Y
), and indicating the probability of a set with a capital P, we have the
necessary property
P(Y ) :=
Y
p(x) dx =
Y
p
(x
) dx
=: P
(Y
) . (A.3)
All in all, the object that has meaning content is P and not p. If one needs
to talk about things such as the most probable regions, or the most probable
states, or the most probable points, or if one needs to compare in any other way
the relative probabilities of dierent parts of the phase space X, an arbitrary
partition of X into nite subsets (X
1
, . . . , X
i
, . . .) must be dened
35
. These
X
i
should be considered more useful states than the individual points x X
and their probabilities P(X
i
), which, contrarily to p(x), do not depend on the
35
Two additional reasonable properties should be asked to such a partition: (i) the sets
in it must be exclusive, i.e., X
i
X
j
= , i = j, and (ii) they must ll the phase space,
S
i
X
i
= X
40
5a/2
2a
3a/2
a
a/2
0
a 3a/5 a/2 0
p(x)
p(x(x))
Figure A.1: Probability density functions p(x) and p
(x
(x
(x)) is
normalized with the measure dx
,
whose relation to x is, say, x = x
2
, and nd the PDF in terms of x
using
equation (A.2):
p
(x
) =
12
a
3
x
3
(a x
2
) . (A.5)
Now, insisting on the mistake, we may nd the maximum of p
(x
), which
lies at x
= (3a/5)
1/2
(see gure A.1), and declare it the most probable value
of x
2
, the point
x
= (3a/5)
1/2
corresponds to x = 3a/5 and, certainly, it is not possible that
x = a/2 and x = 3a/5 are the most probable values of x at the same time!
To sum up, only nite regions of continuous phase spaces can be termed
states and meaningfully assigned a probability that do not depend on the co-
ordinates chosen. In order to do that, an arbitrary partition of the phase space
must be dened.
41
Far for being an academic remark, this is relevant in the study of the equilib-
rium of proteins, where, very commonly, Annsens thermodynamic hypothesis
is invoked (see section 4). Loosely speaking, it says that the functional native
state of proteins lies at the minimum of the eective potential energy (i.e., the
maximum of the associated Boltzmann PDF, proportional to e
W
, in equa-
tion (4.7)), but, according to the properties of PDFs described in the previous
paragraphs, much more qualifying is needed.
First, one must note that all complications arise from the choice of integrat-
ing out the momenta (for example, in equation (4.6)) to describe the equilibrium
distribution of the system with a PDF dependent only on the potential energy.
If the momenta were kept and the PDF expressed in terms of the complete
Hamiltonian as p(q
) = e
H
/Z, then, it would be invariant under canon-
ical changes of coordinates (which are the physically allowed ones), since the
Jacobian determinant that appears in equation (A.2) equals unity in such a
case. If we now look, using this complete description in terms of H, for the
most probable point (q
set to zero (since the kinetic energy is a positive dened quadratic form on
the
), denoted by q
min
. If we now perform a point transformation, which is a
particular case of the larger group of canonical transformations [165],
q
(q
) and
=
q
, (A.6)
the most probable point in the new coordinates turns out to be the same one,
i.e., the point (q
) = (q
(q
min
), 0), and all the insights about the problem
are consistent.
However, if one decides to integrate out the momenta, the marginal PDF
on the positions that remains has a more complicated meaning than the joint
one on the whole phase space and lacks the reasonable properties discussed
above. The central issue is that the marginal p(q
, q
+ dq
, which,
apart from the potential energy, also enters the coordinate PDF.
If, despite these inconveniences, the description in terms of only the positions
q