Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Protein flexibility predictions using graph theory

2001, … : Structure, Function, and …

PROTEINS: Structure, Function, and Genetics 44:150 –165 (2001) Protein Flexibility Predictions Using Graph Theory Donald J. Jacobs,1 A.J. Rader,1 Leslie A. Kuhn,2* and M.F. Thorpe1 Department of Physics and Astronomy, Michigan State University, East Lansing, Michigan 2 Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 1 ABSTRACT Techniques from graph theory are applied to analyze the bond networks in proteins and identify the flexible and rigid regions. The bond network consists of distance constraints defined by the covalent and hydrogen bonds and salt bridges in the protein, identified by geometric and energetic criteria. We use an algorithm that counts the degrees of freedom within this constraint network and that identifies all the rigid and flexible substructures in the protein, including overconstrained regions (with more crosslinking bonds than are needed to rigidify the region) and underconstrained or flexible regions, in which dihedral bond rotations can occur. The number of extra constraints or remaining degrees of bond-rotational freedom within a substructure quantifies its relative rigidity/flexibility and provides a flexibility index for each bond in the structure. This novel computational procedure, first used in the analysis of glassy materials, is approximately a million times faster than molecular dynamics simulations and captures the essential conformational flexibility of the protein main and side-chains from analysis of a single, static threedimensional structure. This approach is demonstrated by comparison with experimental measures of flexibility for three proteins in which hinge and loop motion are essential for biological function: HIV protease, adenylate kinase, and dihydrofolate reductase. Proteins 2001;44:150 –165. © 2001 Wiley-Liss, Inc. Key words: conformational change; mobility and dynamics; coupled/collective motions; hydrogen-bond networks; distance constraints; dihedral angle constraints and rotations; structural stability; dihydrofolate reductase; adenylate kinase INTRODUCTION More than 15,000 protein structures have been determined to date, using X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy.1 Such structures are deduced from sophisticated refinement procedures in conjunction with stereochemical modeling.2 Proteins can be described as a collection of stable fragments,3 ranging in size from a small number of residues to an entire domain. Different packings of the protein molecules in alternative crystal forms can trap the protein in different conformational states, providing snapshots of some of the conformations accessible to the protein.3 NMR spectros© 2001 WILEY-LISS, INC. copy can also indicate dynamic regions of proteins by showing that several conformations are consistent with the experimental constraints, typically inter-proton distances.4 Other computational procedures have been developed5–10 to characterize the intrinsic flexibility and rigidity within a protein. These procedures fall within three classes. One approach compares different conformational states, e.g., from different crystallographic or NMR conformations observed for the protein, to deduce which regions in the protein are flexible or rigid.5–7 A second approach focuses on simulating a protein’s motion, by means of molecular dynamics calculations, using a forcefield describing interatomic potentials.11–14 The third class focuses on identifying rigid protein domains or flexible hinge joints8 –10,15,16 based on a single conformation. Each method has its limitations. In the first class, the methods are limited by the diversity of the conformational states that are available from experiment for comparison, whereas the second class is limited by the computational time involved, meaning that large motions of the protein are usually not sampled. The third class, to which our approach belongs, defines the rigid and flexible regions in the protein from a single conformation and can be used as the starting point to sample a range of motions. These methods are generally computationally fast and can provide a starting point for more efficient molecular dynamics or Monte Carlo sampling of protein conformations, as the number of degrees of freedom in the protein is reduced significantly by defining which regions are rigid. A significant question is, as always: to what extent do such methods for defining rigid/flexible regions correlate with what is observed experimentally? The bonds within a protein can be represented as a network in which the covalent forces and strong hydrogen bonds are modeled as distance constraints between atoms. The mechanical stability of the corresponding bond net- Grant sponsor: National Science Foundation; Grant number: DMR-96 32182; Grant number: DBI-96 00831; Grant sponsor: National Institutes of Health; Grant number: R43 GM58337-01; Grant sponsor: Michigan State University; Grant sponsor: American Heart Association; Grant number: 9940091N. Donald J. Jacobs is currently at the Department of Physics and Astronomy, California State University Northridge, Northridge, CA 91330. Leslie A. Kuhn and M.F. Thorpe are contributing authors. *Correspondence to: Leslie A. Kuhn, Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI. E-mail: kuhn@agua.bch.msu.edu. Received 23 September 2000; Accepted 15 March 2001 Published online 00 Month 2001 PROTEIN FLEXIBILITY PREDICTIONS USING GRAPH THEORY work of a protein can then be analyzed using new graph theoretical techniques that were originally developed for analyzing the rigidity of substructures within covalent network glasses.17–19 The computer implementation20 of this procedure is referred to as Floppy Inclusion and Rigid Substructure Topography (FIRST), which provides a realtime tool for evaluating the intrinsic flexibility within a protein. FIRST gives the precise mechanical properties of a protein structure under a given set of constraints. This approach defines not only the rigid regions in a protein, but also those regions that move collectively (whose motions are coupled), as well as those that move independently of other regions in the structure. Furthermore, the relative flexibility or rigidity of each region is quantified, based on the density of bonds remaining rotatable in each flexible region. This article applies this approach to defining the flexible regions and their relationship to ligand binding in three diverse proteins. METHODS Both bonding and nonbonding forces play an important role in determining the structure of a protein and the dynamics about the native fold. The nonbonding forces are both short- and long-range. Although both hydrophobic interactions and van der Waals forces, as well as longrange electrostatic forces, play an important role in stabilizing a protein structure in the native state,21 these forces are generally weak between pairs of atoms and not highly directional, and thus are not represented as interatomic distance constraints. Strong bonding forces, such as covalent and hydrogen bonds, tend to be directional and are modeled in the present study as distance and angle constraints. Constraints The covalent bonding within the protein resulting from bond-stretching (central), bond-bending, and torsional forces defines a natural set of interactions that can be modeled as distance constraints. It is common practice to represent the degrees of freedom accessible to a protein by fixing the covalent bond lengths and associated bond angles, while allowing the dihedral angles to rotate. The torsional forces associated with peptide bonds and the other partial-double and double bonds in proteins effectively prevent dihedral rotation about the bond and are also represented as constraints. Using the rotatable dihedral angles as a set of internal coordinates, the number of degrees of freedom to describe the flexibility of a protein is typically reduced by a factor of about 7 relative to a Cartesian representation.22 In addition to covalent bonding forces, hydrogen bonds (Fig. 1) have high directional dependence and act over short distances, in contrast to the hydrophobic forces, which are less specific and may be regarded as slippery. By this, we mean that the energy associated with the hydrophobic forces does not change significantly for gentle conformational shifts away from the native state. It is reasonable to expect that the buried hydrogen bonds will be substantially maintained as the protein undergoes 151 Fig. 1. Model of a hydrogen bond, involving donor and acceptor atoms, shown as nitrogen and oxygen, respectively. Covalent bonds are shown as thin black lines. The hydrogen bond is modeled as three distance constraints, consisting of a nearest-neighbor central-force constraint shown as a thick solid line (top center), and two next-nearestneighbor bond-bending force constraints (constraining the donor– hydrogen–acceptor angle), shown as dashed lines. Each hydrogen bond is also associated with three a priori rotatable dihedral angles, indicated by the arrows. conformational changes near its native structure. Recent results by Lu and Schulten23 suggest that the breaking of a hydrogen bond occurs as a well-defined event, involving going over an energy barrier, as opposed to a continuous stretching until a feeble final breaking occurs. These investigators suggest that hydrogen bonds typically break one by one as the protein unfolds, while in some cases a consortium of hydrogen bonds break simultaneously, giving a much higher effective barrier. Rigidity in Networks Our graph-theory algorithm20,24 for analyzing proteins is a three-dimensional (3-D) extension and implementation of results in mathematical rigidity theory that have developed over the past few years. The roots of this work go back to Lagrange’s25 introduction of constraints on the motion of mechanical systems during the late eighteenth century, which Maxwell26 used during the late nineteenth century to determine whether structures were stable or deformable. The applications of this type of work have traditionally been to solve problems in engineering, such as the structural stability of different truss configurations in bridges. A very significant advance occurred with Laman’s theorem27 in 1970, which precisely determines the degrees of freedom within two-dimensional (2-D) networks, and allows the rigid regions and flexible joints between them to be found. The most general type of 3-D network for which results can be calculated are the so-called bond-bending networks (or, equivalently, molecular frameworks), in which vertices are connected by edges and in which every angle between edges is defined. A broader class of networks is the bar-joint framework, in which the angles are not specified; analyzing flexibility and rigidity for this case remains a significant unsolved problem in mathematics. For 3-D bond-bending networks, the flexibility in the system derives from dihedral or torsional rotations of the bonds that are not locked in by the network, and this kind of framework is used to represent the constraints within a protein. First, we present a general algorithm using brute-force matrix diagonalization to identify all the rigid clusters in a 152 D.J. JACOBS ET AL. framework consisting of a network of atoms with interconnecting bonds. The fundamental step on which all such calculations are based is the ability to test whether a constraint (bond between atoms) is redundant or independent. A constraint is considered redundant if breaking it has no effect on the flexibility of the network, and independent if its breakage does affect the flexibility. This can be determined using the following procedure: 1. Define a network of atoms and distance constraints between them. 2. Replace each distance constraint by a spring. All spring constants are set to unity in arbitrary units, and the natural length of each spring is set equal to the distance between the associated pair of atoms. 3. Construct a dynamical matrix28 for the spring network, and calculate all the normal mode frequencies (eigenvalues) of this network. The eigenvalues indicate whether the mode of vibration has a finite or zero frequency. 4. Count the number of zero eigenvalues, corresponding to zero-frequency vibrational modes, or floppy modes, of the system. 5. Add to the system the distance constraint being tested for redundancy or independence, and repeat the above steps. 6. If the number of zero eigenvalues remains the same, the added constraint is redundant; otherwise, it is independent. This procedure is repeated, using steps 5 and 6 above, for each constraint in the network that is to be tested for redundancy or independence. Afterwards, in order to identify all the rigid clusters of atoms within the network, called a rigid cluster decomposition, a test constraint is placed between each pair of atoms. Only test constraints found to be redundant are added to the network. Once all pairs of atoms have been tested and all redundant constraints added, the rigid clusters are identified by selecting all atoms that are connected by a contiguous path of face-sharing tetrahedrons, which form a rigid pathway. Tetrahedra are the 3-D generalization of triangles, where a pathway of edge-sharing triangles forms a rigid path in two dimensions. The implementation described above can be used on any type of 3-D framework, but the resulting computational performance scales as O(N2 3 N3), where N is the number of vertices (atoms) in the framework, rendering it useless for application to protein structures with thousands of atoms. Also of interest are the regions that have more distance constraints than are required for rigidity, that is, that have redundant constraints within the rigid clusters. When redundant constraints are present, this creates regions of internal strain. In order to measure the strain, an additional separate numerical algorithm must be used to relax the network and calculate the stress in the springs. These methods, such as conjugate gradient, typically scale as O(N2). Clearly, the distance constraint approach will only be useful in as much as the calculation for identifying rigid clusters and flexible regions is fast and scales linearly or nearly linearly with system size. The two main contributions of the present article are to show how to use concepts from graph rigidity29 to make this type of calculation feasible and then to present applications of this approach to three diverse proteins, showing that their predicted conformational flexibility correlates well with known, biologically relevant flexible regions in these proteins. Generic Rigidity A generic structure is one that has no special symmetries, such as parallel bonds or bond angles of 180°, that could create geometric singularities.30 Under these conditions, the study of rigidity becomes much easier because the properties of network rigidity depend only on the connectivity as determined by an underlying graph. Laman’s theorem27 states how generic rigidity within any 2-D generic framework can be characterized completely by applying constraint counting to all the subgraphs within the framework. Applying Laman’s theorem directly leads to a combinatorial calculation that scales exponentially in the number of atoms within the framework. However, by applying Laman’s theorem recursively, a very fast and efficient integer algorithm (the so-called “pebble game”) has been constructed17,31,32 for identifying rigid clusters and stressed regions. The computational complexity of the pebble game scales in the worst case as O(N2) for pathological networks, where N is the number of vertices or atoms in the network, and scales linearly in practice. The algorithm also scales linearly with N in computer memory. Unfortunately, it has been a longstanding problem that in 3-D generic bar-joint frameworks, constraint counting over the subgraphs is known to fail in general; that is, Laman’s theorem does not simply generalize to 3-D barjoint frameworks. However, for the special class of truss frameworks, in which the bars and joints have extent and the bars are subject to bond-bending angle constraints, Laman’s theorem can be generalized18,24 to three dimensions. In addition, the molecular framework conjecture proposed by Tay and Whiteley33,34 indicates that Laman constraint counting extends to nongeneric molecular models in which all bonds (hinges) of an atom pass through a single central point of the atom. Although the molecular framework conjecture requires a rigorous proof, there are no known exceptions after years of exact testing. Thus, we model the microstructure of a protein using distance constraints that define a bond-bending network and apply the 3-D pebble game to determine precisely the rigid clusters, stressed regions, and internal degrees of freedom in terms of dihedral angles. Pebble Game At the heart of the FIRST computer program designed to analyze protein structures is the 3-D pebble game algorithm, constructed in a very similar way to its 2-D counterpart.17,18,31,32 The pebble game is an implementation of the counting implicit in Laman’s theorem in two dimensions, as well as its 3-D generalization. Here, three pebbles are assigned to each vertex or atom, representing the 3 PROTEIN FLEXIBILITY PREDICTIONS USING GRAPH THEORY 153 be moved to that constraint from one of its vertices. If the new distance constraint is independent, it is then covered by a pebble; otherwise, it is not covered. This process continues until all distance constraints within the network have been placed and tested for independence. The algorithm can be sketched as follows: Fig. 2. Diagrammatic representation of the final pebble covering of a simple network. Free pebbles are placed directly on vertices and denote degrees of freedom (E). Pebbles covering a bond are placed directly on the edges of the graph and represent distance constraints (F). The pebble covering is not unique, because pebbles can be rearranged according to a few simple rules as explained in the text. An example is shown of how an elementary pebble exchange works (two arrows). A free pebble can be moved onto a covered edge (adjacent to a vertex), provided that a corresponding pebble, presently covering the edge and associated with the neighboring vertex, is moved off. degrees of freedom of the atom. Each bond is represented by a distance constraint, or edge. For each independent edge between two vertices, a pebble from one of the two vertices must be used to cover the edge. Pebbles that remain associated with vertices are called free pebbles, and they represent the independent degrees of freedom remaining to this vertex within the framework. Each independent edge thus uses up one independent degree of freedom from a vertex. Then the following covering rule is applied: once an edge is covered by a pebble (forming an independent constraint), it must always remain covered by one of the pebbles associated with its incident vertices. Rearrangement of pebbles throughout the graph is possible, provided that this covering rule is maintained and that three pebbles remain associated with each vertex, as free pebbles or as pebbles associated with edges coming from the vertex. Figure 2 shows a pebble covering for a network. The pebble covering is a convenient way to indicate the degrees of freedom accessible to atoms, and to mark which constraints are independent and which are redundant in the bond network. Essentially the redistribution of pebbles from vertices to bonds, by a series of stepwise pebble exchanges, is an intuitive way of representing a dynamically changing directed graph.24,31 The 3-D pebble game is a recursive algorithm. The framework is built up by adding one distance constraint at a time, until the final framework is complete. To maintain a bond-bending network, each bond-stretching distance constraint having incident vertices v1 and v2 has associated with it angular (i.e., second-nearest-neighbor) constraints about both vertices. In the case of proteins, these angular constraints correspond to the angles of bond coordination about an atom, defined by its chemistry. For each new independent distance constraint introduced, pebbles are rearranged to test whether the new constraint is independent; this is indicated by whether a pebble can 1. Begin with a set of unconstrained vertices, each with three pebbles, representing the three degrees of freedom for each vertex in 3-D space. 2. Place a central-force distance constraint between vertices v1 and v2, thereby building the framework up one distance constraint at a time. 3. Rearrange the pebble covering, if necessary, to collect three pebbles on vertex v1. 4. Rearrange the pebble covering to collect as many pebbles as possible (where three is the maximum) on vertex v2, while holding the three pebbles at vertex v1. 5. If the number of pebbles on vertex v2 is two, the edge is redundant. Otherwise, three pebbles reside at vertices (v1 and v2) and both remain independently mobile. Continue to rearrange the pebble coverings: a. Hold the three pebbles on vertex v1 and the three pebbles on vertex v2. b. For each neighbor of vertex v2, attempt to collect a pebble from a neighboring bond, restoring a pebble to that bond from its other vertex. c. If for any neighbor of vertex v2 a pebble cannot be obtained, the edge joining the two vertices is redundant. d. If the edge is not redundant, cover it with a pebble from vertex v2. Unlike the 2-D pebble game, the distance constraints cannot be placed in random order. The first distance constraint that is introduced must correspond to a centralforce constraint. After each central-force distance constraint is placed, all its associated angular or bond-bending constraints (nextnearest-neighbor distance constraints) must be placed before another central-force constraint can be introduced. Within this restriction, the order of placing either central force or the associated bond-bending constraints is completely arbitrary, and the resulting decomposition of the network into rigid clusters is unique. This restriction on recursively placing constraints is sufficient18,24 for constraint counting to remain valid in characterizing the rigidity of 3-D bond-bending networks within proteins or other structures. Finally, torsional constraints for the peptide and resonant bonds are fixed by third-nearestneighbor distance constraints (e.g., fixing the distance between the amide H and carbonyl O in peptide bonds). These are most conveniently placed after the central and bond-bending distance constraints. After all distance constraints have been placed, the number of free pebbles remaining on the vertices gives the total number of degrees of freedom required to describe the motion of the framework. This includes the six trivial rigid body translational and rotational degrees of freedom of the 154 D.J. JACOBS ET AL. whole network. The free pebbles can be rearranged but are restricted to certain regions because of the pebble-covering rule. For example, no more than six free pebbles can be found within a rigid cluster. Based on the location and number of free pebbles throughout the framework, one can identify overconstrained regions, rigid clusters, and underconstrained regions, as described below. ing that in contrast to the numerical method, the 3-D pebble game can be used on networks with more than 10 million atoms with the security of knowing the results are exact, because the pebble game is an integer counting algorithm, eliminating the possibility of any numerical round-off errors. Therefore, even the largest proteins and protein complexes can be analyzed both precisely and rapidly. Strained Regions A redundant constraint is identified when a failed pebble search occurs. A failed pebble search consists of a set of vertices that have no extra free pebbles to give up. This physically corresponds to placing an additional distance constraint between a pair of atoms that have a predefined fixed distance. Placing a distance constraint between this pair of atoms generally causes a length mismatch and leads this region to become internally strained. Thus, a failed pebble search identifies overconstrained regions. Overconstrained regions always consist of closed loops. Because distance constraints are added to the framework, more overconstrained regions will be found, and generally these regions will overlap. Overlapping, overconstrained regions merge together into a single overconstrained region. As these frameworks are generic, stress will propagate and redistribute throughout the merged overconstrained regions. Redundant bonds reside within overconstrained regions. Therefore, the more redundant bonds that are present within a given rigid region, the more stable that region will be against removal of constraints, e.g., crosslinking salt bridges or hydrogen bonds. Rigid Cluster Decomposition The method used to identify the rigid clusters, including overconstrained regions, is very simple once all edges in the graph are in place and the pebble game is finished. All rigid clusters can at most have six free pebbles distributed over the vertices within the cluster. Therefore, to identify these clusters, select a vertex and two of its bonded nearest-neighbor vertices. Collect three pebbles on the selected vertex and two pebbles and one pebble, respectively, on its two neighboring vertices. Because we are considering bond-bending networks, it is always possible to collect these six pebbles, and never any more. Mark these three vertices. Then, iteratively in a breadth-first search, check all bonded, unmarked nearest neighbors to the current set of marked vertices, to see whether a free pebble can be obtained. If a free pebble cannot be obtained, mark the new vertex, and note that it is part of the same rigid cluster. This method works because all rigid clusters in bond-bending networks are contiguous through bonded nearest neighbors17,20; this point is essential and implicit in the generalization of Laman’s theorem to 3-D bondbending networks. The rigid cluster decomposition using the 3-D pebble game has been compared with the numerical brute force method described earlier. For a variety of generic bondbending networks containing as many as 450 atoms, exact agreement has always been found.35 It is worth mention- Underconstrained Regions Locating rotatable bond dihedral angles in the protein, which correspond to hinge joints in the network, is an easy task after the rigid cluster decomposition is made. Note that a hinge joint can never occur about a next-nearestneighbor distance constraint, which corresponds to a bondbending angle.17 If the two incident vertices of a bondstretching constraint belong to different rigid clusters, a dihedral angle rotation is possible, and the bond is recorded as a hinge joint; otherwise, the dihedral angle motion is locked, as it is part of a rigid cluster. The number of rotable dihedral angles will generally be considerably more than the number of residual internal degrees of freedom in the network. Not all the rotatable dihedral angles associated with hinge joints are independent, as they are part of a ring of bonds (e.g., within loops formed by crosslinking hydrogen bonds). Collective motions consist of coupled dihedral angles within the protein and take place in underconstrained regions. Distinct underconstrained regions are partitioned such that collective motions can occur within one underconstrained region without directly affecting internal coordinates within all the other underconstrained regions. The underconstrained regions are identified by attempting to specify a value for each dihedral angle and determining whether it can be satisfied. Specifying a dihedral angle is equivalent to placing an external torsional constraint to lock in this choice of angle. Independent, externally imposed torsional constraints represent independent degrees of freedom available to the system, while redundant, externally imposed constraints indicate the angle is predetermined as part of a collective motion. Therefore, the algorithm for finding these distinct underconstrained regions is the same as that for finding the overconstrained regions, except that now the only constraints placed in the network are the external torsional constraints. Proteins as Bond-Bending Networks The connectivity of a bond-bending network is completely defined by the nearest-neighbor bond-stretching (central-force) constraints. Next-nearest-neighbor distance constraints represent the angular bond-bending constraints due to the atom’s chemistry. Dihedral rotation about the central-force bonds is the elementary flexible element in this type of network. Rotation through a dihedral angle is possible a priori but may be locked because of crosslinking bonds in the network. Furthermore, dihedral angles associated with double or partialdouble bonds are represented as fixed by using a thirdnearest-neighbor distance constraint. For instance, to lock PROTEIN FLEXIBILITY PREDICTIONS USING GRAPH THEORY the peptide bonds within a protein structure, the distance between the carbonyl oxygen and amide hydrogen is fixed. Therefore, along the backbone of a protein the F, C dihedral angles are a priori allowed to rotate, but the peptide bonds and other double-bonded groups are kept planar in this way. As shown in Figure 1, the hydrogen bond is modeled similarly to a covalent bond, in which the donor, hydrogen, and acceptor atoms are typically generic and hence noncolinear. Each hydrogen bond will introduce three distance constraints, corresponding to one central force between the hydrogen and acceptor atoms and two bond-bending forces associated with the hydrogen and acceptor atoms. This model for a hydrogen bond allows the protein structure to be described as a bond-bending network and is physically reasonable because hydrogen bonds are almost never precisely linear; the three dihedral angle degrees of freedom associated with this representation of the hydrogen bond also allow it to have some flexibility. Modeling the hydrogen bond to be more or less constrained than this has been tested, and the model in Figure 1 provides a good balance between neither over- nor underrepresenting the flexibility of a hydrogen bond. This model results in ideal rigid a-helices, and b-sheets ranging from rigid to somewhat flexible, depending on size and the regularity of their hydrogen-bonding patterns. Moreover, protein structures typically show a substantial proportion of rigid regions, while having regions that remain flexible. Determination of constrained and rotatable dihedrals, as implemented in the FIRST software’s implementation of the pebble game, has been tested against exact counting and shown to agree for all the elementary structures: a-helices, parallel and antiparallel b-sheets, and reverse turns. APPLICATION TO PROTEINS Detailed information about the mechanical stability of a protein under a fixed set of distance constraints is provided by FIRST analysis. All overconstrained regions, rigid clusters, and underconstrained regions are determined and can be colored and viewed using standard molecular rendering packages, including Rasmol and InsightII. To analyze a protein, it must first be decided which hydrogen bonds to include and model as distance constraints. Hydrogen Bonding Beyond covalent bonds, salt bridges and then hydrogen bonds form the next strongest interactions within proteins (Fig. 3). Hydrogen bonds vary in strength from nearly as strong as the covalent bonds to as weak as the van der Waals interactions.36,37 Hydrogen bonds form directional crosslinks in the bond-bending network that lead to largescale rigid regions. In proteins, the regular hydrogenbonding patterns between main-chain amide and carbonyl groups form the regular secondary structures: a-helices, b-sheets, and reverse turns. Hydrogen bonds also stabilize the tertiary structure of proteins through side-chain interactions that interlock parts of the protein chain distant in sequence. For protein structures in which the hydrogen atom positions are not experimentally defined (the case for most 155 Fig. 3. Schematic representation of the ordering of microscopic forces, from strongest to weakest. Distance constraints are used in FIRST to model strong bonding forces to the left of a sliding pointer. This approach defines a network of covalent and hydrogen bonds and salt bridges in the protein. crystallographic structures determined by X-ray rather than neutron diffraction), the WhatIf software package is used to assign polar hydrogen atoms positioned such that their hydrogen-bonding opportunities are optimized.38 Because FIRST results will depend on the accuracy of placement of polar hydrogen atoms (because of their influence on hydrogen bonds being assigned, or not), we have tested how well WhatIf-defined hydrogen positions compare with the experimentally determined positions in five neutron diffraction structures from the Protein Data Bank (PDB; www.rcsb.org/pdb).1 Five neutron structures available in the PDB are lysozyme (PDB entry 1lzn), trypsin (1ntp), insulin (3ins), myoglobin (2mb5), and ribonuclease A (5rsa). For each structure, we processed the file through FIRST using two different hydrogen-bond energy threshold or cutoff values, as discussed below. We then created a modified version of the neutron structure by stripping the hydrogen atoms from it and adding new hydrogens with WhatIf. This new structure was then processed through FIRST for comparison with the results obtained using neutron diffraction-defined hydrogen atom positions. Table I contains results for these five neutron structures at two different energy cutoff values: 20.1 kcal/mol and 20.6 kcal/mol (the latter corresponding to thermal fluctuation at room temperature). The comparative results show only slight differences due to a few hydrogens placed differently in the two structures. The percentages shown are calculated by dividing twice the number of hydrogen bonds in common by the total number of hydrogen bonds for both versions of the protein structure. While the energy threshold of 20.6 kcal/mol is more restrictive and includes 156 D.J. JACOBS ET AL. TABLE I. Effect of Hydrogen Atom Placement on the Number of Hydrogen Bonds Identified* PDB Code H-Bond Energy 1lzn 1ntp 2mb5 3ins 5rsa Total 129 1762 1.70 223 1790 1.80 153 1836 1.80 102 1305 1.50 124 1556 2.00 — — — No. common to both No. unique to neutrona No. unique to modifiedb % in common 184 3 11 96.3 216 9 17 94.3 249 13 6 96.3 108 3 6 96.0 140 4 4 97.2 897 Common to both Unique to neutrona Unique to modifiedb % in common 130 7 6 95.2 168 16 9 93.1 182 22 8 92.4 80 3 6 94.7 116 3 4 97.1 676 No. of residues No. of protein atoms Resolution (Å) No. of H-bonds with E # 2 0.1 kcal/mol No. of H-bonds with E # 2 0.6 kcal/mol 76 95.9 84 94.2 *Experimentally defined hydrogen positions and resulting hydrogen bonds of five neutron structures are compared with the hydrogen bonds resulting from assignment of hydrogen positions by WhatIf in the same five structures. a Rows labeled unique to neutron include the number of hydrogen bonds from experimentally determined hydrogen atom positions. b Rows labeled unique to modified include the number of hydrogen bonds from WhatIf calculated hydrogen atom positions in neutron diffraction structures from which the experimentally determined hydrogen atom positions had been removed. Fig. 4. Geometry used in the hydrogen bond energy potential. u is the donor– hydrogen–acceptor angle; f is the hydrogen–acceptor– base angle, where base is the atom (C, in this case) covalently bonded to the acceptor; d is the donor–acceptor distance; r is the hydrogen–acceptor distance; and w (not shown) is the angle between the normals of the planes defined by the covalent bonds of the donor and base atoms (e.g., the planes defined by the two sp2 centers, N and C, in this case). only the strongest hydrogen bonds, there was slightly less overall agreement (94%) in the hydrogen bonds at this energy threshold in the WhatIf and neutron versions of the structure than was found at the chosen threshold of 20.1 kcal/mol (96%). Thus, on average, only 4% of the hydrogen bonds were assigned differently in the two types of structures. Not surprisingly, many of the hydrogen-bond differences resulted from different placement of hydrogens on histidine residues. Histidine has two side-chain nitrogen atoms that can bond to 0, 1, or 2 hydrogens, depending on the local environment. Overall, we conclude that the WhatIf software package positions hydrogen atoms sufficiently accurately to permit analysis of the resulting hydrogen-bond network. To define the hydrogen bonds and salt bridges for inclusion in the analysis of rigid and flexible regions, the geometry and energy of these interactions is assessed. A superset of possible hydrogen bonds is assigned based on meeting the following geometric39,40 criteria: the donor– acceptor distance, d # 3.6 Å, the hydrogen-acceptor distance, r # 2.6 Å, and the donor– hydrogen-acceptor angle, u, falls between 90° and 180° (Fig. 4). Salt-bridge (ion-pair) interactions are considered as a special case of hydrogen bonds. Salt bridges are similar to hydrogen bonds, with a more significant ionic or Coulombic component, which is less geometrically sensitive. Our identification of such salt bridges follows previous studies41– 43 by extending the maximum distance between donor and acceptor to 4.6 Å and softening the angular dependence such that u falls between 80° and 180°. Because the strength of hydrogen bonds and salt bridges depends on the chemistry of the particular donor and acceptor atoms as well as their orientation, an energy function44 is then used to rank hydrogen bonds: HS D S DJ EHB 5 V0 5 d0 d 12 26 d0 d 10 F~u, f, w! (1) where sp3 donor–sp3 acceptor sp3 donor–sp2 acceptor sp2 donor–sp3 acceptor sp2 donor–sp2 acceptor V0 5 8kcal/mol and d0 5 2.8 Å F 5 cos2ucos2(f2109.5) F 5 cos2ucos2f F 5 cos4u F 5 cos2ucos2(max[f,w]) The hydrogen bond energy (EHB) is a function of the equilibrium hydrogen bond distance, d0, and well depth, V0. The donor to acceptor distance is d. The angular dependence of the function, F, is dependent on the hybridization of the donor and acceptor atoms. The four possible cases shown rely on the angular terms u, f, and w. Angle u defines the donor atom– hydrogen–acceptor atom angle, and f is the angle between the hydrogen atom, the acceptor, and the base atom bonded to the acceptor. The w angle is between the normals of the two planes defined by the sp2 centers. If f is less than 90°, the supplement of the angle is used. Figure 4 illustrates how these parameters are defined. Salt bridges can be viewed as strong hydrogen bonds36 with average energies of 26 (64) kcal/mol.45 Salt bridges have broader distance and angular distributions than are PROTEIN FLEXIBILITY PREDICTIONS USING GRAPH THEORY found for nonionic hydrogen bonds, and these observed distributions are not well reflected by the hydrogen-bond energy functions we have tested. Salt bridges within the above specified geometric ranges generally have stronger interactions than those exhibited by hydrogen bonds. Therefore, salt bridges meeting the above geometric criteria are always included in the bond network analyzed for flexibility. For hydrogen bonds, we can tune the energy threshold (the sliding pointer in Fig. 3) used to define which hydrogen bonds are included in the network. Setting the threshold at less negative (less favorable) energy values includes weaker hydrogen bonds, which tend to be common in proteins and have a significant influence on structural stabilization. The ability to select hydrogen bonds based on strength allows investigation of how the stability in each region of the protein varies as the hydrogen-bond network is strengthened or weakened. By changing the criteria for modeling a hydrogen bond as a constraint, a protein can be substructured from containing a few, large rigid clusters down to being completely floppy, with many small rigid clusters involving single atoms with their covalent bonds acting as rotatable dihedral angle hinges. Individual hydrogen bonds or small sets of hydrogen bonds that form critical crosslinks can therefore be identified by shifting the energy threshold and observing which hydrogen bonds, when included or omitted, have a large effect on the rigidity of the network. Ideally, we would like to be able to set this energy cutoff value, perform a single FIRST analysis, and know that the output describes the physically relevant flexibility of a protein. One way of setting the energy threshold objectively is to choose it such that maximum agreement in the hydrogen-bond network is obtained for pairs of independently determined structures for a protein (e.g., by different researchers or in different crystallographic packings) in which the main-chain conformations are the same. This ensures that the results of FIRST analysis are not sensitive to the sorts of fluctuations known to occur within protein structures. Since the energy cutoff is the tunable parameter in using FIRST, we tested which setting gave the most similar results between pairs of such structures. Such a cutoff value should naturally be below when all hydrogen bonds are present (Ecut # 0.0 kcal/mol) and above when the protein substructures into many tiny rigid clusters (Esplinter). A very natural place to look at the behavior of these proteins is near room temperature which corresponds to Ert 5 20.6 kcal/mol. However, because the energy function is approximate (does not take into account the effect of more distant neighboring atoms on hydrogenbond strength), it is important not to take these energies too literally, and to consider them as relative rather than absolute. To determine a reasonable default energy threshold for hydrogen bonds, we evaluated which threshold best conserves the hydrogen bonds within a family of protein structures. Multiple structures within four different protein families were studied to find such a threshold. The PDB codes used for each family are as follows: trypsin 157 Fig. 5. Histogram of energies for hydrogen bonds. Distribution of hydrogen bond energy for three structures of human immunodeficiency virus protease (HIVP) (PDB codes 1dif, 1hhp, 1htg). Hydrogen positions were established by WhatIf. The inset expands the low-energy (weak hydrogen bond) region between 20.2 and 0 kcal/mol. An energy threshold of 20.1 kcal/mol is used to eliminate the large number of very weak hydrogen bonds in the spike near 0 kcal/mol. (1tpo, 2ptn, 3ptn), trypsin inhibitor (4pti, 5pti, 6pti, 9pti), adenylate kinase (1zin, 1zio, 1zip), and human immunodeficiency virus protease (HIVP) (1dif, 1hhp, 1htg). Figure 5 shows the hydrogen-bond energy distribution for one of these families, namely the three HIVP structures. A large spike in the distribution of possible bonds located between 20.1 kcal/mol and 0.0 kcal/mol for the number of hydrogen bonds in the three structures appears in Figure 5. This spike is largely due to the fact that quite generous definitions of hydrogen bonds are allowed initially (donor– hydrogen–acceptor angle, u $ 90° and donor–acceptor distance, d # 3.6 Å, as shown in Fig. 4). The inset in Figure 5 expands the region near 0.0 kcal/mol, demonstrating how a large number of very weak hydrogen bonds, often with u angles near 90°, can be removed by setting Ecut # 20.1 kcal/mol. Thus, the generous hydrogen bond distance and angle screening criteria can be effectively filtered by setting Ecut. When these geometric criteria and an energy threshold of 20.1 kcal/mol are applied to analyze the hydrogen bonds and salt bridges in five neutron diffraction structures, a Gaussian distribution is observed for the number of hydrogen bonds as a function of donor–acceptor distance, with virtually all hydrogen bonds and salt bridges having distances between 2.6 and 3.6 Å. The distribution in donor– hydrogen–acceptor angles is bimodal, with a strong, Gaussian peak between 130° and 180° and a weaker peak between 90° and 130°. In the choice of protein structures to analyze, the stereochemical quality of the structure can have a significant influence on the definition of its network of hydrogen bonds, due to their angular dependence. The result is that FIRST analysis on a structure with poor stereochemistry is likely to indicate the protein as being more flexible than it actually is, due to missing hydrogen bonds. It is advisable to assess the main-chain stereochemistry through a 158 D.J. JACOBS ET AL. F, C plot, as well as focus on high-resolution, well-refined structures for FIRST analysis, to avoid this possibility of missing hydrogen bonds due to the misorientation of main-chain hydrogen-bonding groups. Flexibility Index A flexible region consisting of many interconnected rigid clusters within a protein may define a collective motion having only a few independent degrees of freedom. Although underconstrained, this region could be nearly rigid and thus mechanically stable. An isostatically rigid region, however, which contains no redundant constraints and is only rigid, is not expected to be as stable as an overconstrained region. Overconstrained regions have more constraints than necessary to be rigid, and therefore are considered more stable. Because of this continuum between rigidity and flexibility, a continuous index is useful. The total number of floppy modes in a protein, denoted by F, corresponds to the number of independent, internal degrees of freedom. To obtain F, the six trivial rigid body degrees of freedom are subtracted from the total number of independent degrees of freedom. The global count of the number of floppy modes gives a good sense of overall intrinsic flexibility. However, a more useful measure is to track how the degrees of freedom are spatially distributed throughout the protein. In particular, we are interested in locating the underconstrained or flexible regions. The quantity, fi, is defined as a flexibility index that characterizes the degree of flexibility of the ith centralforce bond in the protein. Let Hk and Fk, respectively, denote the number of hinge joints (rotatable bonds) and the number of floppy modes within the kth underconstrained region. Let Cj and Rj, denote the number of central-force bonds and the number of redundant constraints within the jth overconstrained region, respectively. The flexibility index provides a quantitative range from most to least constrained and is given by fi ; 5 Fk in an underconstrained region Hk 0 in an isostatically rigid region 2Rj in an overconstrained region Cj (2) When the ith central-force bond is a hinge joint, the flexibility index is defined to be given by the the number of floppy modes divided by the total number of hinges within the underconstrained region. Because the number of independent dihedral rotations must be less than or equal to the number of hinge joints, the flexibility index is always #1. When the ith central-force bond is not a hinge joint, it is part of a rigid cluster. If the central-force bond is within an overconstrained region, the flexibility index is assigned a negative value with magnitude given by the number of redundant constraints divided by the total number of central-force bonds within the region. This number becomes more negative as the region becomes more overconstrained. As a simple example, consider a single n-fold ring of atoms that are connected by covalent bonds. From con- straint counting, the number of degrees of freedom minus the number of constraints is given by F 5 n 2 6. The number of hinge joints is simply given by n. Therefore, the flexibility index for a n-fold ring is given by fi 5 n26 for each central-force bond in a n-fold ring n (3) Note that as the ring becomes larger, the flexibility index goes to the limit of 11; in this case, each dihedral angle is nearly independent, and the ring is almost as flexible as a linear chain. For a six-fold ring, the flexibility index is zero, indicating an isostatically rigid structure. For a three-fold ring, the ring is overconstrained, and the flexibility index is 20.2. The flexibility index of a protein can be plotted as a function of residue number, and regions within the plot (corresponding to segments within the sequence) can be colored according to whether they are coupled in motion (Fig. 6). A nice property of the flexibility index is that it varies gradually as hydrogen bond constraints are added or removed. RESULTS AND DISCUSSION An important feature of FIRST is that it can predict the intrinsic flexibility of a protein given a single 3-D structure. However, it is typical that upon ligand binding, the hydrogen-bond pattern will change. Therefore, the predicted conformational flexibility using FIRST will depend on whether the structure being analyzed is an open (ligand-free) form, or a closed (ligand-bound) form. Crystal contacts can also influence the flexibility of a protein, and their influence can be assessed in two ways by FIRST: (1) by analyzing the flexibility of the protein independent of its crystal lattice neighbors (in which case the effects of intermolecular hydrogen bonds are removed from analysis); and (2) by comparing the flexible regions found for the same protein crystallized in different lattice packings. However, the general features of flexible and rigid regions found by FIRST are remarkably consistent among different 3-D structures (in the same ligand-binding state) for a protein, as will be shown for HIVP, dihydrofolate reductase, and adenylate kinase. HIV Protease An initial application is to HIVP, a major inhibitory drug target for current acquired immunodeficiency syndrome Fig. 6. A: Rigid cluster decomposition of the open conformation of human immunodeficiency virus protease (HIVP) (PDB code 1hhp). B: Flexibility index, color-mapped onto the same HIVP structure. Four regions of interest, a, b, g, and d, are identified for one of the dimers. C: Flexibility index plotted versus residue number. Of the four regions, a, b, and g are most flexible (shown red in B), while d is rigid (blue) in this open conformation of HIVP. Parts of the sequence that are coupled in motion are plotted in the same color; the same regions in D and E are then colored accordingly. D: Mobility plotted versus residue for this conformation. Mobility is determined as the average crystallographic temperature factor (B-value, or Debye–Waller factor) divided by the average atomic occupancy, for the main-chain atoms in each residue. E: Dihedral angle changes between the main chains of the above open conformation and the closed conformation (Fig. 7, PDB code 1htg). The three flexible regions identified by FIRST are also those with the greatest experimentally defined mobility values and dihedral angle changes. PROTEIN FLEXIBILITY PREDICTIONS USING GRAPH THEORY Figure 6. 159 160 D.J. JACOBS ET AL. (AIDS) therapy. Two ligand-free X-ray crystal structures available for HIV protease, PDB entries 1hhp and 3phv, are superficially very similar in structure and have similar resolution and crystallographic residual error (2.7-Å resolution for both, and an R-factor of 0.190 for 1hhp and 0.191 for 3phv). However, PROCHECK46 showed that 3phv had significantly fewer residues with stereochemically favored F, C values, which results in distorted main-chain hydrogen-bond geometries; therefore we chose 1hhp to represent the open HIVP conformation. The open form of the protein (PDB code 1hhp) is dominated by a single rigid cluster shown in blue (Fig. 6A), including the base and walls of the substrate and inhibitor binding site (cavity at center), and three flexible regions shown as alternatingcolored bonds (each color indicating a rigid microcluster within the flexible region). The ends of the flaps (b region labeled in Figure 6A, residues 45–56) are known from crystallographic and NMR structures to be important for closing over and binding inhibitors47 and appear as the most flexible (red) regions when the structure is characterized by the flexibility index (Fig. 6B,C). Other flexible regions include the base of each flap (region a, residues 39 – 42), which may act as a cantilever, and the g region. The FIRST results for HIVP have been compared with experimental measures of protein flexibility. The major peaks in main-chain thermal mobility (B-value), measured crystallographically and shown in Figure 6D, correlate directly with the a, b, and g flexible regions predicted by FIRST. The region labeled d is the dimer interface, formed by the N- and C-termini of the two, identical protein chains. It should be noted that for proteins with mobile domains or other moving rigid bodies, such as a-helices, the crystallographic mobility and FIRST results will not necessarily compare well with B-values. Crystallographically, they appear as mobile regions, whereas in FIRST they appear as rigid regions flanked by flexible loops (allowing the motion). This confusion can be avoided when NMR order parameters are available for comparison, since they also indicate moving rigid bodies as rigid regions flanked by flexible loops. The Indiana Dynamical Database48 contains such data for a number of proteins, including HIVP. These data are provided for PDB entry 1bvg, a ligand-bound form; as in the FIRST results for a different ligand-bound structure described in Figure 7, the base of the flaps are the most flexible region. HIVP has been crystallized with various inhibitors bound, resulting in a closed conformation with the flaps lowered. The main-chain dihedral angle (F, C) changes (similar to the analysis reported by Korn and Rose49) observed for crystal structures of the open (entry 1hhp) and closed (entry 1htp) conformations are shown in Figure 6E. The FIRST-predicted flexible regions also directly correspond with the regions of greatest dihedral angle change. In the three flexible regions (a, b, and g), the flexibility is associated with a flip in at least one dihedral angle (defined as a change of more than 60°) within a rigid b-turn in the center of each flexible region (Fig. 6A,E). The results here are consistent with the motion observed by interpolation between different HIVP crystal structures50 and an earlier dihedral analysis for a different pair of HIVP structures49 indicating that large changes at residues 40, 50, and 51 in the a and b regions result in a large, concerted movement of the flaps. Flexibility of the g region has not been emphasized in other studies of HIVP; however, it is known that drug-resistant mutants of the protease include two residues that pack against the g region, 63 and 71, with residue 63 proposed to induce a conformational perturbation.51,52 Thus, conformational coupling between the g region and the flaps, through the g–a loop interactions, may explain why mutations in the g region, which are distal from the active site, cause resistance to drug binding. Ligand binding restricts the motion of the flaps through new hydrogen bonds linking the two flaps to each other and to the ligand. Some of these hydrogen bonds between the flaps and ligand are mediated by an conserved water molecule found in retroviral but not mammalian homologs of HIVP,53 providing a useful basis for designing more HIV-specific drugs. To compare the influence of ligands on HIVP flexibility, there were a number of ligand-bound structures of good stereochemistry from which to choose. For brevity, here we show the results from PDB entry 1htg, with GR137615 bound, to represent the closed form of HIVP. (We have also analyzed two other ligand-bound structures, 1hiv and 1dif, and found the influence of these ligands on protein flexibility to be substantially similar.) Unlike the open form, the closed structures were resolved crystallographically as dimers; thus, independent structural information is available for the two subunits of the dimer. This means it is possible to assess the influence of different side-chain conformations in the two halves (due to ligand interactions, thermal fluctuations and environmental differences) in terms of their effects on the hydrogen-bonding network and flexibility. The left and right sides of HIVP in Figure 7 indicate that the only substantial difference in their flexibility is caused by the asymmetry of the ligand bound (at center). Comparison of this ligand-bound structure with the open HIVP also demonstrates how a ligand can rigidify part of the protein through new hydrogen bonds even though the ligand itself is not rigid (note black bonds indicating H-bonds between the protease flaps in Figure 7A, and that the flaps are now rigidified), while making other parts of the protein more flexible. In particular, note the dimer interface, where inter-subunit rotation occurs upon ligand binding, breaking some of the interfacial stabilizing hydrogen bonds, and the loop to the right of the binding cavity, shown as a flexible (orange) region of the main-chain ribbon in Figure 7B. This loop flexibility is not reflected in the other HIVP subunit, due to ligand asymmetry. Flexibility of the dimer interface in a ligand-bound structure is also a prominent feature found by NMR54 and MD analyses55; MD also identifies flap flexibility in the ligand-free conformation. PROTEIN FLEXIBILITY PREDICTIONS USING GRAPH THEORY Fig. 7. A: Rigid cluster decomposition of the closed conformation of human immunodeficiency virus protease (HIVP) (PDB code 1htg). B: Flexibility index plot of the same, closed conformation of HIVP (PDB code 1htg). Four regions of interest, a, b, g, and d, are identified for one of the monomers. Contrast the change in rigidity for these regions between this ligand-bound structure and the ligand-free case (Fig. 6). The influence of water is easily seen in a comparison of the liganded cases. A specific water molecule (WAT301) positioned between the b flaps and the ligand is conserved in most HIVP ligand-bound structures.53 In this rigid region analysis, we found the inclusion of this water essential to rigidify the b flaps. This particular water molecule serves as a hydrogen acceptor from residue 50 of each protein chain and a hydrogen donor to the ligands. Adding one water molecule introduces four hydrogen bonds to the structure, with energies within the range of 21.5 to 27.5 kcal/mol. For both 1htg and 1dif, the b flaps become flexible without the intermolecular hydrogen bonds created by this water molecule. We have only included buried water molecules making intermolecular hydrogen bonds in HIVP, as these tend to be reliably assigned between structures, whereas surface water molecules are unevenly assigned in many of the crystal structures of HIVP, as well as being variable in crystallographic temperature factor. 161 Fig. 8. Flexibility map of dihydrofolate reductase for the open conformation (PDB code 1ra1) (A) and for the closed conformation (PDB code 1rx1) (B). C: Occluded conformation (PDB code 1rx6). Shown in green are the ligands bound in these reaction pathway intermediates. Two loops experimentally determined to be flexible, M20 and bF–bG, are also noted. The motion of the M20 loop is essential to accommodate a variety of ligands during catalysis. The flexible bF–bG loop participates in ligandinduced conformational changes. At the top of the graph is the scale for the flexibility index used to map color onto the Ca trace. The scale runs from red (flexible) through gray (isostatic) to blue (rigid). Dihydrofolate Reductase By trapping different ligand-bound states crystallographically, Sawaya and Kraut56 noted five conformational states for Escherichia coli dihydrofolate reductase (DHFR) during its catalytic cycle. We analyzed three of these crystallographic structures (PDB codes: 1ra1, 1rx1, and 1rx6) with the FIRST algorithm. These three structures correspond to the open, occluded, and closed conformations of DHFR, as shown in Figure 8. The Ca traces are colored according to each residue’s flexibility index, fi, with the most flexible regions colored red and the most rigid colored blue. According to the previous study,56 the regions of greatest interest are the rotations of two major subdo- 162 D.J. JACOBS ET AL. Fig. 9. Flexibility index plot for the three conformations of dihydrofolate reductase (DHFR) shown in Figure 8. The value of the flexibility index as defined in eq. (2), plotted versus residue number (a monochrome version of the type of plot shown for human immunodeficiency virus protease (HIVP) in Fig. 6C). The two experimentally determined loops, M20 and bF–bG, are shown at the top of the plot, and correspond to highly flexible regions. For each collective motion within the structure, a unique symbol is plotted. mains: the adenoside binding subdomain and the loop subdomain. The M20 and bF–bG loops of the loop subdomain are denoted in Figures 8 and 9. Studies by Miller and Benkovic57 concluded that the flexibility of these loops is interrelated, such that the flexibility of the outer bF–bG loops guides the conformation of the M20 loop. It is this correlated flexibility that gives DHFR ligand specificity. We would expect the regions that move the most, and that are important for binding, to appear flexible in the FIRST analysis, at least in the open conformation. Figure 8A shows that the M20 loop is detected by FIRST to be fully flexible, but in the closed form (Fig. 8B), this loop has moved and become partially locked into place. Comparing all three conformations quantitatively in Figure 9, the residues within these two mobile loops tend to be most flexible, and this flexibility is fairly independent of conformation being analyzed. This relates to a functional requirement for these loops to remain flexible during the catalytic cycle. The flexible region around residue 88 in all three conformations of Figure 9 corresponds to a hinge between the two subdomains. Similarly, the flexible region found by FIRST around residue 70 (open triangles in the flexibility index plot, Fig. 9) is within the adenoside binding subdomain and has also been identified as flexible by NMR techniques.58 Plots of crystallographic mobility and regions of greatest main-chain dihedral angle change (data not shown) for the three DHFR structures in Figure 8 are remarkably consistent between the structures and also agree with FIRST results that the most flexible regions are in the M20 loop and the bF–bG loop. We have analyzed several other DHFR structures by FIRST (PDB entries 1rx2, 3, 4, and 5) and have found substantial agreement with the above conclusions. In general, changes in ligands between these structures can influence the flexibility of their neighborhood in the protein, but the major features of flexibility remain consistent. PROTEIN FLEXIBILITY PREDICTIONS USING GRAPH THEORY 163 Fig. 10. Flexibility plot of adenylate kinase for the open conformation (PDB code 1dvr) (A) for the closed conformation (PDB code 1aky) (B). C: Difference in main-chain dihedral angles between these two conformations, indicating the locations of large, localized conformational changes. Ligands bound to these structures are shown in green tubes. The open state (A) has only adenosine triphosphate (ATP) bound. In the closed state (B), the ligand, P1,P5-bis(adenosine-59-)pentaphosphate (AP5A) mimics the roles of AMP and ATP binding concurrently. Dark blue corresponds to highly overconstrained and rigid, with a flexibility index; bright red corresponds to highly flexible. All labeled regions are discussed in the text. Adenylate Kinase Another protein whose motion has been studied experimentally is adenylate kinase.59 – 61 Previous work indicates that adenylate kinase uses hinges rather than shear motions for conformational change. This intrinsic, largescale hinge motion upon ligand binding is easily identifiable in both the open and closed conformations analyzed by FIRST, as indicated in Figure 10 by gold arrows. Several helices at the extreme right of Figure 10 move in like fingers via hinges defined as the most flexible (red) portions of the protein by FIRST. Adenylate kinase binds two ligands, ATP and AMP, in a two-step mechanism. Unfortunately, structures of adenylate kinase from the same species are not available for all three steps (ligand-free, followed by two binding/conformational change events). By comparing enzymes from differ- 164 D.J. JACOBS ET AL. ent species, four hinges were previously defined to contribute to the conformational change between open and ligandbound forms.59 These hinges move in a concerted way to account for the large conformational change closing the lid domain (residues 131–165) over the ligand. Figure 10A,B shows structures with ligands bound to this domain and not this initial ligand-binding step.59 However, the peak in main-chain dihedral change shown around residue 167 in Figure 10C corresponds to a conformational switch made by the red flexible loop (part of the lid) in Figure 10A between structures. Crystallographic mobility analysis for the open structure (10A) and the closed structure (10B) are generally in good agreement with FIRST results, the only difference being that regions between residues 135 and 160 appear crystallographically mobile in the closed structure. Closing of the lid domain is associated with the binding of ATP (green tubes at center in Fig. 10A). Binding of the ligand AP5A (green tubes, Fig. 10B) produces the fully closed conformation of adenylate kinase and locks many of the domain linking hinges (a–f), as seen by the transition between flexible (red) and rigid (blue) regions in Figure 10A,B. The NMPbind site is where the part of AP5A that is nonoverlapping with ATP binds, and this site is formed by the interface between the two domains as they clamp down on the inhibitor.61 Comparing these structures, FIRST shows that the flexibility of the NMPbind domain (especially hinges a, e, and f) decreases upon AP5A binding to this domain. The flexible linkage (b) between helices a3 and a4 around residue 62 (identified in the Fig. 10C plot of change in F, C angles between the open and closed structures in Fig. 10A,B) seems to account for a large part of the motion transforming the open to the closed conformation. This region (b in Fig. 10A,B) is found to decrease in size but remains flexible in the closed conformation. The persistent flexibility in the closed conformation hints at the reversibility of the motion required for catalytic turnover. Even in the ATP-bound closed lid conformation exhibited by both of these structures, certain key hinges59,60 remain flexible. For example, hinges c and d between helices a4 and a5 remain flexible in both states. Thus, FIRST results correlate well with the crystallographically observed conformational changes upon ligand binding for the complex motions within adenylate kinase. SUMMARY A novel distance constraint approach is introduced for characterizing the intrinsic flexibility of a protein. The underlying physical and mathematical assumptions are outlined and implemented computationally in the FIRST software. FIRST determines the floppy inclusion and rigid substructure topography of a given protein structure, based on a set of distance constraints determined by the covalent and hydrogen bonding network within a single conformation of the protein. A flexibility index is introduced as a continuous measure for quantifying the flexibility or stability of each bond within the protein, based on the FIRST identification of the density of floppy modes as well as the density of redundant bonds within the protein. There are several advantages of FIRST relative to previous methods for analyzing protein flexibility. FIRST calculations can be done virtually in real time (a few seconds of CPU time) once the bond network has been defined. Analysis of a single protein structure is able to indicate regions likely to undergo conformational change as part of the protein’s function. For a given set of distance constraints, the rigid regions and the flexible joints between them are determined precisely. The ability to determine coupled motions very quickly among the dihedral angles of a flexible region gives FIRST an advantage over other methods. Collective motions, in which changing one dihedral angle will influence the other dihedral angles within the region, are localized within the flexible regions of the protein. Analysis of the relative flexibility within HIV protease, dihydrofolate reductase, and adenylate kinase, even when performed on a single structure, captures much of the functionally important conformational flexibility observed experimentally between different ligand-bound states. A distribution version of the FIRST software is in preparation and will be available through the authors. ACKNOWLEDGMENTS The authors thank Yuquing Xiao for help with programming and Vishal Thakkar and Brendan Hespenheide for their contribution to molecular graphics. REFERENCES 1. Bernstein FC, Koetzle TF, Williams GJB, Meyer EF Jr., Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M. The Protein Data Bank: a computer based archival file for macromolecular structures. J Mol Biol 1977;112:535–542. 2. Ringe D, Petsko GA. A consumer’s guide to protein crystallography. In: Protein engineering and design. San Diego: Academic Press, 1996; p 210 –229. 3. Bennett W, Huber R. Structural and functional aspects of domain motions in proteins. Crit Rev Biochem 1984;15:291–384. 4. Wuthrich K, Wagner G. Internal motion in globular proteins. Trends Biochem Sci 1978;3:227–230. 5. Nichols WL, Rose GD, Ten Eyck LF, Zimm BH. Rigid domains in proteins: an algorithmic approach to their identification. Proteins 1995;23:38 – 48. 6. Siddiqui AS, Barton GJ. Continuous and discontinuous domains: an algorithm for the automatic generation of reliable protein domain definition. Protein Sci 1995;4:872– 884. 7. Boutonnet N, Rooman M, Wodak S. Automatic analysis of protein conformational changes by multiple linkage clustering. J Mol Biol 1995;253:633– 647. 8. Holm L, Sander C. Parser for protein folding units. Proteins 1994;19:256 –268. 9. Zehfus MH, Rose GD. Compact units in proteins. Biochemistry 1986;25:5759 –5765. 10. Karplus PA, Schulz GE. Prediction of chain flexibility in proteins. Naturwissenschaften 1985;72:212–213. 11. Duan Y, Kollman PA. Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution. Science 1998;282:740 –744. 12. Ma J, Karplus M. The allosteric mechanism of the chaperonin GroEL: a dynamic analysis. Proc Natl Acad Sci USA 1998;95:8502– 8507. 13. Case DA. Molecular dynamics and normal mode analysis of biomolecular rigidity. In: Thorpe M, Duxbury P, editors. Rigidity theory and applications. Kluwer Academic/Plenum, 1999. 14. Keskin O, Jernigan RL, Bahar I. Proteins with similar architecture exhibit similar large-scale dynamic behavior. Biophys J 2000;78:2093–2106. 15. Janin J, Wodak S. Structural domains in proteins and their role in PROTEIN FLEXIBILITY PREDICTIONS USING GRAPH THEORY 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. the dynamics of protein function. Prog Biophys Mol Biol 1983;42: 21–78. Maiorov V, Abagyan R. A new method for modeling large-scale rearrangements of protein domains. Proteins 1997;27:410 – 424. Jacobs DJ, Thorpe MF. Generic rigidity percolation: The pebble game. Phys Rev Letters 1995;75:4051– 4054. Jacobs DJ. Generic rigidity in three-dimensional bond-bending networks. J Phys A Math Gen 1998;31:6653– 6668. Thorpe MF, Jacobs D, Chubynsky N, Rader A. Generic rigidity of network glasses. In: Thorpe MF, Duxbury P, editors. Rigidity theory and applications. New York: Kluwer Academic/Plenum; 1999. Jacobs D, Thorpe MF. Computer-implemented system for analyzing rigidity of substructures within a macromolecule. US Patent number 1998:6,014,449. Dill KA. Dominant forces in protein folding. Biochemistry 1990;29: 7133–7155. Abagyan R, Totrov M, Kuznetsov D. ICM—a new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation. J Comput Chem 1994;15:488 –506. Lu H, Schulten K. Steered molecular dynamics simulation of conformational changes in immunoglobulin domain 127 interpret atomic force microscopy observations. Chem Phys 1999;247:141– 153. Jacobs D, Kuhn LA, Thorpe MF. Flexible and rigid regions in proteins. In: Thorpe MF, Duxbury P, editors. Rigidity theory and applications. New York: Kluwer Academic/Plenum; 1999. Lagrange J. Mećanique analytique. Paris, 1788. Maxwell JC. On the calculation of the equilibrium and stiffness of frames. Philos Mag 1864;27:294 –299. Laman G. On graphs and rigidity of plane skeletal structures. J Eng Math 1970;4:331–340. Ashcroft N, Mermin N. Solid state physics. Fort Worth, TX: Saunders College Publishing; 1976. Graver J, Servatius B, Servatius H. Combinatorial rigidity. Graduate studies in mathematics. Providence, RI: American Mathematical Society; 1993. Guyon E, Roux S, Hansen A, Bibeau D, Troadec J, Crapo H. Non-local and non-linear problems in the mechanics of disordered systems: application to granular media and rigidity problems. Rep Prog Phys 1990;53:373– 419. Jacobs DJ, Hendrickson B. An algorithm for two-dimensional rigidity percolation: the pebble game. J Comput Phys 1997;137: 346 –365. Jacobs DJ, Thorpe MF. Generic rigidity percolation in two dimensions. Phys Rev E 1996;53:3683–3693. Tay T-S, Whiteley W. Recent advances in generic rigidity of structures. Struct Topol 1985;9:31–38. Whiteley W. Rigidity of molecular structures: generic and geometric analysis. In: Thorpe M, Duxbury P, editors. Rigidity theory and application. New York: Kluwer Academic/Plenum; 1999. Xiao Y, Jacobs D, Thorpe M. Unpublished results; 1997. Jeffrey GA. An introduction to hydrogen bonding. New York: Oxford University Press; 1997. Fersht AR. The hydrogen bond in molecular recognition. Trends Biochem Sci 1987;12:301–304. Vriend G. WHAT IF: a molecular modeling and drug design program. J Mol Graph 1990;8:52–56. http://swift.embl-heidelberg.de/whatif/ Stickle D, Presta L, Dill K, Rose G. Hydrogen bonding in globular proteins. J Mol Biol 1992;226:1143–1159. McDonald I, Thorton J. Satisfying hydrogen bonding potential in proteins. J Mol Biol 1991;238:777–793. Barlow DJ, Thorton JM. Ion-pairs in proteins. J Mol Biol 1983;168: 867– 885. 165 42. Gandini D, Gogioso L, Bolognesi M, Bordo D. Patterns in ionizable side chain interactions in protein structures. Proteins 1996;24:439 – 449. 43. Xu D, Tsai C-J, Nussinov R. Hydrogen bonds and salt bridges across protein–protein interfaces. Protein Eng 1997;10:999 –1012. 44. Dahiyat BI, Gordon DB, Mayo SL. Automated design of the surface positions of protein helices. Protein Sci 1997;6:1333–1337. 45. Kumar S, Nussinov R. Salt bridge stability in monomeric proteins. J Mol Biol 1999;293:1241–1255. 46. Laskowski RA, MacArthur MW, Moss DS, Thornton JM. PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Crystallogr 1993;26:283–291. ftp.biochem. ucl.ac.uk/pub/procheck/tar3_4/procheck.tar.Z 47. Nicholson LK, Yamazaki T, Torchia DA, Grzesiek S, Bax A, Stahl SJ, Kaufman JD, Wingfield PT, Yam PYS, Jadhav PK, Hodge CN, Domaille PJ, Chang C-H. Flexibility and function in HIV-1 protease. Nature Struct Biol 1995;2:274 –280. 48. Goodman J, Pagel M, Stone M. Relationships between protein structure and dynamics from a database of NMR-derived backbone order parameters. J Mol Biol 2000;295:963–978. http:// pooh.chem.indiana.edu/IDD.html 49. Korn AP, Rose DR. Torsion angle differences as a means of pinpointing local polypeptide chain trajectory changes for identical proteins in different conformational states. Protein Eng 1994; 7:961–967. 50. Gerstein M, Krebs W. A database of molecular motions. Nucleic Acids Res 1998;26:4280 – 4290. 51. Chen Z, Li Y, Schock HB, Hall D, Chen E, Kuo LC. Three dimensional structure of a mutant HIV-1 protease displaying cross-resistance to all protease inhibitors in clinical trials. J Biol Chem 1995;270:21433–21436. 52. Patrick A, Rose R, Greytok J, Bechtold C, Hermsmeier M, Chen P, Barrish J, Zahler R, Colonno R, Lin P. Characterization of a human immunodeficiency virus type 1 variant with reduced sensitivity to an aminodiol protease inhibitor. J Virol 1995;69: 2148 –2152. 53. Wlodawer A, Erickson J. Structure-based inhibitors of HIV-1 protease. Annu Rev Biochem 1993;62:543–585. 54. Ishima R, Freedber D, Wang Y-X, Louis J, Torchia D. Flap opening and dimer–interface flexibility in the free and inhibitorbound HIV protease, and their implications for function. Struct Fold Design 1999;7:1047–1055. 55. Scott W, Schiffer C. Curling of flap tips in HIV-1 protease as a mechanism for substrate entry and tolerance of drug resistance. Struct Fold Design 2000;9:1259 –1265. 56. Sawaya M, Kraut J. Loop and subdomain movements in the mechanism of Escherichia coli dihydrofolate reductase: crystallographic evidence. Biochemistry 1997;36:586 – 603. 57. Miller G, Benkovic S. Stretching exercises—flexibility in dihydrofolate reductase catalysis. Chem Biol 1998;5:R105–R113. 58. Epstein D, Benkovic S, Wright P. Dynamics of the dihydrofolate reductase–folate complex: catalytic sites and regions known to undergo conformational change exhibit diverse dynamical features. Biochemistry 1995;34:11037–11048. 59. Gerstein M, Schulz G, Chothia C. Domain closure in adenylate kinase: joints on either side of two helices close like neighboring fingers. J Mol Biol 1993;229:494 –501. 60. Schlauderer GJ, Proba K, Schulz GE. Structure of a mutant adenylate kinase ligated with an ATP-analogue showing domain closure over ATP. J Mol Biol 1996;256:223–227. 61. Zhang H-J, Sheng X-R, Pan X-M, Zhou J-M. Activation of adenylate kinase by denaturants is due to the increasing conformational flexibility at its active sites. Biochem Biophys Res Commun 1997;238:382–386.