1. Introduction
The conventional description of protein dynamics asserts that proteins posses intrinsic conformational states [
1]. An enzyme may cycle between catalytic and open states [
2]. An ion channel may open and close its central pore [
3]. A chaperone protein assists transformation of large, hydrophobic proteins from initial, linear, to folded shapes [
4]. X-ray and cryo electron-microscopy reveals conformations with small motions on the 1–5 Ångstrom level for those proteins that crystallize [
5]. Neutron scattering and nuclear magnetic resonance structures of room temperature proteins show greater shape variability, but are usually able to classify structures into a few ‘canonical’ structures.
Advances in molecular modeling have made it possible to simulate the protein folding process, generating very large numbers of samples, free energy landscapes, and information on kinetics. Nevertheless, computational identification of distinct conformational states from molecular simulations has remained an active area of methodological research. Approaches fall roughly into two classes—linear and nonlinear. Linear methods work with the real
-dimensional vector of atomic coordinates as a linear space, using pairwise alignments and Euclidean (RMSD) distances between structures. Nonlinear methods work with ‘features’ derived from nonlinear functions of the atomic coordinates—like pairwise distances between alpha-carbons. Well-known linear techniques include principle component analysis and linkage clustering based on pair-RMSDs. Nonlinear techniques include methods based on internal coordinates (relative distances and angles between groups of atoms) and T-distributed stochastic neighbor embedding [
6].
Both linear and nonlinear techniques can be applied to subsets or groups of atoms to classify site-specific or inter-domain structures, respectively. This can be done trivially by handing off different point-sets to the analysis. Thus, most methods can be adapted to run with reasonable compute time and yield classifications most relevant to a motion under investigation. Automatically detecting protein secondary and tertiary structures, however, remains a challenging problem which we cannot address here.
One of the principle methods developed for visualizing domain motions in proteins is DynDom [
7,
8]. DynDom works from two input structures and determines relative domain rotations (using internal coordinates). This allows predicting transition motions from experimentally observed conformers, but not conformations from observed motions. On the other hand, many structure-to-structure similarity classification methods have emerged for this problem [
9,
10,
11]. A recent review of such dimensionality reduction methods for protein conformational spaces noted that nonlinear methods are generally better than Cartesian or linear ones, but that the complexity of assumptions behind those models makes them difficult to work with and adapt [
6,
12].
This work presents a complete inference method derived from one single statistical hypothesis: that conformational states are defined by sets of contacting residues. Specifically, we hypothesize that the conformational state, k, uniquely determines which pairs of residues u, v, will be touching. Like a weighted coin flip, the contact probability is —independently from all the other contacting pairs. Each conformational state is thus characterized by a vector, , encoding the set of contacting pairs in state k.
The statistical model derived from this problem statement is termed a Bernoulli mixture model for binary feature classification [
13]. The problem setup is similar to the Naive Bayes method [
14]. However, because the categories are not known in advance, this is an unsupervised learning and classification problem.
Bernoulli mixture models have been applied extensively in the field of text subject analysis [
15], optical character recognition [
14], and image feature classification [
13]. Essentially all of these applications have been successful at building extremely accurate classification models. The latter work also presents a thorough summary of sampling methods.
However, there remain difficulties sampling the distribution over categories,
, especially when the number of categories and reference classifications are not known in advance. The well-known expectation-maximization algorithm (EM) [
16] is available in principle, but is not a replacement for sampling. Theoretical work on the EM method [
17] shows that redundant categories will result in many circumstances. In this work, we have introduced a prior that eliminates redundant categorizations.
This work is structured as follows.
Section 2 presents the underlying probability distribution of categories, then outlines a novel method for quickly sampling parameter space—achieving category inference. Full technical details are present in
Appendix A.1 and
Appendix A.2.
Section 3 describes test problems on which the method is demonstrated.
Section 4 presents results demonstrating that the method creates structurally meaningful categories with >90% accuracy. Although the potential application space is vast, this work focuses on proving method robustness using well-defined synthetic test problems. Each follows a time sequence mimicking domain motions in proteins—so that the classification accuracy can be judged by correctly assigning categories in time-order. For the practitioner interested in trying the method directly, full source code and scripts reproducing the test cases in this work are available under the GPLv3 license (
Supplementary Materials or [
18]).
2. Theory
A naïve Bayes model (for bit-strings) assumes that structural input samples, are generated by first selecting a conformational state, , with probability , and then independently deciding whether each point-to-point contact (), is made with probability . If contact j is made in sample number i, then . The bit-vector, , is said to posess feature j. Otherwise , and the feature is absent. The model parameters are thus .
It leads to a sampling distribution,
The second line above notes that, once the categorization,
z, is known, the sampling distribution is easy to express in terms of feature counts in
—the set of samples assigned to category
k,
The first is the number of samples in set
k, and the second is the number of times each contact is seen in that set.
According to Bayes’ theorem [
19], we can turn this around to predict two important things—the probability that sample
i belongs to category
k, (read
z given
x and
),
and also the probability distribution over all possible parameters,
where
is an
x-dependent normalization constant. Sampling this distribution provides everything—the categorizations,
z, the conformational states,
, and even a predicted number of categories,
K.
In Bayesian probability, a prior distribution has to be assumed by the researcher. The prior characterizes the parameter space, independently from any sampled data. Our prior distribution over parameters, introduced below, is . Since the parameters directly determine the sampling distribution, the prior does not affect it []. Note that this work juggles between two different priors, I and U, because the inference problem is simpler using , but eliminates redundant solutions.
We choose a prior probability,
that enforces uniqueness of the categories. Here
is the Bhattacharyya similarity between distributions
p and
q,
If two categories share the same distribution, then , and . This forces our estimator to return zero likelihood that for any .
The second part of Equation (7),
, is a conventional prior used for Bernoulli mixture inference in the literature. We use
throughout this work.
Appendix A.1 contains a detailed justification for this choice. Essentially, it forces the likelihood for dividing a category into two parts to be asymptotically insensitive to the number of features,
M. A proof of this fact, as well as a useful connection to the information entropy of compression is also present in
Appendix A.1.
Sampling Method
Our sampling method is traditional Markov Chain Monte Carlo using four types of moves: a recategorization move, where categories, z, are assigned according to , a reclassification move, where is sampled from and accepted with probability , and one split and one join rule. The function, , referred to here is just without the Bhattacharyya distance terms, and with a different constant prefactor.
For the split trial move, one of the categories, k, is split into two new categories. Every member of k is re-assigned into one of two new sets, labaled L or R. Join moves are the opposite of split moves. This re-categorization changes the set labels, z. Specifically, goes from k to L or R for every i formerly at .
Generating split or join trial moves was done by randomly choosing either one category to split or choosing two categories to join. For splits, member
i of category
k is moved to set
L with probability
. We used
, but any
should work in principle. If all elements end up in
L or
R, the partitioning is re-done. To concentrate splitting on productive cases, we did not attempt to split categories with
or
. Immediately after splitting or joining categories, a reclassification move (re-assigning
) was performed. Category split moves were accepted using the Metropolis criterion, which is the smaller of
or one. Explicit formulas for the move generation probabilities (
, etc.) are provided in
Appendix A.2.
Figure 1 provides a graphical summary of this inference scheme. Each conformational sample is mapped to a bit-string, which is used as the basis for inferring
. Inference proceeds by sampling potential parameters until a good explanation for the data is found. Trial moves that re-categorize and update
look horizontally to find better category prototypes. Trial moves that split categories based on presence or absence of some features allow us to traverse category space vertically.
3. Test Systems
The ability of our sampling procedure to predict categories was tested on three geometrical systems: ‘chomp’ (a closing angle), ‘heli’ (a rotating line), and ‘glob’ (three rotating spheres). Each system was generated as a time-series of 1000 frames for approximately P total particles in 2 or 3-dimensional space. After generation, Gaussian random noise of width was added to every degree of freedom.
Figure 2 shows images of these three test trajectories. A complete description of the coordinate generation methods is present in
Appendix A.3. All three systems were processed into binary feature data by calculating pairwise distances between all points. Pairs of points within 2 distance units were translated to 1 (representing contact). When forming the feature vectors,
x, we removed features,
j, for which every sample showed the same result (all contacting or all disconnected). It is important to note that this removal changes
M seen by the algorithm—usually decreasing it well below
.
Critically, we repeated these classifications for a range of material points, P. This tested robustness of the unsupervised classification problem with respect to the amount of features available. Adding more points without changing the geometry of the problem should not change the number of categories detected.
For each run, five independent MCMC chains were started from an assignment of all points to a single category. Each chain ran for 1000 steps. Samples were collected every 10 steps—starting from step 500. The acceptance probabilities for category split/join Monte Carlo moves varied around 10–13%.
4. Results
We analyzed the results of MC in two different ways. First, the categories assigned were tested for grouping in time. Since the contact lists (on which the categorization is based) varied slowly over time, we expect categories to come in ‘runs’. Second, we computed histograms over the number of categories, K. This is a strong test of the method’s sensitivity to the number of material points, P.
As expected, we found a high degree of correlation between categories and time for every case. Similar time-points were grouped into similar categories. To quantify these results, we counted transitions between category indices in time-order. For a perfect categorization, the number of transitions should equal the number of categories minus one. We computed the categorization accuracy in two ways. For each system, the left columns of
Table 1 and
Table 2 are 100 minus the percent of mis-categorized frames. For lossy categorization at time-boundaries, we expect oscillation between two values. We quantified this by forming a transition matrix between categories, and removing transitions along the ‘most likely path’. Excluding this boundary oscillation we found nearly 100% accuracy for the categorizations. Those are shown in the right columns of
Table 1 and
Table 2.
Integrating the posterior probability (Equation (2) times Equation (7)) over
leads to factors like
, which seem prohibitively costly as
N increases. We therefore wanted to check that the number of categories does not decrease as features or samples are added.
Figure 3 shows the sampled probability distributions over
K, the number of categories, for increasing
P (
Figure 3a) and
N (
Figure 3b). Interestingly, for every system, about five conformational states were deduced at
. As
N increased, however, more categories were deduced and the distribution spread to higher numbers. This is probably reasonable, since more values of the ‘time’ coordinate generated a more fine-grained motion.
Adenylate Kinase Open/Closed Transition
Finally, we tested the classification method against a protein with a well-characterized conformational transition. Adenylate kinase (ADK) converts ATP, ADP, and AMP by closing around substrate molecules [
20]. The transition from closed to open was simulated in Reference [
5] using steered molecular dynamics on a reaction coordinate interpolating between the electron density maps of PDB IDs 1AKE (Reference [
20], closed) and 4AKE (Reference [
2], open). The simulation data we used did not contain ligands, but did contain water and ions. Our analysis only made use of the alpha carbon (C
) positions.
Features were calculated for each of
equally spaced frames during steered dynamics by testing whether C
to C
distances were less than 5 Å. These structures contain
C
-s. Sampling was carried out as described in
Section 3, but 8 independent MC chains were sampled for 1250 steps (instead of 5 for 1000). The acceptance probability of split/join moves was 17%. During sampling, we saved the parameters,
, possessing maximum likelihood values at each
K.
Our implementation (Reference [
18]) is parallelized so that each thread carries out an independent Monte Carlo chain. The run-time for the ADK example was less than 30 min. Larger runs on proteins up to 400 residues using 75,000 sampled protein conformations have been carried out using this method in under an hour. The implementation also contains a series of vizualization tools to highlight the protein regions responsible for detected conformation-to-conformation differences.
We then extracted conformational states with the highest probability for landing in each category as representative points for that category. Since the reference open and closed PDB conformations formed extreme points, our representative structures approached them more nearly as
K increased.
Figure 4 shows that the two end-point conformations ended up very close to the open and closed states from the PDB.
5. Conclusions
The method developed here is ideally suited for the unsupervised structural classification problem. It has been derived from a first-principles Bayesian analysis of the set of atoms which interact within a structure. This work solved the central problem of defining an appropriate prior distribution over parameter space and implementing an efficient sampling method.
Tests on sample conformational transitions identified more categories than naïvely expected because it generated milestones along the motion’s time-coordinate. However, all categorizations were shown to have excellent accuracy as judged by picking out the correct time-sequence. Finally, a test on Adenylate kinase verified that these conclusions easily generalize to protein motions.
The central results of this work showed that the method behaves well under large variations in the number of features and samples. These observations validate our choice for the prior distribution, since changes in will have large effects on the distribution over number of categories, . They also show robustness of the MC sampling method itself, since relatively high acceptance rates were achieved.
Many future applications and the development of this method are possible. Changes in the pair contacts between states could be analyzed more thoroughly, as in the DynDom method [
7]. Probabilities of assignment to each conformational state can be used as reaction coordinates. We are presently applying the method to classify chemical compound space using binary MACCS fingerprints [
21], and to characterize conformational space of SARS-CoV-2 proteins simulated using replica exchange molecular dynamics [
22].
Now that the principle has been demonstrated, more informative classification schemes can be devised. Adding information like hydrogen-bonding, salt bridge formation, and secondary structure annotation will allow the Bayesian framework able to recognize categories more like a biochemist would. The method can also be focused on active domains and binding sites by adding more points and shorter distance cutoffs for key residues. The insensitivity to M shown in this work provides a high degree of confidence that any amount of additional data will improve the overall categorization without spoiling the classification already achieved using coarser-level data.