Protein Conformational States—A First Principles Bayesian Method

Rogers, David M.

doi:10.3390/e22111242

Open AccessArticle

Protein Conformational States—A First Principles Bayesian Method^†

by

David M. Rogers

National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA

^†

This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

Entropy 2020, 22(11), 1242; https://doi.org/10.3390/e22111242

Submission received: 8 October 2020 / Revised: 23 October 2020 / Accepted: 29 October 2020 / Published: 31 October 2020

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

Automated identification of protein conformational states from simulation of an ensemble of structures is a hard problem because it requires teaching a computer to recognize shapes. We adapt the naïve Bayes classifier from the machine learning community for use on atom-to-atom pairwise contacts. The result is an unsupervised learning algorithm that samples a ‘distribution’ over potential classification schemes. We apply the classifier to a series of test structures and one real protein, showing that it identifies the conformational transition with >95% accuracy in most cases. A nontrivial feature of our adaptation is a new connection to information entropy that allows us to vary the level of structural detail without spoiling the categorization. This is confirmed by comparing results as the number of atoms and time-samples are varied over 1.5 orders of magnitude. Further, the method’s derivation from Bayesian analysis on the set of inter-atomic contacts makes it easy to understand and extend to more complex cases.

Keywords:

Bernoulli mixture; Bayesian clustering; unsupervised classification

1. Introduction

The conventional description of protein dynamics asserts that proteins posses intrinsic conformational states [1]. An enzyme may cycle between catalytic and open states [2]. An ion channel may open and close its central pore [3]. A chaperone protein assists transformation of large, hydrophobic proteins from initial, linear, to folded shapes [4]. X-ray and cryo electron-microscopy reveals conformations with small motions on the 1–5 Ångstrom level for those proteins that crystallize [5]. Neutron scattering and nuclear magnetic resonance structures of room temperature proteins show greater shape variability, but are usually able to classify structures into a few ‘canonical’ structures.

Advances in molecular modeling have made it possible to simulate the protein folding process, generating very large numbers of samples, free energy landscapes, and information on kinetics. Nevertheless, computational identification of distinct conformational states from molecular simulations has remained an active area of methodological research. Approaches fall roughly into two classes—linear and nonlinear. Linear methods work with the real

3 N

-dimensional vector of atomic coordinates as a linear space, using pairwise alignments and Euclidean (RMSD) distances between structures. Nonlinear methods work with ‘features’ derived from nonlinear functions of the atomic coordinates—like pairwise distances between alpha-carbons. Well-known linear techniques include principle component analysis and linkage clustering based on pair-RMSDs. Nonlinear techniques include methods based on internal coordinates (relative distances and angles between groups of atoms) and T-distributed stochastic neighbor embedding [6].

Both linear and nonlinear techniques can be applied to subsets or groups of atoms to classify site-specific or inter-domain structures, respectively. This can be done trivially by handing off different point-sets to the analysis. Thus, most methods can be adapted to run with reasonable compute time and yield classifications most relevant to a motion under investigation. Automatically detecting protein secondary and tertiary structures, however, remains a challenging problem which we cannot address here.

One of the principle methods developed for visualizing domain motions in proteins is DynDom [7,8]. DynDom works from two input structures and determines relative domain rotations (using internal coordinates). This allows predicting transition motions from experimentally observed conformers, but not conformations from observed motions. On the other hand, many structure-to-structure similarity classification methods have emerged for this problem [9,10,11]. A recent review of such dimensionality reduction methods for protein conformational spaces noted that nonlinear methods are generally better than Cartesian or linear ones, but that the complexity of assumptions behind those models makes them difficult to work with and adapt [6,12].

This work presents a complete inference method derived from one single statistical hypothesis: that conformational states are defined by sets of contacting residues. Specifically, we hypothesize that the conformational state, k, uniquely determines which pairs of residues u, v, will be touching. Like a weighted coin flip, the contact probability is

μ_{k, (u : v)}

—independently from all the other contacting pairs. Each conformational state is thus characterized by a vector,

μ_{k}

, encoding the set of contacting pairs in state k.

The statistical model derived from this problem statement is termed a Bernoulli mixture model for binary feature classification [13]. The problem setup is similar to the Naive Bayes method [14]. However, because the categories are not known in advance, this is an unsupervised learning and classification problem.

Bernoulli mixture models have been applied extensively in the field of text subject analysis [15], optical character recognition [14], and image feature classification [13]. Essentially all of these applications have been successful at building extremely accurate classification models. The latter work also presents a thorough summary of sampling methods.

However, there remain difficulties sampling the distribution over categories,

μ

, especially when the number of categories and reference classifications are not known in advance. The well-known expectation-maximization algorithm (EM) [16] is available in principle, but is not a replacement for sampling. Theoretical work on the EM method [17] shows that redundant categories will result in many circumstances. In this work, we have introduced a prior that eliminates redundant categorizations.

This work is structured as follows. Section 2 presents the underlying probability distribution of categories, then outlines a novel method for quickly sampling parameter space—achieving category inference. Full technical details are present in Appendix A.1 and Appendix A.2. Section 3 describes test problems on which the method is demonstrated. Section 4 presents results demonstrating that the method creates structurally meaningful categories with >90% accuracy. Although the potential application space is vast, this work focuses on proving method robustness using well-defined synthetic test problems. Each follows a time sequence mimicking domain motions in proteins—so that the classification accuracy can be judged by correctly assigning categories in time-order. For the practitioner interested in trying the method directly, full source code and scripts reproducing the test cases in this work are available under the GPLv3 license (Supplementary Materials or [18]).

2. Theory

A naïve Bayes model (for bit-strings) assumes that structural input samples,

i = 1, \dots, N

are generated by first selecting a conformational state,

z_{i} \in {1, \dots, K}

, with probability

π_{z_{i}}

, and then independently deciding whether each point-to-point contact (

x_{i j}, j = 1, \dots, M

), is made with probability

μ_{z_{i} j}

. If contact j is made in sample number i, then

x_{i j} = 1

. The bit-vector,

x_{i}

, is said to posess feature j. Otherwise

x_{i j} = 0

, and the feature is absent. The model parameters are thus

θ = (K, π, μ)

.

It leads to a sampling distribution,

\begin{array}{l} (1) & P (x z | θ) & = \prod_{i} π_{z_{i}} \prod_{j = 1}^{F} μ_{z_{i} j}^{x_{i j}} {(1 - μ_{z_{i} j})}^{1 - x_{i j}} \\ (2) & = \prod_{k = 1}^{K} π_{k}^{N_{k}} \prod_{j = 1}^{M} μ_{k j}^{N_{k j}} {(1 - μ_{k j})}^{N_{k} - N_{k j}} . \end{array}

The second line above notes that, once the categorization, z, is known, the sampling distribution is easy to express in terms of feature counts in

{i : z_{i} = k}

—the set of samples assigned to category k,

\begin{matrix} N_{k} & = | {i : z_{i} = k} |, & N_{k j} & = \sum_{{i : z_{i} = k}} x_{i j} . \end{matrix}

(3)

The first is the number of samples in set k, and the second is the number of times each contact is seen in that set.

According to Bayes’ theorem [19], we can turn this around to predict two important things—the probability that sample i belongs to category k, (read z given x and

θ

),

P (z_{i} = k | x_{i} θ) \propto \prod_{j = 1}^{M} μ_{k j}^{x_{i j}} {(1 - μ_{k j})}^{1 - x_{i j}},

(4)

and also the probability distribution over all possible parameters,

P (θ z | x I) = C (x) P (x z | θ) P (θ | I),

(5)

where

C (x)

is an x-dependent normalization constant. Sampling this distribution provides everything—the categorizations, z, the conformational states,

π, μ

, and even a predicted number of categories, K.

In Bayesian probability, a prior distribution has to be assumed by the researcher. The prior characterizes the parameter space, independently from any sampled data. Our prior distribution over parameters, introduced below, is

P (θ | I)

. Since the parameters directly determine the sampling distribution, the prior does not affect it [

P (x z | θ) = P (x z | θ I)

]. Note that this work juggles between two different priors, I and U, because the inference problem is simpler using

P (θ | U)

, but

P (θ | I)

eliminates redundant solutions.

We choose a prior probability,

\begin{matrix} P (θ | I) & \propto P (θ | U) \prod_{k < l} (1 - B (μ_{k}, μ_{l})), \end{matrix}

(6)

\begin{matrix} P (θ | U) & = \frac{Γ (K α)}{Γ {(α)}^{K}} \prod_{k = 1}^{K} π_{k}^{α - 1} \end{matrix}

(7)

that enforces uniqueness of the categories. Here

B (μ_{k}, μ_{l}) = B_{k l}

is the Bhattacharyya similarity between distributions p and q,

B (p, q) = \prod_{j = 1}^{M} \sqrt{p_{j} q_{j}} + \sqrt{(1 - p_{j}) (1 - q_{j})} .

(8)

If two categories share the same distribution, then

p = q

, and

B (p, q) = 1

. This forces our estimator to return zero likelihood that

μ_{k} = μ_{l}

for any

k \neq l

.

The second part of Equation (7),

P (θ | U)

, is a conventional prior used for Bernoulli mixture inference in the literature. We use

α = M + 1

throughout this work. Appendix A.1 contains a detailed justification for this choice. Essentially, it forces the likelihood for dividing a category into two parts to be asymptotically insensitive to the number of features, M. A proof of this fact, as well as a useful connection to the information entropy of compression is also present in Appendix A.1.

Sampling Method

Our sampling method is traditional Markov Chain Monte Carlo using four types of moves: a recategorization move, where categories, z, are assigned according to

P (z | θ x I)

, a reclassification move, where

θ

is sampled from

P (θ | z x U)

and accepted with probability

\prod_{k < l} (1 - B_{k l})

, and one split and one join rule. The function,

P (θ | z x U)

, referred to here is just

P (θ | z x I)

without the Bhattacharyya distance terms, and with a different constant prefactor.

For the split trial move, one of the categories, k, is split into two new categories. Every member of k is re-assigned into one of two new sets, labaled L or R. Join moves are the opposite of split moves. This re-categorization changes the set labels, z. Specifically,

z_{i}

goes from k to L or R for every i formerly at

z_{i} = k

.

Generating split or join trial moves was done by randomly choosing either one category to split or choosing two categories to join. For splits, member i of category k is moved to set L with probability

η^{x_{i j}} {(1 - η)}^{1 - x_{i j}}

. We used

η = 0.9

, but any

η \in (0.5, 1.0)

should work in principle. If all elements end up in L or R, the partitioning is re-done. To concentrate splitting on productive cases, we did not attempt to split categories with

N_{k j} = 0

or

N_{k j} = N_{k}

. Immediately after splitting or joining categories, a reclassification move (re-assigning

θ

) was performed. Category split moves were accepted using the Metropolis criterion, which is the smaller of

P_{gen, join} P_{split} / P_{gen, split} P_{joined}

or one. Explicit formulas for the move generation probabilities (

P_{gen, split}

, etc.) are provided in Appendix A.2.

Figure 1 provides a graphical summary of this inference scheme. Each conformational sample is mapped to a bit-string, which is used as the basis for inferring

μ

. Inference proceeds by sampling potential parameters until a good explanation for the data is found. Trial moves that re-categorize and update

μ

look horizontally to find better category prototypes. Trial moves that split categories based on presence or absence of some features allow us to traverse category space vertically.

3. Test Systems

The ability of our sampling procedure to predict categories was tested on three geometrical systems: ‘chomp’ (a closing angle), ‘heli’ (a rotating line), and ‘glob’ (three rotating spheres). Each system was generated as a time-series of 1000 frames for approximately P total particles in 2 or 3-dimensional space. After generation, Gaussian random noise of width

σ = 0.1

was added to every degree of freedom.

Figure 2 shows images of these three test trajectories. A complete description of the coordinate generation methods is present in Appendix A.3. All three systems were processed into binary feature data by calculating pairwise distances between all points. Pairs of points within 2 distance units were translated to 1 (representing contact). When forming the feature vectors, x, we removed features, j, for which every sample showed the same result (all contacting or all disconnected). It is important to note that this removal changes M seen by the algorithm—usually decreasing it well below

P (P - 1) / 2

.

Critically, we repeated these classifications for a range of material points, P. This tested robustness of the unsupervised classification problem with respect to the amount of features available. Adding more points without changing the geometry of the problem should not change the number of categories detected.

For each run, five independent MCMC chains were started from an assignment of all points to a single category. Each chain ran for 1000 steps. Samples were collected every 10 steps—starting from step 500. The acceptance probabilities for category split/join Monte Carlo moves varied around 10–13%.

4. Results

We analyzed the results of MC in two different ways. First, the categories assigned were tested for grouping in time. Since the contact lists (on which the categorization is based) varied slowly over time, we expect categories to come in ‘runs’. Second, we computed histograms over the number of categories, K. This is a strong test of the method’s sensitivity to the number of material points, P.

As expected, we found a high degree of correlation between categories and time for every case. Similar time-points were grouped into similar categories. To quantify these results, we counted transitions between category indices in time-order. For a perfect categorization, the number of transitions should equal the number of categories minus one. We computed the categorization accuracy in two ways. For each system, the left columns of Table 1 and Table 2 are 100 minus the percent of mis-categorized frames. For lossy categorization at time-boundaries, we expect oscillation between two values. We quantified this by forming a transition matrix between categories, and removing transitions along the ‘most likely path’. Excluding this boundary oscillation we found nearly 100% accuracy for the categorizations. Those are shown in the right columns of Table 1 and Table 2.

Integrating the posterior probability (Equation (2) times Equation (7)) over

π

leads to factors like

Γ {(N + α K)}^{- 1}

, which seem prohibitively costly as N increases. We therefore wanted to check that the number of categories does not decrease as features or samples are added. Figure 3 shows the sampled probability distributions over K, the number of categories, for increasing P (Figure 3a) and N (Figure 3b). Interestingly, for every system, about five conformational states were deduced at

N = 1000

. As N increased, however, more categories were deduced and the distribution spread to higher numbers. This is probably reasonable, since more values of the ‘time’ coordinate generated a more fine-grained motion.

Adenylate Kinase Open/Closed Transition

Finally, we tested the classification method against a protein with a well-characterized conformational transition. Adenylate kinase (ADK) converts ATP, ADP, and AMP by closing around substrate molecules [20]. The transition from closed to open was simulated in Reference [5] using steered molecular dynamics on a reaction coordinate interpolating between the electron density maps of PDB IDs 1AKE (Reference [20], closed) and 4AKE (Reference [2], open). The simulation data we used did not contain ligands, but did contain water and ions. Our analysis only made use of the alpha carbon (C

_{α}

) positions.

Features were calculated for each of

N = 3900

equally spaced frames during steered dynamics by testing whether C

_{α}

to C

_{α}

distances were less than 5 Å. These structures contain

P = 214

C

_{α}

-s. Sampling was carried out as described in Section 3, but 8 independent MC chains were sampled for 1250 steps (instead of 5 for 1000). The acceptance probability of split/join moves was 17%. During sampling, we saved the parameters,

θ

, possessing maximum likelihood values at each K.

Our implementation (Reference [18]) is parallelized so that each thread carries out an independent Monte Carlo chain. The run-time for the ADK example was less than 30 min. Larger runs on proteins up to 400 residues using 75,000 sampled protein conformations have been carried out using this method in under an hour. The implementation also contains a series of vizualization tools to highlight the protein regions responsible for detected conformation-to-conformation differences.

We then extracted conformational states with the highest probability for landing in each category as representative points for that category. Since the reference open and closed PDB conformations formed extreme points, our representative structures approached them more nearly as K increased. Figure 4 shows that the two end-point conformations ended up very close to the open and closed states from the PDB.

5. Conclusions

The method developed here is ideally suited for the unsupervised structural classification problem. It has been derived from a first-principles Bayesian analysis of the set of atoms which interact within a structure. This work solved the central problem of defining an appropriate prior distribution over parameter space and implementing an efficient sampling method.

Tests on sample conformational transitions identified more categories than naïvely expected because it generated milestones along the motion’s time-coordinate. However, all categorizations were shown to have excellent accuracy as judged by picking out the correct time-sequence. Finally, a test on Adenylate kinase verified that these conclusions easily generalize to protein motions.

The central results of this work showed that the method behaves well under large variations in the number of features and samples. These observations validate our choice for the prior distribution, since changes in

P (θ | I)

will have large effects on the distribution over number of categories,

P (K | x I)

. They also show robustness of the MC sampling method itself, since relatively high acceptance rates were achieved.

Many future applications and the development of this method are possible. Changes in the pair contacts between states could be analyzed more thoroughly, as in the DynDom method [7]. Probabilities of assignment to each conformational state can be used as reaction coordinates. We are presently applying the method to classify chemical compound space using binary MACCS fingerprints [21], and to characterize conformational space of SARS-CoV-2 proteins simulated using replica exchange molecular dynamics [22].

Now that the principle has been demonstrated, more informative classification schemes can be devised. Adding information like hydrogen-bonding, salt bridge formation, and secondary structure annotation will allow the Bayesian framework able to recognize categories more like a biochemist would. The method can also be focused on active domains and binding sites by adding more points and shorter distance cutoffs for key residues. The insensitivity to M shown in this work provides a high degree of confidence that any amount of additional data will improve the overall categorization without spoiling the classification already achieved using coarser-level data.

Supplementary Materials

The source code for this classification method, along with the code to reproduce the test calculations in this paper, is publicly available from Reference [18].

Funding

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL). This research used resources of the Oak Ridge Leadership Computing Facility at ORNL. ORNL is managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725.

Acknowledgments

We thank Josh Vermaas for providing simulated trajectories of ADK and helpful comments on the manuscript.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A.

Appendix A.1. Behavior of the Posterior PDF with Increasing M

We argue here that the implicit uniform prior for all

μ_{k j}

as well as the choice

α = M + 1

is most appropriate for this problem. This is the conclusion we arrive at by requiring that the likelihood for dividing a category into two parts should be insensitive to M.

First, multiply

P (x z | θ)

and

P (θ | U)

from the main text to provide

P (x z θ | U)

,

\frac{Γ (K α)}{Γ {(α)}^{K}} \prod_{k = 1}^{K} {(\frac{Γ (2 β)}{Γ {(β)}^{2}})}^{M} π_{k}^{N_{k} + α - 1} \prod_{j = 1}^{M} μ_{k j}^{N_{k j} + β - 1} {(1 - μ_{k j})}^{N_{k} - N_{k j} + β - 1} .

(A1)

This is generalized a bit from the main text so that a Beta prior for

μ

(with parameter

β

) is explicitly shown.

Observe that (without the uniqueness constraint) the posterior distribution over categories is proportional to,

P (x z | U) \propto f (N, K) \prod_{k} Γ (N_{k} + α) \prod_{j} \frac{Γ (N_{k j} + β) Γ (N_{k} - N_{k j} + β)}{Γ (N_{k} + 2 β)},

(A2)

with

f (N, K) = Γ (α K) / Γ {(α)}^{K} Γ (N + α K)

(for

β = 1

). The probability per category on the far-right can be likened to an information entropy,

S_{\inf} (k j) \equiv ln (\binom{N_{k}}{N_{k j}}),

(A3)

but is offset because of the limit,

lim_{n \to \infty} \frac{Γ (n + a)}{Γ (n) n^{a}} = 1 .

(A4)

Inserting this limit on the right, we find

\frac{Γ (N_{k j} + β) Γ (N_{k} - N_{k j} + β)}{Γ (N_{k} + 2 β)} \approx e^{- S_{\inf} (k j)} \frac{N_{k j}^{β - 1} {(N_{k} - N_{k j})}^{β - 1}}{N_{k}^{2 β - 1}} .

(A5)

This provides the motivation for choosing

β = 1

. This choice for

β

makes the categorization closer to an information entropy, depending only on the ratio

N_{k j} / N_{k}

.

The categorization probability is now asymptotically like an information entropy, except for the extra factor of

1 / (N_{k} + 1)

. This factor is present for every feature, and creates an undesired dependence on M. We can cancel that factor by appealing to Equation (A4). This time using

n = N_{k} + 1

and

a = α - 1

to approximate

Γ (N_{k} + α)

,

P (x z | U) \approx f (N, K) \prod_{k} \frac{N_{k}! {(N_{k} + 1)}^{α - 1}}{{(N_{k} + 1)}^{M}} \prod_{j} e^{- S_{\inf} (k j)} .

(A6)

Equation (A6) provides the motivation for choosing

α = M + 1

. With this choice, the

{(N_{k} + 1)}^{M}

term cancels. It leaves behind a relative information entropy for the

N_{k}

. Essentially,

P (x z | U) \approx f (N, K) N! exp (S_{\inf} ({N_{k}}) - \sum_{k j} S_{\inf} (k j)) .

(A7)

At this point, different choices for

f (N, K)

lead to differences in the probability distribution over K, the number of categories. Rather than directly influencing these, we have chosen not to create additional assumptions about

f (N, K)

beyond the above choices for

α

and

β

.

Appendix A.2. Sampling Method

Since

π

can be trivially integrated out of the posterior distribution (Equation (A1)), we re-generate

π

on every split/join move, and consider only the integrated probability

P (z μ | x I)

as our sampling target. Each time a split/join move is generated, a re-categorization of samples in category k (

{x_{k}}

) into

{x_{L}}

and

{x_{R}}

. is immediately followed by generation of

μ_{L}

and

μ_{R}

(or

μ_{k}

for a join) from their respective Beta distributions (with samples

N_{L j}

and

N_{L} - N_{L k}

, etc.).

We set up the acceptance probability for these split / join moves to satisfy the detailed balance condition,

\frac{P_{acc} (z^{'} μ^{'} | z μ x)}{P_{acc} (z μ | z^{'} μ^{'} x)} = \frac{P_{gen} (z μ | z^{'} μ^{'} x)}{P_{gen} (z^{'} μ^{'} | z μ x)} \frac{P (z^{'} μ^{'} | x I)}{P (z μ | x I)} .

(A8)

Without loss of generality, assume

z^{'}

contains

K + 1

categories and z contains K. We already have explicit formulas for

P (z θ | x I)

(modulo

P (x | I)

), so the only thing needed to calculate

P_{acc}

is the move generation probabilities.

Given category k and feature j was chosen to define the split (see main text for a description of this process), the recategorization probability is,

P_{gen} (z^{'} | z k j) = \frac{1}{2} (\begin{matrix} η^{N_{L j} + N_{R} - N_{R j}} {(1 - η)}^{N_{R j} + N_{L} - N_{L j}} \\ + {(1 - η)}^{N_{L j} + N_{R} - N_{R j}} η^{N_{R j} + N_{L} - N_{L j}} \end{matrix}) / (1 - p_{L} - p_{R}) .

(A9)

The factor of

\frac{1}{2}

is necessary because the labels L and R are interchangeable.

p_{L}

(

p_{R}

) is the expression in the numerator when all

z^{'}

move from k to L (R).

Because any j could have been chosen, the overall probability of ending up with categories L and R is,

P_{gen} (z^{'} | z) = \frac{ξ (K)}{\sum_{n l} Q_{n l}} \sum_{j} Q_{k j} P_{gen} (z^{'} | z k j) .

(A10)

To concentrate splitting on productive cases, we choose

Q_{k j}

to be one for all

k, j

unless

N_{k j} = 0

or

N_{k j} = N_{k}

. In those cases

Q_{k j}

is zero so that we do not split using a non-informative feature.

Generating

μ_{L}

and

μ_{R}

adds the factor,

P_{gen} (z^{'} μ^{'} | z x) = P_{gen} (z^{'} | z) P ({x_{L}} | μ_{L}) {(N_{L} + 1)}^{M} P ({x_{R}} | μ_{L}) {(N_{R} + 1)}^{M} e^{\sum_{j} S_{\inf} (L j) + S_{\inf} (R j)} .

(A11)

Here, we have made the intuitive definition,

P ({x_{k}} | μ_{k}) \equiv \prod_{j} μ_{k j}^{N_{k j}} {(1 - μ_{k j})}^{N_{k} - N_{k j}} .

(A12)

Most of this term will cancel

P (z μ | x I)

in the final expression for the acceptance probability.

The recombination rule is attempted with frequency

1 - ξ (K + 1)

. A set u is chosen at random with probability

1 / (K + 1)

and recombined with set

v \neq u

chosen with probability

2 / K (K + 1)

. The resulting generation probability is,

P_{gen} (z | z^{'} x) = \frac{(1 - ξ (K + 1)) 2}{K (K + 1)} .

(A13)

Generating

μ_{k}

adds the factor,

P_{gen} (z μ | z^{'} x) = P_{gen} (z | z^{'} x) P ({x_{k}} | μ_{k}) {(N_{k} + 1)}^{M} e^{\sum_{j} S_{\inf} (k j)} .

(A14)

As a final aid to computing the acceptance probability ratio (Equation (A8)), we provide a simplified expression for the likelihood ratio,

\begin{matrix} \frac{P (z^{'} μ^{'} | x I)}{P (z μ | x I)} & = \frac{(1 - B_{L R}) \prod_{k^{'} \neq L, R} [(1 - B_{L k^{'}}) (1 - B_{R k^{'}})]}{\prod_{k^{'} \neq k} (1 - B_{k k^{'}})} \frac{Γ (α (K + 1))}{Γ (α K) Γ (α)} \frac{Γ (N + α K)}{Γ (N + α (K + 1))} \\ \frac{Γ (N_{L} + α) Γ (N_{R} + α)}{Γ (N_{k} + α)} \frac{P ({x_{L}} | μ_{L}) P ({x_{R}} | μ_{R})}{P ({x_{k}} | μ_{k})} \end{matrix}

(A15)

Appendix A.3. Coordinate Trajectory Generation for Synthetic Test Systems

The ‘chomp’ system was 2D, with two line segments, both of length 4 units. The bottom segment (

P / 2

uniformly spaced particles) stretched along the x-axis and remained stationary. The top segment (

P / 2 - 1

uniformly spaced particles, since the origin is excluded), moved at angle

θ (t) = \frac{π}{4} cos (t)

,

t \in [0, π]

. This motion mimicks a harmonic oscillator—which spends a majority of its time at the two limits of its motion.

The ‘heli’ system was 3D. It also had two line segments with

P / 2

equally spaced particles each. The bottom segment remained as before, while the top segment was replaced by

P / 2

particles displaced from the bottom one by 1 distance unit along z. The top segment was rotated in the

x y

plane with

θ (t) = t

,

t \in [0, π]

.

The ‘glob’ system was 3D, consisting of 3 spheres, each represented by

P / 3

points at radius 2. The point locations were taken from Lebedev quadrature rules, which possess octahedral symmetry. Sphere 1 remained fixed at the origin. Spheres 2 and 3 started in locations displaced along z or x, respectively, by 2.1 distance units. As t traversed

[0, π]

, sphere 2 was rotated about the origin around the x-axis by angle

θ_{2} (t) = \frac{π}{4} (1 - cos (t))

, while sphere 3 was rotated about the origin around the z-axis by angle

θ_{3} (t) = \frac{π}{4} (1 - cos (2 t))

. This motion creates something like three conformational states. Since sphere 3 moves twice as fast as sphere 2, sphere 2 can be at

+ z

or

- y

while sphere 3 is at

+ x

. A third conformation is at the center where sphere 3 is at

+ y

.

References

Guo, J.; Zhou, H.-X. Protein allostery and conformational dynamics. Chem. Rev. 2016, 116, 6503–6515. [Google Scholar] [CrossRef] [PubMed]
Müller, C.W.; Schlauderer, G.J.; Reinstein, J.; Schulz, G.E. Adenylate kinase motions during catalysis: An energetic counterweight balancing substrate binding. Structure 1996, 4, 147–156. [Google Scholar] [CrossRef] [Green Version]
Amaral, C.; Carnevale, V.; Klein, M.L.; Treptow, W. Exploring conformational states of the bacterial voltage-gated sodium channel NavAb via molecular dynamics simulations. Proc. Nat. Acad. Sci. USA 2012, 109, 21336–21341. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kim, Y.E.; Hipp, M.S.; Bracher, A.; Hayer-Hartl, M.; Hartl, F.U. Molecular chaperone functions in protein folding and proteostasis. Annu. Rev. Biochem. 2013, 82, 323–355. [Google Scholar] [CrossRef] [PubMed]
Vant, J.W.; Sarkar, D.; Fiorin, G.; Skeel, R.; Vermaas, J.V.; Singharoy, A. Data-guided multi-map variables for ensemble refinement of molecular movies. bioRxiv 2020. submitted. [Google Scholar] [CrossRef]
Spiwok, V.; Kříž, P. Time-lagged t-distributed stochastic neighbor embedding (t-SNE) of molecular simulation trajectories. Front. Mol. Biosci. 2020, 7, 132. [Google Scholar] [CrossRef] [PubMed]
Lee, R.A.; Razaz, M.; Hayward, S. The DynDom database of protein domain motions. Bioinformatics 2003, 19, 1290–1291. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Girdlestone, C.; Hayward, S. The DynDom3D webserver for the analysis of domain movements in multimeric proteins. J. Comput. Biol. 2016, 23, 21–26. [Google Scholar] [CrossRef] [PubMed]
Stamati, H.; Clementi, C.; Kavraki, L.E. Application of nonlinear dimensionality reduction to characterize the conformational landscape of small peptides. Proteins Struct. Funct. Bioinf. 2010, 78, 223–235. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ramanathan, A.; Savol, A.J.; Langmead, C.J.; Agarwal, P.K.; Chennubhotla, C.S. Discovering conformational sub-states relevant to protein function. PLoS ONE 2011, 6, e15827. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ferguson, A.L.; Panagiotopoulos, A.Z.; Kevrekidis, I.G.; Debenedetti, P.G. Nonlinear dimensionality reduction in molecular simulation: The diffusion map approach. Chem. Phys. Lett. 2011, 509, 1–11. [Google Scholar] [CrossRef]
Duan, M.; Fan, J.; Li, M.; Han, L.; Huo, S. Evaluation of dimensionality-reduction methods from peptide folding–unfolding simulations. J. Chem. Theory Comput. 2013, 9, 2490–2497. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, C.; Wang, B.; Pavlu, V.; Aslam, J. Conditional Bernoulli mixtures for multi-label classification. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48. [Google Scholar]
Alabau, V.; Andrés, J.; Casacuberta, F.; Civera, J.; Adrià Giménez, J.-H.; Juan, A.; Sanchis, A.; Vidal, E. The Naive Bayes Model, Generalisations and Applications; Centre de Visio per Computador: Barcelona, Spain, 2005; Available online: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.513.977&rep=rep1&type=pdf (accessed on 31 October 2020).
Novovičová, J.; Antonín Malík, A. Text document classification based on mixture models. Kybernetika 2004, 40, 293–304. [Google Scholar]
Kaji, D.; Watanabe, K.; Watanabe, S. Phase transition of variational bayes learning in Bernoulli mixture. Aust. J. Intell. Inf. Proc. Syst. 2010, 35, 35–40. [Google Scholar]
Yamazaki, K.; Kaji, D. Comparing two Bayes methods based on the free energy functions in Bernoulli mixtures. Neur. Netw. 2013, 44, 36–43. [Google Scholar] [CrossRef] [PubMed]
Rogers, D.M. Frobnitzem/Classifier: Classifier Version 1.0 (Version v1.0). Zenodo. Available online: http://doi.org/10.5281/zenodo.4122299 (accessed on 23 October 2020).
Jaynes, E.T. Probability Theory, The Logic of Science; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Müller, C.W.; Schulz, G.E. Structure of the complex between adenylate kinase from escherichia coli and the inhibitor ap5a refined at 1.9 å resolution. J. Mol. Biol. 1992, 224, 159–177. [Google Scholar] [CrossRef]
Durant, J.L.; Leland, B.A.; Henry, D.R.; Nourse, J.G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Acharya, A.; Agarwal, R.; Baker, M.; Baudry, J.; Bhowmik, D.; Boehm, S.; Byler, K.; Coates, L.; Chen, S.Y.; Cooper, C.J.; et al. Supercomputer-Based Ensemble Docking Drug Discovery Pipeline with Application to Covid-19. Available online: https://chemrxiv.org/articles/preprint/Supercomputer-Based_Ensemble_Docking_Drug_Discovery_Pipeline_with_Application_to_Covid-19/12725465 (accessed on 31 October 2020).

Figure 1. Classification applied to a 4-site structure. Five conformers (with frequencies shown) are input to the method. Bit-strings, x, for each conformer represent pairs of contacting residues (black squares correspond to dashed lines). Sampling the posterior distribution over

(K, π, μ)

provides a maximum-likelihood categorization at each K. We show both category prototypes,

μ

, determined by the algorithm for the

K = 2

classification. Darker shading in

μ

represents contacts with higher probability of forming.

Figure 1. Classification applied to a 4-site structure. Five conformers (with frequencies shown) are input to the method. Bit-strings, x, for each conformer represent pairs of contacting residues (black squares correspond to dashed lines). Sampling the posterior distribution over

(K, π, μ)

provides a maximum-likelihood categorization at each K. We show both category prototypes,

μ

, determined by the algorithm for the

K = 2

classification. Darker shading in

μ

represents contacts with higher probability of forming.

Figure 2. Sketch of the three synthetic motions used for testing. Lines, spheres, and colors are used to guide the eye, but the classification engine sees only the unlabeled points. The time axis proceeds left to right. For the ‘glob’ system, the sphere on the right moves behind, and then back to the right again while the top sphere moves slowly downward.

Figure 3. Counts (out of 250 samples each) of the number of categories, K, as P or N are varied. The shape of the distribution function remains essentially constant as P varies, but tends to spread toward higher K with more samples.

Figure 4. Adenylate kinase (ADK) closed (a) and open (b) configurations. PDB structures (magenta) are compared to the most similar molecular dynamics classification results (green) found among the five representative structures for the

K = 5

classification.

Figure 4. Adenylate kinase (ADK) closed (a) and open (b) configurations. PDB structures (magenta) are compared to the most similar molecular dynamics classification results (green) found among the five representative structures for the

K = 5

classification.

Table 1. Categorization accuracy for each system type and number of points, P. The left column for each system is the percent of category assignments identical to their previous time (not including

K - 1

required). The right column for each is the percent of category assignments that do not switch between runs.

Table 1. Categorization accuracy for each system type and number of points, P. The left column for each system is the percent of category assignments identical to their previous time (not including

K - 1

required). The right column for each is the percent of category assignments that do not switch between runs.

P	Chomp		Heli		Glob
18	89.2	98.0	89.9	98.9	91.1	96.7
42	94.7	98.9	93.7	99.0	93.5	97.3
78	96.3	99.3	95.3	99.4	95.5	98.0
114	96.8	99.4	96.5	99.5	96.4	98.4
150	97.2	99.4	96.7	99.4	97.0	98.5
222	97.8	99.5	97.0	99.7	98.2	99.1
258	98.4	99.7	97.7	99.5	98.6	99.4
330	98.4	99.6	97.8	99.7	98.7	99.4
438	98.5	99.7	98.1	99.5	98.8	99.4

Table 2. Categorization accuracy for each test system as the sample number, N, varies. The number of points remains fixed at

P = 222

. Column labels are as in Table 1.

Table 2. Categorization accuracy for each test system as the sample number, N, varies. The number of points remains fixed at

P = 222

. Column labels are as in Table 1.

N	Chomp		Heli		Glob
125	99.3	99.2	99.8	99.2	99.5	99.3
250	99.4	99.6	99.3	99.6	99.2	99.4
500	98.8	99.6	97.7	99.7	98.6	99.3
1000	97.7	99.6	96.9	99.6	98.2	99.2
2000	96.9	99.4	96.5	99.6	98.0	99.1

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rogers, D.M. Protein Conformational States—A First Principles Bayesian Method. Entropy 2020, 22, 1242. https://doi.org/10.3390/e22111242

AMA Style

Rogers DM. Protein Conformational States—A First Principles Bayesian Method. Entropy. 2020; 22(11):1242. https://doi.org/10.3390/e22111242

Chicago/Turabian Style

Rogers, David M. 2020. "Protein Conformational States—A First Principles Bayesian Method" Entropy 22, no. 11: 1242. https://doi.org/10.3390/e22111242

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Protein Conformational States—A First Principles Bayesian Method^†

Abstract

1. Introduction

2. Theory

Sampling Method

3. Test Systems

4. Results

Adenylate Kinase Open/Closed Transition

5. Conclusions

Supplementary Materials

Funding

Acknowledgments

Conflicts of Interest

Appendix A.

Appendix A.1. Behavior of the Posterior PDF with Increasing M

Appendix A.2. Sampling Method

Appendix A.3. Coordinate Trajectory Generation for Synthetic Test Systems

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Protein Conformational States—A First Principles Bayesian Method †

Abstract

1. Introduction

2. Theory

Sampling Method

3. Test Systems

4. Results

Adenylate Kinase Open/Closed Transition

5. Conclusions

Supplementary Materials

Funding

Acknowledgments

Conflicts of Interest

Appendix A.

Appendix A.1. Behavior of the Posterior PDF with Increasing M

Appendix A.2. Sampling Method

Appendix A.3. Coordinate Trajectory Generation for Synthetic Test Systems

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Protein Conformational States—A First Principles Bayesian Method^†