Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

BEES: Bayesian Ensemble Estimation from SAS

2018

Many biomolecular complexes exist in a flexible ensemble of states in solution which are necessary to perform their biological function. Small angle scattering (SAS) measurements are a popular method for characterizing these flexible molecules due to their relative ease of use and ability to simultaneously probe the full ensemble of states. However, SAS data is typically low-dimensional and difficult to interpret without the assistance of additional structural models. In theory, experimental SAS curves can be reconstituted from a linear combination of theoretical models, although this procedure carries significant risk of overfitting the inherently low-dimensional SAS data. Previously, we developed a Bayesian-based method for fitting ensembles of model structures to experimental SAS data that rigorously avoids overfitting. However, we have found that these methods can be difficult to incorporate into typical SAS modeling workflows, especially for users that are not experts in comput...

bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. BEES: Bayesian Ensemble Estimation from SAS Samuel Bowerman1,2 , Joseph E. Curtis3 , Joseph Clayton1 , Emre H. Brookes4 , and Jeff Wereszczynski1 1 Department of Physics and the Center for Molecular Study of Condensed Soft Matter, Illinois Institute of Technology, Chicago, IL 60616 2 Current address: Department of Biochemistry and Howard Hughes Medical Institute, University of Colorado Boulder, Boulder, CO 80309 3 NIST Center for Neutron Research, National Institute of Standards and Technology, Gaithersburg, MD 20899 4 University of Texas Health Science Center, San Antonio, TX 78229 February 4, 2019 1 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 1 Abstract Many biomolecular complexes exist in a flexible ensemble of states in solution which are necessary to perform their biological function. Small angle scattering (SAS) measurements are a popular method for characterizing these flexible molecules due to their relative ease of use and ability to simultaneously probe the full ensemble of states. However, SAS data is typically low-dimensional and difficult to interpret without the assistance of additional structural models. In theory, experimental SAS curves can be reconstituted from a linear combination of theoretical models, although this procedure carries significant risk of overfitting the inherently low-dimensional SAS data. Previously, we developed a Bayesian-based method for fitting ensembles of model structures to experimental SAS data that rigorously avoids overfitting. However, we have found that these methods can be difficult to incorporate into typical SAS modeling workflows, especially for users that are not experts in computational modeling. To this end, we present the “Bayesian Ensemble Estimation from SAS” (BEES) program. Two forks of BEES are available, the primary one existing as module for the SASSIE webserver and a developmental version that is a standalone python program. BEES allows users to exhaustively sample ensemble models constructed from a library of theoretical states and to interactively analyze and compare each model’s performance. The fitting routine also allows for secondary data sets to be supplied, thereby simultaneously fitting models to both SAS data as well as orthogonal information. The flexible ensemble of K63-linked ubiquitin trimers is presented as an example of BEES’ capabilities. 2 Introduction Biological molecules rely heavily on their conformational dynamics to conduct their cellular function, and the characterization of these flexible ensembles of states remains a key challenge in modern biophysics 1 . As a result, many different experimental and computational techniques have been developed to probe and model configurational ensembles. Of these, small angle scattering (SAS) measurements are an increasingly popular technique due to their relative ease of use and ability to simultaneously probe the full solution ensemble 2,3 . Moreover, SAS measurements are able to probe systems at room temperature, free from packing forces induced by the lattice and cryogenic effects of crystallography, and they can measure the solution of states in both equilibrium ensembles and time-dependent processes 4 , such as protein and RNA folding 5,6 , or the allosteric coupling of enzymatic activity and large-scale domain movement 7,8 . However, the low-dimensional nature of SAS data can often cause the interpretation of scattering profiles to be relatively difficult, and reconstituting a three-dimensional molecular structure solely from scattering curves can often be misleading, as multiple reconstitutions of varying shapes may result from the same scattering profile. In contrast, model structures can also be identified from all-atom or coarse-grained simulations, and their calculated scattering profiles can be compared against empirical curves 9–12 . Since SAS profiles are measurements of the full solution ensemble and therefore may not be fully described by a single structural state, these in silico profiles can also serve as a basis set to construct an ensemble model through a linear combination of states 13–16 . While this ensemble reconstitution approach is conceptually straightforward, in practice it can be quite difficult to identify the “best” ensemble model. For instance, it is not known a priori what the number of underlying states should be in the ensemble. It is also possible for ensemble models to overfit experimental data through the inclusion of too many underlying populations. Furthermore, altogether different combinations of states may yield similarly performing models, in respect to their goodness-of-fit values. For these reasons, a Bayesian-based approach has many advantages over more traditional methods. For instance, Markov Chain Monte Carlo posterior sampling methods will not only estimate model parameters but will also allow for the direct assessment of their errors 17 . Moreover, Bayesian formalism allows for the comparison of a population of models as a solution to parameterization, rather than only identifying a single set of parameters 18–21 . This is exceptionally useful for SAS modeling, where information regarding the model is underdetermined. However, the ability to construct a large population of solutions can also be a disadvantage, as both the computational resources to construct a complete array of model parameters, as well as tools for comparing models, can be daunting for many systems. 2 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. To this end, we previously developed an iterative Bayesian method to use small angle scattering (SAS) profiles, either of x-rays (SAXS) or of neutrons (SANS), to re-weight the population of states from simulated models. This approach, which is an extension of the BSS-SAXS technique 13 , compares solution ensembles of a variety of sub-ensembles from a combination of potential scattering states. Originally, we used this method to fit ensembles of covalently linked ubiquitin trimers, and we observed that the algorithm could produce ensemble models that robustly resisted overfitting 22 . Here, we present an update to this method as an open source program called “Bayesian Ensemble Estimation from SAS” (BEES, henceforth). Two versions of this code have been developed. The primary version is an open-access module on the SASSIE-web server (http://sassie-web.chem.utk.edu/sassie2/), which provides a graphical user interface for controlling the module 23,24 . The BEES-SASSIE module is designed for users that are both new and experienced in biophysical modeling, and, through SASSIE, it provides access to the computational resources required to calculate and analyze large combinations of states. The second, developmental, version is a stand-alone python code that is designed to be run from the command line, and is intended for experienced computational scientists. We also provide two example use cases, one in which we fit profiles of K63-linked ubiquitin trimers to SAXS data alone and another in which we add a second data set to the fitting procedure. 3 3.1 Methods BEES Algorithm The BEES algorithm is designed to find the theoretical solution ensemble that uses the fewest number of populations to accurately describes the experimental data. This algorithm is briefly presented here (Fig 1), but further details can be found in the supplemental text and elsewhere 22 . In short, experimental data are gathered and post-processed prior to using the BEES module. For example, users may wish to screen their data for low-q beam smearing effects or to extrapolate their scattering profile to I(0). A collection of theoretical profiles for candidate solution states are also input to BEES, which can be computed by standalone programs such as Crysol 25 or FoXS 26 , in SASSIE via the “SasCalc” module 27 , or from many other scattering prediction software 28–31 . Once initiated, the BEES routine first determines the goodness-of-fit values of each individual profile. It then identifies all possible sub-bases containing combinations of two theoretical profiles, and it conducts a Bayesian Monte Carlo routine on each combination to identify the population of states in each sub-basis. Each Monte Carlo routine is conducted according to user-defined parameters: number of independent Monte Carlo parameter fittings per sub-basis, number of iterations per Monte Carlo fitting, and amount of population change per iteration. Notably, the BEES likelihood function (L) includes the ability to simultaneously fit the scattering profiles and an auxiliary set of measurements: L=e χ2total /2.0 (1) where the total model goodness-of-fit (χ2total ) is the linear combination of the model scattering goodness-of-fit (χ2SAS ) and the model goodness-of-fit to the auxiliary data set (χ2aux ): χ2total = χ2SAS + χ2aux . Once the ensemble of states for each two-member sub-basis has been identified, the best two-member state is selected in accordance to the information criteria (IC) selected by the user, either the Akaike information criterion 32 or the Bayesian Information Criterion 33 (see Section 3.2 for more details). If the IC value of the best two-member state is worse than that of any single theoretical profile, then the module reports the best single profile as the most likely model. However, if the IC value of this two-member state is instead an improvement over all individual profiles, then the BEES module conducts the Bayesian Monte Carlo routine on every three-member sub-basis, and the best three-state IC value is similarly compared to the two-state ensemble. This iterative increase in sub-basis size and comparison of IC values is conducted until either the IC metric does not improve or every possible combination of states is considered. Alternatively, users also have the option to override the IC-comparison and force the construction of all combinations of subensembles. Once the desired number of models have been identified, the BEES module will also calculate 3 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Experimental Spectra Evaluate Individual Members Theoretical Profiles Increase Basis Size User-Defined BMC Parameters Run Separate BMC for Each Combination Inputs Print Best Model Information Yes Has IC Improved? No Identify Best IC-Evaluated Model BEES Routine Save Interactive Bokeh Plots to HTML File Store Model Information to Filesystem Outputs Figure 1: Workflow schematic of the BEES routine. Users supply empirical data and the collection of theoretical profiles for potential ensemble members, as well as set several parameters associated with the Bayesian Monte Carlo (BMC) parameter search. After the performance of each individual theoretical state is evaluated, ensemble populations are fit by BMC routines conducted iteratively on increasing sized sub-ensembles, until the addition of another member population does not improve the IC value and overfitting is observed. Alternatively, users can bypass the IC-comparison step to compare all possible combinations of states. The routine then relays information regarding the resulting models to the command terminal (stand-alone version) or GUI (SASSIE-web version) and further stores model information in several file locations for further review by users. each model’s “relative performance” metric to determine its likelihood over the best IC-identified model (Section 3.2) 34 : RP (m) = e(ICm ICo )/2 (2) where RP (m) and ICm are the the relative performance and IC values of model m, and ICo is the minimum IC value of all observed models. The relative performance metric is more commonly known as the relative likelihood of a model. Here, we opt for the changed nomenclature to assist non-experts in the interpretation of the metric, as well as to avoid confusion with the likelihood function used by the Bayesian Monte Carlo fitting routine. While the relative performance provides a quantitative result, it is admittedly an approximation of the more rigorous Bayes Factor 33,35 . As such, it is intended to be interpreted loosely and to assist the user in applying their intuition toward the performance of alternative ensembles to the best identified one. Once the best model has been identified, BEES outputs information regarding ensemble members of the IC-identified model, its model population weights, goodness-of-fit information for the full ensemble model and each individual, and the IC value of the model. Beyond the best identified model, information regarding every model identified for each sub-basis is also saved. Plots of the model ensemble fit to the experimental data, along with the associated residual errors, are automatically created once the fitting routine is completed. These plots are included in a multi-tab HTML page that which provides graphical and table presentations to allow users the ability to compare different models and performances. 3.2 Comparing Model Perfomances with Information Criteria The rigorous comparison of theoretical ensembles to experimental data requires creating models that are rich enough to describe the underlying physical structures that generated the data while simultaneously avoiding overfitting. Biomolecules exist in an ensemble of conformations in solution, therefore an ensemble of theoretical structures is typically required to interpret SAS data. However, it is imperative that the final model does not achieve a strong goodness-of-fit value through inclusion of an arbitrary number of 4 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. parameters (here, the number of scattering profiles). As a result, the true “best model” must be a balance between optimizing the goodness-of-fit metric and minimizing the number of underlying scattering states. To this end, the BEES module utilizes “Information Criterion” (IC) in order to penalize model goodness-of-fit values according to their ensemble size. Users have the option to use one of two different IC metrics during fitting — the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) 32,33 : ⇣ ⌘ (3) AIC = 2k − 2 · log L̂ ⇣ ⌘ BIC = log(n) · k − 2 · log L̂ (4) Here, k is the number of model parameters (number of scattering states), L̂ is the maximum observed likelihood value during the Bayesian Monte Carlo parameter fitting, and n is the number of points in the experimental data set. Both the BIC and AIC have forms that reward models with improved experimental fits (higher values of L̂) and penalize those with more parameters (higher values of k). The BIC is closely related to the AIC; however, it is derived from Bayesian principles rather than the frequentist foundation of the AIC. In both metrics, smaller values are indicative of better model performance, with the defining separation between them being the strength of the penalty term. In the AIC, the penalty is always double the number of states, whereas the BIC penalty will become increasingly larger for a larger number of data points. In reality, both metrics are an approximate way to identify the true model, and the AIC may be more prone to false positive estimations (including too many states), while the BIC metric may be more prone to false negatives (rejecting too many states), depending on the number of experimental data points. However, it is often possible that both metrics converge upon the same solution, as is the case with the K63 example presented here. The model with the minimum IC value can be interpreted as the most likely, best performing, model. While it may be tempting to accept this model and reject all others, Bayesian principles dictate that there is a possibility that one of these other models might actually be more accurate to the true nature of the system, even though each one possesses a weaker IC value. The probability that a model is, in fact, a better assessment of the data can be calculated by comparing the model IC values to the lowest IC value, as previously stated (Eqn 2) 34 . Because the BIC and AIC apply different penalties to the number of states, they may also produce different relative performance values for the same set of models. Depending on the number of independent data points, the BIC will produce relative performance values for competing models that are either closer to (n ≤ 7) or further from (n ≥ 8) the performance of the model with the lowest BIC. That is, if the number of independent data points is seven or fewer, then more models will have a relative performance closer to 1.0 than if evaluated by AIC. On the other hand, if the number of observed data points is greater than eight, then more models will have relative performances closer to 0.0 if they are evaluated by the BIC in place of the AIC. In the end, the choice of BIC vs AIC evaluation is up to the user, and it may sometimes be appropriate to use both to determine upper and lower bounds for relative model performances. 4 Results Here, we describe a sample usage of BEES and its resulting data. The necessary data files for this test set are included in the Supporting Information. Users can thereby re-create the analyses presented here by unpacking the archive locally and uploading the relevant files for each case to the BEES module in SASSIE-web, or by following the shell scripts provided alongside the stand-alone version (https: //github.com/WereszczynskiGroup/BEES/tree/master/examples). In the first example, we model the populations of states of K63-linked ubiquitin trimers using clusters identified from accelerated molecular dynamics trajectories 22 . In the second example, we showcase the effects of simultaneously fitting the SAS spectra and an auxiliary data set by including simulated measurements of an inter-domain distance and angle. 5 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. A B Figure 2: (a) Example of output from BEES program, as used on SASSIE-web GUI. (top) Text output displaying the contributing populations of the best IC-identified ensemble and the associated error in population estimates, as well as goodness-of-fit for each member. Total model goodness-of-fit and IC value are also printed by the module. (middle) Ensemble scattering profile of the best identified model, shown in blue, fit to the experimental spectrum, shown in black. (bottom) Residual errors of the best model against the experiment. (b) The third tab of the BEES-output HTML file (“Compare All Models”), which contains the relative performances histogram as well as a table of all the constructed ensemble models and their relative performance, ensemble size, selected IC metric, and goodness-of-fit values. Selecting a particular model in the table will also visualize the constituent populations on the bar graph below (best identified model selected here). The full interactive HTML file can be accessed by downloading the “K63 sas only plots.html” file from the example files contained within the Supporting Material. A similar file for the inclusion of auxiliary data can be found in “K63 with aux.html”, also included in the Supporting Material example files. 6 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. d  Figure 3: A visual representation of the two auxiliary measurements included in the second BEES routine. Both the distal monomer separation distance (d) and angle (θ) are measured in accordance to each monomer’s center-of-mass. 4.1 Building Ensembles of SAS Data BEES requires the user to supply the experimental scattering curve along with theoretical scattering curves for candidate structures. In addition to providing this data, users must also define the Dmax of the molecule, which can be determined from the experimental profile using pre-existing software 36 . Here, a Dmax of 83.6 Å was determined using the Shanum program of the ATSAS package 37 . Furthermore, five Monte Carlo walkers were used for each sub-basis ensemble, and each walker was conducted for 10,000 iterations. The first 1,000 iterations were neglected when determining the model populations so as to remove any influence of the randomly selected initial values from the final result. Parallel processing can also be used (here, six processors were used), but using multiple processors will only enhance the speed of the calculation and has no effect on the final result (see Supporting Information for more information). In addition, the full array of sub-ensembles has been calculated to display the depth of analysis available. In this example, truncation of the algorithm via the IC parameter would save a significant amount of computational time without effecting the best IC-identified model; however, models with lower χ2 free would not have been observed. At the conclusion of the BEES routine, the best identified model is reported (Fig 2A), and an interactive plot interface is created (Fig 2B). In this example, the best model is a two-state solution that is approximately equal parts clusters 2 and 9. This model has a χ2 free of 0.79 and a BIC value of 5.55. While this is the best model according to BIC comparisons, roughly 50 models of varying sizes possess better χ2 free values, and the model with the best goodness-of-fit (χ2 free = 0.74) is a 4-member state comprised of clusters 2 (∼45%), 4 (∼22%), 10 (∼15%), and 11 (∼18%). This lowest χ2 free model has an IC value of 8.47, which yields a relative performance of 0.23 when compared to the IC-identified two-state model. As such, the improved χ2 free value of this model is unwarranted, as it is likely the result of overfitting by too many basis members. Indeed, inspection of the model performance histogram (Figure 2B, top) shows that the best performing models are largely twostate solutions, but some three-state solutions perform moderately well. Furthermore, many of the two- and three-state solutions are a significant improvement over each of the single-state models. 4.2 Building Ensembles with Auxiliary Data Some users may desire to use BEES to build theoretical solution states by fitting solely to SAS data, and then use these states to predict measurements of future experiments. However, others may already possess such data and may prefer to create models that are consistent with both these measurements as well as the observed SAS profiles. For example, an experimenter may desire to simultaneously model both a scattering profile and a catalogue of NMR-derived distances. For the benefit of this class of users, we have included 7 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. this functionality within BEES. To demonstrate how including such data might affect the modeling results, we discuss here an extension of the previous tri-ubquitin example in which we provide a simulated data set that contains the ensemble-averaged center-of-mass distance between distal monomers and the angle formed by the trimer arrangement (Fig 3). These data were created by taking the ensemble-averaged measures of the best model from the previous example with the inclusion of a Gaussian noise factor, resulting in a target distance of 53.0 ± 1.6 Å and a target angle of 117.7 ± 8.3 . Inputs to the BEES routine are identical to the previous example, with the exception of the auxiliary data set. With the addition of the distance and angle measurements, we find a shift in the best IC-identified model. While still a two-state solution, the contributing members are now clusters 3 (43 ± 5%) and 4 (56 ± 5%). This model yields a χ2 total of 0.80, with a χ2 SAS of 0.96 and a χ2 aux of 0.38. As was the case in the last example, there are a plethora of models containing three or more members in which better goodness-of-fits are observed, and the best goodness-of-fit model is a mixture of clusters 2, 4, and 11 and has a χ2 total of 0.65. While this model is arguably a better fit to the data than the two-state ensemble of clusters 3 and 4, the IC value of this model is larger due to the addition of a third population. As such, this model is only the eighth most probable model, and possesses a relative performance of 0.63. When we inspect the ten best ensembles, we once again find the best model from the previous example, which possesses a χ2 total of 0.81, a χ2 SAS of 0.81, and a χ2 aux of 0.83. Differences between the exact values of the χ2 SAS metric in this example and the previous example are a result of the random-sampling nature of the χ2 free metric, but these values are statistically indistinguishable. Similarly, the total goodness-of-fit in the clusters 3 and 4 ensemble is comparable to the ensemble containing clusters 2 and 9. As both models are two-state solutions, this results in very similar IC metrics and a relative performance value of 0.94, which suggests that neither model is significantly more accurate than the other. However, the 3+4 ensemble significantly outperforms the 2+9 ensemble in the context of the distance and angle measurements, while the 2+9 ensemble is a better fit to the the scattering curve. 5 Discussion Here, we have presented the Bayesian Ensemble Estimation from SAS (BEES) program and highlighted its use with two example use cases. In the first example, we used the module to reweight states of K63-linked triubiquitin that were obtained from accelerated molecular dynamics simulations. The BEES module identified a two-state solution as the model that best balanced the fit to experimental data with the fewest number of states. However, the analysis also found a plethora of models that had improved goodness-of-fits to the experimental scattering profile, but each of these models had more ensemble members than the two-state solution. The BEES module provides users with a convenient interface to both find and compare these other candidate ensembles with the IC-identified best state. This allows researchers the option to either rigorously trust the IC statistics to identify the most appropriate scattering model or to use the “ensemble of ensembles” constructed by the BEES module to guide their understanding of datasets separate from the fitting procedure. The second use case discussed here demonstrated how BEES performs when simultaneously fitting populations to both SAXS and auxiliary data (here, simulated distance and angle measurements). In this example, the best identified model was still a two-state solution. However, a three-member ensemble was observed to have a better goodness-of-fit, but the improvement to χ2 total was not sufficient to also improve the IC parameter, yielding a relative performance of 0.63. Since the two-state solution has strong agreement with both measurements (χ2 free , χ2 aux < 1.0), this relative performance value suggests that a conservative estimate for the solution ensemble would favor the two-state model over the χ2 three-state case. However, the performance is of high enough quality that this ensemble could also be considered as a solution for future measurements. In this way, we emphasize that the relative performance metric should aide the intuition of researchers, rather than completely replace it. BEES seeks to identify the theoretical ensemble of states that uses the fewest number of populations to accurately describe the experimentally measured solution ensemble. In doing so, BEES is biased toward fitting the minimum amount of information contained within the experimental data, so as to avoid potential 8 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. over-fitting. In contrast, other methods such as genetic algorithms and maximum entropy approaches will seek to use the full information of each scattering point 14,15,38 . While these methods may result in overfitting, BEES is also susceptible to under-fitting when utilizing SAXS data alone. As a result, the most accurate model to the true solution ensemble is likely one that is of a size between the smallest and largest ensembles identified by these methods. Furthermore, accurate use of any of these fitting methods is reliant on high-quality theoretical profiles; inaccurate theoretical states will likely lead to incorrect models. Therefore, users should be very careful when selecting scattering calculator programs and parameters, and special attention should be paid to accurately accounting for hydration layer effects 39 . BEES can be used to construct ensemble models of scattering data from a library of candidate states, and the iterative algorithm of BEES quantitatively resists overfitting of the data from the addition of unnecessary populations. The program is available as a module on SASSIE (https://sassie-web.chem. utk.edu/sassie2/), as well as in a stand-alone form (https://github.com/WereszczynskiGroup/BEES). BEES is designed for use by both new and expert users of computational ensemble modeling, and the GUIbased module for the SASSIE-web platform provides structural and computational biophysicists with the resources necessary to construct molecular models in a Bayesian-based manner. Furthermore, BEES provides visual tools for quickly interpreting not only the quality of the best IC-identified model, but also for the full ensemble of sub-basis models available from the candidate populations. This feature allows users to inspect many different potential solutions and to compare their ability to model both SAS and auxiliary data sets. In this way, BEES serves the intuition of structural researchers in building ensembles of states for their systems of interest. 6 Author Contributions SB and JW designed the BEES routine; SB and JC wrote the code of the BEES routine; SB, JEC, and EHB designed and wrote the SASSIE-web implementation of BEES; SB, JC, and JW analyzed the data sets; SB and JW wrote the first manuscript draft, and all authors contributed to editing of the manuscript. 7 Acknowledgements The authors would like to thank Dr. Susan Krueger for valuable discussions in designing the plotting interface. EHB’s work is supported by National Science Foundation grant number OAC-1740097 and NIH grant GM120600. SB, JC and JW are supported by the National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health under award number R35GM119647. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work benefited from CCP-SAS software developed through a joint EPSRC (EP/K039121/1) and NSF (CHE-1265821) grant, as well as interactions and data collection at the Biophysics Collaborative Access Team, which is supported by NIGMS grant P41GM103622. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562 40 . Certain commercial equipment, instruments, or materials are identified in this paper to foster understanding. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the materials or equipment identified are necessarily the best available for the purpose. References [1] K. Henzler-Wildman and D. Kern. Dynamic personalities of proteins. Nature, 450(7172):964–972, Dec 2007. 9 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. [2] L. Boldon, F. Laliberte, and L. Liu. Review of the fundamental theories behind small angle X-ray scattering, molecular dynamics simulations, and relevant integrated application. Nano Rev., 6:25661, 2015. [3] J. Trewhella. Small-angle scattering and 3D structure interpretation. Curr. Opin. Struct. Biol., 40:1–7, 10 2016. [4] R. Graceffa, R. P. Nobrega, R. A. Barrea, S. V. Kathuria, S. Chakravarthy, O. Bilsel, and T. C. Irving. Sub-millisecond time-resolved SAXS using a continuous-flow mixer and X-ray microbeam. J. Synchrotron Radiat., 20(Pt 6):820–825, Nov 2013. [5] A. Nasedkin, M. Marcellini, T. L. Religa, S. M. Freund, A. Menzel, A. R. Fersht, P. Jemth, D. van der Spoel, and J. Davidsson. Deconvoluting Protein (Un)folding Structural Ensembles Using X-Ray Scattering, Nuclear Magnetic Resonance Spectroscopy and Molecular Dynamics Simulation. PLoS ONE, 10(5):e0125662, 2015. [6] A. Plumridge, A. M. Katz, G. D. Calvey, R. Elber, S. Kirmizialtin, and L. Pollack. Revealing the distinct folding phases of an RNA three-helix junction. Nucleic Acids Res., May 2018. [7] P. J. Cross, R. C. Dobson, M. L. Patchett, and E. J. Parker. Tyrosine latching of a regulatory gate affords allosteric control of aromatic amino acid biosynthesis. J. Biol. Chem., 286(12):10216–10224, Mar 2011. [8] L. Fetler, E. R. Kantrowitz, and P. Vachette. Direct observation in solution of a preexisting structural equilibrium for a mutant of the allosteric aspartate transcarbamoylase. Proc. Natl. Acad. Sci. U.S.A., 104(2):495–500, Jan 2007. [9] S. C. Howell, X. Qiu, and J. E. Curtis. Monte Carlo simulation algorithm for B-DNA. J. Comput. Chem., 37(29):2553–2563, 11 2016. [10] S. A. Datta, J. E. Curtis, W. Ratcliff, P. K. Clark, R. M. Crist, J. Lebowitz, S. Krueger, and A. Rein. Conformation of the HIV-1 Gag protein in solution. J. Mol. Biol., 365(3):812–824, Jan 2007. [11] P. C. Chen and J. S. Hub. Validating solution ensembles from molecular dynamics simulation by wide-angle X-ray scattering data. Biophys. J., 107(2):435–447, Jul 2014. [12] J. S. Hub. Interpreting solution X-ray scattering data using molecular simulations. Curr. Opin. Struct. Biol., 49:18–26, Apr 2018. [13] S. Yang, L. Blachowicz, L. Makowski, and B. Roux. Multidomain assembled states of Hck tyrosine kinase in solution. Proc. Natl. Acad. Sci. U.S.A., 107(36):15757–15762, Sep 2010. [14] G. Tria, H. D. Mertens, M. Kachala, and D. I. Svergun. Advanced ensemble modelling of flexible macromolecules using X-ray solution scattering. IUCrJ, 2(Pt 2):207–217, Mar 2015. [15] M. Pelikan, G. L. Hura, and M. Hammel. Structure and flexibility within proteins as identified through small angle X-ray scattering. Gen. Physiol. Biophys., 28(2):174–189, Jun 2009. [16] D. Schneidman-Duhovny, M. Hammel, J. A. Tainer, and A. Sali. FoXS, FoXSDock and MultiFoXS: Single-state and multi-state structural modeling of proteins and their complexes based on SAXS profiles. Nucleic Acids Res., 44(W1):W424–429, Jul 2016. [17] K. E. Hines. A primer on Bayesian inference for biophysical systems. Biophys. J., 108(9):2103–2113, May 2015. [18] C. K. Fisher, A. Huang, and C. M. Stultz. Modeling intrinsically disordered proteins with bayesian statistics. J. Am. Chem. Soc., 132(42):14919–14927, Oct 2010. 10 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. [19] V. A. Voelz and G. Zhou. Bayesian inference of conformational state populations from computational models and sparse experimental observables. J. Comput. Chem., 35(30):2215–2224, Nov 2014. [20] Y. Ge and V. A. Voelz. Model Selection Using BICePs: A Bayesian Approach for Force Field Validation and Parameterization. J. Phys. Chem. B, 122(21):5610–5622, May 2018. [21] W. Potrzebowski, J. Trewhella, and I. Andre. Bayesian inference of protein conformational ensembles from limited structural data. PLoS Comput. Biol., 14(12):e1006641, Dec 2018. [22] S. Bowerman, A. S. J. B. Rana, A. Rice, G. H. Pham, E. R. Strieter, and J. Wereszczynski. Determining Atomistic SAXS Models of Tri-Ubiquitin Chains from Bayesian Analysis of Accelerated Molecular Dynamics Simulations. J. Chem. Theory. Comput., 13(6):2418–2429, Jun 2017. [23] Emre H. Brookes, Nadeem Anjum, Joseph E. Curtis, Suresh Marru, Raminder Singh, and Marlon Pierce. The genapp framework integrated with airavata for managed compute resource submissions. Concurrency and Computation: Practice and Experience, 27(16):4292–4303, May 2015. [24] S. J. Perkins, D. W. Wright, H. Zhang, E. H. Brookes, J. Chen, T. C. Irving, S. Krueger, D. J. Barlow, K. J. Edler, D. J. Scott, N. J. Terrill, S. M. King, P. D. Butler, and J. E. Curtis. Atomistic modelling of scattering data in the Collaborative Computational Project for Small Angle Scattering (CCP-SAS). J. Appl. Crystallogr., 49(Pt 6):1861–1875, Dec 2016. [25] D. Svergun, C. Barberato, and M. H. J. Koch. Crysol– a program to evaluate x-ray solution scattering of biological macromolecules from atomic coordinates. J. Appl. Crystal., 28(6):768–773, Dec 1995. [26] D. Schneidman-Duhovny, M. Hammel, and A. Sali. FoXS: a web server for rapid computation and fitting of SAXS profiles. Nucleic Acids Res., 38(Web Server issue):W540–544, Jul 2010. [27] Max C. Watson and Joseph E. Curtis. Rapid and accurate calculation of small-angle scattering profiles using the golden ratio. Journal of Applied Crystallography, 46(4):1171–1177, Aug 2013. [28] K. Stovgaard, C. Andreetta, J. Ferkinghoff-Borg, and T. Hamelryck. Calculation of accurate small angle X-ray scattering curves from coarse-grained protein models. BMC Bioinformatics, 11:429, Aug 2010. [29] K. M. Ravikumar, W. Huang, and S. Yang. Fast-SAXS-pro: a unified approach to computing SAXS profiles of DNA, RNA, protein, and their complexes. J. Chem. Phys., 138(2):024112, Jan 2013. [30] J. J. Virtanen, L. Makowski, T. R. Sosnick, and K. F. Freed. Modeling the hydration layer around proteins: applications to small- and wide-angle x-ray scattering. Biophys. J., 101(8):2061–2069, Oct 2011. [31] P. C. Chen and J. S. Hub. Interpretation of solution x-ray scattering by explicit-solvent molecular dynamics. Biophys. J., 108(10):2573–2584, May 2015. [32] Hirotugu Akaike. Akaike’s Information Criterion, pages 25–25. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. [33] Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. [34] Kenneth P Burnham and David R Anderson. Information and Likelihood Theory: A Basis for Model Selection and Inference, page 76–123. Springer, 2 edition, 2002. [35] Robert E. Kass and Adrian E. Raftery. Bayes Factors. Journal of the American Statistical Association, 90(430):773–795, June 1995. [36] D. Franke, M. V. Petoukhov, P. V. Konarev, A. Panjkovich, A. Tuukkanen, H. D. T. Mertens, A. G. Kikhney, N. R. Hajizadeh, J. M. Franklin, C. M. Jeffries, and D. I. Svergun. ATSAS 2.8: a comprehensive data analysis suite for small-angle scattering from macromolecular solutions. J. Appl. Crystallogr., 50(Pt 4):1212–1225, Aug 2017. 11 bioRxiv preprint doi: https://doi.org/10.1101/400168. this version posted February 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. [37] P. V. Konarev and D. I. Svergun. A posteriori determination of the useful data range for small-angle scattering experiments on dilute monodisperse systems. IUCrJ, 2(Pt 3):352–360, May 2015. [38] Bartosz Różycki and Evzen Boura. Large, dynamic, multi-protein complexes: a challenge for structural biology. J. Phys. Condens. Matter, 26(46):463103, oct 2014. [39] João Henriques, Lise Arleth, Kresten Lindorff-Larsen, and Marie Skepö. On the calculation of saxs profiles of folded and intrinsically disordered proteins from computer simulations. J. Mol. Biol., 430(16):2521 – 2539, 2018. Intrinsically Disordered Proteins. [40] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr. XSEDE: Accelerating Scientific Discovery. Comput. Sci. Eng., 16(5):62–74, Sep. 2014. 12