Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ProtoMD: A Prototyping Toolkit for Multiscale Molecular Dynamics

arXiv (Cornell University), 2013
...Read more
ProtoMD : A Prototyping Toolkit for Multiscale Molecular Dynamics Endre Somogyi b,c,** , Andrew Abi Mansour a,c,** , Peter J. Ortoleva a,c,* a Department of Chemistry, Indiana University, Bloomington b Department of Physics, Indiana University, Bloomington c Center for Theoretical and Computational Nanoscience, Indiana University, Bloomington Abstract ProtoMD is a toolkit that facilitates the development of algorithms for mul- tiscale molecular dynamics (MD) simulations. It is designed for multiscale methods which capture the dynamic transfer of information across multiple spatial scales, such as the atomic to the mesoscopic scale, via coevolving mi- croscopic and coarse-grained (CG) variables. ProtoMD can be also be used to calibrate parameters needed in traditional CG-MD methods. The toolkit inte- grates ‘GROMACS wrapper’ to initiate MD simulations, and ‘MDAnalysis’ to analyze and manipulate trajectory files. It facilitates experimentation with a spectrum of coarse-grained variables, prototyping rare events (such as chemi- cal reactions), or simulating nanocharacterization experiments such as terahertz spectroscopy, AFM, nanopore, and time-of-flight mass spectroscopy. ProtoMD is written in python and is freely available under the GNU General Public Li- cense from github.com/CTCNano/proto md. Keywords: python, molecular dynamics, multiscale, coarse-graining PROGRAM SUMMARY/NEW VERSION PROGRAM SUMMARY Manuscript Title: ProtoMD: A Prototyping Toolkit for Multiscale Molecular Dynam- ics Authors: Endre Somogyi, Andrew Abi Mansour, and Peter J. Ortoleva Program Title: ProtoMD Journal Reference: Catalogue identifier: * Corresponding author: ortoleva@indiana.edu ** Both authors contributed equally to this work Preprint submitted to Elsevier January 11, 2016 arXiv:1309.6397v6 [physics.comp-ph] 8 Jan 2016
Licensing provisions: GPL v3 (or above) Programming language: python 2.7.3 Computer: x86 / x86 64 Operating system: Linux RAM: Depends on the size of the system being simulated and duration of the simula- tion (few MBs to TBs) Number of processors used: 12 - 128 Supplementary material: Keywords: python, molecular dynamics, multiscale, coarse-graining Classification: Computational Methods External routines/libraries: GROMACS, MDAnalysis, GromacsWrapper, numpy, scipy Subprograms used: Nature of problem: Prototyping multiscale coarse-grained algorithms for molecular dynamics Solution method: Combining the open-source GROMACS molecular dynamics package and the python- based MDAnalysis library for running, debugging, and analyzing multiscale simula- tions Restrictions: The system under study must be characterized by a clear separation of timescales; otherwise, the multiscale algorithm fails to capture the slowly-varying modes. Unusual features: Additional comments: Running time: Depending on the problem size, simulations can take few hours to months. 1. Introduction Supramolecular assemblies of contemporary interest include viruses, vac- cine nanoparticles, nanocapsules for drug delivery, nanomaterials, and light- harvesting fibers. These systems have been studied via all-atom [1, 2, 3, 4, 5] 2
ProtoMD: A Prototyping Toolkit for Multiscale Molecular Dynamics arXiv:1309.6397v6 [physics.comp-ph] 8 Jan 2016 Endre Somogyib,c,∗∗, Andrew Abi Mansoura,c,∗∗, Peter J. Ortolevaa,c,∗ a Department of Chemistry, Indiana University, Bloomington of Physics, Indiana University, Bloomington c Center for Theoretical and Computational Nanoscience, Indiana University, Bloomington b Department Abstract ProtoMD is a toolkit that facilitates the development of algorithms for multiscale molecular dynamics (MD) simulations. It is designed for multiscale methods which capture the dynamic transfer of information across multiple spatial scales, such as the atomic to the mesoscopic scale, via coevolving microscopic and coarse-grained (CG) variables. ProtoMD can be also be used to calibrate parameters needed in traditional CG-MD methods. The toolkit integrates ‘GROMACS wrapper’ to initiate MD simulations, and ‘MDAnalysis’ to analyze and manipulate trajectory files. It facilitates experimentation with a spectrum of coarse-grained variables, prototyping rare events (such as chemical reactions), or simulating nanocharacterization experiments such as terahertz spectroscopy, AFM, nanopore, and time-of-flight mass spectroscopy. ProtoMD is written in python and is freely available under the GNU General Public License from github.com/CTCNano/proto md. Keywords: python, molecular dynamics, multiscale, coarse-graining PROGRAM SUMMARY/NEW VERSION PROGRAM SUMMARY Manuscript Title: ProtoMD: A Prototyping Toolkit for Multiscale Molecular Dynamics Authors: Endre Somogyi, Andrew Abi Mansour, and Peter J. Ortoleva Program Title: ProtoMD Journal Reference: Catalogue identifier: ∗ Corresponding ∗∗ Both author: ortoleva@indiana.edu authors contributed equally to this work Preprint submitted to Elsevier January 11, 2016 Licensing provisions: GPL v3 (or above) Programming language: python 2.7.3 Computer: x86 / x86 64 Operating system: Linux RAM: Depends on the size of the system being simulated and duration of the simulation (few MBs to TBs) Number of processors used: 12 - 128 Supplementary material: Keywords: python, molecular dynamics, multiscale, coarse-graining Classification: Computational Methods External routines/libraries: GROMACS, MDAnalysis, GromacsWrapper, numpy, scipy Subprograms used: Nature of problem: Prototyping multiscale coarse-grained algorithms for molecular dynamics Solution method: Combining the open-source GROMACS molecular dynamics package and the pythonbased MDAnalysis library for running, debugging, and analyzing multiscale simulations Restrictions: The system under study must be characterized by a clear separation of timescales; otherwise, the multiscale algorithm fails to capture the slowly-varying modes. Unusual features: Additional comments: Running time: Depending on the problem size, simulations can take few hours to months. 1. Introduction Supramolecular assemblies of contemporary interest include viruses, vaccine nanoparticles, nanocapsules for drug delivery, nanomaterials, and lightharvesting fibers. These systems have been studied via all-atom [1, 2, 3, 4, 5] 2 and coarse-grained (CG) [6, 7, 8, 9] molecular dynamics (MD). MD software uses physical concepts and software engineering to make simulations accurate and efficient via modern hardware (e.g., GPUs and multicore platforms). However, simulating supramillion-atom systems over long time periods still presents a challenge for MD. As a result, many multiscale coarse-grained MD methods have been proposed in the past. These methods can be generally divided into two classes. The first is based on coevolving both the atomistic and CG states simultaneously [10, 11, 12, 13, 14]; in these methods, the atomic resolution is never lost. The second class coarse-grains the atomistic structure and evolves the CG variables based on dynamical equations that depend on parameters to be determined [15, 16, 17, 18, 19, 20]. These parameters are usually calibrated from experimental or MD simulation data. In this work, we present ProtoMD, a multiscale toolkit for prototyping both classes of multiscale MD algorithms. ProtoMD can be used as a program to run multiscale simulations or as a library to debug a multiscale algorithm or perform coarse-grained analysis. ProtoMD uses GROMACS [21] to perform energy minimization and run MD simulations. This is needed for completing the atomistic phase in coevolution methods and for calibrating system-specific parameters for CG-MD methods. An overview of the theoretical methods on which ProtoMD is based is presented in sec. (2), a discussion on the implementation of these methods and the design of ProtoMD are provided in sec. (3), and a general user guide is covered in sec. (4). 2. Theoretical framework While ProtoMD is a general-purpose toolkit that can be used to prototype CG-MD methods (if calibration is based on data collected from MD simula- 3 tions), here we emphasize the implementation of multiscale coevolution algorithms using the space-warping method [10, 22] as a coarse-graining technique. Conventional MD is designed to solve Newton’s equations for a set of N atoms interacting via conventional or custom force fields. Coevolution methods take advantage of exntensive investment made in the computational and software engineering of conventional MD codes. This is achieved by integrating MD into one complete multiscale algorithm that evolves an atomistic system via MD, followed by coarse-grained dynamics. Coevolution methods have many of the same goals as conventional MD, primarily to simulate the evolving state of the N atoms with atomic precision. However, these methods are able to achieve this with greater efficiency in two distinct ways. The first follows the evolution of an N atom system along a time course that starts with user specified initial data. The second is the analogue of ensemble MD where hundreds of MD simulations are followed to provide an assessment of the statistical significance of the simulation. These two types of simulations (single and ensemble) are achieved via two distinct algorithms implemented as two branches of the code. Many-atom systems display cooperative behaviors ranging from mass and charge density oscillations to structural transitions between crystal phases in a polymorphic system. One challenge in many-particle physics is to develop a quantitative understanding about how such behaviors emerge from the Newtonian dynamics of many particles influencing each other via an interatomic force field such as CHARMM [23] or AMBER [24]. The standard statistical mechanical framework used to address such a challenge is the Liouville equation (LE) for the dynamics of the N -particle density, ρ(Γ, t), for the 6N particle positions and momenta, Γ. The pathway to achieving this quantitative understanding considered here is to unfold the dependencies of ρ on the microstate, i.e., on both Γ and a set of coarse-grained variables, Φ. Since Φ is constructed from Γ, 4 then Φ obeys the equation of motion Φ̇ = Π, where Π can also be expressed in terms of Γ through Newton’s equations. With this, and the usual arguments of probability conservation used to derive the LE, one finds that ρ(Γ, Φ, t) satisfies the LE. Thus, an extended description of the system state is introduced such that it includes the microstate (described by Γ), and the mesostate (described by Φ). At first, this seems like a backward step in the quest for dimensionality reduction. However, this expanded representation facilitates theoretical developments and an associated conceptual picture, which ultimately imply efficient and accurate computational algorithms. This extended description introduces approximations that enter only via well-founded multiscale perturbation [25, 26, 11, 13] and Trotter factorization [14, 27] methods. Besides the computational reduction, there are several advantages to our approach. Key features of the 6N dimensional representation are provided, and codevelopment of a multiscale perturbation technique with the Trotter scheme yields insights into the physical meaning of the latter. Both methods enable the use of well-established interatomic force fields directly, avoiding the need to conjecture the form of CG governing equations. 2.0.1. Multiscale factorization This technique is employed for the generation of a single N −atom trajectory while an N −atom system simultaneously evolves on multiple scales. This introduces a computational challenge for a time stepping algorithm, i.e. capturing the slow modes as well as rapidly fluctuating ones restricts the timestep. The latter determine the time stepping of conventional MD integrators. However, the characteristic time of the collective modes can be orders of magnitude larger than that of atomic fluctuations. Since the collective behaviors are a result of many individual atomic motions, the collective and fast-atom modes are coupled, leading to the slowness of conventional MD. This is overcome in 5 multiscale factorization. Rather than attempting to untangle the collective and fast particle modes directly, a Trotter factorization strategy is used [28]. In this approach, the untangling of the collective and fast-atom modes is achieved via an alternating stepping evolution procedure [14, 27]. In each step, the collective and fast-particle modes are updated: the former via the collective integration of the momenta constructed from the MD phase, and the latter by conventional MD. This is justified via the stationarity hypothesis, which states that the time evolution of the momenta conjugate to the collective variables can be represented by a stationary process which expresses a representative ensemble of values over a period of time short relative to the characteristic CG timescale [14]. Then if the characteristic timescale of the collective behavior is ∆, and the MD duration needed to generate the representative ensemble of conjugate momenta states is δ, the theoretical efficiency over MD is expected to be ∆/δ. 2.0.2. Multiscale perturbation This method generates an ensemble of N -atom trajectories and is therefore useful as an accelerated trajectory sampling technique, i.e. ensemble MD. Starting from the Liouville equation, Langevin type equations for a variety of CG variables are derived [25, 29, 26]. The multiscale approach also provides prescriptions for constructing all factors (such as diffusion coefficients and thermalaverage forces) in these CG governing equations, criteria for completeness of the set of CG variables, and algorithms for efficient simulation of N -atom systems based on coevolution of CG variables and an ensemble of all-atom configurations evolving with them [11, 13, 30, 31]. The theoretical speedup over ensemble MD in this case is also ∆/δ, with δ being the length of the short MD runs constituting the microscopic ensemble needed to advance the mesoscopic state in time over a period of ∆. 6 2.1. Coarse graining and subsystems Interatomic forces in many-atom systems induce collective (i.e., coherent) behaviors which characterize the long space-time dynamics. Identifying these collective variables is a first step in a multiscale simulation. Mesoscopic systems also manifest single atomic modes (such as collisions or vibrations) that are needed for a complete understanding of the system behavior. While a variety of CG variables have been introduced in the context of many-atom simulations [32, 33], here the focus is on slowly varying ones [10, 29, 22] since they provide an appropriate starting point for multiscale analysis. The general notion is to consider a set of CG variables φ, related to the N atom positions r, via a dimensionality reduction map. Complex systems such as viruses and other macromolecular assemblies can often be separated into a collection of subsystems. Therefore, it is convenient to apply this dimensionality reduction locally to each subsystem of the entire structure. In the subsystem decomposition approach, subsystem l has its own set of CG variables, denoted φl . For example, a virus capsid can often be decomposed into energy-defined capsomers, notably pentamers or hexamers of a protein (Fig. (1)). Specifically, let the center of mass of subsystem l be denoted Rl and the subsystem in which atom i resides be l(i). Let ri be the position of atom i and si its position relative to Rl(i) , i.e., ri = Rl(i) + si . (1) Using this notation, one can introduce CG variables describing the position, orientation, and deformation of each subsystem. For example, using the space warping method [10, 22], basis functions Ukli are introduced for atom i in subsystem l, where k is the order of the polynomial spatial function used in defining the CG variables. For example, if Legendre polynomials are used in constructing 7 Figure 1: The capsid-like structure of L1 Human Papilimano virus (HPV) proteins (left) is divided into twelve pentameric subsystems (right), each of which has its own set of CG variables when subsystem decomposition is used in ProtoMD. Ukli , then each subsystem is embedded in a cube that contains the subsystem and within which Legendre polynomials are used to define the CG variables in terms of a reference atomic-configuration [30]. Let Φkl be the k th CG variable for subsystem l, then the CG to all-atom map is taken to be a Fourier-like expansion si = X Φkl Ukli + σi . (2) k The k−sum on the RHS describes the more coherent (i.e. collective) dynamics of the system while the residual σi accounts for the small scale, stochastic motion of atom i over-and-above the coherent motion. The basis functions Ukli are taken to be orthogonal polynomials. The specific relation between Φkl and the atomic positions is determined by maximizing the information in Φkl . This is accomplished by minimizing the mass-weighted sum of square residuals [29], restricted to atoms in the given subsystem l. As system complexity (e.g., the number of subsystems and their internal structure) increases, one may increase the number of CG variables, e.g., the range of the sum in Eq. (2). As the number of CG variables increases, smaller 8 scale features are captured, but the characteristic timescale decreases. This restricts the timestep used in advancing the CG state. Too coarse a description might miss a given pathway of, e.g., self-assembly or disassembly. Thus, there is a trade-off between completeness and numerical efficiency. However, this strategy has been shown to be successful in arriving at an accurate and efficient algorithm for simulating bionanosystems [30, 22, 34] since the CG variables guide the overall dynamics, while much of the atomic-scale structure is captured by the fine-scale aspect of the coevolution. The implementation of both multiscale algorithms and the subsystem decomposition outlined here is discussed in the next section that covers the technical aspects of the design of ProtoMD. 3. Design and implementation 3.1. Strategy One of the key goals in the design of ProtoMD is to allow users to readily experiment and develop their own CG variable definitions and dynamical equations. The overall structure of ProtoMD is based on an object-oriented approach, with strict enforcement that all functions are guaranteed to be side effect free. Side effect producing functions can modify global variables or input parameters without the user being aware of it. Such side effects are eliminated with the approach adopted here, which proved to be practical when dealing with large sets of numerical data (atomic coordinates, velocities, etc., for supramillion atom systems). All state management is isolated in the top level System class. The System object contains a number of other objects. However, once these objects are initialized, they are immutable. For example, the integrators do not have any internal mutable state, i.e., they simply take an existing atomic state (from the System.current timestep context) and calculate the next state. 9 3.2. Functions and classes A system is a set of atoms that can be delineated in a number of ways. We specify subsystem delineation with an MDAnalysis selection string. MDAnalysis provides a full atomic selection language comparable to VMD [35] or CHARMM [1]. Subsystem CG variables are handled with the proto md.subsystems.SubSystem class. A SubSystem references a specific set of atoms, calculates its associated CG variables, or conversely it can take new CG variables and construct the micro state, or an ensemble thereof, consistent with the new CG state. SubSystems are notified by the System of various events such as when the energy of the atomic state is minimized, the system is equilibrated, or connectivity has changed. The atoms contained in a SubSystem are stored in an MDAnalysis AtomGroup object, i.e. each SubSystem contains an AtomGroup. SubSystems are created by a factory function. The factory creates objects without specifying the exact class of object that will be created. The essence of this pattern is to define an interface for creating an object, and let the classes that implement the interface decide which object to instantiate. Therefore, the factory method lets a class defer instantiation to subclasses. A SubSystem factory function must have a signature of def SubsystemFactory(system, selects, *args), where system is a reference to the System object, selects is a list of MDAnalysis Atomselect statements, and args is a list of arbitrary user specified data. The full textual name of the subsystem factory function is stored in the database, e.g. proto md.subsystems.RigidSubsystemFactory. The System object’s initializer dynamically instantiates an instance of the named factory function and calls it to create a list of SubSystems. This approach to object creation maximizes the flexibility and the ease with which one can extend ProtoMD. All one needs to do is provide the textual name of a factory function. This named function does not need to be part of ProtoMD. The named factory 10 function simply needs to be reference-able in the python path. So this approach allows one to extend ProtoMD without having to change any ProtoMD source code. This could prove to be very useful in a situation where one is developing some new type of CG variable and one does not have administrator privileges, and ProtoMD is installed in a system path. Users can develop a new SubSystem object, store it in their home directory, and simply specify the name of their subsystem factory in the ProtoMD database; ProtoMD will automatically use their new SubSystem object. A SubSystem derived object must implement the SubSystem interface. Because the implementation is in python, it is technically not a requirement for subsystems to be derived from the SubSystem class; they simply must define and implement all the methods defined by the SubSystem class. The proto md.subsystems.SubSystem class only defines the interface that subsystem classes must implement. • def universe changed(self, universe): changes the universe. This occurs when the number of atoms has changed (on startup, in the event of solvation or chemical reactions). So, init will be called when a new Sub- System derived object is created, then after the universe is created, it will call universe changed. • def frame(self ): notifies the subsystem that a new frame is ready. It should return a tuple (position, velocity, force) of CG variables for the current frame. If the CG positions are stored in a multi-dimensional tensor (e.g., 3 × 3 × 3), then they need to be unfolded and returned as a 1 × 27 row vector, and similarly for the CG velocity and force variables. • def translate(self, values): translates all atomic positions given a coarse grained variable. For example, if the CG variable is the center of mass of a macromolecule, then translate would be given a 1 × 3 vector which 11 represents how much to shift the center of mass. Next, the subsystem would simply add this value to all atomic positions. The values argument is the difference in CG variable space between the current configuration and a deformed configuration. This is given as an unfolded 1 × n column vector, and if the CG variables are tensorial, then values needs to be folded. • def minimized(self ): is called after the atomic system is minimized; no return value is required. This method is provided if the CG variables must be modified in the event of minimization. • def equilibrated(self ): is called after the atomic system is relaxed (thermalized in isothermal simulations); no return value is required. • def md(self ): is called after the MD trajectories are processed; no return value is required. ProtoMD currently provides two types of SubSystems. The first uses the center of mass as the CG variable and is denoted proto md.subsystems.RigidSubsystem. This subsystem serves as a demonstration, because it is probably not useful beyond spherically symmetric systems. A more sophisticated and useful sybsystem is proto md.subsystems.SpaceWarpingSubSystem, which constructs a set of CG variables based on an orthogonal polynomial basis set [10, 22]. In principle, any set of orthogonal polynomials may be used, but currently only Legendre polynomials are implemented. 3.3. Coevolution integration The Integrator class performs the time evolution of the system with the application of the classical propagator Γ(tn+1 ) = eiLt Γ(tn ). The Integrator currently has two concrete implementations actualizing the physical ideas presented in sec. (2): the LangevinIntegrator implements the concepts presented 12 in sec. (2.0.1) while the FactorizationIntegrator implements those presented in sec. (2.0.2). 3.4. Key objects and packages ProtoMD consists of a number objects, the key constituents are discussed here, and are displayed in Fig. (2). The principle object is conveniently called the System. This object manages the state of the simulation and is roughly analogous to OpenGL context (Fig. (3)). All state variables are contained entirely in the System. Additionally, all other objects are contained within the System. system.System system.System is the top level class, it is a container class for all the other classes. It is also the single place where ProtoMD handles data persistence. The System class can be used either for loading and simulation, or loading and analysis. The System constructor takes the path to a database file, and a flag indicating the mode to open the file, defaults to “r” - read only. For running a simulation, it should be “a”. This flag is relevant in that a simulation can be running in one process (in read / write mode), and another process can safely open the database in read only mode and not interfere with a running simulation. This is useful for checking the state of a running simulation. All simulation data is stored in a single database file, and the System class provides a programatic interface. All completed timesteps are accessible via the timesteps property. 3.4.1. State management and data persistence The present state of a running simulation is contained wholly within the System.current timestep and System.universe objects. Molecular simulation programs have typically stored their data in a variety of proprietary, non-standardized, and incompatible file formats. For example, 13 System universe subsystems timesteps cg_positions cg_velocities cg_forces current_timestep begin_timestep() end_timestep() translate(cg) SubSystem universe_changed(universe) frame() translate(cg) minimized() equilibriated() TimeStep atomic_minimized_positions atomic_equilibriated_positions atomic_starting_positions atomic_final_positions timestep_begin timestep_end cg_positions cg_velocities cg_forces cg_translate Integrator system run() atomistic_step() cg_step() Figure 2: A UML diagram of the key objects and their key attributes in ProtoMD. GROMACS [2] uses 31 different file formats, some of which are deprecated or used only as input. Even though most MD file formats are documented, accessing them entails developing a parser for each format. Rather than invent yet another format, ProtoMD used the standardized and widely supported hierarchical data format version 5 (HDF). HDF is a self-describing file format designed for the general exchange of numerical data. All information relevant to a ProtoMD simulation is stored in a single HDF5 database. This approach allows preparation of a database on a completely different computer than the one on which the simulation is intended to be run. All molecular dynamics topologies, force field parameters, simulation parameters, coordinates, etc., are stored in the HDF5 database. One could even start a simulation on one compute cluster, run for a while, and restart the simulation on a different cluster by simply copying a single file. 14 Considering the environment in which MD simulations are run, a compute cluster is sometimes unreliable. Compute nodes and data storage frequently crash or are rebooted. Programs are often terminated by the job scheduler. Flaws or faults in external programs become evident, and countless other environmental conditions can cause termination of executing programs. Molecular dynamics simulations produce a considerable amount of data. ProtoMD also produces and uses a good deal of disparate data, ranging from simulation parameters, archived molecular dynamics input files, and atomic trajectory time series. Therefore, an efficient, fault-tolerant data storage system which readily allows restarts is a key, if not the core, requirement. For on disk file storage, the HDF5 file format was chosen. HDF stands for “hierarchical data format.” HDF5 is an open standard file format for the high performance storage and retrieval of an unlimited variety of datatypes. HDF5 provides a library that allows access in a variety of languages. In this sense, it can be thought of as similar to XML. However, XML is plain text (though it can be compressed), it is nonetheless inefficient for storage of large amounts of numerical data. HDF5 can can be accessed natively in a variety of different mathematical and analysis packages such as Mathematica, MATLAB, and R. A data-persistence model was chosen that is similar to that of most graphics libraries such as Cairo, OpenGl, Cocoa, or Windows. All of these libraries function essentially the same way. One creates or obtains a graphics context for an appropriate output device, be it the screen, a printer, or an image buffer. The graphics context is a data structure which maintains information about the current drawing state. Then there are drawing functions, such as drawRect, drawPoly, and so forth, which add graphics primitives to the context. In the case of OpenGl, one adds a sequence of 3D information to the context. The key point here is that all of these drawing operations operate on the data structure, 15 void d i s p l a y ( ) { g l B e g i n ( GL LINES ) ; glVertex3f (1.0 , 1.0 , 0 . 0 ) ; g l V e r t e x 3 f ( −1.0 , −1.0 , 0 . 0 ) ; glEnd ( ) ; glFlush ( ) ; Figure 3: A sample OpenGL code that displays a line on the screen. Objects in ProtoMD manage the state of the simulation in a similar way to that employed in OpenGL. and not the output device directly. Once all of the drawing operations are complete, then the graphics context is passed to the output device in a single atomic operation. In contrast, if one were to output each drawing operation immediately to the output device, it would be very cost prohibitive as that output device must be locked, data written to it, and then the device must be unlocked again. It would be a similar case if every function that produced some relevant output information wrote that information directly to disk and flushed the IO buffers. Therefore, all operations that affect the state of a TimeStep write to the TimeStep data structure, which is a memory resident object. The TimeStep object is created by the System.begin timestep(), accessed with the System.current timestep, and atomically persisted to disk and deleted from memory via System.end timestep(). The TimeStep object behaves as a data structure, however, it is actually implemented as an HDF5 data set. The reason for this is that the data stored in a TimeStep object can potentially be huge. Therefore, the underlying HDF5 implementation determines how to best store this data, e.g., all data directly in memory, or some portions stored on disk. 3.5. Simulation The command-line interface of ProtoMD can be accessed via python -m proto md. This front end program is intended for initial database generation 16 and running the simulation. There is an extensive built in help system accessible via: python -m proto md –help. Available options will likely change in the future, so it is best to consult the help system for the latest complete up-to-date information. This subsection will describe how a simulation is performed from both implementation and user perspectives. 3.5.1. Initial database creation The first step in setting up a simulation is the creation of a database. We have provided a command line program that front ends the config.create sim(...) function. This function (program) reads a set of parameters such as simulation parameters, subsystem definition and atomic structure files. It can optionally generate a topology (using pdb2gmx), perform a series of validations, and finally package everything up into a ProtoMD database file, from which a ProtoMD simulation can be run. 3.5.2. Stand-alone programs and libraries ProtoMD can be used either as a stand alone program, or as a library. There are many instances where it is more convenient to use ProtoMD as a program, e.g. submission of simulations to a queuing system, or quick and simple analysis tasks. The stand alone program interface to ProtoMD is made possible by python’s module mechanism, i.e. when a package is installed with a main .py file. 3.5.3. Interfacing with molecular dynamics programs ProtoMD requires external MD programs to perform a number of calculations, and it interfaces with external MD programs using a very transparent approach; ProtoMD handles all interaction with external programs through context 17 managers. All of the MD related functionality is contained in the proto md.md package. Programmatically, interfacing with external MD programs is frequently fraught with error primarily due to the large number of input files that these MD programs require to be present on the file system. Most MD packages do not provide an application program interface (API), and are intended to be userinvoked programs. This presents difficulty to calling programs, as they must determine the current state of the file system, copy and convert large portions of internal state into MD input files, call the MD program, parse the output, and finally clean up after the MD program is run. Significant sources of error arise when the MD program reads state from locations that the calling program is unaware of, for example, external configuration files, and force field parameter files. Furthermore, a more common source of error is the MD program reading previous input files that perhaps were not deleted properly. In effect, most of these errors arise from the fact that input state required by MD can be scattered throughout the file system. This can be analogous to a programming language with no variable scoping, i.e. all variables are global: in such situations, one loses the concept of functions, and all subroutine calls consist purely of sideeffects. Yet another error occurs when concurrent copies of the calling program are run: if these programs have to use the file system to pass state to an MD program, they will overwrite each others’ file system state. To circumvent or sequester these errors, ProtoMD generates temporary, randomly named directories into which the requisite MD input state variables are serialized, the necessary files are copied, and various GROMACS programs are run via GROMACS wrapper [36]. The use of temporary randomly named directories to store MD state minimizes the possibility of using other files or input that the calling program is unaware of. These temporary directories only ex- 18 ist for the lifetime of the MD program call. In effect, we have created stack frame, the set of information required for, and existing only for the lifetime of procedure call. The lifetime management of these temporary file system based variables is implemented using a Resource Acquisition Is Initialization (RAII) idiom [37]. This is a programming idiom, widely attributed to Bjarne Stroustrup, which states that a class with a constructor that acquires the resource (either by creating it, or obtaining it via a parameter) has a destructor that always releases the allocated resource. During the object lifetime, the resource may be accessed via instance variables, and is guaranteed to be freed when the object is either deleted or goes out of scope. RAII is a widely used pattern in C++: it is a safe way to deal with resources and also makes the code much cleaner as it eliminates the need to mix error handling code with functionality. This idiom was traditionally not used in non-deterministic, garbage collected languages like python. In C++, the object is destroyed when it either goes out of scope (if stack allocated), or deleted (if heap allocated). In python, however, there is no explicit control over when the object is destroyed. Therefore, if it contained a resource, that resource may potentially exist for the lifetime of the program. This limitation was overcome with PEP343, where context managers, provide enter () and exit () methods that are explicitly invoked on entry to and exit from the body of the with statement. For instance, the open statement returns a context manager containing an open file handle, it is kept open as long as the execution is in the context of the with statement where it is used, and closed when execution leaves the block, irrespective of whether the exit was because of an exception or during regular control flow. The with statement can thus be used in ways similar to the RAII pattern in C++: some resource is acquired by the with statement and released when the with block is left. 19 ProtoMD wraps all external MD calculations into a set of functions which return a proto md.md.MDManager object. This is a context manager which holds all of the file system objects required for, and returned by, MD programs in the file system directory. Calling functions can access any of the MD results through a dictionary interface. For example, the function md.solvate(...) accepts an MDAnalysis.Universe object, along with other parameters, creates a random temporary directory, write the contents of the universe object, performs a solvation using external GROMACS programs, and finally returns a context manager containing this directory with the solvated structure and topology files. Calling functions can access these files with the “struct” and “top” keys, i.e. with md. s o l v a t e ( . . . ) a s s o l : d o s o m e t h i n g ( s o l [ ” s t r u c t ” ] , s o l [ ” top ” ] ) As soon as the with block is exited, the context manager deletes the temporary directory, and all its contents. Therefore, we have eliminated the chance that subsequent calls to external MD programs could pick up input files that were left in this directory. Furthermore, as each call to an MD function creates a unique and random directory, multiple copies of ProtoMD may be concurrently executed. 3.5.4. Solvation The CG variables used in ProtoMD are functions of atomic coordinates of the macromolecule of interest (although in principle, one could define the solvent as a subsystem). In MD simulations, the solvent atoms typically account for at least 50% to 75% by particle count of the total system. After the end of each CG timestep, ProtoMD extracts the macromolecule, which is then resolvated using Gromacs, and counter-ions (by default NaCl) are added to the system. The reason resolvation is done is to avoid bad atom contacts between 20 the macromolecule and the solvent. In the case of simulations done in vacuum or using implicit solvent models, solvation is omitted. 4. User guide The purpose of this section is to describe how users can launch and analyze multiscale simulations via the ProtoMD platform. ProtoMD can be used as an executable module, i.e. a program with command-line arguments. In this mode, a GROMACS compatible structure input file can be used to initiate a multiscale simulation. The user must also supply GROMACS-specific parameters for running MD simulations (used for the microscopic phase), and ProtoMDspecific parameters needed for the mesoscopic phase wherein the CG variables are advanced in time and new microstates are generated. The output is an hdf file that contains information including the trajectory at user-specified CG time steps. In the second mode, ProtoMD can be used as a library containing modules that can be assembled into programs; this mode facilitates debugging runs and analyzing results. 4.1. Genaral concepts ProtoMD is built on the concept that everything related to a simulation (including configuration parameters and output) is stored in a single database file. Most traditional MD programs rely on a large number of configuration and parameter files. Even though having a number of files in different locations makes it relatively easy for users to modify these files, it is more convenient to reproduce a single input file when rerunning a simulation at either a later date or on another computer. Since ProtoMD is a wrapper around GROMACS, it requires GROMACSspecific files. When a simulation is first prepared, all these external GROMACS configuration files are collected and stored in the ProtoMD database file. These 21 files are automatically extracted and passed to GROMACS when the simulation is run. This ensures that results are always reproducible because there are no external dependencies. Furthermore, this approach facilitates moving or restarting simulations on different computers. Subsection 4.3 shows how to use the ProtoMD command line interface to prepare an initial database. One can iterate through the output trajectory frames in a very similar manner to that used in MDAnalysis [38]. Details of how one can programmatically access the ProtoMD database including trajectory analysis is covered in section 4.4. Nearly all ProtoMD objects and methods have descriptive docstrings. These strings form the contents of the interactive python help system, and it can be accessed by typing the name of the class or method followed by a “?” in the interactive python interpreter. 4.2. Selections A key concept in ProtoMD is atomic selections. Atomic selections are a group or subset of atoms selected from the whole system based on some criteria. They are used throughout ProtoMD, particularly in defining subsystems (sec. 2.1). MDAnalysis [38] provides a selection language with a syntax similar to that of CHARMM [1] or VMD [35]. One can think of this atomic selection language as “SQL for atoms”. The atomic selection language provides a way to select atoms based on a variety of criteria such as boolean combinations of simple keywords, geometric or connectivity measures, index, name, residue ID, or segment names. Examples on selection statements using MDAnalysis are as follows: # s e l e c t a l l DMPC l i p i d atoms u n i v e r s e . s e l e c t A t o m s ( ” s e g i d DMPC” ) # s e l e c t a l l atoms w i t h i n 5 . 0 Angstroms from ( 0 , 0 , 0 ) 22 universe . selectAtoms ( ” point 0.0 0.0 0.0 5.0 ” ) # s e l e c t a l l atoms b e l o n g i n g t o r e s i d u e s 1 t o 10 universe . selectAtoms ( ” r e s i d 1:10 ” ) In ProtoMD, the user creates a list of selections to define the subsystems. These selections must not overlap, however, because each subsystem must constitute one or more segment(s). For example, it is natural to define the subsystems at the pentamer or hexamer level for a virus-like particle (Fig. (1)). 4.3. Preparing a system ProtoMD can be run from the command line as follows: python -m proto md command [args], where command can be one of the following: • config: creates a database from which a multiscale simulation can be run • top: auto-generates a topology file • solvate: attempts auto-solvation • run: runs or resumes a simulation • runsol: like run, but performs only the auto-solvation step • mn: performs energy minimization • mneq: performs energy minimization followed by equilibriation • mneqmd: performs energy minimization, equilibriation, and molecular dynamics • eq: performs equilibriation • atomistic step: performs a full atomistic step (microphase) • step: performs a single complete coarse-grained step (mesophase) 23 • md: performs only an MD step with the starting structure • cg step: performs only the coarse grained portion of the time step 4.3.1. Configuring the system The first step in running a multiscale simulation is to create the initial database. This can be accomplished by using the following commands: python -m proto md config [args] where args can be one of the following: • -o fid FID name of the simulation file to be created. • -struct STRUCT starting structure name. • -box BOX BOX BOX x,y,z values of the system box in Angstroms. If box is not given, the system size is read from the CRYST line in the structure pdb. • -top TOP name of the topology file. If this is not given, ProtoMD attempts to generate a topology file using pdb2gmx. • -posres POSRES name of a position restraints file, optional. • -I INCLUDE DIRS name of the directories to search for topology file includes. There can be many additional include directories, just like gcc, but UNLIKE GCC, there must be a space between the -I and the dir, for example -I /home/foo -I /home/foo/bar. • -temperature TEMPERATURE is the temperature (in Kelvin) at which to run the simulation. The default temperature is 300 K. • -subsystem factory SUBSYSTEM FACTORY is a fully qualified function name which can create a list of subsystems. The function defaults to proto md.subsystems.SpaceWarpingSubsystemFactory which uses the space 24 warping method for constructing the CG variables for each subsystem. Currently, only Legendre polynomials are supported. • -subsystem selects SUBSYSTEM SELECTS [SUBSYSTEM SELECTS ...] is a list of MDAnalysis select statements, one for each subsystem. For example, if the system is divided into 10 subsystems, subsystem selects takes a list of length 10, with each entry in the list defining a given subsystem. • -anion anion name to use for ionization (if system is not neutral). The default anion is Cl− . • -cation cation name to use for ionization (if system is not neutral). The default cation is Na+ . • -concentration concentration (in mol/L) of ions to use for ionization. The default is 0.15 M. • -subsystem args SUBSYSTEM ARGS [SUBSYSTEM ARGS ...] is a list of additional arguments passed to the subsystem factory, the second item of the list may be the string resid unique, which creates a separate subsystem for each residue. The most commonly used subsystem is SpaceWarpingSubsystemFactory. The args for this subsystem are [kmax, OPTIONAL(“resid unique”)]. kmax provides the upper bound on the sum of the three Legendre indices, and the last argument is the optional string “resid unique” to make a unique subsystem for each residue. • -integrator INTEGRATOR is a fully qualified name of the integrator function, defaults to proto md.integrators.LangevinIntegrator, and the other integrator provided is proto md.integrators.FactorizationIntegrator. • -integrator args INTEGRATOR ARGS are additional arguments passed to the integrator function. 25 • -cg steps CG STEPS is the number of coarse grained time steps taken over the duration of a multiscale simulation. • -dt DT is the size of the coarse grained time step in picoseconds. • -mn steps MN STEPS is the number of MD steps used in performing energy minimization. • -md steps MD STEPS is the number of MD steps used for performing the MD run of the microphase • -multi MULTI is the number of parallel MD runs used to construct microstate ensembles for the Langevin integrator. • -mn args MN ARGS is a dictionary of GROMACS-specific (mdp) parameters used in minimization. • -md args MD ARGS is a dictionary of GROMACS-specific (mdp) parameters used in MD. • -ndx NDX is the name of GROMACS-specific index file. • -solvate is a boolean flag. If it is set and the system is to be auto-solvated, the initial structure must not contain any solvent. This flag default to False. To enable solvation, add ‘-solvate’ with no args. • -debug enables debug mode (all simulation directories are not deleted). • -mainselection MAINSELECTION is the name of make ndx group which is used for the main selection. This should be a group that consists of the non solvent molecules. “Protein” usually works when simulating protein. However, another selection command is required when simulating lipids. To find which selection to make, either run ProtoMD config, and an error 26 will pop up informing the user of the available groups, or run make ndx to get a list of possible selections. 4.3.2. Examples Algorithms on which ProtoMD is based has been demonstrated for a number of systems [11, 14, 34, 27, 39]. The following example illustrates the use of the ProtoMD command line interface to generate a simulation database. In the following example, a database file named “1LFG-open.hdf” is created using the starting structure of Lactoferrin, “1LFG-open.gro”. ProtoMD generates a single-trajectory simulation (multi 1) for a series of 30 CG time steps, each of length 1 ps (dt 1). The order of Legendre polynomials to be used is less than or equal to 1 (subsystem args 1), which implies a total of 4 × 3 CG variables are used to coarse-grain the protein. Furthermore, this example generates a single subsystem which is specified by the “not segid SOL” selection statement. This statement specifies that the subsystem should consist of all the atoms present in the system which are not solvent atoms (i.e., protein atoms). Note that here no topology file is specified; in this case, ProtoMD uses the ‘pdb2gmx” program to generate a topology file, which is stored in the hdf database. 4.4. Accessing the database and analyzing simulations proto md.System object is the main class in ProtoMD responsible for loading and accessing the underlying hdf5 database, and for orchestrating the various MD programs used to advance the simulation. The System object allows one to open the underlying hdf5 database and access all pertinent data as native python objects. One can create a System object in either read/write or readonly mode. One typically creates a System object in read-only mode to analyze and access simulation results, and only the actual running simulation creates the System object in read/write mode. There can be multiple instances of the 27 #! / b i n / sh python −m proto md c o n f i g \ −o 1LFG−open . hdf \ −s t r u c t 1LFG−open . g r o \ −t e m p e r a t u r e 3 0 0 . 0 \ −s u b s y s t e m s e l e c t s ” not s e g i d SOL” \ −c g s t e p s 30 \ −dt 1 \ −mn steps 5000 \ −md steps 500 \ −m u l t i 1 \ −i n t e g r a t o r \ proto md . i n t e g r a t o r s . F a c t o r i z a t i o n I n t e g r a t o r \ −s u b s y s t e m f a c t o r y \ proto md . s u b s y s t e m s . SpaceWarpingSubsystemFactory \ −s u b s y s t e m a r g s 1 System object being open in the same database. For example, a simulation can be currently running (on another process or even another node in a compute cluster), and a user can create a System object to access the simulation. Provided the user creates the System object in read-only mode, it will not interfere with the currently running simulation. The System object has a number of important properties and methods. However, from an analysis perspective the most important is the System.timesteps property. This is stored in the database and a series of hdf5 groups, but is presented to the user as a list of Timestep objects. The Timestep object provides access to all of the calculated variables in a CG step as Numpy arrays, and the Timestep object can even create an MDAnalysis Universe object from the CG timestep data. The Timestep object provides the following properties: • atomic minimized positions • atomic equilibriated positions 28 • atomic starting positions • atomic final positions • timestep begin • timestep end • cg coords • cg velocities • cg forces • cg translate The Timestep object is also capable of writing itself to the hdf5 database. This is accomplished with the Timestep.flush() method. Writing to the database is normally only performed by a simulation, and will fail if the System was opened in read-only mode. For analysis, the Timestep object provides a useful “create universe()” method, which creates an MDAnalysis Universe from the timestep object. Once the Universe is created, it can be loaded with the atomic positions listed above. Syntactically, it would have been simpler if each timestep possessed a universe property; however, the implementation of such a system would have entailed either creating a new universe each time the universe property is called (computationally expensive), or create a caching system. It is relatively simple to first obtain a universe object then set the atomic properties during the course of iterating over the timesteps. 4.4.1. Examples This section provides an example on analyzing a ProtoMD trajectory using the universe object and the timesteps. 29 • Radius Of Gyration: This example first obtains a universe object, then sets the atomic positions with the values stored in each timestep. Finally, it performs the calculation and appends the value to a list. # i m p o r t t h e proto md p a c k a g e import proto md # c r e a t e a system from t h e f i l e ‘ m y s i m u l a t i o n . h d f ’ s y s = proto md . System ( ‘ m y s i m u l a t i o n . hdf ’ ) # c r e a t e a new u n i v e r s e o b j e c t , t h i s i s now # s e t to the i n i t i a l c o o r d i n a t e s u = s . universe # c r e a t e a new empty l i s t l = l i s t () # i t e r a t e over the timesteps for ts in sys . timesteps : # s e t t h e u n i v e r s e atomic p o s i t i o n s # to the values stored in the database timestep . u . atoms . p o s i t i o n s = t s . a t o m i c f i n a l p o s i t i o n s # c a l c u l a t e the r adius of gyration r o g = u . atoms . r a d i u s O f G y r a t i o n s ( ) # append r o g t o t h e l i s t l . append ( r o g ) • Radius Of Gyration (alternative version): perhaps a more ‘pythonic’ method of calculating the radius of gyration (or any other metric) would be to create a higher order function once the universe is obtained and populate it with the atomic positions and return it. In this approach, list comprehension can be used to create a list of values using a single list comprehension statement: # i m p o r t t h e proto md p a c k a g e import proto md 30 # c r e a t e a system from t h e f i l e ‘ m y s i m u l a t i o n . h d f ’ s y s = proto md . System ( ‘ m y s i m u l a t i o n . hdf ’ ) # c r e a t e a new u n i v e r s e o b j e c t , t h i s i s now # s e t to the i n i t i a l c o o r d i n a t e s u = sys . universe # f u n c t i o n which t a k e s an a r r a y o f atomic p o s i t i o n s , # s e t s t h e u n i v e r s e atomic s t a t e and r e t u r n s t h e # universe object . def u f ( pos ) : # s e t t h e u n i v e r s e ’ s atomic p o s i t i o n s u . atoms . p o s i t i o n s = pos return u # c a l c u l a t e the value in a s i n g l e l i s t statement . r o g = [ u f ( t s . a t o m i c f i n a l p o s i t i o n s ) . atoms . radiusOfG− y r a t i o n ( ) f o r t s in s y s . t i m e s t e p s ] • Distance Between Proteins: This example is taken directly from the MDAnalysis website, but here MDAnalysis is used to analyze a coarse grained time series. In this example, the end to end distance of a protein and the radius of gyration are calculated. This is assuming that the protein in question has the segment identifier “S4AKE”. This example uses MDAnalysis atom groups. Note, any analysis of ProtoMD timesteps performed using MDAnalisys virtually identical to conventional MD analysis, the only difference being that one must to set the atomic positions from the ProtoMD timestep values. # i m p o r t t h e proto md p a c k a g e import proto md # c r e a t e a system from t h e f i l e ‘mySim . h d f ’ s y s = proto md . System ( ‘ mySim . hdf ’ ) # c r e a t e a new u n i v e r s e o b j e c t , t h i s i s now # s e t to the i n i t i a l c o o r d i n a t e s u = s . universe 31 # t h e f i r s t n i t r o g e n atom i n t h e 4AKE segment . nterm = u . s4AKE .N [ 0 ] # l a s t carbon atom i n t h i s segment cterm = u . s4AKE . C[ −1] # a s e l e c t i o n ( a AtomGroup ) bb = u . s e l e c t A t o m s ( ‘ p r o t e i n and backbone ’ ) # iterate f o r t s in # # # t h r o u g h a l l frames sys . timesteps : s e t t h e u n i v e r s e atomic p o s i t i o n s t h i s i s t h e o n l y d i f f e r e n t a s p e c t from c o n v e n t i o n a l MD time s e r i e s a n a l y s i s . u . atoms . p o s i t i o n s = t s . a t o m i c f i n a l p o s i t i o n s # end−to−end v e c t o r from atom p o s i t i o n s r = cterm . pos − nterm . pos # end−to−end d i s t a n c e d = numpy . l i n a l g . norm ( r ) # AtomGroup i s a ‘ l i v e ’ o b j e c t , i t i s u p d a t e d # each time new c o o r d i n a t e s a r e l o a d e d . r g y r = bb . r a d i u s O f G y r a t i o n ( ) print ” frame = %d : d = %f Angstrom , Rgyr = ” \ ”%f Angstrom ” % ( t s . frame , d , r g y r ) 4.5. Running simulations Once an hdf file is created, a multiscale simulation is initiated by invoking: python -m proto md run file.hdf [-debug]. If the user supplies the optional parameter -debug, ProtoMD will not delete any of the directories created to run MD. This is useful for debugging and tracking problems with unstable simulations. As an example on running a multiscale simulation using the Trotter integrator, pertussis toxin (PDB ID: 1PRT), a protein-based AB5-type exotoxin [40] 32 was simulated under NVT conditions at 300K. The CHARMM27 force field [23] was used in explicit solvent using the TIP3P model [41]. NaCl counter-ions of concentration 0.15 M were added for charge neutrality. The system consisted of 603, 775 atoms in a box of dimensions 16nm × 16nm × 24nm. The CG timestep was set to 0.5 ps with the micro (MD) phase taken to be 100 fs and the equilibriation time 50 fs. To prepare the system, the following script was run: #! / b i n / sh python −m proto md c o n f i g \ −o 1PRT. hdf \ −s t r u c t 1PRT. g r o \ −t e m p e r a t u r e 3 0 0 . 0 \ −s u b s y s t e m s e l e c t s ” p r o t e i n ” \ −c g s t e p s 7400 \ −dt 0 . 5 \ −mn steps 0 \ −e q s t e p s 50 \ −md steps 100 \ −t o p a r g s { ’ f f ’ : ’ charmm27 ’ } \ −m u l t i 1 \ −i n t e g r a t o r \ proto md . i n t e g r a t o r s . F a c t o r i z a t i o n I n t e g r a t o r \ −s o l v a t e \ −s u b s y s t e m f a c t o r y \ proto md . s u b s y s t e m s . SpaceWarpingSubsystemFactory \ −s u b s y s t e m a r g s 1 The last argument ‘subsystem args 1’ signifies that only first order Legendre 33 polynomials are used in constructing a total of 4×3 space-warping CG variables (sec. 2.1). Once the hdf file 1PRT.hdf was created, a multiscale simulation was initiated with the following command: python −m proto md run 1PRT. hdf The simulation performs a series of 7, 400 CG steps, writing the minimized, equilibriated, and extrapolated coordinates of all protein atoms at the end of each CG timestep. The protein undergoes a conformational change over this time period, which is visualized by plotting the root-mean-square deviation (RMSD) of the protein atomic positions with the reference configuration set to the protein positions at t = 0 ns. For comparison, two MD runs were also done (starting from the same initial conditions) at the same thermodynamic conditions. import proto md import numpy a s np from MDAnalysis . a n a l y s i s . a l i g n import ∗ s y s = proto md . System ( ‘ 1PRT. hdf ’ ) # c r e a t e a new u n i v e r s e o b j e c t , t h i s i s now # s e t to the i n i t i a l c o o r d i n a t e s mob = s . u n i v e r s e # create a reference r e f = proto md . System ( ‘ 1PRT. hdf ’ ) f o r t s in s y s . t i m e s t e p s : # a l i g n the mobile to the reference s t r u c t u r e 34 a l i g n t o (mob , r e f , s e l e c t=” p r o t e i n ” , m a s s w e i g h t e d=True ) # s a v e rmsd t o f i l e r m s d l i s t . append ( rmsd (mob , r e f ) ) # w r i t e rmsd t o f i l e r m s d l i s t = np . a r r a y ( r m s d l i s t ) np . s a v e t x t ( ’ rmsd . dat ’ , r m s d l i s t ) 15 RMSD (Å) 10 5 MF MD1 MD2 0 0 1 2 Time (ns) 3 4 Figure 4: Variation in the RMSD of pertussis toxin protein as a function of time, using multiscale factorization (MF) and molecular dynamics (MD1 and MD2). 5. Acknowledgments This research was supported in part by the following: the NSF INSPIRE program (grant 1344263), the NSF division of material science research (grant 35 1533988), and the Indiana University information technology services (UITS) for high performance computing resources. References [1] Bernard R. Brooks, Robert E. Bruccoleri, Barry D. Olafson, David J. States, S. Swaminathan, and Martin Karplus. Charmm: A program for macromolecular energy, minimization, and dynamics calculation. J. Comput. Chem., 4:187–217, 1983. [2] David van Der Spoel, Erik Lindahl, Berk Hess, Gerrit Groenhof, Alan E. Mark, and Herman J. C. Berendsen. Gromacs: Fast, flexible, and free. J. Comput. Chem., 26:1701–1718, 2005. [3] James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, Laxmikant Kale, and Klaus Schulten. Scalable molecular dynamics with namd. J. Comput. Chem., 26:1781–1802, 2005. [4] S. Plimpton. Fast parallel algorithms for short-range molecular dynamics rigid bodies. J. Comp. Phys., 117:1–19, 1995. [5] David A. Pearlman, David A. Case, James W. Caldwell, Wilson S. Ross, Thomas E. Cheatham III, Steve DeBolt, David Ferguson, George Seibel, and Peter Kollman. Amber, a package of computer programs for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to simulate the structural and energetic properties of molecules. Comput. Phys. Commun., 91:1–41, 1995. [6] Robert E. Rudd and Jeremy Q. Broughton. Coarse-grained molecular dynamics and the atomic limit of finite elements. Phys. Rev. B, 58:5893–5896, 1998. 36 [7] Amy Y. Shiha, Anton Arkhipov, Peter L. Freddolino, and Klaus Schulten. Coarse grained protein-lipid model with application to lipoprotein particles. J. Phys. Chem. B, 110:3674–3684, 2006. [8] Amy Y. Shiha, Peter L. Freddolinoa, Anton Arkhipova, and Klaus Schulten. Assembly of lipoprotein particles revealed by coarse-grained molecular dynamics simulations. J. Struct. Biol., 157:579–592, 2007. [9] Peter J. Bond, John Holyoake, Anthony Ivetac, Syma Khalid, and Mark S.P. Sansom. Coarse-grained molecular dynamics simulations of membrane proteins and peptides. J. Struct. Biol., 157:593–605, 2007. [10] Khuloud Jaqaman and Peter J. Ortoleva. New space warping method for the simulation of large-scale macromolecular conformational changes. J. Comput. Chem., 23:484–491, 2002. [11] Srinath Cheluvaraja and Peter J. Ortoleva. Thermal nanostructure: An order parameter multiscale ensemble approach. J. Chem. Phys., 132:75102– 75110, 2010. [12] Yinglong Miao, J. E. Johnson, and Peter J. Ortoleva. All-atom multiscale simulation of ccmv capsid swelling. J. Phys. Chem. B, 114:11181–11195, 2010. [13] Harshad Joshi, Abhishek Singharoy, Yuriy V. Sereda, Srinath C. Cheluvaraja, and Peter J. Ortoleva. Multiscale simulation of microbe structure and dynamics. Prog. Biophys. Mol. Biol., 107:200–217, 2011. [14] Andrew Abi Mansour and Peter Ortoleva. Multiscale factorization method for simulating mesoscopic systems with atomic precision. J. Chem. Theory Comput., 10:518–523, 2014. 37 [15] I. Bahar, R. A. Atilgan, and B. Erman. Direct evaluation of thermal fluctuations in protein using a single parameter harmonic potential. Folding and Design, 2:173–181, 1997. [16] T. Haliloglu, I. Bahar, and B. Erman. Gaussian dynamics of folded proteins. Phys. Rev. Letters, 79:3090–3093, 1997. [17] M. K. Kim, R. L. Jernigan, and G. S. Chirikjian. Rigid-cluster models of conformational transitions in macromolecular machines and assemblies. J. Biophys., 89:43–55, 2005. [18] Siewert J. Marrink, Alex H. de Vries, and Alan E. Mark. Coarse grained model for semiquantitative lipid simulations. J. Phys. Chem. B, 108:750– 760, 2004. [19] Teemu Murtola, Emma Falck, Mikko Karttunen, and Ilpo Vattulainen. Coarse-grained model for phospholipid/cholesterol bilayer employing inverse monte carlo with thermodynamic constraints. J. Chem. Phys., 126:75101–75114, 2007. [20] W. G. Noid, Jhih-Wei Chu, Gary S. Ayton, Vinod Krishna, Sergei Izvekov, Gregory A. Voth, Avisek Das, and Hans C. Andersen. The multiscale coarse-graining method. i. a rigorous bridge between atomistic and coarsegrained models. J. Chem. Phys., 128:244114–244124, 2008. [21] J. C. Berendsen, David van der Spoel, and R. van Drunen. Gromacs: A message-passing parallel molecular dynamics implementation. Comput. Phys. Commun., 91:43–56, 2013. [22] Abhishek Singharoy, Harshad Joshi, Yinglong Miao, and Peter J. Ortoleva. Space warping order parameters and symmetry: Application to multiscale 38 simulation of macromolecular assemblies. J. Phys. Chem. B, 116:8423– 8434, 2012. [23] Jr. AD MacKerell, N. Banavali N, and N. Foloppe. Development and current status of the charmm force field for nucleic acids. Biopolymers, 56:257– 265, 2001. [24] Wendy D. Cornell, Piotr Cieplak P, Christopher I. Bayly, Ian R. Gould, Kenneth M. Merz Jr., David M. Ferguson, David C. Spellmeyer, Thomas Fox, James W. Caldwell, and Peter A. Kollman. A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc, 117:5179–5197, 1995. [25] Peter J. Ortoleva. Nanoparticle dynamics: A multiscale analysis of the liouville equation. J. Chem. Phys., 109:2770–2771, 2005. [26] Stephen Pankavich, Zeina Shreif, Yinglong Miao, and Peter J. Ortoleva. Self-assembly of nanocomponents into composite structures: Derivation and simulation of langevin equations. J. Chem. Phys., 130:194115–194124, 2009. [27] Yuriy V. Sereda, John M. Espinosa-Duran, , and Peter J. Ortoleva. Energy transfer between a nanosystem and its host fluid: A multiscale factorization approach. J. Chem. Phys., 140:074102, 2014. [28] H. F. Trotter. On the product of semi-groups of operators. Proceedings of the American Mathematical Society, 10:545–551, 1959. [29] Stephen Pankavich, YingLong Miao, Jamil Ortoleva, Zeina Shreif, and Peter J. Ortoleva. Stochastic dynamics of bionanosystems: Multiscale analysis and specialized ensembles. J. Chem. Phys., 128:234908–234921, 2008. 39 [30] Abhishek Singharoy, Srinad Cheluvaraja, and Peter J. Ortoleva. Order parameters for macromolecules: Application to multiscale simulation. J. Chem. Phys., 134:44104–44120, 2011. [31] A. Singharoy, H. Joshi, and Peter J. Ortoleva. Multiscale macromolecular simulation: role of evolving ensembles. J. Chem. Inf. Model., 52:2638–2686, 2012. [32] Mary A. Rohrdanz, Wenwei Zheng, Mauro Maggioni, and Cecilia Clementi. Determination of reaction coordinates via locally scaled diffusion map. J. Chem. Phys., 134:124116–244124, 2011. [33] F Muller-Plathe. Coarse-graining in polymer simulation: From the atomistic to the mesoscopic scale and back. ChemPhysChem., 3:754–769, 2002. [34] Jing Yang, Abhishek Singharoy, Yuriy V. Sereda, and Peter J. Ortoleva. Quasiequivalence of multiscale coevolution and ensemble md simulations: A deomnstration with lactoferrin. Chem. Phys. Lett., 616-617:154–160, 2014. [35] William Humphrey, Andrew Dalkea, and Klaus Schulten. Vmd: Visual molecular dynamics. J. Mol. Graph. Model., 14:33–38, 1996. [36] Oliver Beckstein. Gromacswrapper. @ONLINE. [37] Bjarne Stroustrup. The design and evolution of c++. In The Design and Evolution of C++. Addison-Wesley, 1994. [38] N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. Mdanalysis: A toolkit for the analysis of molecular dynamics simulations. J. Comput. Chem., 32:2319–2327, 2011. 40 [39] Andrew Abi Mansour, Yuriy V. Sereda, Jing Yang, and Peter J. Ortoleva. Prospective on multiscale simulation of virus-like particles: Application to computer-aided vaccine design. Vaccine, 33:5890–5896, 2015. [40] Penelope E. Stein, Amechand Boodhoo, Glen D. Armstrong, Stephen A. Cockle, Michel H. Klein, and Randy J. Read. The crystal structure of pertussis toxin. Structure, 2:45–57, 1994. [41] W. L. Jorgensen, J. Chandrasekhar, J. D. Madura, R. W. Impey, and M. L. Klein. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys, 79:926–935, 1983. 41