Temporal Subspace Clustering for Molecular Dynamics Data

Anna Beer University of ViennaViennaAustria anna.beer@univie.ac.at , Martin Heinrigs LMU MunichMunichGermany heinrigs-martin@t-online.de , Claudia Plant University of ViennaViennaAustria plant@univie.ac.at and Ira Assent Aarhus UniversityAarhusDenmark ira@cs.au.dk

Abstract.

We introduce MOSCITO (MOlecular Dynamics Subspace Clustering with Temporal Observance), a subspace clustering for molecular dynamics data. MOSCITO groups those timesteps of a molecular dynamics trajectory together into clusters in which the molecule has similar conformations. In contrast to state-of-the-art methods, MOSCITO takes advantage of sequential relationships found in time series data. Unlike existing work, MOSCITO does not need a two-step procedure with tedious post-processing, but directly models essential properties of the data. Interpreting clusters as Markov states allows us to evaluate the clustering performance based on the resulting Markov state models. In experiments on 60 trajectories and 4 different proteins, we show that the performance of MOSCITO achieves state-of-the-art performance in a novel single-step method. Moreover, by modeling temporal aspects, MOSCITO obtains better segmentation of trajectories, especially for small numbers of clusters.

Clustering, Molecular Dynamics, Subspace Clustering

1. Introduction

Proteins play an important role in every living organism. Analyzing the transition between different shapes of a protein, also known as folding, is important for understanding the structure and function of proteins, e.g., for drug design. Many diseases like Alzheimer, Parkinson, and different types of cancer are connected to misfolding of certain proteins (Hartl, 2017). Thus, understanding such folding processes may help to prevent or cure diseases.

Molecular dynamics data contains trajectories of proteins that change their shape over time. It is usually very high-dimensional, with thousands of time steps and hundreds of atoms for a single protein. Analyzing molecular dynamics data with traditional clustering is typically not meaningful due to the curse of dimensionality. To mitigate issues with high dimensionality, subspace clustering methods detect clusters in lower-dimensional subspaces (Kriegel et al., 2012). However, existing subspace clustering methods do not cater for the characteristics of molecular dynamics data.

In this paper, we introduce MOSCITO (Molecular Dynamics Subspace Clustering with Temporal Observance), a subspace clustering algorithm that uses temporal regularization to exploit sequential relationships found in molecular dynamics trajectories that describe folding activities of single molecules. MOSCITO is the first one-step approach to handle this complex data type by identifying the best subspace projection. It finds features tailored specifically to molecular dynamics data. Leveraging methods from video analytics, we show that MOSCITO finds temporally contiguous groups of of protein conformations.

In trajectories, data points that are near in time are usually more closely related than those that are further apart. MOSCITO allows users to choose different sizes of time windows and different weighting methods for temporal regularization: binary, Gaussian, exponential, or logarithmic. Interestingly, the resulting clusters can be interpreted as states in a Markov State Model (MSM), which we use for comparing MOSCITO to approaches like SSC, PCA + k-Means, and TICA + k-Means. As we demonstrate in thorough empirical evaluation, MOSCITO successfully clusters molecular dynamics data in a model that directly captures the essential features and temporal relationships. Our main contributions are as follows:

•

We introduce MOSCITO, a subspace clustering method for molecular dynamics data that makes use of temporal relations between time steps that are close
•

MOSCITO finds temporally contiguous groups of protein conformations that are similar
•

In contrast to currently used methods that require several steps, the clusters found by MOSCITO can directly be used as states of a Markov State Model describing the trajectory, yielding state-of-the-art quality with a one-in-all holistic approach

2. Clustering trajectories of molecular dynamics data

In this paper, we study the trajectories of a single molecule over time. A trajectory is given by the three-dimensional Cartesian coordinates of all atoms of the molecule for a number of time steps with fixed intervals. Where most traditional clustering methods target numerical data in $\mathbb{R}^{d}$ , clustering MD data brings some unique challenges as we outline in the following.

Due to the size of MD data, automated methods for analyzing them are required. These often involve clustering methods to automatically group similar groups of atoms or timesteps.

In this paper, we focus on simulations of very long molecular dynamics trajectories that show folding processes of single molecules (Shaw et al., 2014). Based on the starting positions of a molecule’s atoms, physical forces on all other atoms can be calculated, yielding simulated trajectories with a temporal resolution that usually lies in the femtosecond range. While the overall shape of the molecules is decisive for its function, several features are shown to be relevant, e.g., the torsions between $C^{\alpha}$ atoms that form the backbone of the protein, distances between specific atoms, or the solvent accessible surface area.

There are different use cases for which a clustering of MD trajectories is required. E.g., clusters of atoms that move similarly over time can be used to find dynamic domains, i.e., fragments of a protein that are internally rigid (Romanowska et al., 2012). Clustering time steps in which the conformation of the protein, i.e., the arrangement of its atoms in space, is similar is used for detecting relevant states of the protein. Those states can be linked to (local) valleys in the energy landscape of the protein and are e.g. used for developing Markov models that describe a protein trajectory. Thus, one goal is to separate metastable states of a protein and the transitions between those states.

3. Markov state modeling

Markov state models (MSMs) can be used to describe the dynamics of a MD trajectory by using a $n\times n$ transition matrix. To generate an MSM from an MD trajectory $X=\{x_{1},...,x_{t}\}$ , discrete states of the trajectory need to be identified. State-of-the-art methods use a two-step approach in which they first reduce the dimensionality of the data by using PCA or TICA and then clustering it with $k$ -Means into $n$ discrete states $S=\{S_{1},...S_{n}\}$ (Wehmeyer et al., 2019). These states are used to calculate the transition probability matrix for a given lag time $\tau$ by counting the transitions from time step $t$ to $t+\tau$ :

(1)

p_{ij}(\tau)=\mathbb{P}(x_{t+\tau}\in S_{j}|x_{t}\in S_{i})

The resulting MSM can be used for further analysis of the trajectory. In this paper, we analyze the clustering performance of our method in comparison to state-of-the-art methods for clustering MD trajectories based on their suitability as Markov states.

The variational approach for Markov models (VAMP)(Wu and Noé, 2017) is a generalization of the Variational approach to conformational dynamics (VAC) (Nuske et al., 2014), (Noé and Nuske, 2013) and provides a tool to evaluate MSMs. The VAC allows the approximation of eigenvalues and eigenfunctions with statistical observables (Noé and Nuske, 2013). It also defines a family of scoring functions, called the $VAMP-r$ , which can be calculated from data. This score can be used to compare the choices of hyper-parameters, like the featurization of the data or the used clustering, used for constructing the MSM. The score is calculated from the singular value decomposition of the Koopman operator. The Koopman operator is explained below. It is important to note that the VAMP score can only be compared for MSMs constructed with the same lag time $\tau$ and cannot be used to find the optimal lag time $\tau$ itself.

The Koopman operator gives the conditional expected value of an arbitrary observable $g$ at time $t+\tau$ for a given $x_{t}$ . The Koopman operator $K_{\tau}$ of a Markov process is defined as follows (Wu and Noé, 2017):

(2)

\textit{K}_{\tau}g(x)=\mathbb{E}[g(x_{t+\tau}|x_{t}=x]

When choosing the Dirac delta function $\delta_{y}$ centered at y, applying the Koopman operator evaluates the transition density of the dynamics (Wu and Noé, 2017).

(3)

K(\tau)=C_{00}^{-\frac{1}{2}}C_{01}C_{11}^{-\frac{1}{2}}\approx U_{m}\Sigma_{m% }V_{m}^{T},

where $\Sigma_{m}=diag(\sigma_{1},...,\sigma_{m})$ are the first $m$ singular values.

VAMP-r Score

The VAMP-score (Wu and Noé, 2017) defines a function that can be used to score the quality of MSMs for a set lag-time $\tau$ . This score can be used to compare the methods used to create the discrete trajectories from the same featured trajectory (Suárez et al., 2021).

With $\Sigma_{m}$ as defined above (see Equation 3), the VAMP-r score can be written as:

(4)

\text{VAMP-}r=\sum_{i=1}^{m}\sigma_{i}^{r}

The VAMP-2-score is based on the sum of squared eigenvalues of the transition matrix and can be interpreted as kinetic content (Suárez et al., 2021). In this paper, we use the VAMP-2 score to score the performance of different discretization methods for MD trajectories since there usually is no given ground truth to compare to. The most commonly used scores are the VAMP-1, which is analogous to the Rayleigh trace, and the VAMP-2, which is analogous to the kinetic variance (Scherer et al., 2019).

4. Subspace clustering

In high-dimensional datasets where the curse of dimensionality renders traditional clustering infeasible, subspace clustering finds clusters in a union of lower dimensional subspaces (Qu et al., 2023). For a dataset $X=[x_{1},x_{2},...,x_{n}]\in\mathbb{R}^{d}$ it is often assumed that vectors $x_{j}$ are drawn from a union of $k$ subspaces $\{S_{i}\}_{i=1}^{k}$ of unknown dimensionality (Li et al., 2015). As MD data is typically very high dimensional, subspace clustering is a suitable approach. For details on subspace clustering, see (Parsons et al., 2004; Vidal, 2011). In this paper, we focus on spectral-based subspace clustering.

The advantages of spectral subspace clustering methods include a relatively simple implementation and robustness regarding data corruption and initialization (Qu et al., 2023). Spectral subspace clustering methods usually perform two steps. First, a subspace representation of the original data is learned and used to construct an affinity matrix. Then spectral clustering is used on the affinity matrix to obtain the final clustering. Many different spectral algorithms like sparse subspace clustering (SSC) (Elhamifar and Vidal, 2013) or Low-Rank Representation (LRR) (Liu et al., 2012) have been proposed, for an overview see (Vidal, 2011). While spectral clustering has been successfully applied to numeric data, it lacks the ability to handle temporal data. We, thus, turn our attention to methods that have been developed with a different application goal in mind, but, as we argue, with similarities that make them attractive for the design of our MD subspace clustering method.

Temporal subspace clustering (TSC) (Li et al., 2015) is a subspace clustering algorithm designed to segment video sequences into separate human motions. Video sequences are high-dimensional time series, with one frame of the video at each timestep. Significant changes in a video sequence usually are not immediate, but take place over many frames. Neighboring time frames are more correlated to each other than more distant frames. In that sense, molecular dynamics data is similar to video data. Unlike classical subspace clustering methods, which do not take any relations between the dimensions into account, TSC is designed to take advantage of the sequential relationship in time series data. TSC uses a dictionary learning approach together with a temporal Laplacian regularization function to encode those relationships in the dataset. Dictionary learning is a method that learns a set of basis vectors, a so-called dictionary (Tošić and Frossard, 2011). The original data can then be represented as linear combinations of the basis vectors. TSC combines the dictionary learning approach with a temporal regularization function to encode the sequential relationships in the data. A non-negative dictionary and a corresponding coding are learned from the given data creating a robust affinity graph.

5. Moscito: Temporal Subspace Clustering for MD Data

Refer to caption — Figure 1. MOSCITO derives MD features from input trajectories. Temporal regularization provides the means for effective subspace clustering with temporal relations.

We are now ready to describe our MOSCITO (MOlecular dynamics Subspace Clustering with Temporal Observance) approach, the first subspace clustering method for molecular dynamics data that captures temporal relations. Existing methods (Wehmeyer et al., 2019) often rely on a two- or three-step approach consisting of 1) dimensionality reduction, 2) clustering, and 3) finding macrostates by combining several clusters. As these steps are handled in isolation, their separate results may not be optimal overall. To avoid such undesirable side effects, MOSCITO takes a holistic approach: Firstly, in subspace clustering, it inherently combines clustering with dimensionality reduction. Secondly, MOSCITO incorporates temporal relation by weighting sequential neighbors in a time window. In contrast, existing subspace clustering algorithms cannot capture e.g., correlation between consecutive time steps.

For our new method MOSCITO, we exploit the similarity between MD data and video data (see Section 4) as well as the advances in the field of data mining on videos: TSC (Li et al., 2015) clusters video data while incorporating the temporal relations between time steps as well as allowing clusters in subspaces of the data. Thus, we base MOSCITO on TSC. However, TSC is not able to handle MD data, because despite some similarities between the videos and MD data (e.g., smooth temporal transitions between time steps and potential clusters in subspaces), they also have strong differences: for every time step , MD data describes a chain of atoms with its 3d coordinates, while video data contains colors for every pixel, where pixels are in a 2d grid. Thus, we need to change several steps and include knowledge from the field of MD as shown in Figure 1, which provides an overview of MOSCITO. The blue parts are equivalent to those of TSC (Li et al., 2015), see Section 7. The green parts are adapted or changed completely. At first, essential MD features are extracted from the MD trajectory. From the features and a temporal regularization function, we learn a dictionary and the corresponding representation coding matrix. We then construct an affinity graph, and generate temporal segments by clustering the affinity graph.

Feature extraction

From the input MD trajectors, we extract features that capture essential properties.
Cartesian coordinates: The $C^{\alpha}$ atoms determine the shape and function of a protein, thus we use the common approach to focus solely on their 3d coordinates, rather than considering all atoms. For each frame, coordinates are centered and aligned to the first frame of the trajectory.
Backbone torsions: The backbone torsions determine the main structure of a protein. The torsion angles $\phi$ and $\psi$ describe the dihedral angles between atoms of the backbone. Backbone torsions better describe the structure of a protein than absolute coordinates (Vaidehi and Jain, 2015) as internal angles are independent of the global rotation of the protein, removing the need to align all frames to a fixed rotation.
Distance based: Heavy atoms are all atoms that are not hydrogen. We extract the minimal distance between heavy atoms (Bonomi et al., 2009) (we exclude the pairs at indices(i, i+1) and (i, i+2), as their distance is largely determined by their chemical bonds). Following (Scherer et al., 2019), distances $d$ are transformed to the negative exponential function $e^{-d}$
Flexible torsions: The flexible torsion feature consists of the $\chi_{1}$ to $\chi_{5}$ side chain torsions. The side chains of a protein are chains of atoms, connected to the backbone. The angles are transformed into sine and cosine angles.
Solvent accessible surface area: For each frame of the trajectory, the solvent accessible surface area (SASA) of each residue is calculated using the Shrake-Rupley algorithm (Shrake and Rupley, 1973). The SASA of a residue is calculated by summing up the SASA of each of the residues atoms
3D shape histogram: Based on (Ankerst et al., 1999), a combined 3D shape histogram for each frame of the trajectory is created. The state space is separated into five equally spaced shells around the origin of the coordinate system. The shells are split up further by combining them with 12 sectors into 60 cells. Histograms are created by counting the number of atoms in each cell. Unlike (Ankerst et al., 1999), all atoms instead of only the surface atoms were used as we are not only interested in the shape and function of the protein at a certain time step, but also in its internal folded structure. Final histograms are normalized between 0 and 1.

Temporal regularization

For trajectories, data points at neighboring time steps are usually more closely related than those at more distant time steps. Let the $i$ -th column of the coding matrix $Z$ a representation of vector $x_{i}$ , then the encodings of its sequential neighbors, e.g., $z_{i-1},z_{i+1}$ should be similar to $z_{i}$ . To achieve this, we use a temporal Laplacian regularization function $f(Z)$ .

(5)

f(Z)=\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}w_{i}j||z_{i}-z_{j}||_{2}^{2}=tr(% ZL_{T}Z^{T})

$L_{T}$ is the temporal Laplacian matrix defined as $L_{T}=\tilde{D}-W$ . $W$ is the weight matrix, capturing the sequential relationship in $X$ , and $\tilde{D}_{ii}=\sum_{j=1}^{n}w_{ij}$ is a diagonal matrix with the sum of the rows of $W$ on its diagonal.

The temporal regularization in MOSCITO weighs the sequential neighbors using a weight matrix $W\in\mathbb{R}^{n\times n}$ for a trajectory with $n$ timesteps. Row $i$ of $W$ represents the weights of $x_{i}$ ’s neighbors. In the base case, binary weights are used. This weighs all $s$ sequential neighbors of $x_{i}$ the same, independent of the actual temporal distance between the data points. In contrast to TSC, MOSCITO offers several Naturally, points in close temporal proximity are more strongly correlated than points further apart in time. To model the decrease in correlation, we introduce different decaying weightings:
Binary: weights of 1 for points in the neighborhood and 0 outside
Gaussian: Gaussian curve fitted to the neighborhood
Logarithmic: Logarithmic decay with a value of 1 for the direct neighbors dropping logarithmically to 0 for $x_{i+s}$ and $x_{i-s}$
Exponential: Exponential decay w.r.t. the temporal distance

Dictionary learning

Similar to TSC (Li et al., 2015), MOSCITO learns to represent dataset $X$ by a dictionary $D$ and a coding matrix $Z$ , such that $X\approx DZ$ (Li et al., 2015).

Adopting a least square regression approach to dictionary learning, we use the following objective function (Li et al., 2015)

(6)

\begin{split}\min_{Z,D}||X-DZ||_{F}^{2}+\lambda_{1}||Z||_{F}^{2}\\ \text{s.t. }Z\geq 0,D\geq 0,||d_{i}||_{2}^{2}\leq 1,\end{split}

where $||Z||_{F}^{2}$ denotes the Frobenius norm, and we use $\lambda_{1}$ for balancing both terms.

Importantly, MOSCITO integrates temporal regularization to capture relationships between nearby time steps. We achieve this by adding temporal regularization function $f(Z)$ such that our new objective function is the same as in TSC (Li et al., 2015)

(7)

\begin{split}\min_{Z,D}||X-DZ||_{F}^{2}+\lambda_{1}||Z||_{F}^{2}+\lambda_{2}f(% Z)\\ \text{s.t. }Z\geq 0,D\geq 0,||d_{i}||_{2}^{2}\leq 1,\end{split}

with $\lambda_{2}$ tradeoff parameter (Li et al., 2015).

To solve the dictionary learning problem efficiently, we follow the alternating direction method of multipliers (ADMM) (Li et al., 2015). It decomposes the problem into subproblems that are more easy to solve. Using auxiliary variables $U,V$ , Eq. 7 is equivalent to

(8)

\begin{split}\min_{Z,D,U,V}||X-UV||_{F}^{2}+\lambda_{1}||V||_{F}^{2}+\lambda_{% 2}f(V)\\ \text{s.t. }U=D,V=Z,Z\geq 0,D\geq 0,||d_{i}||_{2}^{2}\leq 1.\end{split}

The augmented Lagrangian of Eq. 8 is:

(9)

\begin{split}\mathcal{L}&=\frac{1}{2}||X-UV||_{F}^{2}+\lambda_{1}||V||_{F}^{2}% +\lambda_{2}tr(VL_{T}V^{T})\\ &+\langle\Lambda,U_{D}\rangle+\langle\Pi,V-Z\rangle+\frac{\alpha}{2}||U-D||_{F% }^{2}+\frac{\beta}{2}||V-Z||_{F}^{2},\end{split}

with $\Lambda$ and $\Pi$ Lagrangian multipliers (Li et al., 2015), which can now be solved by alternatingly minimizing $\mathcal{L}$ respective to $V,U,Z$ and $D$ .

Updating $\bf V$ : To find the minimum of $\mathcal{L}$ , its derivative with respect to $V$ is set to zero, resulting in the following equation (Li et al., 2015):

(10)

[I\otimes(U^{T}U+(\lambda_{1}+\alpha)I)+\lambda_{2}L_{T}\otimes I]vec(V)=vec(U% ^{T}X-\Pi+\alpha Z)

Updating $\bf U$ : As for $V$ , set derivative of $\mathcal{L}$ wrt. $U$ to zero (Li et al., 2015):

(11)

U=(XV^{T}-\Lambda+\alpha D)(VV^{T}+\alpha I)^{-1}

Updating $\bf Z$ and $\bf D$ : The update rules for $Z$ and $D$ are (Li et al., 2015):

(12)

Z=F_{+}(V+\frac{\Pi}{\beta}),\qquad D=F_{+}(U+\frac{\Lambda}{\alpha})

where $(F_{+}(A))_{ij}=max\{A_{ij},0\}$ , non-negativity constraint. Additionally, each column of $D$ is normalized to unit length $||d_{i}||_{2}^{2}\leq 1$ .

Clustering

Using the learned coding matrix $Z$ , an affinity graph is constructed. To capture sequential relationships, the following similarity measure is used (Li et al., 2015):

(13)

G(i,j)=\frac{z_{i}^{T}z_{j}}{||z_{i}||_{2}||z_{j}||_{2}}

Spectral clustering on graph $G$ yields the final clusters.

input : Trajectory

traj

, dictionary size

d

, sequential neighbors

s

, number of clusters

k

, parameters

\lambda_{1}

\lambda_{2}

\alpha

\beta

output : Discrete trajectory vector

d\_traj

2 Extract features

X

from traj;

3 while not converged do

4 Update

V_{k+1}

according to Eq. 10 while fixing others;

5 Update

U_{k+1}

according to Eq. 11 while fixing others;

6 Update

Z_{k+1}

according to Eq. 12 while fixing others;

7 Update

D_{k+1}

according to Eq. 12 while fixing others;

8 Update

\Pi_{k+1}:\Pi_{k+1}=\Pi_{k}+\nu\alpha(V_{k+1}-Z_{k-1})

;

9 Update

\Lambda_{k+1}:\Lambda_{k+1}=\lambda_{k}+\nu\beta(U_{k+1}-D_{k-1})

;

k=k+1

;

12 end while

13Generate affinity graph

G

using Eq. 13;

14 Perform spectral clustering on

G

, get

d\_traj

;

Algorithm 1 MOSCITO

6. Experiments

Implementation

We extract features of MD trajectories using libraries PyEMMA (Scherer et al., 2015) and MDTraj¹¹1https://www.mdtraj.org. Matrices are represented as NUMPY arrays and sparse matrices by scipy sparse matrices. We efficiently solve linear equations of the form $Ax=b$ using the PyPardiso 0.4.2²²2https://github.com/haasad/PyPardisoProject (version 2021.4) wrapper for the Intel oneMKL PARDISO library. For spectral clustering, we use the scikit-learn implementation. Our Python code for MOSCITO is publicly available ³³3https://github.com/Mhae98/MOSCITO. For PCA, TICA and $k$ -Means we use the implementations in PyEMMA. The implementation of SSC is based on (Syzonenko and Phillips, 2018) and a Python translation of SSC ⁴⁴4https://github.com/abhinav4192/sparse-subspace-clustering-python All experiments run on a virtual machine with 64 CPUs and 512GB of RAM running Ubuntu 22.04.

Datasets and parameter settings

We use MD trajectories of four proteins: 2F4K, Ace, 2WXC, and Savinase, which differ in the number of atoms $\mid A\mid$ , number of C $\alpha$ atoms $\mid C\alpha\mid$ , and number of residues $\mid R\mid$ , as summarized in Table 1, together with number of trajectories and timesteps in each trajectory. If not specified otherwise, parameter default values are $d=60$ , $s=3$ , $\lambda_{1}=0.01$ , $\lambda_{2}=15$ , $a=0.1$ , $b=0.1$ .

Protein	$\mid$ A $\mid$	$\mid$ C $\alpha\mid$	$\mid$ R $\mid$	$\mid$ T $\mid$	# Trajs	Source
2F4K	577	35	68	10.000	20	(Shaw et al., 2014)
Ace	264	21	23	10.000	28	(McGibbon, 2014)
2WXC	710	47	47	10.000	10	⁵⁵5Birgit Schiøtt’s Biomodelling group, Department of Chemistry, Aarhus University
Savinase	3.725	269	269	$\sim$ 1500	2	5

Table 1. Molecular dynamics data

6.1. MOSCITO settings and features

In this section, we analyze the parameter sensitivity and impact of feature extraction for MOSCITO. For the following experiments, 10 trajectories of the 2WAV-protein are used. The Markov state models are constructed for $k\in\{10,50,100,200\}$ clusters with a lag time $\tau\in\{1,10,100\}$ . The average VAMP2-score of the 10 trajectories is calculated for each choice of clusters $k$ and lag time $\tau$ . The VAMP2-score is calculated for five eigenvalues. It is important to keep in mind that the VAMP2-score for different MSM lag times is not directly comparable, only for different numbers of clusters.

6.1.1. Dictionary size

$d$ defines the dimensionality of the learned dictionary and thus the number of subspaces. As such, it is crucial for the clustering and runtime performance of MOSCITO, where larger dictionary size $d$ provides more subspaces to represent the data. We study $d\in\{10,20,...,100\}$ , and additionally $s\in\{2,3,4\}$ number of sequential neighbors. Figure 2 shows VAMP2-scores for different numbers of clusters (rows) and MSM lag times (columns): clustering performance improves for larger dictionary size $d$ until it converges. For smaller MSM lag times, the VAMP2-score converges faster than for larger lag times, independent of the number of clusters. With an increasing number of sequential neighbors $s$ , the overall shape of the curves is highly similar, with a slight preference for more neighbors that capture more of the temporal relationships.

6.1.2. Sequential neighbors

Setting the number of sequential neighbors $s$ in the temporal regularization to 0, the temporal aspect is not considered, with growing $s$ longer temporal relationships are captured. The scores for $s\in\{0,...,12\}$ are shown in Figure 3 for different MSM lags (columns) and number of clusters (rows). The number of sequential neighbors $s$ clearly affects clustering performance, with too few or too neighbors reducing performance, but with a broad stable range where temporal relationships are successfully captured.

Figure 4 shows the clustering of a 2F4K protein trajectory with a long folded state into 5 clusters represented by different colors. With no (top) or few (center) sequential neighbors, time intervals comprised by a cluster can be very short. Increasing the number of sequential neighbors (bottom) leads to continuous clusters with mostly uninterrupted blocks.

6.1.3. $\lambda$ parameters

We study settings of the tradeoff parameters $\lambda_{1}\in\{0.01,0.05,\dotsc,15\}$ for sparsity of the coding matrix $Z$ , $\lambda_{2}\in\{0.01,0.05,\dotsc,35\}$ for the temporal regularization function. Figure 5 shows that best performance is achieved for low values of $\lambda_{1}$ and higher values of $\lambda_{2}$ (light areas in the heatmap). This shows that temporal regularization is the more important factor for optimal clustering performance.

6.1.4. Learning rates

$\alpha,\beta\in\{0.01,0.05,\dotsc,15\}$ define how much Lagrangian multipliers $\Pi$ and $\Lambda$ change in each iteration of the algorithm. Heat maps in Figure 6 show that the results are stable across parameter settings, with VAMP2-scores in each heatmap relatively close, meaning our default values work well.

6.1.5. Weighting mode

All weighting modes for the Laplacian regularization function: binary, Gaussian, logarithmic, and exponential, are tested for different numbers of sequential neighbors. In Figure 7 we see that for few sequential neighbors, the performance for all weighting modes is similar. For many sequential neighbors, performance starts to decrease. In most cases, the performance when using the binary weighting mode decreases earlier than for the other methods, but all reliably reach the same maximal score.

6.1.6. Feature comparison

We analyze the above described features and their combinations via simple concatenation. Figure 8 shows the average VAMP2-score for 10 trajectories of the 2WAX-protein, using the five largest eigenvalues for scoring.

Best results for individual features are achieved with backbone torsions. Overall, combinations of backbone torsions, SASA, and shape histograms yield the best results. Figure 9 shows the importance that MOSCITO gives to each of the features in its dictionary when clustering the full-dimensional feature space. While, e.g., the feature ”flex” (flexible torsions) yields the worst VAMP2 score (left), it still has a high influence on the results (right), explaining the suboptimal VAMP2 score when using the full feature set.

6.2. Clustering performance

In this section, the clustering performance of MOSCITO is compared to that of state-of-the-art approaches that are currently used in the field. We compare to PCA with $k$ -Means and to TICA with $k$ -Means because of their widespread use (Wehmeyer et al., 2019). Additionally, we compare against the sparse spectral clustering algorithm (Syzonenko and Phillips, 2018), a subspace clustering approach for molecular dynamics data. For PCA and TICA, the default parameter settings of PyEmma (Wehmeyer et al., 2019) were used. SSC is a subspace clustering algorithm, thus it does not need any dimensionality reduction in the preprocessing.

A general challenge in this kind of evaluation is the fact that no ground truth is given. As such, clustering performance is compared by constructing a MSM and scoring it using the VAMP2-score, in order to obtain quantitative insights. Still, visual inspection plays an important role in assessing the clustering performance, in particular, the temporal connectivity of clusters.

Quantitative analysis

Figure 10 shows the average VAMP2-score for 20 trajectories of the 2F4K protein in the first row. For MSM lag time 1, the performance of all methods is comparable with the exception of 10 clusters. There, MOSCITO and TICA + $k$ -Means deliver the best performances. For longer MSM lag times, the scores separate with SSC performing better.

In the second row of Figure 10, we see the average VAMP2-score for 10 trajectories of the 2WXC protein. For the MSM lag times 1 and 10, the performance for all methods is relatively similar for large cluster sizes. For small cluster sizes, SSC performs poorly, while the other methods already reach close to their maximal score for only 10 clusters. For MSM lag time of 100, PCA + $k$ -Means drastically drops in performance, while the other methods maintain comparable scores to each other.

The third row of Figure 10 shows the average VAMP2-score for 28 trajectories of the Ace protein. Similar to the results of 2WXC, the performance for a MSM lag time of 1 and 10 is comparable between all methods for a large number of clusters. For few clusters, SSC again performs significantly worse. For MSM lag time of 100, the gap between the methods grows, with SSC performing best.

The last row of Figure 10 shows the average VAMP2-score for 2 trajectories of Savinase. For all MSM lag times, SSC performs best while TICA + $k$ -Means performs worst. Unlike for the other proteins, all methods require a large number of clusters to reach the maximal VAMP2-score. This might be due to the large size of Savinase and shorter trajectories compared to the other proteins.

Qualitative analysis

The purpose of modeling temporal aspects is to obtain clusters with larger continuous segments, ideally retrieving macrostates directly. Figure 11 visually compares the segmentation of clustering methods, either by directly clustering the trajectory into five clusters or, alternatively by coarse-graining 50 clusters into five clusters using PCCA+ (Röblitz and Weber, 2013) (see Section 7). We regard a 2F4K trajectory with a known long folded segment.

Directly clustering, i.e. without PCCA+, PCA + $k$ -Means and SSC struggle to find the large folded state as one large block. MOSCITO and TICA+ $k$ -Means find similar clusterings and succeed to uncover the large folded state. When applying PCCA+, PCA + $k$ -Means, TICA + $k$ -Means, and MOSCITO generate similar clusterings, comparable to the ones of MOSCITO and TICA + $k$ -Means without PCCA+. In this specific case, SSC fails to find the large segment, which is reflected in the respective VAMP2-score for this specific trajectory which is unusually low.

6.3. Clustering runtime

In this section, the runtime of MOSCITO is studied and compared to the runtime of PCA + $k$ -Means, TICA + $k$ -Means, and SSC. The runtimes for MOSCITO are measured for 20 iterations. The following runtimes are the average of 10 2WXC-trajectories clustered into ten clusters, using all atom coordinates as features. The main factors influencing the runtime of MOSCITO, besides the size of the input data, are the dictionary size $d$ and the number of sequential neighbors $s$ . The effect of different dictionary sizes $[10,...,100]$ on the runtime is shown left in Figure 12. The effect of different numbers of sequential neighbors $[0,...,25]$ is on the right.

The runtime of MOSCITO grows with increasing dictionary sizes. For optimal cluster performance, a large enough dictionary size has to be selected. Large values do not affect clustering performance negatively, but directly impact runtime. A dictionary size between 60 and 80 provides a good tradeoff. The runtime generally grows with increasing number of regarded sequential neighbors. The overall run-to-run variance for the different trajectories is close to zero, indicating a stable runtime for a fixed number of iterations.

For the comparison of the different clustering methods, backbone torsions were used. The comparison between the different clustering algorithms is shown in Figure 13. PCA + $k$ -Means and TICA + $k$ -Means are both much faster than the subspace clustering methods. Compared to SSC, MOSCITO is substantially faster.

6.4. Discussion

We conclude that MOSCITO provides at least comparable performance to state-of-the-art clustering methods for MD data. We evaluated the clustering performance by the clusters’ suitability as Markov states using the VAMP2-score. The visual analysis of MOSCITOs clustering results shows a better clustering for higher numbers of sequential neighbors. Thus, MOSCITO is a very promising approach for creating MSMs in a one-step-approach.

MOSCITO’s parameter robustness shows that there are good tradeoffs in dictionary size between runtime and clustering performance. The number of sequential neighbors matters, and our experiments indicate a good default value is between 3 and 5 to capture nearby temporal relations. In terms of feature extraction, MOSCITO on backbone torsions leads to best results, followed by C $\alpha$ coordinates and SASA. Combining features does not improve the overall clustering performance. While simpler non-subspace methods are even faster, MOSCITO provides a faster runtime than SSC, and state-of-the-art quantitative results, with qualitative analysis showing the promise of the approach.

7. Related Work

Distance-based clustering algorithms suffer from the curse of dimensionality when it comes to high-dimensional data like MD data. Thus, the dimensionality is often reduced in a preprocessing step, e.g., with principal component analysis (PCA) or time-lagged independent component analysis (TICA) (Molgedey and Schuster, 1994). Both are widely used for analyzing MD data (Palma and Pierdominici-Sottile, 2023; Schultze and Grubmüller, 2021). PCA transforms the data onto the axes with the highest variance, also known as principal components. PCA is a commonly used for MD data, e.g., in PyEMMA (Scherer et al., 2015), MSMBuilder ⁶⁶6http://www.msmbuilder.org, or MDAnalysis ⁷⁷7https://www.mdanalysis.org/. TICA (Time-lagged independent component analysis) (Molgedey and Schuster, 1994) linearly projects the data to independent, uncorrelated components (IC). The ICs are computed by maximizing the autocovariance for a fixed lagtime $\tau$ , and are frequently used for MD data ((Schwantes and Pande, 2013), (Pérez-Hernández et al., 2013)).

State-of-the-art methods often follow a two-step approach, where the trajectories are discretized into a large number of microstates, and subsequently combined to the relevant macrostates. Since the transitions between the metastable states are taking place over multiple timesteps it is hard to assign time steps that are part of a transition to a metastable state clearly. Perron-Cluster Cluster Analysis (PCCA) ((Weber et al., 2004),(Röblitz and Weber, 2013)) and a more robust version PCCA+ ((Weber and Galliat, 2002), (Deuflhard and Weber, 2005)) fuzzily assign the microstates to their corresponding macrostates based on spectral methods for transition matrices:

The transition matrix $P$ of a decomposable MSM with $n_{c}$ macrostates can be reordered to a block-diagonal matrix $\tilde{P}$ with $n_{c}$ blocks. For the matrix $X$ containing the $n_{c}$ dominant eigenvectors of $\tilde{P}$ , the rows belonging to states in the same block are identical (Frank et al., 2022). These rows represent the $n_{c}$ corners of a ( $n_{c}$ -1) dimensional simplex.

For MSMs with transition states, i.e., nearly decomposable MSMs, the structure of the simplex is only slightly perturbed. To assign the microstates to the macrostates, they are fuzzily assigned to the $n_{c}$ corners of the simplex based on their position in the simplex. The goal is to find a linear transformation $A\in\mathbb{R}^{n_{c}\times n_{c}}$ so that $\chi=XA$ where $\chi_{i}$ are the membership vectors. Those memberships all have to be positive and sum up to 1. The resulting macrostates generally cannot be considered to be a Markov chain (Röblitz and Weber, 2013).

Time series data can be found in many fields, like traffic analysis, geology, computer vision, or molecular dynamics data. Many existing subspace clustering algorithms do not take the sequential properties of time series data into account (Wu et al., 2015). One of the first subspace clustering methods taking advantage of sequential relationships in data was SpatSC introduced by (Guo et al., 2014). A penalty term used to preserve sequential relationships in the learned affinity graph was added to the subspace clustering problem. Based on the same idea, OSC, a more robust subspace clustering method for sequential data was introduced by (Tierney et al., 2014). TSC (Li et al., 2015) further improved on the idea by defining a Laplacian regularization to encode the temporal information. Our method MOSCITO builds upon TSC.

To the best of our knowledge, only one subspace clustering method has been used on molecular dynamics data, namely SSC (Syzonenko and Phillips, 2018), which was first introduced in the field of image processing (Elhamifar and Vidal, 2013). We include SSC in our experiments as comparative method, thus we introduce it in detail in the following. Note that SSC does not take advantage of any properties that distinguish time series from other high-dimensional data, leading to suboptimal use of the available information in MD data. The main idea behind SSC is to take advantage of the self-expressiveness of the data, i.e., assuming that every point $y_{i}\in Y$ can be rewritten as a linear combination of other points in the dataset. Each datapoint $y\in\cup_{i=1}^{n}S_{i}$ can be written as

(14)

y_{i}=Yc_{i},\;\;c_{ii}=0,

where $Y=[y_{1},...,y_{N}]$ and $c_{i}:=[c_{i1}\;c_{i2}\;...\;c_{iN}]^{T}$ with the constraint $c_{ii}=0$ eliminating the trivial solution of representing the point as itself. Since each subspace usually has more data points than dimensions the self-expressive dictionary $Y$ is generally not unique. Among all those possible solutions for Equation 14 there is a sparse solution so that the non-zero entries of $c_{i}$ correspond to datapoints in the same subspace as $y_{i}$ . This solution can be found by minimizing the $l_{1}$ -norm of $c_{i}$ , which can be solved efficiently and is known to prefer sparse solutions (Elhamifar and Vidal, 2013).

These sparse representations are transferred to an undirected graph where points that are part of another point’s sparse representation are connected. Based on this graph, spectral clustering detects the desired subspace clusters.

8. Conclusion

We introduced MOSCITO, a subspace clustering method for high-dimensional molecular dynamics data. It uses temporal regularization to capture the relationship of sequential neighbors in time-series data. While traditional and state-of-the-art methods rely on coarse-graining using PCCA+ in a two-step approach, MOSCITO achieves similar overall performance in a single clustering step. Especially for small numbers of clusters, the temporal regularization in MOSCITO yields clusters consisting of connected time steps, fitting the real world more closely than alternative methods without relying on any postprocessing steps. Compared to current subspace clustering methods on MD data, MOSCITO provides a better runtime. In future work, the integration of a multi-view approach into MOSCITO might take advantage of the additional information provided by using multiple features.

References

(1)
Ankerst et al. (1999) Mihael Ankerst, Gabi Kastenmüller, Hans-Peter Kriegel, and Thomas Seidl. 1999. 3D shape histograms for similarity search and classification in spatial databases. In Advances in Spatial Databases: 6th International Symposium, SSD’99 Hong Kong, China, July 20—23, 1999 Proceedings 6. Springer, 207–226.
Bonomi et al. (2009) Massimiliano Bonomi, Davide Branduardi, Giovanni Bussi, Carlo Camilloni, Davide Provasi, Paolo Raiteri, Davide Donadio, Fabrizio Marinelli, Fabio Pietrucci, Ricardo A. Broglia, and Michele Parrinello. 2009. PLUMED: A portable plugin for free-energy calculations with molecular dynamics. Computer Physics Communications 180, 10 (2009), 1961–1972. https://doi.org/10.1016/j.cpc.2009.05.011
Deuflhard and Weber (2005) Peter Deuflhard and Marcus Weber. 2005. Robust Perron cluster analysis in conformation dynamics. Linear algebra and its applications 398 (2005), 161–184.
Elhamifar and Vidal (2013) Ehsan Elhamifar and René Vidal. 2013. Sparse Subspace Clustering: Algorithm, Theory, and Applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11 (2013), 2765–2781. https://doi.org/10.1109/TPAMI.2013.57
Frank et al. (2022) Anna-Simone Frank, Alexander Sikorski, and Susanna Röblitz. 2022. Spectral clustering of Markov chain transition matrices with complex eigenvalues. J. Comput. Appl. Math. (2022). under review.
Guo et al. (2014) Yi Guo, Junbin Gao, and Feng Li. 2014. Spatial subspace clustering for drill hole spectral data. Journal of Applied Remote Sensing 8, 1 (2014), 083644–083644.
Hartl (2017) F Ulrich Hartl. 2017. Protein misfolding diseases. Annual review of biochemistry 86 (2017), 21–26.
Kriegel et al. (2012) Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. 2012. Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2, 4 (2012), 351–364.
Li et al. (2015) Sheng Li, Kang Li, and Yun Fu. 2015. Temporal subspace clustering for human motion segmentation. In Proceedings of the IEEE international conference on computer vision. 4453–4461.
Liu et al. (2012) Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. 2012. Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2012), 171–184.
McGibbon (2014) Robert T. McGibbon. 2014. Fs MD Trajectories. (5 2014). https://doi.org/10.6084/m9.figshare.1030363.v1
Molgedey and Schuster (1994) Lutz Molgedey and Heinz Georg Schuster. 1994. Separation of a mixture of independent signals using time delayed correlations. Physical review letters 72, 23 (1994), 3634.
Noé and Nuske (2013) Frank Noé and Feliks Nuske. 2013. A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Modeling & Simulation 11, 2 (2013), 635–655.
Nuske et al. (2014) Feliks Nuske, Bettina G Keller, Guillermo Pérez-Hernández, Antonia SJS Mey, and Frank Noé. 2014. Variational approach to molecular kinetics. Journal of chemical theory and computation 10, 4 (2014), 1739–1752.
Palma and Pierdominici-Sottile (2023) Juliana Palma and Gustavo Pierdominici-Sottile. 2023. On the Uses of PCA to Characterise Molecular Dynamics Simulations of Biological Macromolecules: Basics and Tips for an Effective Use. ChemPhysChem 24, 2 (2023), e202200491. https://doi.org/10.1002/cphc.202200491 arXiv:https://chemistry-europe.onlinelibrary.wiley.com/doi/pdf/10.1002/cphc.202200491
Parsons et al. (2004) Lance Parsons, Ehtesham Haque, and Huan Liu. 2004. Subspace clustering for high dimensional data: a review. Acm sigkdd explorations newsletter 6, 1 (2004), 90–105.
Pérez-Hernández et al. (2013) Guillermo Pérez-Hernández, Fabian Paul, Toni Giorgino, Gianni De Fabritiis, and Frank Noé. 2013. Identification of slow molecular order parameters for Markov model construction. The Journal of chemical physics 139, 1 (2013), 07B604_1.
Qu et al. (2023) Wentao Qu, Xianchao Xiu, Huangyue Chen, and Lingchen Kong. 2023. A Survey on High-Dimensional Subspace Clustering. Mathematics 11, 2 (Jan 2023), 436. https://doi.org/10.3390/math11020436
Romanowska et al. (2012) Julia Romanowska, Krzysztof S Nowinski, and Joanna Trylska. 2012. Determining geometrically stable domains in molecular conformation sets. Journal of Chemical Theory and Computation 8, 8 (2012), 2588–2599.
Röblitz and Weber (2013) Susanna Röblitz and Marcus Weber. 2013. Fuzzy spectral clustering by PCCA+: Application to Markov state models and data classification. Advances in Data Analysis and Classification 7 (2013), 147–179. Issue 2. https://doi.org/10.1007/s11634-013-0134-6
Scherer et al. (2019) Martin K. Scherer, Brooke E. Husic, Moritz Hoffmann, Fabian Paul, Hao Wu, and Frank Noé. 2019. Variational selection of features for molecular kinetics. The Journal of Chemical Physics 150, 19 (may 2019), 194108. https://doi.org/10.1063/1.5083040
Scherer et al. (2015) Martin K. Scherer, Benjamin Trendelkamp-Schroer, Fabian Paul, Guillermo Pérez-Hernández, Moritz Hoffmann, Nuria Plattner, Christoph Wehmeyer, Jan-Hendrik Prinz, and Frank Noé. 2015. PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models. Journal of Chemical Theory and Computation 11 (Oct. 2015), 5525–5542. https://doi.org/10.1021/acs.jctc.5b00743
Schultze and Grubmüller (2021) Steffen Schultze and Helmut Grubmüller. 2021. Time-Lagged Independent Component Analysis of Random Walks and Protein Dynamics. Journal of Chemical Theory and Computation 17, 9 (2021), 5766–5776. https://doi.org/10.1021/acs.jctc.1c00273 arXiv:https://doi.org/10.1021/acs.jctc.1c00273 PMID: 34449229.
Schwantes and Pande (2013) Christian R Schwantes and Vijay S Pande. 2013. Improvements in Markov state model construction reveal many non-native interactions in the folding of NTL9. Journal of chemical theory and computation 9, 4 (2013), 2000–2009.
Shaw et al. (2014) David E. Shaw, J.P. Grossman, Joseph A. Bank, Brannon Batson, J. Adam Butts, Jack C. Chao, Martin M. Deneroff, Ron O. Dror, Amos Even, Christopher H. Fenton, et al. 2014. Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 41–53. https://doi.org/10.1109/SC.2014.9
Shrake and Rupley (1973) A. Shrake and J.A. Rupley. 1973. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. Journal of Molecular Biology 79, 2 (1973), 351–371. https://doi.org/10.1016/0022-2836(73)90011-9
Suárez et al. (2021) Ernesto Suárez, Rafal P. Wiewiora, Chris Wehmeyer, Frank Noé, John D. Chodera, and Daniel M. Zuckerman. 2021. What Markov State Models Can and Cannot Do: Correlation versus Path-Based Observables in Protein-Folding Models. Journal of Chemical Theory and Computation 17 (5 2021), 3119–3133. Issue 5. https://doi.org/10.1021/acs.jctc.0c01154
Syzonenko and Phillips (2018) Ivan Syzonenko and Joshua L. Phillips. 2018. Hybrid Spectral/Subspace Clustering of Molecular Dynamics Simulations. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (Washington, DC, USA) (BCB ’18). Association for Computing Machinery, New York, NY, USA, 325–330. https://doi.org/10.1145/3233547.3233595
Tierney et al. (2014) Stephen Tierney, Junbin Gao, and Yi Guo. 2014. Subspace clustering for sequential data. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1019–1026.
Tošić and Frossard (2011) Ivana Tošić and Pascal Frossard. 2011. Dictionary learning. IEEE Signal Processing Magazine 28, 2 (2011), 27–38.
Vaidehi and Jain (2015) Nagarajan Vaidehi and Abhinandan Jain. 2015. Internal coordinate molecular dynamics: A foundation for multiscale dynamics. The Journal of Physical Chemistry B 119, 4 (2015), 1233–1242.
Vidal (2011) René Vidal. 2011. Subspace clustering. IEEE Signal Processing Magazine 28, 2 (2011), 52–68.
Weber and Galliat (2002) Marcus Weber and Tobias Galliat. 2002. Characterization of transition states in conformational dynamics using Fuzzy sets. (2002).
Weber et al. (2004) Marcus Weber, Wasinee Rungsarityotin, and Alexander Schliep. 2004. Perron cluster analysis and its connection to graph partitioning for noisy data. Citeseer.
Wehmeyer et al. (2019) Christoph Wehmeyer, Martin K. Scherer, Tim Hempel, Brooke E. Husic, Simon Olsson, and Frank Noé. 2019. Introduction to Markov state modeling with the PyEMMA software [Article v1.0]. Living Journal of Computational Molecular Science 1 (2019). Issue 1. https://doi.org/10.33011/livecoms.1.1.5965
Wu et al. (2015) Fei Wu, Yongli Hu, Junbin Gao, Yanfeng Sun, and Baocai Yin. 2015. Ordered subspace clustering with block-diagonal priors. IEEE transactions on cybernetics 46, 12 (2015), 3209–3219.
Wu and Noé (2017) Hao Wu and Frank Noé. 2017. Variational approach for learning Markov processes from time series data. https://doi.org/10.48550/ARXIV.1707.04659