Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

RNA-FrameFlow: Flow Matching for de novo
3D RNA Backbone Design

Rishabh Anand∗,1  Chaitanya K. Joshi∗,2  Alex Morehead4Arian R. Jamasb2,3
Charles Harris2
Simon V. Mathis2Kieran Didi2Bryan Hooi1Pietro Liò2
1National University of Singapore, Singapore  2University of Cambridge, UK
3Prescient Design, Genentech, Roche  4University of Missouri, USA  Equal contribution

Open-source code: github.com/rish-16/rna-backbone-design
Abstract

We introduce RNA-FrameFlow, the first generative model for 3D RNA backbone design. We build upon SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) flow matching for protein backbone generation and establish protocols for data preparation and evaluation to address unique challenges posed by RNA modeling. We formulate RNA structures as a set of rigid-body frames and associated loss functions which account for larger, more conformationally flexible RNA backbones (13 atoms per nucleotide) vs. proteins (4 atoms per residue). Toward tackling the lack of diversity in 3D RNA datasets, we explore training with structural clustering and cropping augmentations. Additionally, we define a suite of evaluation metrics to measure whether the generated RNA structures are globally self-consistent (via inverse folding followed by forward folding) and locally recover RNA-specific structural descriptors. The most performant version of RNA-FrameFlow generates locally realistic RNA backbones of 40-150 nucleotides, over 40% of which pass our validity criteria as measured by a self-consistency TM-score 0.45absent0.45\geq 0.45≥ 0.45, at which two RNAs have the same global fold.

1 Introduction

Designing RNA structures. Proteins, and the diverse structures they can adopt, drive essential biological functions in cells. Deep learning has led to breakthroughs in structural modeling and design of proteins (Jumper et al., 2021, Dauparas et al., 2022, Watson et al., 2023), driven by the abundance of 3D data from the Protein Data Bank (PDB). Concurrently, there has been a surge of interest in Ribonucleic Acids (RNA) and RNA-based therapeutics for gene editing, gene silencing, and vaccines (Doudna and Charpentier, 2014, Metkar et al., 2024). RNAs play a dual role as carriers of genetic information coding for proteins (mRNAs) as well as performing functions driven by their tertiary structural interactions (riboswitches and ribozymes). While there is growing interest in designing structured RNAs for a range of applications in biotechnology and medicine (Mulhbacher et al., 2010, Damase et al., 2021), the current toolkit for 3D RNA design uses classical algorithms and heuristics to assemble RNA motifs as building blocks (Han et al., 2017, Yesselman et al., 2019). However, hand-crafted heuristics are not always broadly applicable across multiple tasks and rigid motifs may not fully capture the conformational dynamics that govern RNA functionality (Ganser et al., 2019, Li et al., 2023a). This presents an opportunity for deep generative models to learn data-driven design principles from existing 3D RNA structures.

Refer to caption
Figure 1: The RNA-FrameFlow pipeline for 3D backbone generation. Our implementation establishes RNA-specific protocols for data preparation and evaluation for FrameFlow (Yim et al., 2023a). (1) Each nucleotide in the RNA backbone is converted into a frame to parameterize the placement of C4𝐶superscript4C4^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by a translation, C3C4O4𝐶superscript3𝐶superscript4𝑂superscript4C3^{\prime}-C4^{\prime}-O4^{\prime}italic_C 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_O 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by a rotation, and the rest of the atoms via 8 torsion angles ΦΦ\Phiroman_Φ. (2) We train generative models on all RNA structures of length 40-150 nucleotides from RNAsolo (Adamczyk et al., 2022). We also explore training with structural clustering and cropping augmentations to tackle the lack of diversity in 3D RNA datasets. (3) We introduce evaluation metrics to measure the recovery of local structural descriptors and global self-consistency of designed structures via inverse-folding with gRNAde (Joshi et al., 2023) followed by forward-folding with RhoFold (Shen et al., 2022).

What makes deep learning for RNA hard? The primary challenge is the paucity of raw 3D RNA structural data, manifesting as an absence of ML-ready datasets for model development (Joshi et al., 2023). Protein structure is primarily driven by hydrogen bonding along the backbone, and current geometric deep learning models incorporate this inductive bias through backbone frames to represent residues (Jumper et al., 2021). RNA structure, however, is often more conformationally flexible and driven by base pairing interactions across strands as well as base stacking between rings of adjacent nucleotides (Vicens and Kieft, 2022), all of which can only be learnt implicitly at present111 See Eric Westhof’s talk contrasting RNA and protein structure. .

Additionally, RNA nucleotides, the equivalent of amino acids in proteins, include significantly more atoms as part of the backbone (13 compared to 4) which necessitates a generalization of backbone frames where the placement of most atoms needs to be parameterized by torsion angles. These complexities have contributed to relatively poor performance of deep learning for RNA structure prediction compared to proteins (Kretsch et al., 2023, Abramson et al., 2024). Additionally, structure prediction models cannot directly be used for designing or generating novel RNA structures with desired constraints, which our work aims to do.

Our contributions. We develop RNA-FrameFlow, the first generative model for 3D RNA backbone design, illustrated in Figure 1. We adapt FrameFlow (Yim et al., 2023a), an SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) equivariant flow matching model for proteins to RNA. We introduce RNA-specific modifications to the data preparation and loss formulation, including representing RNA nucleotides as rigid-body frames that parameterize all 13 atoms. We also introduce an evaluation pipeline to benchmark RNA backbone design models’ capabilities at recovering local and global structure. Our best model is trained on RNAs of lengths 40-150 from the PDB and can unconditionally sample locally plausible backbones with over 40% validity as measured by a self-consistency TM-score 0.45absent0.45\geq 0.45≥ 0.45.

Through this study, we aimed to evaluate the extent to which generative models for proteins can be adapted for RNA. This brought up critical challenges and limitations of deep learning for RNA modelling, such as a lack of explicit representations of the physical interactions that drive RNA structure as well as biases in 3D RNA datasets, which we have made preliminary efforts towards addressing. Together with our engineering contributions, we hope this work will stimulate future research in generative models for RNA design.

2 The RNA-FrameFlow Pipeline

Overview. We are concerned with building a generative model that unconditionally outputs all-atom RNA backbones, sampled from a distribution of realistic 3D RNA structures. Formally, given an RNA sequence length of Nntsubscript𝑁ntN_{\text{nt}}italic_N start_POSTSUBSCRIPT nt end_POSTSUBSCRIPT nucleotides, we aim to generate a real-valued tensor 𝐗𝐗\mathbf{X}bold_X of shape Nnt×13×3subscript𝑁nt133{N_{\text{nt}}\times 13\times 3}italic_N start_POSTSUBSCRIPT nt end_POSTSUBSCRIPT × 13 × 3 representing 3D atomic coordinates for each of the 13 backbone atoms per nucleotide. In the following sections, we will describe how we adapt FrameFlow (Yim et al., 2023a), an SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) equivariant flow matching model for protein backbones, to our setting.

2.1 Representing RNA Backbones as Frames

As shown in Figure 1, the RNA backbone consists of nucleotides with a phosphate group (P,OP1,OP2,O5𝑃𝑂𝑃1𝑂𝑃2𝑂superscript5P,OP1,OP2,O5^{\prime}italic_P , italic_O italic_P 1 , italic_O italic_P 2 , italic_O 5 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), a ribose sugar (C1C5,O2,O3,O4𝐶superscript1𝐶superscript5𝑂superscript2𝑂superscript3𝑂superscript4C1^{\prime}-C5^{\prime},O2^{\prime},O3^{\prime},O4^{\prime}italic_C 1 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_C 5 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_O 2 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_O 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_O 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), and a nitrogen atom N𝑁Nitalic_N at the stem of the base. We represent the group of atoms within each nucleotide as a rigid-body frame. Frames enable inferring the positions of all atoms within the nucleotide via a frame center and orientation (described subsequently). However, the 13 atoms per nucleotide in the RNA backbone is significantly greater than protein residues with 4 atoms (Cα,N,C,Osubscript𝐶𝛼𝑁𝐶𝑂C_{\alpha},N,C,Oitalic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_N , italic_C , italic_O). In proteins, it is standard to represent each residue by a frame centered at Cαsubscript𝐶𝛼C_{\alpha}italic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT with vectors along CαNsubscript𝐶𝛼𝑁C_{\alpha}-Nitalic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_N and CαCsubscript𝐶𝛼𝐶C_{\alpha}-Citalic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - italic_C, and O𝑂Oitalic_O is placed assuming an idealised planar geometry (Jumper et al., 2021). No such canonical frame representation exists for RNAs.

RNA frames. We use the C4,C3𝐶superscript4𝐶superscript3C4^{\prime},C3^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_C 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and O4𝑂superscript4O4^{\prime}italic_O 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT atoms to create the frame for each nucleotide, as in Morehead et al. (2023). All other backbone atoms are inferred with 8 torsions Φ={ϕ1ϕ8},ϕiSO(2)formulae-sequenceΦsubscriptitalic-ϕ1subscriptitalic-ϕ8subscriptitalic-ϕ𝑖𝑆𝑂2\Phi=\{\phi_{1}\rightarrow\phi_{8}\},\phi_{i}\in SO(2)roman_Φ = { italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_ϕ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT } , italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S italic_O ( 2 ) that are predicted post-hoc after frame generation. The Gram-Schmidt process is used on v1,v2subscript𝑣1subscript𝑣2v_{1},v_{2}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT defined by the vectors along the C4O4𝐶superscript4𝑂superscript4C4^{\prime}-O4^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_O 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and C4C3𝐶superscript4𝐶superscript3C4^{\prime}-C3^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_C 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bonds; C5𝐶superscript5C5^{\prime}italic_C 5 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is imputed based the positions of the other 3 atoms and tetrahedral geometry. Given the 8 torsion angles, we autoregressively place non-frame atoms in order of the torsions ΦΦ\Phiroman_Φ in Figure 1, constructing the final all-atom RNA backbone structure.

Criteria on choosing frame atoms. We had two main considerations for selecting the subset of atoms to create RNA frames: (1) the atoms should have roughly the same spatial orientation w.r.t. each other; and (2) the atoms should be reasonably close to the centroid in the nucleotide to reduce error accumulation when placing the furthest non-frame atoms. We choose C3𝐶superscript3C3^{\prime}italic_C 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, C4𝐶superscript4C4^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and O4𝑂superscript4O4^{\prime}italic_O 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as these atoms spatially shift the least in naturally occurring RNA (Harvey and Prabhakaran, 1986). The non-frame backbone atoms – the remaining atoms in the ribose sugar ring and the phosphate group atoms – are parameterized by torsion angles to account for their relative conformational flexibility. This choice of frame enables models to learn ring puckering, the planar rotation of the ribose sugar ring about the C4C5𝐶superscript4𝐶superscript5C4^{\prime}-C5^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_C 5 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bond which affects how the RNA interacts with partners to form complexes (Clay et al., 2017). We are actively evaluating alternate choices of RNA frame atoms.

2.2 SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) Flow Matching on RNA Frames

Input. Given a set of 3D coordinates, a simultaneous rotation and translation (r,x)𝑟𝑥(r,x)( italic_r , italic_x ) forms an orientation-preserving rigid-body transformation of the coordinates. The set of all such transformations in 3D is the Special Euclidean group SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ), which composes the group of 3D rotations SO(3)𝑆𝑂3SO(3)italic_S italic_O ( 3 ) and 3D translations in 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.

We can represent an RNA frame T=(r,x)𝑇𝑟𝑥T=(r,x)italic_T = ( italic_r , italic_x ) as a translation x3𝑥superscript3x\in\mathbb{R}^{3}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT from the global origin to place C4𝐶superscript4C4^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and a rotation rSO(3)𝑟𝑆𝑂3r\in SO(3)italic_r ∈ italic_S italic_O ( 3 ) to orient C3C4O4𝐶superscript3𝐶superscript4𝑂superscript4C3^{\prime}-C4^{\prime}-O4^{\prime}italic_C 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_O 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Compared to working with raw 3D coordinates for each backbone atom, using the frame representation entails performing flow matching on the space of SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ). This is an inductive bias to reduce the degrees of freedom the generative model needs to learn. Instead of predicting 13 correlated 3D coordinates independently (39 quantities) for each nucleotide, we instead predict one 3D coordinate (of C4𝐶superscript4C4^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) and one 3×\times×3 rotation matrix (12 quantities). We follow Chen and Lipman (2024) and Yim et al. (2023a)’s framework for flow matching on SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ), which we summarise subsequently.

Overview. Flow matching generates or learns how to place and orient a set of N𝑁Nitalic_N frames 𝐓={T(n)}n=1N𝐓subscriptsuperscriptsuperscript𝑇𝑛𝑁𝑛1\mathbf{T}=\{T^{(n)}\}^{N}_{n=1}bold_T = { italic_T start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT, where T(n)=(r(n),x(n))superscript𝑇𝑛superscript𝑟𝑛superscript𝑥𝑛T^{(n)}=(r^{(n)},x^{(n)})italic_T start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT = ( italic_r start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ), to form an RNA backbone of length N𝑁Nitalic_N. To do so, we initialize frames at random in 3D space at time t=0𝑡0t=0italic_t = 0, and train a denoiser or flow model to iteratively refine the location and orientation of each frame for a specified number of steps until time t=1𝑡1t=1italic_t = 1.

Suppose p0(T0)subscript𝑝0subscript𝑇0p_{0}(T_{0})italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and p1(T1)subscript𝑝1subscript𝑇1p_{1}(T_{1})italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) are the marginal distributions of randomly oriented and ground truth frames from our dataset of RNA structures, respectively. Suppose a non-unique time-dependent vector field utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT leads to an ODE between the two distributions p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, i.e., assume there is a way to map from noisy samples to the corresponding true samples. This solution forms a ground truth probability path ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT between the two distributions at time t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], which we can use to transform samples from noise to the true distribution. The continuity equation pt=(ptut)𝑝𝑡subscript𝑝𝑡subscript𝑢𝑡\frac{\partial p}{\partial t}=-\nabla\cdot(p_{t}u_{t})divide start_ARG ∂ italic_p end_ARG start_ARG ∂ italic_t end_ARG = - ∇ ⋅ ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) relates the vector field utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the evolution of the probability path ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Given a noisy frame T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT sampled from p0(T0)subscript𝑝0subscript𝑇0p_{0}(T_{0})italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and the corresponding ground truth frame T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT sampled from p1(T1)subscript𝑝1subscript𝑇1p_{1}(T_{1})italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), we construct a flow Ttsubscript𝑇𝑡T_{t}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by following the probability path ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT between T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for any time step t𝑡titalic_t sampled from 𝒰(0,1)𝒰01\mathcal{U}(0,1)caligraphic_U ( 0 , 1 ). As shown by Chen and Lipman (2024) for the SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) group (and other manifolds), the shortest path between the two states T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can be used to define an interpolation:

Tt=expT0(tlogT0(T1)).subscript𝑇𝑡subscriptexpsubscript𝑇0𝑡subscriptlogsubscript𝑇0subscript𝑇1\displaystyle T_{t}\ =\ \operatorname{exp}_{T_{0}}(t\cdot\operatorname{log}_{T% _{0}}(T_{1})).italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ⋅ roman_log start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) . (1)

Here, exp()exp\operatorname{exp}({\cdot)}roman_exp ( ⋅ ) and log()log\operatorname{log}({\cdot})roman_log ( ⋅ ) are the exponential and logarithmic maps that enable moving (taking random walks) on curved manifolds such as the SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) group. As we can decompose a frame T=(r,x)𝑇𝑟𝑥T=(r,x)italic_T = ( italic_r , italic_x ) into separate rotation and translation terms, we can obtain closed-form interpolations for the group of rotations SO(3)𝑆𝑂3SO(3)italic_S italic_O ( 3 ) and translations 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. This gives us two independent flows:

Translations: xtTranslations: subscript𝑥𝑡\displaystyle\text{Translations:\quad}x_{t}\ Translations: italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =tx1+(1t)x0,absent𝑡subscript𝑥11𝑡subscript𝑥0\displaystyle=\ tx_{1}+(1-t)x_{0}\ ,= italic_t italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_t ) italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (2)
Rotations: rtRotations: subscript𝑟𝑡\displaystyle\text{Rotations:\quad}r_{t}\ Rotations: italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =expr0(tlogr0(r1)).absentsubscriptexpsubscript𝑟0𝑡subscriptlogsubscript𝑟0subscript𝑟1\displaystyle=\ \operatorname{exp}_{r_{0}}(t\cdot\operatorname{log}_{r_{0}}(r_% {1}))\ .= roman_exp start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ⋅ roman_log start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) . (3)

The random translation x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is sampled from a zero-centered Gaussian distribution 𝒩(0,𝐈)𝒩0𝐈\mathcal{N}(0,\mathbf{I})caligraphic_N ( 0 , bold_I ) in 3superscript3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and the random rotation r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is sampled from 𝒰(SO(3))𝒰𝑆𝑂3\mathcal{U}(SO(3))caligraphic_U ( italic_S italic_O ( 3 ) ), a generalization of the uniform distribution for the group of rotations, SO(3)𝑆𝑂3SO(3)italic_S italic_O ( 3 ). For an RNA backbone consisting of a set of N𝑁Nitalic_N frames 𝐓={T(n)}n=1N𝐓subscriptsuperscriptsuperscript𝑇𝑛𝑁𝑛1\mathbf{T}=\{\ T^{(n)}\}^{N}_{n=1}bold_T = { italic_T start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT, we can define the interpolation for each frame in parallel via the aforementioned procedure.

Training. During training, we would like to learn a parameterized vector field vθ(𝐓t,t)subscript𝑣𝜃subscript𝐓𝑡𝑡v_{\theta}(\mathbf{T}_{t},t)italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ), a deep neural network with parameters θ𝜃\thetaitalic_θ, which takes as input the intermediate frames 𝐓tsubscript𝐓𝑡\mathbf{T}_{t}bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t sampled from 𝒰(0,1)𝒰01\mathcal{U}(0,1)caligraphic_U ( 0 , 1 ), and predicts the final frames 𝐓^={T^(n)}n=1N^𝐓subscriptsuperscriptsuperscript^𝑇𝑛𝑁𝑛1\mathbf{\hat{T}}=\{\hat{T}^{(n)}\}^{N}_{n=1}over^ start_ARG bold_T end_ARG = { over^ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT, where T^(n)=(r^t(n),x^t(n))superscript^𝑇𝑛superscriptsubscript^𝑟𝑡𝑛superscriptsubscript^𝑥𝑡𝑛\hat{T}^{(n)}=(\hat{r}_{t}^{(n)},\hat{x}_{t}^{(n)})over^ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT = ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ). The ground truth vector field utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for mapping from the intermediate frames 𝐓tsubscript𝐓𝑡\mathbf{T}_{t}bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the ground truth frames 𝐓1subscript𝐓1\mathbf{T}_{1}bold_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can also be decomposed into a ground truth rotation and translation for each frame T(n)superscript𝑇𝑛T^{(n)}italic_T start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT:

Translations: ut(x(n)|x0(n),x1(n))Translations: subscript𝑢𝑡conditionalsuperscript𝑥𝑛superscriptsubscript𝑥0𝑛superscriptsubscript𝑥1𝑛\displaystyle\text{Translations:\quad}u_{t}(x^{(n)}|x_{0}^{(n)},x_{1}^{(n)})Translations: italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) =x1(n),absentsuperscriptsubscript𝑥1𝑛\displaystyle=x_{1}^{(n)}\ ,= italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT , (4)
Rotations: ut(r(n)|r0(n),r1(n))Rotations: subscript𝑢𝑡conditionalsuperscript𝑟𝑛superscriptsubscript𝑟0𝑛superscriptsubscript𝑟1𝑛\displaystyle\text{Rotations:\quad}u_{t}(r^{(n)}|r_{0}^{(n)},r_{1}^{(n)})Rotations: italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT | italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) =logrt(n)(r1(n)).absentsubscriptlogsuperscriptsubscript𝑟𝑡𝑛superscriptsubscript𝑟1𝑛\displaystyle=\operatorname{log}_{r_{t}^{(n)}}(r_{1}^{(n)})\ .= roman_log start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) . (5)

To train the model vθsubscript𝑣𝜃v_{\theta}italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, we compute separate losses for the predicted rotation r^tSO(3)subscript^𝑟𝑡𝑆𝑂3\hat{r}_{t}\in SO(3)over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S italic_O ( 3 ) and translation x^t3subscript^𝑥𝑡superscript3\hat{x}_{t}\in\mathbb{R}^{3}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The combined SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) flow matching loss over N𝑁Nitalic_N frames is as follows:

SE(3)=𝔼t,p0(𝐓0),p1(𝐓1)subscript𝑆𝐸3subscript𝔼𝑡subscript𝑝0subscript𝐓0subscript𝑝1subscript𝐓1\displaystyle\mathcal{L}_{SE(3)}=\mathbb{E}_{\leavevmode\nobreak\ t,\ p_{0}(% \mathbf{T}_{0}),\ p_{1}(\mathbf{T}_{1})}caligraphic_L start_POSTSUBSCRIPT italic_S italic_E ( 3 ) end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_t , italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [1(1t)2n=1Nx^t(n)x1(n)323(n)+logrt(n)(r^1(n))logrt(n)(r1(n))SO(3)2SO(3)(n)].delimited-[]1superscript1𝑡2superscriptsubscript𝑛1𝑁subscriptsubscriptsuperscriptnormsuperscriptsubscript^𝑥𝑡𝑛superscriptsubscript𝑥1𝑛2superscript3superscriptsubscriptsuperscript3𝑛subscriptsubscriptsuperscriptnormsubscriptlogsuperscriptsubscript𝑟𝑡𝑛superscriptsubscript^𝑟1𝑛subscriptlogsuperscriptsubscript𝑟𝑡𝑛superscriptsubscript𝑟1𝑛2𝑆𝑂3superscriptsubscript𝑆𝑂3𝑛\displaystyle\Bigg{[}\ \frac{1}{(1-t)^{2}}\sum_{n=1}^{N}\underbrace{\Big{\|}% \hat{x}_{t}^{(n)}-x_{1}^{(n)}\Big{\|}^{2}_{\mathbb{R}^{3}}}_{\mathcal{L}_{% \mathbb{R}^{3}}^{(n)}}+\underbrace{\Big{\|}\operatorname{log}_{r_{t}^{(n)}}(% \hat{r}_{1}^{(n)})-\operatorname{log}_{r_{t}^{(n)}}(r_{1}^{(n)})\Big{\|}^{2}_{% SO(3)}}_{\mathcal{L}_{SO(3)}^{(n)}}\Bigg{]}.[ divide start_ARG 1 end_ARG start_ARG ( 1 - italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT under⏟ start_ARG ∥ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG ∥ roman_log start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) - roman_log start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_O ( 3 ) end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_S italic_O ( 3 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] . (6)

The architecture of the flow model vθsubscript𝑣𝜃v_{\theta}italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is similar to the structure module from AlphaFold2 comprising Invariant Point Attention layers interleaved with standard Transformer encoder layers, following Yim et al. (2023a, b). We use an MLP head to predict torsion angles ΦΦ\Phiroman_Φ.

Auxiliary losses. The inclusion of auxiliary loss terms to the objective in Equation 6 can be seen as a form of adding domain knowledge into the training process (Yim et al., 2023b). We include 3 additional losses that operate on the all-atom structure inferred from the predicted frames, weighted by tunable coefficients to modulate their contribution to the total loss:

totsubscripttot\displaystyle\mathcal{L}_{\text{tot}}caligraphic_L start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT =SE(3)+bb+dist+tors.absentsubscript𝑆𝐸3subscriptbbsubscriptdistsubscripttors\displaystyle=\mathcal{L}_{SE(3)}\ +\ \mathcal{L}_{\text{bb}}\ +\ \mathcal{L}_% {\text{dist}}\ +\ \mathcal{L}_{\text{tors}}\ .= caligraphic_L start_POSTSUBSCRIPT italic_S italic_E ( 3 ) end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT bb end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT tors end_POSTSUBSCRIPT . (7)

Suppose S={C4,C3,O4}𝑆𝐶superscript4𝐶superscript3𝑂superscript4S=\{C4^{\prime},C3^{\prime},O4^{\prime}\}italic_S = { italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_C 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_O 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } is the set of frame atoms222In Section C.1, we show how including all backbone atoms better accounts for larger RNA nucleotides and improves validity of generated samples. and the sequence length is N𝑁Nitalic_N. We summarise the auxiliary losses subsequently.

  • Coordinate MSE bbsubscriptbb\mathcal{L}_{\text{bb}}caligraphic_L start_POSTSUBSCRIPT bb end_POSTSUBSCRIPT: A direct all-atom MSE is computed between generated and ground truth coordinates. Here, a,a^𝑎^𝑎a,\hat{a}italic_a , over^ start_ARG italic_a end_ARG are the ground truth and predicted atomic coordinates for the frame atoms:

    bb=1|S|Nn=1NaSa(n)a^(n)2.subscriptbb1𝑆𝑁subscriptsuperscript𝑁𝑛1subscript𝑎𝑆superscriptnormsuperscript𝑎𝑛superscript^𝑎𝑛2\displaystyle\mathcal{L}_{\text{bb}}=\frac{1}{|S|N}\sum^{N}_{n=1}\ \sum_{a\in S% }\|a^{(n)}-\hat{a}^{(n)}\|^{2}.caligraphic_L start_POSTSUBSCRIPT bb end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_S | italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ italic_S end_POSTSUBSCRIPT ∥ italic_a start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT - over^ start_ARG italic_a end_ARG start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (8)
  • Distogram loss distsubscriptdist\mathcal{L}_{\text{dist}}caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT: A distogram DNS×NS𝐷superscript𝑁𝑆𝑁𝑆D\in\mathbb{R}^{NS\times NS}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_S × italic_N italic_S end_POSTSUPERSCRIPT containing all-to-all coordinate differences between the atoms in an RNA structure is computed. Let Dab(nm)=a(n)b(m)subscriptsuperscript𝐷𝑛𝑚𝑎𝑏normsuperscript𝑎𝑛superscript𝑏𝑚D^{(nm)}_{ab}=\|a^{(n)}-b^{(m)}\|italic_D start_POSTSUPERSCRIPT ( italic_n italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT = ∥ italic_a start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT - italic_b start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∥ be the elements of the distogram for the ground truth structure. Here, atom a𝑎aitalic_a belongs to nucleotide n𝑛nitalic_n and atom b𝑏bitalic_b to nucleotide m𝑚mitalic_m. Given the corresponding predicted distogram D^ab(nm)subscriptsuperscript^𝐷𝑛𝑚𝑎𝑏\hat{D}^{(nm)}_{ab}over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT ( italic_n italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT, we compute another difference between the tensors:

    dist=1(|S|N)2Nn,m=1nmNa,bSDab(nm)D^ab(nm)2.subscriptdist1superscript𝑆𝑁2𝑁superscriptsubscript𝑛𝑚1𝑛𝑚𝑁subscript𝑎𝑏𝑆superscriptnormsubscriptsuperscript𝐷𝑛𝑚𝑎𝑏subscriptsuperscript^𝐷𝑛𝑚𝑎𝑏2\displaystyle\mathcal{L}_{\text{dist}}=\frac{1}{(|S|N)^{2}-N}\sum_{\begin{% subarray}{c}n,m=1\\ n\neq m\end{subarray}}^{N}\ \sum_{\begin{subarray}{c}a,b\in S\end{subarray}}\|% D^{(nm)}_{ab}-\hat{D}^{(nm)}_{ab}\|^{2}.caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ( | italic_S | italic_N ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_N end_ARG ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_n , italic_m = 1 end_CELL end_ROW start_ROW start_CELL italic_n ≠ italic_m end_CELL end_ROW end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_a , italic_b ∈ italic_S end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ italic_D start_POSTSUPERSCRIPT ( italic_n italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT - over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT ( italic_n italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (9)
  • Torsional loss torssubscripttors\mathcal{L}_{\text{tors}}caligraphic_L start_POSTSUBSCRIPT tors end_POSTSUBSCRIPT: An angular loss between the 8 predicted torsions by the auxiliary MLP head and the angles from the ground truth all-atom structure. Suppose ϕΦnitalic-ϕsubscriptΦ𝑛\phi\in\Phi_{n}italic_ϕ ∈ roman_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and ϕ^Φ^n^italic-ϕsubscript^Φ𝑛\hat{\phi}\in\hat{\Phi}_{n}over^ start_ARG italic_ϕ end_ARG ∈ over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the ground truth and predicted torsion angles for residue n𝑛nitalic_n, we compute:

    tors=18Nn=1NϕΦn(ϕϕ^2).subscripttors18𝑁superscriptsubscript𝑛1𝑁subscriptitalic-ϕsubscriptΦ𝑛superscriptnormitalic-ϕ^italic-ϕ2\displaystyle\mathcal{L}_{\text{tors}}=\frac{1}{8N}\sum_{n=1}^{N}\ \sum_{\phi% \in\Phi_{n}}\Big{(}\|\phi-\hat{\phi}\|^{2}\Big{)}.caligraphic_L start_POSTSUBSCRIPT tors end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 8 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_ϕ ∈ roman_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_ϕ - over^ start_ARG italic_ϕ end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (10)

Sampling. To generate or unconditionally sample an RNA backbone of length N𝑁Nitalic_N, we initialize a random point cloud of frames. We use our trained flow model vθsubscript𝑣𝜃v_{\theta}italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT within an ODE solver to iteratively transform the noisy frames into a realistic RNA backbone. For each nucleotide, we begin with a noisy frame T0=(r0,x0)subscript𝑇0subscript𝑟0subscript𝑥0T_{0}=(r_{0},x_{0})italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) at time step t=0𝑡0t=0italic_t = 0, and integrate to t=1𝑡1t=1italic_t = 1 using the Euler method for a specified number of steps NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, with step size Δt=1/NTΔ𝑡1subscript𝑁𝑇\Delta t=1/N_{T}roman_Δ italic_t = 1 / italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. At each step t𝑡titalic_t, the flow model vθsubscript𝑣𝜃v_{\theta}italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT predicts updates for the frames via a rotation r^1subscript^𝑟1\hat{r}_{1}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and translation x^1subscript^𝑥1\hat{x}_{1}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:

Translations: xt+ΔtTranslations: subscript𝑥𝑡Δ𝑡\displaystyle\text{Translations:\quad}x_{t+\Delta t}\ Translations: italic_x start_POSTSUBSCRIPT italic_t + roman_Δ italic_t end_POSTSUBSCRIPT =xt+Δt(x^1xt),absentsubscript𝑥𝑡Δ𝑡subscript^𝑥1subscript𝑥𝑡\displaystyle=\ x_{t}+\Delta t\cdot(\hat{x}_{1}-x_{t})\ ,= italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + roman_Δ italic_t ⋅ ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (11)
Rotations: rt+ΔtRotations: subscript𝑟𝑡Δ𝑡\displaystyle\text{Rotations:\quad}r_{t+\Delta t}\ Rotations: italic_r start_POSTSUBSCRIPT italic_t + roman_Δ italic_t end_POSTSUBSCRIPT =exprt(cΔtlogrt(r^1)),absentsubscriptexpsubscript𝑟𝑡𝑐Δ𝑡subscriptlogsubscript𝑟𝑡subscript^𝑟1\displaystyle=\ \operatorname{exp}_{r_{t}}(\ c\ \Delta t\cdot\operatorname{log% }_{r_{t}}(\hat{r}_{1}))\ ,= roman_exp start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c roman_Δ italic_t ⋅ roman_log start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , (12)

where c=10𝑐10c=10italic_c = 10 is a tunable hyperparameter governing the exponential sampling schedule for rotations.

Conditional generation. The unconditional sampling strategy described above aims to generate realistic RNA backbone structures sampled from the training distribution. However using generative models in real-world design tasks entails conditional generation based on specified design constraints or requirements (Ingraham et al., 2022, Watson et al., 2023), which we are currently exploring. For example, unconditional models can leverage inference-time guidance strategies (Wu et al., 2024), be fine-tuned conditionally (Denker et al., 2024) or in an amortized fashion for motif-scaffolding (Didi et al., 2023). For sequence conditioning and structure prediction, we can incorporate embeddings from language models (Penic et al., 2024, He et al., 2024).

3 Experiments

3D RNA structure dataset. RNAsolo (Adamczyk et al., 2022) is a recent dataset of RNA 3D structures extracted from isolated RNAs, protein-RNA complexes, and DNA-RNA hybrids from the Protein Data Bank (as of January 5, 2024). The dataset contains 14,366 structures at resolution 4absent4\leq 4≤ 4 Å (1111 Å = 0.1nm). We select sequences of lengths between 40 and 150 nucleotides (5,319 in total) as we envisioned this size range contains structured RNAs of interest for design tasks.

Refer to caption
Figure 2: Structural self-consistency evaluation. We sample a backbone from our model and pass it through an inverse folding model (gRNAde) to obtain Nseq=8subscript𝑁seq8N_{\text{seq}}=8italic_N start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT = 8 sequences. Each sequence is fed into a structure prediction model (RhoFold) to get the predicted all-atom backbone. Self-consistency between each predicted backbone and the generated sample is measured with TM-score (we also report RMSD and GDT_TS). For a given generated sample, we thus have Nseq=8subscript𝑁seq8N_{\text{seq}}=8italic_N start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT = 8 TM-scores of which we take the maximum as the scTM score for that sample.

Evaluation metrics. We evaluate our models for unconditional RNA backbone generation, analogous to recent work in protein design (Yim et al., 2023b, a, Bose et al., 2023, Lin and AlQuraishi, 2023). We generate 50 backbones for target lengths sampled between 40 and 150 at intervals of 10. We then compute the following indicators of quality for these backbones:

  • Validity (scTM 0.45absent0.45\geq 0.45≥ 0.45): We inverse fold each generated backbone using gRNAde (Joshi et al., 2023) and pass Nseq=8subscript𝑁seq8N_{\text{seq}}=8italic_N start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT = 8 generated sequences into RhoFold (Shen et al., 2022). We then compute the self-consistency TM-score (scTM) between the predicted RhoFold structure and our backbone at the C4𝐶superscript4C4^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT level. We say a backbone is valid if scTM 0.45absent0.45\geq 0.45≥ 0.45; this threshold corresponds to roughly the same fold between two RNA strands (Zhang et al., 2022). We expand on this framework in Figure 2.

  • Diversity: Among the valid samples, we compute the number of unique structural clusters formed using qTMclust (Zhang et al., 2022) and take the ratio to the total number of samples. Two structures are considered similar if their TM-score 0.45absent0.45\geq 0.45≥ 0.45. This metric shows how much each generated sample varies from others across various sequence lengths.

  • Novelty: Among the valid samples, we use US-align (Zhang et al., 2022) at the C4𝐶superscript4C4^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT level to compute how structurally dissimilar the generated backbones are from the training distribution. For a set of samples for a given sequence length, we compute the TM-score between all pairs of generated backbones and training samples, and for each generated backbone, we assign the highest TM-score. We call the average across this set, pdbTM.

  • Local structural measurements: We measure the similarity between bond distances, bond angles, and dihedral angles from the set of generated samples and the training set. To do so, we compute histograms for each of the local structural metrics and use 1D Earth Mover’s distance to measure the similarity between generated and training distributions.

Hyperparameters. We use 6 IPA blocks in our flow model, with an additional 3-layer torsion predictor MLP that takes in node embeddings from the IPA module. Our final model contains 16.8M trainable parameters. We use Adam optimizer with learning rate 0.00010.00010.00010.0001, β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999. We train for 120K120𝐾120K120 italic_K gradient update steps on four NVIDIA GeForce RTX 3090 GPUs for about 15 hours with a batch size B=20𝐵20B=20italic_B = 20. Each batch contains samples of the same sequence length to avoid padding. Further hyperparameters are listed in Appendix B.1.

4 Results

4.1 Global evaluation of generated RNA backbones

We begin by analyzing RNA-FrameFlow’s samples using the aforementioned evaluation metrics. For validity, we report percentage of samples with scTM 0.45absent0.45\geq 0.45≥ 0.45; for diversity, we report the ratio of unique structural clusters to total valid samples; and for novelty, we report the highest average pdbTM to a match from the PDB. For each sequence length between 40 and 150, at intervals of 10, we generate 50 backbones. Table 1 reports these metrics across different variants for the number of denoising steps NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. We compare our model to protein-RNA-DNA complex co-design model MMDiff (Morehead et al., 2023), which is a diffusion model. As the original version of MMDiff was trained on shorted RNA sequences, we retrain it on our training set. Additionally, we inverse-folded MMDiff’s backbones using gRNAde.

We identify NT=50subscript𝑁𝑇50N_{T}=50italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 50 as the best-performing model that balances validity, diversity, and novelty; furthermore, it takes 4.74 seconds (averaged over 5 runs) to sample a backbone of length 100, as opposed to 27.3 seconds for MMDiff with 100 diffusion steps. We note that increasing NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT does not improve validity despite allowing the model to perform more updates to atomic coordinate placements. Our model also outperforms MMDiff. On manual inspection, samples from MMDiff had significant chain breaks and disconnected floating strands; see Appendix D.1.

Table 1: Unconditional RNA backbone generation. We evaluate the performance of RNA-FrameFlow for multiple values for denoising steps NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. The best-performing model uses NT=50subscript𝑁𝑇50N_{T}=50italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 50 steps, taking 4.74s to sample a backbone of length 100. Diffusion-based MMDiff generated no valid backbones and took 5×5\times5 × longer to sample.
Model Timesteps NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT % Validity \uparrow Diversity \uparrow Novelty \downarrow
RNA-FrameFlow 10101010 16.7 0.62 0.70
50505050 41.0 0.61 0.54
100100100100 20.0 0.61 0.69
500500500500 20.0 0.57 0.67
MMDiff 100100100100 0.0 - -
Refer to caption
Refer to caption
Refer to caption
Figure 3: Validity and novelty of generated backbones. (Left) scTM of backbones of lengths 40-150 with the mean and spread of scTM for each length; we select the top 10 structures with the best validation scores per length. (Middle) Scatter plot of self-consistency TM-score (scTM) and novelty (pdbTM) across lengths. Vertical and horizontal dotted lines represent TM-score thresholds of 0.45. (Right) Selected samples with high pdbTM scores (colored) with the closest, aligned match from the PDB (gray). Our model generates valid backbones for certain sequence lengths and tends to recapitulate the most frequent folds in the PDB (e.g., tRNAs, small rRNAs).

4.2 Local evaluation with structural measurements

For our best-performing model with number of timesteps NT=50subscript𝑁𝑇50N_{T}=50italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 50, we plot histograms of bond distance, bond angles, and dihedral angles in Figure 4. We include the Earth Mover’s distance (EMD) between measurements from the training and generated distributions as an indicator of local realism (using 30 bins for each quantity). An ideal generative model will score an EMD close to 0.0 (i.e. consistent with the training set comprising naturally occurring RNA). In Table 3, we observe EMD values from our best-performing model’s backbones being significantly closer to 0.0 compared to MMDiff and random Gaussian all-atom point clouds (akin to an untrained model), which serve as sanity checks. We include histograms for MMDiff in Appendix D.1.

We also show RNA Ramachandran angle plots for generated samples and the training distribution in Figure 4. Keating et al. (2011) introduced ηθ𝜂𝜃\eta-\thetaitalic_η - italic_θ plots, similar to Ramachandran angle plots for proteins, that track the separate dihedral angles formed by {C4i,Pi+1,C4i+1,Pi+2}𝐶subscriptsuperscript4𝑖subscript𝑃𝑖1𝐶subscriptsuperscript4𝑖1subscript𝑃𝑖2\{C4^{\prime}_{i},P_{i+1},C4^{\prime}_{i+1},P_{i+2}\}{ italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT } and {Pi,C4i,Pi+1,C4i+1}subscript𝑃𝑖𝐶subscriptsuperscript4𝑖subscript𝑃𝑖1𝐶subscriptsuperscript4𝑖1\{P_{i},C4^{\prime}_{i},P_{i+1},C4^{\prime}_{i+1}\}{ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT } respectively, for each nucleotide i𝑖iitalic_i along the chain. We observe that the dehedral angle distribution from RNA-FrameFlow closely recapitulates the distribution of naturally occuring RNA structures from the training set.

4.3 Generation quality across sequence lengths

We next investigate how sequence length affects the global realism of generated samples (measured by scTM). Figure 3 (Left) shows the performance of RNA-FrameFlow for different sequence lengths. We observe our model generates samples with high scTM for specific sequence lengths like 50, 60, 70, and 120 while generating poorer quality structures for other lengths. We believe the fluctuation of TM-scores may be due to certain lengths being over-represented in the training distribution. We can also partially attribute this to the inherent length bias of RhoFold; see Appendix B.2. With a better structure predictor, we expect an increase in valid samples that meet the 0.45 TM-score threshold.

We also analyze the novelty of our generated samples (measured by pdbTM) in Figure 3 (Middle). We are particularly interested in samples that lie in the right half with high scTM and low pdbTM, which means that the designs are highly likely to fold back into the sampled backbone but are structurally dissimilar to any RNAs in the training set. It is worth noting that our training set has high structural similarity among samples: running qTMclust on our training dataset revealed only 342 unique clusters from 5,319 samples, which indicates that the model does not encounter a diverse set of samples during training. This contributes to many generated samples from our model looking similar to samples from the training distribution. We include two such examples in Figure 3 (Right). Both generated RNAs yield relatively high pdbTM scores and look similar to their respective closest matching chain from the training set: a tRNA at length 70 and a 5S ribosomal RNA at length 120, respectively. We include comparative results on validity and novelty for MMDiff in Appendix D.1, finding that MMDiff does not generate any samples that pass the validity criteria.

Table 2: Local structural metrics. Earth Mover’s Distance for local structural measurements compared to ground truth measurements from RNAsolo. Our model (NT=50subscript𝑁𝑇50N_{T}=50italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 50) shows improved recapitulation of local structural descriptors compared to baselines.
Model Earth Mover’s Distance (\downarrow)
dist angles torsions
RNA-FrameFlow 0.17 0.11 2.36
MMDiff (original) 1.38 0.43 3.06
MMDiff (retrained) 0.39 0.21 3.23
Gaussian noise 29.00 6.35 4.37
Table 3: Impact of data preparation strategies. Increasing the diversity of the training dataset using a combination of strategies improves diversity and novelty of generated structures but leads to fewer designs passing the validity threshold.
Model % Validity \uparrow Diversity \uparrow Novelty \downarrow
Base 41.0 0.62 0.54
+ Clustering 12.0 0.88 0.49
    + Cropping 11.0 0.85 0.47
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Local structural metrics from 600 generated backbone samples, compared to random Gaussian point cloud as a sanity check. Our model can recapitulate local structural descriptors. (Subplots 1-3) Histograms of inter-nucleotide bond distances, bond angles between nucleotide triplets, and torsion angles between every four nucleotides. (Subplot 4): RNA-centric Ramachandran plot of structures from the training set (purple) and generated backbones (green).

4.4 Data preparation protocols

Due to the overrepresentation of RNA strands of certain lengths (mostly corresponding to tRNA or 5S ribosomal RNA) in our training set, our models generate close likenesses for those lengths that achieve high self-consistency but are not novel folds. To avoid this memorized recapitulation and promote increased diversity among samples, we sought to develop data preparation protocols to balance RNA folds across sequence lengths.

  • Structural clustering: We cluster our training set using qTMclust. When creating a training batch, we sample random clusters and from each, a random structure. This ensures a batch does not comprise solely of samples for a single sequence length or is dominated by over-represented folds. There are only 342 structural clusters for the 5,319 samples within sequence lengths 40-150, highlighting the lack of diversity in RNA structural data.

  • Cropping augmentation: We expand our training set by cropping longer RNA strands beyond length 150 by sampling a random crop length in [40,150]40150[40,150][ 40 , 150 ] and extracting a contiguous segment from the larger chains. As cropped RNA are not standalone molecules and serve only to augment the dataset, we consider a randomly chosen 20% of the training set size to balance uncropped and cropped samples; this gives 1,063 extra cropped samples.

We train identical models on these data splits for 120120120120K gradient steps, with evaluation results reported in Table 3 showing improved diversity and novelty in the generated samples, at the cost of reduced validity. For structural clustering, each batch comprises padded samples up to a maximum length of 150 from randomly selected structural clusters across sequence lengths. See Section D.2 for full results for the two alternate data preparation protocols.

5 Limitations and Discussions

Altogether, our experiments demonstrate that the SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) flow matching framework is sufficiently expressive for learning the distribution of 3D RNA structure and generating realistic RNA backbones similar to well-represented RNA folds in the PDB. Select examples are shown in LABEL:fig:addition-samples. We have also identified notable limitations and avenues for future work, which we highlight below.

Physical violations. While well-trained models usually generate realistic RNA backbones, we do observe some physical violations: generated backbones sometimes have chains that are either too close by or directly clash with one another, are highly coiled, have excessive loops and unrealistically intertwined helices, or have chain breaks. We highlight these limitations in Figure 5. RNA tertiary structure folding is driven by base pairing and base stacking (Vicens and Kieft, 2022) which influence the formation of helices, loops, and other tertiary motifs. Base pairing refers to nucleotides along adjacent chains forming hydrogen bonds, while base stacking involves interactions between rings of adjacent nucleotide bases along a chain. To our knowledge, all current deep learning models operate on individual nucleotides, only implicitly learning base pairing and stacking. Developing explicit representations of these interactions as part of the architecture may further minimize physical violations and provide stronger inductive biases to learn complex tertiary RNA motifs.

Generalisation and novelty. We observed that the best designs from our models (as measured by scTM score) are sampled at lengths 70-80 and 120-130, and often have closely matching structures in the PDB (high TM-scores). This suggests that models can recapitulate well-represented RNA folds in their training distribution (e.g., both tRNAs at length 70-90 and small 5S ribosomal RNAs at length 120 are very frequent). However, self-consistency metrics were relatively poorer for less frequent lengths, suggesting that models are currently not designing novel folds.

We would also like to note that the models we use for structure prediction and inverse folding may be similarly biased to perform well for certain sequence lengths, leading to the overall pipeline being reliable for commonly occurring lengths and unreliable for less frequent ones (see Appendix B.2 for an analysis on RhoFold). We evaluated preliminary strategies for structural clustering and cropping augmentations during training, which improved the novelty of designed structures but led to fewer designs passing the validity filter. Overall, the relative scarcity of RNA structural data compared to proteins necessitates greater care in preparing data pipelines for scaling up training and/or incorporating inductive biases into generative models, which we hope to continue exploring.

6 Conclusion

We introduce RNA-FrameFlow, a generative model for 3D RNA backbone design. Our evaluations show that our model can design locally realistic and moderately novel backbones of length 40 – 150 nucleotides. We achieve a validity score of 41.0% and relatively strong diversity and novelty scores compared to diffusion model baselines and ablated variants. While generative models can successfully recapitulate well-represented RNA folds (e.g., tRNAs, small rRNAs), the lack of diversity in the training data may hinder broad generalization at present. We are actively exploring improved data preparation strategies combined with inductive biases that explicitly incorporate physical interactions that drive RNA structure. We hope RNA-FrameFlow and the associated evaluation framework can serve as foundations for the community to explore 3D RNA design, towards developing conditional generative models for real-world design scenarios.

Acknowledgements

We would like to thank Jason Yim and Emile Mathieu for helpful comments and discussions. CKJ was supported by the A*STAR Singapore National Science Scholarship (PhD). AM was supported by a U.S. NSF grant (DBI2308699) and two U.S. NIH grants (R01GM093123 and R01GM146340). SVM was supported by the UKRI Centre for Doctoral Training in Application of Artificial Intelligence to the study of Environmental Risks (EP/S022961/1). This research was partially supported by a Cambridge Dawn Supercomputer Pioneer Project compute grant.

Refer to caption
Figure 5: Physical violations in generated samples. (A) Steric clashes between generated chains (highlighted in yellow). (B) Chain breaks and stray strands (highlighted in yellow). (C)-(E) Excessive loops and intertwined helices.

References

  • Jumper et al. (2021) John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Zidek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold. Nature, 2021.
  • Dauparas et al. (2022) Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn. Science, 2022.
  • Watson et al. (2023) Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, 2023.
  • Doudna and Charpentier (2014) Jennifer A Doudna and Emmanuelle Charpentier. The new frontier of genome engineering with crispr-cas9. Science, 2014.
  • Metkar et al. (2024) Mihir Metkar, Christopher S Pepin, and Melissa J Moore. Tailor made: the art of therapeutic mrna design. Nature Reviews Drug Discovery, 23(1):67–83, 2024.
  • Mulhbacher et al. (2010) Jerome Mulhbacher, Patrick St-Pierre, and Daniel A Lafontaine. Therapeutic applications of ribozymes and riboswitches. Current opinion in pharmacology, 2010.
  • Damase et al. (2021) Tulsi Ram Damase, Roman Sukhovershin, Christian Boada, Francesca Taraballi, Roderic I Pettigrew, and John P Cooke. The limitless future of rna therapeutics. Frontiers in bioengineering and biotechnology, 9:628137, 2021.
  • Han et al. (2017) Dongran Han, Xiaodong Qi, Cameron Myhrvold, Bei Wang, Mingjie Dai, Shuoxing Jiang, Maxwell Bates, Yan Liu, Byoungkwon An, Fei Zhang, et al. Single-stranded dna and rna origami. Science, 2017.
  • Yesselman et al. (2019) Joseph D Yesselman, Daniel Eiler, Erik D Carlson, Michael R Gotrik, Anne E d’Aquino, Alexandra N Ooms, Wipapat Kladwang, Paul D Carlson, Xuesong Shi, David A Costantino, et al. Computational design of three-dimensional rna structure and function. Nature nanotechnology, 2019.
  • Ganser et al. (2019) Laura R Ganser, Megan L Kelly, Daniel Herschlag, and Hashim M Al-Hashimi. The roles of structural dynamics in the cellular functions of rnas. Nature reviews Molecular cell biology, 2019.
  • Li et al. (2023a) Yueyi Li, Anibal Arce, Tyler Lucci, Rebecca A Rasmussen, and Julius B Lucks. Dynamic rna synthetic biology: new principles, practices and potential. RNA biology, 2023a.
  • Yim et al. (2023a) Jason Yim, Andrew Campbell, Andrew Y. K. Foong, Michael Gastegger, José Jiménez-Luna, Sarah Lewis, Victor Garcia Satorras, Bastiaan S. Veeling, Regina Barzilay, Tommi Jaakkola, and Frank Noé. Fast protein backbone generation with se(3) flow matching, 2023a.
  • Adamczyk et al. (2022) Bartosz Adamczyk, Maciej Antczak, and Marta Szachniuk. RNAsolo: a repository of cleaned PDB-derived RNA 3D structures. Bioinformatics, 2022.
  • Joshi et al. (2023) Chaitanya K. Joshi, Arian R. Jamasb, Ramon Viñas, Charles Harris, Simon Mathis, Alex Morehead, Rishabh Anand, and Pietro Liò. grnade: Geometric deep learning for 3d rna inverse design, 2023.
  • Shen et al. (2022) Tao Shen, Zhihang Hu, Zhangzhi Peng, Jiayang Chen, Peng Xiong, Liang Hong, Liangzhen Zheng, Yixuan Wang, Irwin King, Sheng Wang, Siqi Sun, and Yu Li. E2efold-3d: End-to-end deep learning method for accurate de novo rna 3d structure prediction, 2022.
  • Vicens and Kieft (2022) Quentin Vicens and Jeffrey S Kieft. Thoughts on how to think (and talk) about rna structure. Proceedings of the National Academy of Sciences, 2022.
  • Kretsch et al. (2023) Rachael C Kretsch, Ebbe S Andersen, Janusz M Bujnicki, Wah Chiu, Rhiju Das, Bingnan Luo, Benoît Masquida, Ewan KS McRae, Griffin M Schroeder, Zhaoming Su, et al. Rna target highlights in casp15: Evaluation of predicted models by structure providers. Proteins: Structure, Function, and Bioinformatics, 2023.
  • Abramson et al. (2024) Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, 2024.
  • Morehead et al. (2023) Alex Morehead, Jeffrey Ruffolo, Aadyot Bhatnagar, and Ali Madani. Towards joint sequence-structure generation of nucleic acid and protein complexes with se(3)-discrete diffusion, 2023.
  • Harvey and Prabhakaran (1986) Stephen C Harvey and M Prabhakaran. Ribose puckering: structure, dynamics, energetics, and the pseudorotation cycle. Journal of the American Chemical Society, 108(20):6128–6136, 1986.
  • Clay et al. (2017) Mary C Clay, Laura R Ganser, Dawn K Merriman, and Hashim M Al-Hashimi. Resolving sugar puckers in rna excited states exposes slow modes of repuckering dynamics. Nucleic acids research, 45(14):e134–e134, 2017.
  • Chen and Lipman (2024) Ricky T. Q. Chen and Yaron Lipman. Flow matching on general geometries, 2024.
  • Yim et al. (2023b) Jason Yim, Brian L. Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, and Tommi Jaakkola. Se(3) diffusion model with application to protein backbone generation, 2023b.
  • Ingraham et al. (2022) John Ingraham, Max Baranov, Zak Costello, Vincent Frappier, Ahmed Ismail, Shan Tie, Wujie Wang, Vincent Xue, Fritz Obermeyer, Andrew Beam, et al. Illuminating protein space with a programmable generative model. bioRxiv, pages 2022–12, 2022.
  • Wu et al. (2024) Luhuan Wu, Brian Trippe, Christian Naesseth, David Blei, and John P Cunningham. Practical and asymptotically exact conditional sampling in diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  • Denker et al. (2024) Alexander Denker, Francisco Vargas, Shreyas Padhy, Kieran Didi, Simon Mathis, Vincent Dutordoir, Riccardo Barbano, Emile Mathieu, Urszula Julia Komorowska, and Pietro Lio. Deft: Efficient finetuning of conditional diffusion models by learning the generalised hhitalic_h-transform. arXiv preprint arXiv:2406.01781, 2024.
  • Didi et al. (2023) Kieran Didi, Francisco Vargas, Simon Mathis, Vincent Dutordoir, Emile Mathieu, Urszula Julia Komorowska, and Pietro Lio. A framework for conditional diffusion modelling with applications in motif scaffolding for protein design. In NeurIPS 2023 Machine Learning for Structural Biology Workshop, 2023.
  • Penic et al. (2024) Rafael Josip Penic, Tin Vlasic, Roland G Huber, Yue Wan, and Mile Sikic. Rinalmo: General-purpose rna language models can generalize well on structure prediction tasks. arXiv preprint, 2024.
  • He et al. (2024) Shujun He, Rui Huang, Jill Townley, Rachael C Kretsch, Thomas G Karagianes, David BT Cox, Hamish Blair, Dmitry Penzar, Valeriy Vyaltsev, Elizaveta Aristova, et al. Ribonanza: deep learning of rna structure through dual crowdsourcing. bioRxiv, 2024.
  • Bose et al. (2023) Avishek Joey Bose, Tara Akhound-Sadegh, Kilian Fatras, Guillaume Huguet, Jarrid Rector-Brooks, Cheng-Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael Bronstein, and Alexander Tong. Se(3)-stochastic flow matching for protein backbone generation, 2023.
  • Lin and AlQuraishi (2023) Yeqing Lin and Mohammed AlQuraishi. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds, 2023.
  • Zhang et al. (2022) Chengxin Zhang, Morgan Shine, Anna Marie Pyle, and Yang Zhang. Us-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nature methods, 2022.
  • Keating et al. (2011) Kevin S Keating, Elisabeth L Humphris, and Anna Marie Pyle. A new way to see rna. Quarterly reviews of biophysics, 44(4):433–466, 2011.
  • Baek et al. (2022) Minkyung Baek, Ryan McHugh, Ivan Anishchenko, David Baker, and Frank DiMaio. Accurate prediction of nucleic acid and protein-nucleic acid complexes using rosettafoldna. bioRxiv, 2022.
  • Li et al. (2023b) Yang Li, Chengxin Zhang, Chenjie Feng, Robin Pearce, P Lydia Freddolino, and Yang Zhang. Integrating end-to-end learning with deep geometrical potentials for ab initio rna structure prediction. Nature Communications, 2023b.
  • Townshend et al. (2021) Raphael JL Townshend, Stephan Eismann, Andrew M Watkins, Ramya Rangan, Maria Karelina, Rhiju Das, and Ron O Dror. Geometric deep learning of rna structure. Science, 2021.
  • Boniecki et al. (2016) Michal J Boniecki, Grzegorz Lach, Wayne K Dawson, Konrad Tomala, Pawel Lukasz, Tomasz Soltysinski, Kristian M Rother, and Janusz M Bujnicki. Simrna: a coarse-grained method for rna folding simulations and 3d structure prediction. Nucleic acids research, 2016.
  • Watkins et al. (2020) Andrew Martin Watkins, Ramya Rangan, and Rhiju Das. Farfar2: improved de novo rosetta prediction of complex global rna folds. Structure, 2020.
  • Tan et al. (2023) Cheng Tan, Yijie Zhang, Zhangyang Gao, Hanqun Cao, and Stan Z. Li. Hierarchical data-efficient representation learning for tertiary structure-based rna design, 2023.
  • Shulgina et al. (2024) Yekaterina Shulgina, Marena I Trinidad, Conner J Langeberg, Hunter Nisonoff, Seyone Chithrananda, Petr Skopintsev, Amos J Nissley, Jaymin Patel, Ron S Boger, Honglue Shi, et al. Rna language models predict mutations that improve rna function. bioRxiv, 2024.
  • Nori and Jin (2024) Divya Nori and Wengong Jin. Rnaflow: Rna structure & sequence co-design via inverse folding-based flow matching. In ICLR 2024 Workshop on Generative and Experimental Perspectives for Biomolecular Design, 2024.

Appendix A Related Work

Here, we summarize recent developments in deep learning for 3D RNA modeling and design. Recent end-to-end RNA structure prediction tools include RhoFold (Shen et al., 2022), RoseTTAFold2NA (Baek et al., 2022), DRFold (Li et al., 2023b), and AlphaFold3 (Abramson et al., 2024), each with varying performance that is yet to match the current state-of-the-art for proteins. Other approaches use GNNs as ranking functions (Townshend et al., 2021) together with sampling algorithms (Boniecki et al., 2016, Watkins et al., 2020). However, structure prediction tools are not directly capable of designing new structures, which this work aims to address by adapting an SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) flow matching framework for proteins (Yim et al., 2023a). MMDiff (Morehead et al., 2023), a diffusion model for protein-nucleic acid complex generation, can also sample RNA-only structures in principle. Our evaluation shows that our flow matching model significantly outperforms both the original and RNA-only versions of MMDiff that we re-trained for fair comparison.

Joshi et al. (2023) introduce gRNAde, a GNN-based encoder-decoder for 3D RNA inverse folding, a closely related task of designing new sequences conditioned on backbone structures. Tan et al. (2023) and Shulgina et al. (2024) have also developed GNNs for 3D RNA inverse folding. We use gRNAde (Joshi et al., 2023) followed by RhoFold (Shen et al., 2022) in our evaluation pipeline to forward fold designed backbones and measure structural self-consistency.

Independently and concurrent to our work, Nori and Jin (2024) propose RNAFlow, an SE(3) flow matching model to co-design RNA sequence and structure conditioned on protein partners. At each denoising step, RNAFlow uses a protein-conditioned variant of gRNAde (Joshi et al., 2023) to inverse fold noised structures, followed by RoseTTAFold2NA (Baek et al., 2022) to predict the structure of the designed sequence. The performance of RNAFlow is upper-bounded by RoseTTAFold2NA as a pre-trained structure generator, which is kept frozen and not developed for designed RNAs which do not have co-evolutionary MSA information. Our work tackles de novo 3D RNA backbone generation, an orthogonal design task of sampling RNA backbone structures. We train RNA structure generation models from scratch, akin to recent developments in protein design (Yim et al., 2023b, a, Bose et al., 2023, Lin and AlQuraishi, 2023). Backbone generation followed by inverse folding has shown experimental success in designing functional proteins (Dauparas et al., 2022, Watson et al., 2023, Ingraham et al., 2022), as the framework is flexible for including specific structural motifs and sequence constraints.

Appendix B Additional Experimental Details

B.1 Hyperparameters for denoiser

Table 4: Hyperparameters for denoiser model.
Category Hyperparameter Value
Invariant Point Attention (IPA) Atom embedding dimension Dhsubscript𝐷D_{h}italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 256
Hidden dimension Dzsubscript𝐷𝑧D_{z}italic_D start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT 128
Number of blocks 6
Query and key points 8
Number of heads 8
Key points 12
Transformer Number of heads 4
Number of layers 2
Torsion Prediction MLP Input dimension 256
Hidden dimension 128
Schedule Translations (training) linear
Rotations (training) linear
Translations (sampling) linear
Rotations (sampling) exponential
Number of denoising steps NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT [10,𝟓𝟎,100,500]1050100500[10,\mathbf{50},100,500][ 10 , bold_50 , 100 , 500 ]

B.2 RhoFold Length Bias

We investigate the performance of RhoFold on the training dataset used for our generative model. Figure 11 shows that RhoFold has a sequence length bias where it predicts good structures with low RMSDs for specific sequence lengths (like 70, 100, and 120) while predicting poor structures for other lengths with larger RMSDs. The performance across lengths is disparate and may influence what is considered ‘valid’ in our unconditional generation benchmarks.

Refer to caption
Figure 11: RhoFold length bias. RhoFold has a strong bias for certain sequence lengths over others. This affects its efficacy when used to compute the 3D self-consistency of generated backbones. The blue dotted line represents the median RMSD of RhoFold predictions to the samples. To minimize the influence of this length bias, we use TM-score for self-consistency because it does not penalize flexible regions as much as RMSD.

Appendix C Ablation Study

C.1 Composition of Backbone Coordinate Loss

We also analyze how changing the composition of atoms considered in the inter-atom losses affects performance. We increase the number of atoms being supervised in the bbsubscriptbb\mathcal{L}_{\text{bb}}caligraphic_L start_POSTSUBSCRIPT bb end_POSTSUBSCRIPT loss described above. Aside from the frame comprising C3𝐶superscript3C3^{\prime}italic_C 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, C4𝐶superscript4C4^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and O4𝑂superscript4O4^{\prime}italic_O 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we try two settings with 3 and 7 additional non-frame atoms included in the loss. For the 3 non-frame atoms, we additionally choose C1𝐶superscript1C1^{\prime}italic_C 1 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, P𝑃Pitalic_P, and O3𝑂superscript3O3^{\prime}italic_O 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and for the 7 non-frame atoms, we choose a superset C1𝐶superscript1C1^{\prime}italic_C 1 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, P𝑃Pitalic_P, O3𝑂superscript3O3^{\prime}italic_O 3 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, C5𝐶superscript5C5^{\prime}italic_C 5 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, OP1𝑂𝑃1OP1italic_O italic_P 1, OP2𝑂𝑃2OP2italic_O italic_P 2, and N1/N9𝑁1𝑁9N1/N9italic_N 1 / italic_N 9. We posit the additional supervision may increase the local structural realism, which may further improve validity, as shown in Table 5.

We indeed observe increasing validity as we increase the frame complexity in the auxiliary backbone loss. The minute RMSD contributions from disordered fragments of the RNA may be minimal, accounting for greater likeness to the RhoFold predicted structures, scoring relatively higher scTM scores. However, the original frame-only baseline model has better diversity and novelty which we attribute to high local variation in atomic placements. This variation causes two generated structures for the same sequence length to look very different at an all-atom resolution.

Table 5: Ablating composition of backbone loss bbsubscriptbb\mathcal{L}_{\text{bb}}caligraphic_L start_POSTSUBSCRIPT bb end_POSTSUBSCRIPT. Supervising more non-frame atoms improves validity but worsens diversity and novelty. Best per-column result is bolded.
Frame composition in bbsubscriptbb\mathcal{L}_{\text{bb}}caligraphic_L start_POSTSUBSCRIPT bb end_POSTSUBSCRIPT % Validity \uparrow Diversity \uparrow Novelty \downarrow
Frame only (baseline) 41.0 0.62 0.54
Frame and 3 non-frame 45.0 0.28 0.79
Frame and 7 non-frame 46.7 0.35 0.85

C.2 Composition of Auxiliary Loss

We ablate the inclusion of different auxiliary loss terms that guide our SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) flow matching setup; results are in Table 6. Although, there is an increase in EMD for bond distances as we remove distance-based losses like backbone coordinate loss bbsubscriptbb\mathcal{L}_{\text{bb}}caligraphic_L start_POSTSUBSCRIPT bb end_POSTSUBSCRIPT and all-to-all pairwise distance loss (distsubscriptdist\mathcal{L}_{\text{dist}}caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT). However, we also observe the model still learns realistic distributions despite removing different loss terms, indicating that each loss makes up for the absence of the other. Moreover, the best model still uses all losses with any removal causing a drop in validity.

Further inspecting the samples from the models without each loss term reveals structural deformities at the all-atom level. Figure 12 shows such artifacts resulting from not enforcing geometric constraints through explicit losses.

Table 6: Ablations of loss terms on Earth Mover’s Distance scores for structural measurements compared to ground truth measurements from the training set. The first row corresponds to the baseline model. Distance-based losses like the backbone coordinate loss (bbsubscriptbb\mathcal{L}_{\text{bb}}caligraphic_L start_POSTSUBSCRIPT bb end_POSTSUBSCRIPT) and all-to-all pairwise distance loss (distsubscriptdist\mathcal{L}_{\text{dist}}caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT) are necessary to learn geometric properties like bond distances adequately.
bbsubscriptbb\mathcal{L}_{\text{bb}}caligraphic_L start_POSTSUBSCRIPT bb end_POSTSUBSCRIPT distsubscriptdist\mathcal{L}_{\text{dist}}caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT SO(3)subscript𝑆𝑂3\mathcal{L}_{SO(3)}caligraphic_L start_POSTSUBSCRIPT italic_S italic_O ( 3 ) end_POSTSUBSCRIPT EMD (distance) \downarrow EMD (angles) \downarrow EMD (torsions) \downarrow % Validity \uparrow
0.17 0.11 2.36 41.0
0.18 0.14 3.85 35.0
0.23 0.11 3.72 13.3
0.18 0.18 3.59 16.7
Refer to caption
Figure 12: Not including auxiliary losses causes structural deformities in generated RNAs. (A) RNA backbone from our baseline model with expected adherence to bonding between nucleotides. (B) Not including the rotation loss SO(3)subscript𝑆𝑂3\mathcal{L}_{SO(3)}caligraphic_L start_POSTSUBSCRIPT italic_S italic_O ( 3 ) end_POSTSUBSCRIPT causes nucleotides to have random orientations, preventing them from connecting contiguously. (C) Not including the backbone atom loss bbsubscriptbb\mathcal{L}_{\text{bb}}caligraphic_L start_POSTSUBSCRIPT bb end_POSTSUBSCRIPT causes intra-residue atoms to be placed too close to one another resulting in bonds that should not exist. (D) Not including the all-to-all pairwise distance loss distsubscriptdist\mathcal{L}_{\text{dist}}caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT causes deformations and fusing between adjacent frames, and unrealistic nucleotide placements, especially along helices and loops.

Appendix D Additional Results

D.1 Evaluation of MMDiff Samples

Here, we document global and local metrics from samples generated by MMDiff. MMDiff has a validity score of 0.0% as all the samples have a poor scTM score below the 0.45 threshold to the RhoFold predicted backbones. Even though none of the samples are valid, we show the average pdbTM scores for the samples, which are trivially low as there are no structures from the PDB that match them due to poor quality.

While MMDiff’s samples locally resemble RNA structures given realistic, manual inspection reveals multiple chain breaks and disconnected floating strands, resulting in 0.0% validity. In Figure 14 (Subplot 1), we see inter-residue C4𝐶superscript4C4^{\prime}italic_C 4 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT distances slightly varying, causing the chain breaks and clashes. Furthermore, the Ramachandran plot in Figure 14 (Subplot 4) reveals a more complex angular distribution than found in the training set, which may be a consequence of excessively folded regions or substructures that may have folded in on themselves.

Refer to caption
Refer to caption
Figure 13: Validity and novelty of retrained MMDiff’s top-10 generated backbones. (Left) scTM of backbones of lengths 40-150 with the mean and spread of scTM for each length. (Middle) Scatter plot of self-consistency TM-score (scTM) and novelty (pdbTM) across lengths. Vertical and horizontal dotted lines represent TM-score thresholds of 0.45. Overall, MMDiff retrained on our training set does not generate realistic RNA structures.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 14: Structural measurements from samples generated by MMDiff. (Subplots 1-3) Left: histogram of inter-nucleotide bond distances in Angstrom. Middle: histogram of bond angles between nucleotide triplets. Right: histogram of torsion (dihedral) angles between every four nucleotides. (Subplot 4): RNA-centric Ramachandran plot of structures from the training set (purple) and MMDiff’s generated backbones (green).

D.2 Evaluation of Data Preparation Strategies

We include global evaluation metrics for the two data preparation strategies presented in the main text, namely structural clustering and cropping augmentation.

Refer to caption
Refer to caption
Figure 15: Validity and novelty of top-10 generated backbones from the model trained with only structural clustering. (Left) scTM of backbones of lengths 40-150 with the mean and spread of scTM for each length. (Middle) Scatter plot of self-consistency TM-score (scTM) and novelty (pdbTM) across lengths. Vertical and horizontal dotted lines represent TM-score thresholds of 0.45.
Refer to caption
Refer to caption
Figure 16: Validity and novelty of top-10 generated backbones from the model trained with structural clustering and cropping. (Left) scTM of backbones of lengths 40-150 with the mean and spread of scTM for each length. (Middle) Scatter plot of self-consistency TM-score (scTM) and novelty (pdbTM) across lengths. Vertical and horizontal dotted lines represent TM-score thresholds of 0.45.