figurec \addbibresourcebibliography.bib
A compact model of Escherichia coli core and biosynthetic metabolism
Abstract
Metabolic models condense biochemical knowledge about organisms in a structured and standardised way. As large-scale network reconstructions are readily available for many organisms of interest, genome-scale models are being widely used among modellers and engineers. However, these large models can be difficult to analyse and visualise, and occasionally generate hard-to-interpret or even biologically unrealistic predictions. Out of the thousands of enzymatic reactions in a typical bacterial metabolism, only a few hundred comprise the metabolic pathways essential to produce energy carriers and biosynthetic precursors. These pathways carry relatively high flux, are central to maintaining and reproducing the cell, and provide precursors and energy to engineered metabolic pathways. Here, focusing on these central metabolic subsystems, we present a manually-curated medium-scale model of energy and biosynthesis metabolism for the well-studied prokaryote Escherichia coli K-12 MG1655. The model is a sub-network of the most recent genome-scale reconstruction, iML1515, and comes with an updated layer of database annotations, as well as a range of metabolic maps for visualisation. We enriched the stoichiometric network with extensive biological information and quantitative data, enhancing the scope and applicability of the model. In addition, here we assess the properties of this model in relation to its genome-scale parent and demonstrate the use of the network and supporting data in various scenarios, including enzyme-constrained flux balance analysis, elementary flux mode analysis, and thermodynamic analysis. Overall, we believe this model holds the potential to become a reference medium-scale metabolic model for E. coli.
Keywords: Metabolic network, functional annotation, gene-reaction mapping, Elementary Flux Mode, biochemical thermodynamics
Abbreviations: EFM: Elementary Flux Mode; FBA: Flux Balance Analysis; GEM: Genome-scale models; MDF: Max-Min Driving Force; RMSE: Root Mean Squared Error; 3GP: glycerol 3-phosphate; 3hmrsACP: (R)-3-Hydroxytetradecanoyl-ACP; 3MOB: 3-Methyl-2-oxobutanoate; Ala: L-alanine; Asp: L-aspartate; ATP: adenosine triphosphate; F6P: fructose 6-phosphate; G3P: glyceraldehyde 3-phosphate; HdeACP: cis-hexadec-9-enoyl-ACP; palmACP: palmitoyl-ACP; PEP: phosphoenolpyruvate; Ru5P: D-ribulose 5-phosphate.
1 Introduction
Metabolic models are a valuable tool for biologists and biotechnologists aiming to elucidate and engineer cell metabolism [sarkar_engineering_2019, gu_current_2019, lu_silico_2023]. In their simplest form, such models encode basic biochemical knowledge, such as network structure, reaction stoichiometries, or known reaction directionalities, in a structured and standardised format. However, the scope of these models is often wider, including information about catalysing enzymes and kinetic parameters, as well as annotations that link model elements to external databases. The fast development of high-throughput experimental and computational pipelines has led to genome-scale reconstructions of metabolism now existing for a wide range of microorganisms of interest [gu_current_2019, fang_reconstructing_2020]. One of them, Escherichia coli, has been the most studied prokaryotic organism and, as such, its metabolism has been the subject of extensive modelling efforts spanning over three decades [reed_thirteen_2003, feist_genomescale_2007, orth_comprehensive_2011, monk_iml1515_2017]. In particular, for the common laboratory strain E. coli K-12 MG1655, the most recent genome-scale reconstruction, iML1515, accounts for 2712 enzyme-catalysed reactions mapped in detail to 1515 genes [monk_iml1515_2017].
Genome-scale metabolic network models (GEMs) provide a comprehensive picture of cell metabolism, and constraint-based modelling algorithms using these models have shown remarkable predictive power, for example, when predicting gene essentiality [bernstein_evaluating_2023]. Nevertheless, working with such large models comes with some disadvantages. In the absence of sufficient constraining or parametrisation, simulations based on large networks can easily lead to biologically unrealistic solutions, which may confound the interpretation of modelling results. For example, when designing and testing gene knockout strategies, genome-scale networks often wrongly predict unphysiological metabolic bypasses which have to be manually inspected and filtered out. Another issue is that, owing to their size and complexity, analysis of genome-scale networks is often limited to relatively simple modelling frameworks, such as flux balance analysis (FBA), that can only answer a limited range of questions. More complex methods, including kinetic modelling, sampling of metabolic flux distributions, or elementary flux mode (EFM) analysis [wortel_metabolic_2018], can be used to gain additional insight into the governing principles and constraints of microbial metabolism, but are difficult to apply to large models. Finally, genome-scale models are often hard to visualise comprehensively, which can make the interpretation of computed flux distributions cumbersome and unintuitive.
For all these reasons, small-scale models of E. coli metabolism are commonly used both for strain design and for the development of novel modelling frameworks. Among these, the E. coli core model (ECC) developed by \citeauthor*orth_reconstruction_2010 [orth_reconstruction_2010] has been widely used in the literature. While being popular as an educational and benchmark tool, ECC is limited in scope: it lacks, among others, most biosynthesis pathways, which would be relevant to many metabolic engineering applications. This limitation was later addressed by \citeauthorhadicke_ecolicore2_2017 [hadicke_ecolicore2_2017], who constructed a medium-scale model, E. coli core 2 (ECC2), as a subnetwork of iJO1366, the most up-to-date GEM available at the time [orth_comprehensive_2011]. ECC2 was obtained through an algorithmic reduction [erdrich_algorithm_2015] that iteratively prunes reactions from a template model while retaining user-specified structural and phenotypic features, such as the ability to grow on a defined set of conditions. However, to enforce the desired phenotypes the algorithm relied only on steady-state stoichiometric modelling and did not account for other important factors, such as thermodynamics, kinetics, or regulatory effects, which are relevant under physiological conditions. Therefore, while the resulting submodel satisfies the imposed stoichiometric constraints by construction, further manual curation is often needed depending on the application at hand.
Here, we introduce iCH360 (named, as per convention, by the initials of the authors followed by the number of genes covered by the model), a manually curated, “Goldilocks-sized” model of E. coli K-12 MG1655 energy and biosynthesis metabolism. The model was derived from the most recent genome-scale reconstruction (iML1515 [monk_iml1515_2017]) and includes all pathways required for energy production and the biosynthesis of the main biomass building blocks, such as amino acids, nucleotides, and fatty acids, but accounts only implicitly for the conversion of these precursors into more complex biomass components. We extended the coverage of annotations pointing to external databases from iML1515, and built custom metabolic maps to facilitate visualising the model and its subsystems [jahan_development_2016]. We complemented the stoichiometric network structure with a curated layer of biological information about catalytic function, protein complex composition, and small-molecule regulation. Finally, we enriched the model with useful quantitative data, including thermodynamic and kinetic constants. Thanks to all these extra layers of information, our model can support a wide range of modelling efforts beyond simple stoichiometric ones.
Below we present the model and demonstrate several use-cases across various modelling scenarios, including enzyme-allocation predictions, EFM analysis, and thermodynamic analysis. The model is freely available in SBML, JSON, and SBTab standard formats, and can be directly used with popular metabolic modelling tools such as the COBRA toolbox [ebrahim_cobrapy_2013].
2 Results
2.1 A compact model of E. coli energy and biosynthesis metabolism
To assemble the iCH360 model, we started from the core metabolic reactions present in ECC and extended them with the pathways required for the biosynthesis of the main biomass building blocks, including the twenty amino acids, the four nucleotides, and both saturated and unsaturated fatty acids (Fig. 1, Table 1, and Supplementary Figures 1-4). In addition, while not performing a comprehensive review of iML1515, we applied a small number of corrections to the original reactions based on literature knowledge (Supplementary Information A.1).
Subsystem | Description | Metabolic map |
---|---|---|
Carbon uptake and transport | Reactions required for the uptake and assimilation of the following carbon sources: glucose, fructose, ribose, xylose, lactate, acetate, gluconate, pyruvate, glycerol, glycerate, succinate, 2-ketoglutarate | Figure 1 |
Central carbon metabolism | Glycolysis, pentose phosphate pathway, pyruvate fermentation, TCA cycle, oxidative phosphorylation | Figure 1 |
Amino acids biosynthesis | Biosynthesis of all 20 amino acids from core metabolism precursors | Supplementary Figure 1 |
Nucleotide biosynthesis | Biosynthesis of purine and pyrimidine nucleotides (and deoxynucleotides) from core and amino acid metabolism | Supplementary Figure 2 |
Fatty-acids biosynthesis | Biosynthesis of saturated and unsaturated fatty acids present in E. coli from acetyl-CoA | Supplementary Figure 3 |
C1 metabolism | One-carbon metabolism | Supplementary Figure 4 |
The final assembled model (Figure 1) comprises compartment-specific metabolites and metabolic reactions mapped to genes, thus qualifying as a medium-scale model ranging in between ECC and iML1515 (Supplementary Figure 5). While similar in scale, our model and ECC2 present a fundamental structural difference. ECC2 was built by systematically removing reactions from its genome-scale parent (iJO1366) [hadicke_ecolicore2_2017]. Thus, its metabolic space spans the production of all compounds consumed in the iJO1336 biomass reaction. In contrast, the metabolic space of iCH360 only reaches biomass building blocks, without explicitly considering peripheral pathways, such as assembly of cell-membrane components, de novo synthesis of cofactors, and active transport of ions in the cell.
To make the model comparable to its parent model iML1515, we computed an equivalent biomass reaction, wherein all biomass requirements not present in our model are converted into an equivalent metabolic cost in terms of precursors, based on manually curated bioproduction routes (Table 2, Supplementary Information A.2).
Pathway | Precursor in iCH360 |
---|---|
Biosynthesis of phosphatidylethanolamine (C16:0 and C16:1) | 3GP, PalmACP (C16:0), HdeACP (C16:1) |
Biosynthesis of KDO2-Lipid-A | F6P, Ru5P, 3hmrsACP |
Murein Biosynthesis | G3P, PEP, F6P, Ala |
NAD/NADP de novo biosynthesis | Asp, DHAP |
FAD de novo biosynthesis | GTP, Ru5P |
CoA de novo biosynthesis | 3MOB, Asp |
Active transport of ions | ATP |
2.2 Range of metabolic conversions described by production envelopes
To check how its significantly smaller size and complexity affects the solution space of our model, we first looked at the maximally achievable biomass production flux under a range of growth conditions (Supplementary Figure 6). Across most conditions considered, the model can achieve biomass fluxes comparable to its genome-scale parent. The main differences exist in anaerobic growth on fumarate, alpha-ketoglutarate (AKG) and glycerol, where our model predicts zero growth, while the GEM achieves some (albeit small) biomass production rate. In practice, fermentation in these scenarios is in fact biologically unrealistic (e.g. [boecker_enabling_2022]). The reduced model is thus a good basis for metabolic simulation frameworks with few constraints such as FBA, since the reduction procedure, at least in this test, limits the original solution space of the GEM to more physiologically relevant regions.
To investigate the metabolic capabilities of our model even further, we looked at production envelopes (projections of the model’s solution space onto a smaller set of dimensions, also known as phenotypic phase planes), describing production rates for biomass and a range of metabolites. Figure 2 shows the computed envelopes for aerobic growth on glucose. The reduced model has a production envelope comparable with iML1515, where the main apparent difference is in the production of acetate. These differences can be ascribed to the absence of a series of degradation pathways in our model. These pathways allow the genome-scale model to channel additional carbon flux towards acetyl-CoA and, as a result, to reach higher theoretical acetate production yields of up to mole acetate per mole of glucose. Since the physiological relevance of such high acetate yields, in the absence of heterologously expressed pathways, has already been questioned [hadicke_ecolicore2_2017], we chose not to include in the model those additional reactions that would allow for higher acetate yields. A similar pattern of agreement with iML1515 was found when comparing production envelopes under different growth conditions (Supplementary Figure 7).
2.3 Connecting reactions to their catalysing enzymes and the enzymes’ protein components
Metabolic models often contain annotations that connect model elements to entries in biological databases, such as EcoCyc [keseler_ecocyc_2017], KEGG [kanehisa_kegg_2017] and MetaNetX [moretti_metanetxmnxref_2021]. However, even for the subset of reactions included in iCH360, the annotations present in iML1515 were incomplete and in part outdated. To fill these gaps, we extended and corrected the original annotations through a mixture of automated querying and manual curation (Figure 3A). Notably, the annotations to EcoCyc, the main knowledgebase for E. coli, are complete so that every metabolic reaction in iCH360 is uniquely mapped to a corresponding EcoCyc ID (with the exception of three reactions, for which a match on the database could only be found for a different choice of redox cofactor, namely NADH instead of NADPH). Additionally, we found deprecated annotations pointing to the MetaNetX namespace, which were consequently updated with the most up-to-date IDs.
Leveraging the complete mapping with the EcoCyc database, we parsed, assembled, and manually curated a data structure that enhances the stoichiometric model with a detailed layer of information about enzymes, polypeptides, and genes related to the network (Methods). This data structure takes the form of a weighted graph, wherein nodes represent biological entities (reactions, proteins, genes or compounds) and edges represent (potentially quantitative) functional relationships between them (Methods, Figure 3B, Supplementary Tables 1 and 2), such as catalysis, regulation, protein modification, and protein subunit composition. This graph contains the gathered information in a unified form, allowing users to perform a number of tasks related to metabolic modelling applications. For example, by explicitly mapping reactions to their catalysing enzymes rather than to the associated genes (which may additionally include protein activators or cofactors), it simplifies the definition of meaningful enzyme capacity constraints in the model. Similarly, since the graph topology implicitly defines associations between reactions and genes, boolean gene-protein-reaction (GPR) rules needed for in silico knockout studies can be generated accounting for both catalytic and non-catalytic requirements for each reaction (Supplementary Information A.3). Crucially, these GPR rules can be re-generated as needed whenever the graph is updated with new nodes or edges. Finally, we employed it to estimate the abundance of protein complexes included in the model from measured polypeptide abundances (Supplementary Information A.4) using an automated procedure (see Result section 2.5).
2.4 Primary and secondary catalysis
Through this annotation graph, metabolic reactions in the model are linked to catalysing enzymes, with more than of the reactions being catalysed by multiple enzymes (isozymes). Such enzymatic redundancy plays an important role, for example, when designing metabolic engineering strategies to prevent flux through a pathway. However, the different isoenzymes of a reaction need not all be equally important, and treating them as completely equivalent may generate inaccuracies in some phenotypic predictions. For example, while phosphofructokinase activity in E. coli (reaction PFK) is known to be carried by two isozymes, pfkA and pfkB, the latter is known to be responsible only for minor activity under normal physiological conditions [babul_phosphofructokinases_1978]. If these differences are not taken into account, metabolic models can overpredict redundancy in the network, for instance when assessing phenotypic predictions in response to gene knockouts [monk_iml1515_2017].
To address this issue and make the model usable in a wider range of modelling scenarios, we qualified catalytic edges in the graph as either primary or secondary catalysis (Methods, Figure 3C). A catalytic relationship between a reaction and an enzyme was annotated as secondary whenever the enzyme, according to experiments, accounts only for negligible activity for the reaction in the wild-type strain. As experimental evidence we considered in vitro and in vivo complementation studies. Through this curation process, a total of catalytic edges were functionally annotated as secondary.
To test how this functional annotation can support phenotypic predictions, we investigated the disruption of each essential reaction in the model resulting from the knockout of each one of its associated genes (Methods). We classified the outcome of each possible knockout as: (i) complete disruption, if a reaction loses all of its catalytic edges, (ii) full primary disruption, if a reaction loses all of its primary catalysis edges, but secondary ones are left, (iii) partial primary disruption, if the reaction loses some, but not all, of its primary catalysis edges, or (iv) secondary disruption, if some or all secondary edges are disrupted, but none of the primary ones. We classified the effect of in silico knockouts for aerobic growth on different carbon sources and compared the results against a large dataset of mutant fitness data, obtained via competitive fitness assays, from [price_mutant_2018].
Based on this analysis, we found that disruptions of primary edges as a result of a knockout were significantly associated with greater fitness losses than disruptions of secondary edges (Wilcoxon rank-sum test, , Figure 3D). In addition, primary disruptions that were only partial were associated with more contained fitness losses. Finally, minor fitness gains could be identified when mutants were associated with secondary and partial primary disruptions, but not when complete primary disruptions occurred. We did not find as significant differences between complete disruptions and full primary disruptions (Supplementary Figure 8), supporting the idea that catalytic relationships annotated as secondary are unlikely to be strong enough to stand-in for disrupted primary ones under normal physiological conditions.
2.5 Enzyme-constrained Flux Balance Analysis with EC-iCH360
To extend the scope and applicability of iCH360, we constructed a version of the model that allows for enzyme-constrained flux simulations, which we denote as EC-iCH360. We constructed EC-iCH360 in the sMOMENT [bekiaris_automatic_2020] format (Methods). The sMOMENT framework is inherently simple and generates the same solution space as more complex model formats, such as GECKO, unless reaction-specific capacity constraints are specified in the latter [sanchez_improving_2017, bekiaris_automatic_2020]. Since it requires unique reaction-enzyme mappings, we used the knowledge graph to remove all secondary catalytic relationships in the model (Section 2.4).
We first parametrized the model by defining a flux-specific enzyme cost for each reaction (in units of grams of enzyme per unit flux), using as values the estimated in vivo turnover number from [heckmann_kinetic_2020]. Using this enzyme-constrained model, we then predicted enzyme abundances and compared them to experimental enzyme abundances, estimated from a dataset of proteomic measurements for aerobic growth on eight different carbon sources [schmidt_quantitative_2016](Methods). This analysis led to generally good predictions, with root mean squared error (RMSE, computed for log10-transformed enzyme abundances) ranging from to (Supplementary Figure 9). In order to assess the nature of residuals between measurements and predictions, we investigated the geometric mean of enzyme abundances across all eight conditions (Supplementary Figure 9, bottom right panel). Notably, averaging predicted and measured enzyme abundances, respectively, across conditions did not significantly improve the RMSE, indicating that the abundances of individual enzymes were systematically over- or under-predicted across conditions.
To increase the predictive ability of the model, we thus adjusted the turnover numbers by fitting them to experimental measurements through a custom heuristic (Methods, Supplementary Information A.5). By simultaneously fitting all available conditions, we ensured that our adjustment procedure is robust to condition-specific biases. Further, by introducing regularisation within our adjustment scheme, we penalised large deviations of parameters from the original dataset, increasing the robustness of the procedure to overfitting. Our adjusted parameter set shows a mean absolute deviation (computed for log10-transformed turnover values) from the original parameter set of (Supplementary Figure 10) and results in significantly better agreement with experimental measurements across conditions (Figure 4A). Further, mean enzyme abundances across conditions are very well predicted with the adjusted parameter set (Figure 4B), implying that residuals between measurements and predictions are now to be attributed to variability in enzyme cost across conditions (which simple frameworks such as enzyme-constrained FBA cannot account for), rather than systematic over- or under-predictions. Thus, each adjusted turnover parameter can be thought of as a ”typical” apparent value, incorporating average saturation trends for an enzyme across growth conditions.
2.6 Elementary Flux Modes in the reduced model variant iCH360red
Despite the small size of iCH360, we found the explicit enumeration of its elementary flux modes (EFMs) to be intractable. This is not necessarily surprising, since the topology of a metabolic network is a much better determinant of EFM count than its sheer size. Metabolic networks can possess different types of redundancies, such as the presence of alternative pathways for the production of the same metabolite, the use of alternative cofactors for the same catalytic step in a pathway, or the presence of alternative transporters for the uptake/excretion of a compound. While knowledge about these redundancies is often valuable, including them in the model can increase the number of EFMs exponentially, hampering or even preventing EFM-based analyses.
To address this issue, we identified and removed a small set of such redundancies in iCH360, using available information from literature, whenever possible, to ensure the most physiologically relevant alternative was maintained (Supplementary Table 3). This resulted a metabolic submodel of iCH360, a model variant that we denote by iCH360red. iCH360red contains metabolic reactions ( less than iCH360) and shares the same production envelopes with its parent model for a number of metabolites of interest (Supplementary Figure 12). While the number of EFMs in iCH360red is still relatively large ( millions for aerobic growth on glucose, see Supplementary Table 4), it is not prohibitive for most types of EFM-based analysis, and their explicit enumeration does not require high-performance computing (Methods).
We used the obtained EFMs to study the possible combinations of achievable growth rates and yields in the network [wortel_metabolic_2018]. To this end, we considered growth on glucose as a scenario and computed, for each EFM from iCH360red, its yield and its achievable cell growth rate, which we estimate based on the enzyme costs defined for the enzyme-constrained model (Methods). Based on this analysis, we identified a front of Pareto-optimal EFMs, along which any increase in the growth rate will necessarily lead to a reduction in yield (Figure 5). Along the Pareto front, we observe a transition from a purely respiratory mode at maximum yield (Supplementary Figure 13) to a mixed respiratory-fermentative mode at maximum growth (Supplementary Figure 14). Quantitatively, the extent of this tradeoff was rather modest: the EFM with maximal yield reaches almost the maximal growth rate, so only minor gains in growth rate can be achieved by using other, fermentative modes along the Pareto front. However, it is worth noting that this analysis was performed using a simple capacity-based enzyme cost function, which summarises different determinants of enzyme cost such as driving force and substrate saturation [noor_protein_2016] in one single value, assumed to be constant. Hence, repeating the analysis with a more complete enzyme-cost function [noor_protein_2016] could help elucidate the nature of this tradeoff. To demonstrate that the shape and size of the Pareto front depend strongly on growth conditions, we also simulated an environment with very low oxygen levels. We implemented this by increasing the flux-specific enzyme cost of the oxygen-dependent reaction in the respiratory chain, equivalent to assuming a lower enzyme efficiency due to a lower oxygen level (Methods). Results (Supplementary Figure 15) show a much broader front of Pareto optimal EFMs, indicating that the nature of the observed tradeoff is indeed condition-dependent. Notably, a similar dependence of the Pareto front on extracellular oxygen availability has been previously observed in a small-scale model of E. coli core metabolism [wortel_metabolic_2018].
2.7 Saturation FBA and modelling of overflow metabolism
In order to study the effect of external conditions on optimal metabolic strategies in more detail, we used another framework that does not require an enumeration of EFMs and that allows for additional flux bounds, for example, a lower bound on the ATP consumption for cell maintenance. The saturation FBA (satFBA) framework [muller_resource_2015] is a variant of enzyme-constrained FBA, wherein a fixed enzyme cost per flux is assumed for all metabolic reactions in a model, except for the substrate transporter, for which a complete kinetic law is used (Methods). Since the external substrate concentration is a simple parameter, screening this concentration is equivalent to screening the values of the transporter efficiency. Here we used satFBA to simulate how the growth-maximising solution of the network varies in response to changing extracellular glucose concentration. By solving the satFBA problem for a range of glucose concentrations, we predicted the dependence of the cell’s growth rate on substrate concentration, resulting in the typically observed Monod curve (Figure 6A). If the problem does not contain any further flux bounds (so that the magnitude of fluxes at the optimum is solely limited by the maximum enzyme availability) the solution of satFBA problems will be an elementary flux mode [muller_resource_2015]. Hence, in this case we can use satFBA to explore how a cell should switch across elementary modes as a function of the growth environment.
At low glucose concentrations, the glucose transporter operates at low saturation and glucose uptake is enzymatically expensive, leading to a high-yield, purely respiratory metabolic mode at the optimum (Figure 6B and C). As substrate availability is increased, the cost of substrate uptake decreases and higher growth rates are achieved by switching to lower yield, acetate-secreting modes, in line with typical observations of overflow metabolism in E. coli [basan_overflow_2015, chen_energy_2019]. However, the satFBA formalism can also be used with additional flux bounds, thus going beyond EFM-based analysis. For example, if a positive lower bound on ATP hydrolysis is added as a maintenance requirement, optimal solutions to the satFBA problem will no longer be elementary modes, and the yield of the optimal solution no longer follows a piecewise constant profile (Supplementary Figure 16)
2.8 Equilibrium constants, thermodynamic forces, and thermodynamically feasible states
Living systems operate out of thermodynamic equilibrium, and thermodynamics places strong constraints on the operation of metabolic systems. In any metabolic state, the flux directions must follow the signs of thermodynamic forces, which depend on metabolite concentrations and equilibrium constants. To provide these constants as parts of our model, we used the component contribution framework [noor_consistent_2013] to estimate the standard Gibbs free energy () of each reaction (Methods). These estimates account for compartment-specific chemical environment parameters, such as pH, pMg, and ionic strength, and were corrected to account for protons and charge translocation in multi-compartment reactions.
The resulting parameter set covers the vast majority of the model (over 97 % of metabolic reactions, with the remaining being not covered by the component contribution database used here) and accounts for the uncertainty in the estimates through a multivariate covariance matrix. Accounting for correlations between the different values becomes important when imposing thermodynamic constraints on the model. For example, the fatty acid biosynthesis subsystem in the model consists of repeated elongation cycles, where a short sequence of chemical transformations is repeatedly performed on a growing carbon chain. As a result, even if the for each reaction in the pathway is known with some uncertainty, this uncertainty is tightly correlated across the reactions, which constrains the set of achievable thermodynamic states in the network.
Using this set of thermodynamic constants, we first tested whether some typical flux distributions obtained from the model are thermodynamically feasible. To this end, we considered flux distributions generated by parsimonious Flux Balance Analysis (pFBA) across growth conditions and computed their max-min driving force (MDF) [noor_pathway_2014], accounting for uncertainty in the estimates (Methods). We found a positive MDF for each of the flux distributions, indicating that all pFBA solutions tested are thermodynamically feasible. Notably, we found that the computed MDF values cluster very clearly in three groups, corresponding to aerobic growth on glycolytic substrates, aerobic growth on gluconeogenic substrates, and anaerobic conditions (Figure 7A).
Having confirmed that our reference FBA-derived flux distributions are thermodynamically feasible, we then employed an alternative flux prediction method that ensures thermodynamically feasibilty by construction. The probabilistic metabolic optimisation (PMO) framework [gollub_probabilistic_2021] uses a mixed-integer quadratic programming approach to compute a set of fluxes, metabolite concentrations, and reaction driving forces which is probabilistically most in agreement with experimentally measured metabolite concentrations (Methods). Using this framework, we computed a maximum-likelihood thermodynamic state for the model, simulating aerobic growth on glucose. In the thermodynamic state computed by PMO, we found that all metabolites lie within physiologically reasonable concentrations (1 M – 1 mM). In addition, all ”anomalous concentrations” identified by the framework (metabolite concentrations lying more than one standard deviation away from the mean experimental value) had been identified and explained previously [gollub_probabilistic_2021], and can potentially be addressed by lumping together those reactions for which substrate channelling is known to happen in literature.
We used the PMO-derived thermodynamic state to identify candidate bottlenecks, in terms of enzyme demand, across the network. To this end, we first computed the flux-force efficacy of each reaction in the thermodynamic state [noor_pathway_2014] (Methods). The flux-force efficacy is a unitless quantity, ranging from to , denoting the ratio between net flux (forward minus backward flux) and total flux (forward plus backward flux) of a reaction. Reactions operating at low flux-force efficacy have a lower net flux due to two reasons: (1) the forward flux is lower in absolute terms, and (2) the higher backward flux is counter-productive and subtracts from the forward flux. Therefore, to achieve a given required net flux, the cell has to invest more resources in maintaining a higher enzyme level. To identify potential thermodynamic bottlenecks leading to high enzyme costs in specific reactions, we thus screened all reactions for their predicted flux-force efficacy and flux (Figure 7B) and identified those predicted to operate at low efficacy, while carrying significant flux (Figure 7B, labelled points). To assess the extent to which the thermodynamic operating states predicted by PTA are predictive of enzyme investment, we split reactions in two groups based on whether their predicted flux-force efficacy sits above or below , and compared the distributions of measured enzyme abundances between the two groups (Methods, Supplementary Figure 17). While there are many determinants of enzyme abundance beyond thermodynamics, including enzyme turnover, affinity, or regulation, we observed a significant difference between the two distributions (, two-sided Wilcox sum-rank test), with the low efficacy having approximately a 3-fold higher median enzyme abundance than the high efficacy group.
3 Discussion
\RaggedRight Model structure | Notes | Data source |
\RaggedRight Network (reaction stoichiometries) | Selected reactions from iML1515, hand-curated | [monk_iml1515_2017] |
\RaggedRight Annotations to external databases | Parsed from iML1515, extended, and updated | [monk_iml1515_2017], manual curation |
\RaggedRight Network graphics | Escher maps of the full model and its subsystems | |
\RaggedRight Biological knowledge supporting the stoichiometric model | Catalytic relationships, protein complex composition, small-molecule regulations, and others | [keseler_ecocyc_2017] |
\RaggedRight Physico-chemical parameters mapped to model | ||
\RaggedRight Thermodynamic constants () | Account for compartment-specific chemical environment (pH, pMg, ionic strength, and potential) and include corrections for reactions occurring across compartments. | [noor_consistent_2013], manual curation |
\RaggedRight values | in vivo estimates of catalytic turnover numbers | [heckmann_kinetic_2020] |
\RaggedRight Typical values | Adjusted estimates of turnover numbers, fitted to proteomic data, accounting for typical saturation levels across growth conditions | [heckmann_kinetic_2020] [schmidt_quantitative_2016] |
\RaggedRight protein MWs | molecular mass for all proteins/protein complexes covered by the model | [keseler_ecocyc_2017] |
\RaggedRight Cell-state data mapped to the model | ||
\RaggedRight Protein abundances | Abundance across different growth conditions for proteins/protein complexes covered by the model, estimated from experimentally measured polypeptide abundances | [schmidt_quantitative_2016] |
\RaggedRight Metabolite concentrations | Measured metabolite concentrations across growth conditions | [gerosa_pseudo-transition_2015] |
\RaggedRight Metabolic fluxes | Measured metabolite fluxes for aerobic growth on glucose | [heckmann_kinetic_2020], [long_metabolic_2019] |
\RaggedRight Example applications shown | ||
\RaggedRight Phenotypic phase plane analysis | See Figures 2, S7 | |
\RaggedRight Enzyme-constrained FBA | With enzyme-constrained version of the model, EC-iCH360, constructed in sMOMENT format [bekiaris_automatic_2020] | |
\RaggedRight EFM analysis | With reduced model variant iCH360red; see Figures 5 | |
\RaggedRight Saturation FBA | Predicts metabolic fluxes as a function of external substrate concentration | |
\RaggedRight Max-Min Driving force | Formulation from [noor_pathway_2014], extended to account for correlated uncertainty in the themodynamic constants estimates. | |
\RaggedRight Probabilistic Metabolic Optimisation | Prediction of thermodynamic states (metabolite concentrations and relative fluxes) maximally consistent with measured metabolite concentrations [gollub_probabilistic_2021] |
Here we presented iCH360, a medium-scale metabolic model of E. coli covering central and biosynthesis metabolism, together with the associated data and metabolic maps and results from several analysed use-cases. Similarly to previously constructed core models [orth_reconstruction_2010, hadicke_ecolicore2_2017], this model trades metabolic coverage for usability, interpretability, and ease of visualisation. It is well suited whenever a relatively small, highly curated network is desired, when computationally demanding analyses are to be performed, or as an educational tool in the field of metabolic modelling. When comparing some key properties of this model with those of its parent genome-scale model, we observed only small differences in the achievable biomass and product yields across a range of growth conditions, validating that, despite its contained size, the model captures the most salient metabolic features of the genome-scale network.
To make this model easily usable in a variety of applications, we enriched the stoichiometric network structure by a curated layer of biological knowledge in the form of a graph data structure. This graph encodes information about biological entities in the network in a structured, ready-to-use format, including catalytic relationships between reactions and enzymes, the stoichiometric composition of protein complexes, and small-molecule regulation interactions. Further, we mapped to the model a range of quantitative parameters, including in vivo turnover number estimates and thermodynamic constants, extending the use of the model beyond simple stoichiometric analysis. A summary of the biological knowledge captured by iCH360 is shown in Table 3.
Owing to its medium size and the high level of curation, iCH360 lends itself to a wide range of modelling frameworks. Here, we demonstrated some representative examples. These include modelling the allocation of metabolic proteome via the enzyme-constraint variant of the model EC-iCH360, enumerating and analysing elementary flux modes in the network via the reduced model variant iCH360red, and performing thermodynamic-based analysis using the provided set of thermodynamic constants.
While, at this stage, the parameterisation of the model is limited to turnover numbers and thermodynamic constants, we anticipate that additional parameter sets can easily be mapped to the model. Facilitated by extensive annotations present in the model and by the recent development of machine-learning-enabled kinetic constant estimators [kroll_deep_2021, li_deep_2022, kroll_turnover_2023, gollub_enkie_2023], a complete kinetic parameterisation of iCH360 is thus a valuable potential future development. In addition, probabilistic estimates of kinetic parameters [gollub_enkie_2023, lubitz_parameter_2019] can be combined with our existing thermodynamic parameterisation, making it possible to account for (potentially correlated) parameter uncertainty throughout the kinetic modelling process.
In light of the above results, we believe that iCH360 holds the potential to become a reference metabolic model for E. coli.
4 Methods
4.1 Model assembly and curation
All the relevant pathways included in the model were assembled and curated based on information available in the EcoCyc [keseler_ecocyc_2017] and KEGG [kanehisa_kegg_2017] databases. The respective reactions were then extracted from iML1515 and parsed into a new model. In order to compute an equivalent biomass reaction, we first collected all the pathways required for the production of any component present in the iML1515 biomass reaction, but not in our model. These additional pathways (available in the repository supporting this manuscript) were manually curated based on available literature and database annotation to ensure they represent the biologically most relevant bioproduction route for each biomass component. By adding these pathways to iCH360, we obtained an extended model that was able to predict growth rates directly through the original iML1515 biomass reaction. The equivalent biomass function was then computed based on a reference flux distribution computed on this extended model, as explained in Supplementary Information A.2. Model assembly, manipulation and validation were performed using the COBRA Toolbox [ebrahim_cobrapy_2013]. Extension of database cross-annotation for the model reactions was performed through a mixture of automated database query and manual curation.
4.2 Network graphics
iCH360 can be visualised through a series of custom-built maps using the metabolic visualisation tool Escher [king_escher_2015] (see Figure 1). There are three main ways to visualise the model or solutions thereof. First, a complete map of the model, including all of its reactions, can be used (Figure 1). In order to provide a more compact representation of the network, a compressed second variant of the same map was constructed (Supplementary Figure 18). Here, long biosynthetic linear pathways were lumped into single pseudo-reactions, which only show the net production or consumption of metabolites by the pathway while omitting intermediates. Finally, individual maps for each of the main subsystems in the model are provided.
4.3 Graph data structure for linking reactions to enzymes and proteins
Information about the enzymes and proteins behind the stoichiometric model was collected in a knowledge graph. To build this graph, all available data on reaction-protein association and subunit composition were retrieved by automatic querying of the BioCyc database through its REST-based data retrieval API (https://biocyc.org/web-services.shtml). This information was then extended and curated based on a comparison with existing iML1515 GPR annotations. The resulting data were used to generate a directed graph wherein nodes represent biological entities (such as reactions, proteins, genes, and metabolites) and edges represent functional dependencies across them, including catalysis, subunit composition, post-translational modifications, and others. A complete list of node and edge types in the graph are provided in Supplementary Tables 1 and 2, respectively. All polypeptide nodes were annotated with their molecular mass (parsed from EcoCyc), enabling recursive computation of the molecular mass of any protein node in the graph (see Supplementary Information A.3). All manipulation and analyses of the graph data structure were performed using the NetworkX Python package [hagberg_exploring_2008], and the final data structure provided to the user in both Cytoscape and GML formats.
4.4 Primary and secondary catalytic edges
Catalytic edges in the graph, i.e. edges connecting reaction and enzyme nodes, were manually annotated as either primary or secondary, based on available evidence for the activity of each enzyme in the model with respect to its associated reactions. More specifically, a catalytic edge between a reaction-enzyme pair was labelled as secondary whenever the enzyme had been shown in literature to account for only minor catalytic activity for a reaction when compared to another isozyme. In this case, the references to the relevant literature were included as metadata for that edge. Whenever sufficient information was not available, all isozyme edges for a reaction were conservatively treated as primary.
4.5 Catalytic disruption analysis
For the catalytic disruption test, we identified condition-specific essential reactions in the following way: in the model, for each condition considered, we determined the reactions whose knockout led to inability to produce biomass. A small number of known false-positives, which are essential in iCH360 only due to its lack of certain reactions or pathways, were excluded from the analysis. For each essential reaction, we considered all associated genes and investigated the removal of each of them, individually, from the graph. The effect of the simulated knockout was propagated across the graph by removing all nodes from the graph for which the gene is required according to the nodes’ boolean GPR rules (see Supplementary Information A.3). Finally, the outcome of each simulated knockout was catalogued based on the reaction-level disruption it caused (see main text). Whenever multiple reactions were disrupted by the knockout of a gene, the strongest disruption among them was assigned to the knockout, in the following precedence order: complete disruption, full primary disruption, partial primary disruption, secondary disruption. The analysis was repeated for a total of growth conditions, and each condition-gene pair, labelled with the assigned disruption type, was compared with experimentally measured mutant relative fitness (averaged across available replicates for that condition-gene combination), using the dataset from [price_mutant_2018].
4.6 Construction of the enzyme-constrained metabolic model
In the sMOMENT formalism, positive and negative fluxes in each reaction are formally described, respectively, as positive fluxes in separate “forward” and “backward” versions of the reaction. To construct the enzyme-constrained model in sMOMENT format, reversible reactions in the model were therefore duplicated in the model to separately represent fluxes in forward or backward direction. Direction-specific turnover number estimates were parsed from [heckmann_kinetic_2020] and a default value of was used for transporters as in [heckmann_machine_2018]. To account for the fact that [heckmann_kinetic_2020] reports values as turnover numbers per polypeptide, the values were multiplied by the number of polypeptide subunits in each enzyme. For each reaction-enzyme pair, an enzyme cost per unit flux (in units of ) was then defined as:
(1) |
where is the enzyme cost of reaction-enzyme pair , is the turnover rate estimate for the pair (here, in units of ), is the molecular mass of the enzyme involved (in ), and is a unitless condition-specific scaling factor (typically interpreted as an average enzyme saturation value). Then, a unique enzyme was assigned to each reaction. To this end, secondary catalytic relationships were first discarded and, for reactions with multiple annotated primary isoenzymes, the enzyme with the highest measured abundance in the integrated PAX Database [huang_paxdb_2023] was heuristically chosen. Based on these costs per unit flux, an enzyme capacity constraint was introduced as:
(2) |
where denotes the flux in reaction (in ) and is a parameter denoting the total amount of enzyme mass (in g/gDW) that can be allocated to the flux mode. The constraint is enforced by augmenting the stoichiometric matrix of the model with an additional enzyme supply pseudoreaction (upper-bounded by ) and an enzyme pool pseudometabolite consumed in each reaction with stoichiometry [bekiaris_automatic_2020].
4.7 Adjustment of turnover numbers across conditions
Turnover numbers from [heckmann_kinetic_2020] were adjusted based on the nonlinear programming (NLP) formulation detailed in Supplementary Information A.5). The reference flux distributions required by the procedure were obtained using the original (unadjusted) set of turnover numbers and the bounds on allowable adjustments ( and in Eq. 29) set to (corresponding to a maximal 100-fold increase or reduction of each parameter from the original value). The linear program used to obtain reference flux distributions was formulated and solved with GUROBI [gurobi], while the nonlinear program used for turnover adjustment was formulated and solved with the open-source optimisation package CasADi [Andersson2019]. To investigate the effect of the ridge regularisation hyperparameter ( in equation 29), we solved the adjustment problem for a broad range of values of this parameter and, each time, we computed the RMSE between measurements and predictions of enzyme abundance, as well as the mean absolute deviation of between original and adjusted turnover values, both computed for log10-transformed data (Supplementary Figure 11). Based on this information, a value of , after which any further decrease in the amount of regularisation results in marginal reduction of the RMSE, was chosen to compute the final set of adjusted turnover numbers. Upon reparametrisation into apparent turnover numbers, (see Supplementary Information A.5.3), this set of adjusted parameters was used to parametrise the enzyme-constrained model variant EC-iCH360.
4.8 Enzyme allocation predictions
To validate enzyme allocation predictions against experimental values, we first retrieved measured polypeptide abundances for each growth condition from [schmidt_quantitative_2016] and imputed missing values using, whenever available, abundance values from the PAX Database [huang_paxdb_2023]. Then we estimated enzyme counts across conditions from polypeptide counts using non-negative least-squares estimation (Supplementary Information A.4). Next, we converted them into mass abundances (in units of ) based on the molecular mass of each complex (see Supplementary Information A.3) and assuming a cell mass of (BIONUMBER ID 103904 [milo_bionumbers--database_2010]). Enzyme allocation predictions for each condition were then computed via EC-iCH360 by fixing the growth rate to the experimentally measured one from [schmidt_quantitative_2016] and minimising the total enzyme cost. To compute predictions using the turnover number estimates from [heckmann_kinetic_2020], the average saturation coefficient was estimated, for each condition as:
(3) |
where denotes the index set of model enzymes for which measurements are available and is the total measured model enzyme abundance for that condition. This value of was then used to scale predicted enzyme abundances before comparing them to the experimentally measured ones. To compute predictions with the adjusted parameter set, the scaling factors for each condition were obtained as part of the fitting procedure (see Supplementary Information A.5). All reported root mean squared errors were computed on log10 transformed enzyme abundances, excluding zero-predictions from the computation.
4.9 Enumeration of Elementary Flux Modes
Elementary Flux Modes (EFMs) enumeriation for the submodel iCH360red was performed, for each growth condition, using EFMtools [terzer_large-scale_2008]. Filtered modes (Supplementary Table 4) were defined as those supporting nonzero biomass flux and, for aerobic modes, nonzero oxygen uptake. In addition, in the aerobic case, filtered modes exclude those carrying flux in either of three reactions – Pyruvate-Formate Lyase (PFL), Fumarate Reductase (FRD2), and the menaquinone-dependent Dihydroorotate dehydrogenase (DHORD5) – known to be only physiologically active under anaerobic conditions [zhang_inactivation_2001, cecchini_succinate_2002, andrews_anaerobic_1977]
4.10 Growth/yield tradeoff analysis
To analyse trade-offs between growth rate and yield, we computed the yield of each EFM as the ratio of its biomass flux to its glucose uptake flux. In FBA models, fluxes are given in units of mmol/gDW/h, except for the biomass flux, whose conventional unit is because the biomass flux in these models is scaled to describe the biomass production rate per amount of biomass present. This means that our yields are given in units of gDW/mmol. To determine the cell growth rate allowed by a flux distribution, we first computed its absolute enzyme cost
(4) |
where is the enzyme cost of the flux distribution (an enzyme mass, measured in g/gDW), is the flux of reaction , and is the enzyme cost per unit flux in reaction , computed as per equation 1 using the set of adjusted turnover numbers. Clearly, both and depend on the particular choice of scaling of the mode (though their ratio, , doesn’t). In order to obtain an estimate for the growth rate that is scale-invariant, we thus rescaled all modes to the same enzyme cost (in g/gDW), and looked at the resulting flux through the biomass reaction. More formally, the achievable growth rate (in ) for an (arbitrarily scaled) EFM with biomass flux and enzyme cost was computed as:
(5) |
For simplicity, we approximate by a constant value of g/gDW, which we obtained by taking the minimum enzyme investment required by the enzyme-constrained model to support the experimentally measured growth rate reported in the proteomic dataset [schmidt_quantitative_2016] for the condition of interest in this analysis (aerobic growth on glucose).
To simulate low-oxygen conditions, the cost per unit flux of all oxygen-consuming reactions (CYTBO3_4pp, CYTBDpp, CYTBD2pp) was increased by a -fold, mimicking the physiological state in which these reaction operate at low substrate saturation.
4.11 Saturation FBA analysis
Saturation FBA calculations were performed by optimising biomass production in the enzyme-constrained model, setting the saturation coefficient ( in Eq. (1) to the value fitted as part of the turnover number adjusting procedure (Section 4.7) for all reactions except the glucose transporter (GLCptspp). The saturation of the glucose transporter, , was computed as a function of external glucose concentration assuming irreversible Michaelis Mentens kinetics, so that:
(6) |
where is the external glucose concentration (in mM) and a value of mM was used for the Michaelis constant [wortel_metabolic_2018]
4.12 Component contribution estimates of thermodynamic constants
Estimates of the free energies of reactions, and their uncertainties, were obtained using the component contribution framework previously described in [noor_consistent_2013]. Several reactions in the model involve protein side groups, such as the acyl-carrying protein (ACP), or cofactors, such as glutaredoxin, for which a decomposition in terms of chemical groups is not available. As a result, thermodynamic constants for these reactions cannot be directly estimated through database-based implementations of the component contribution method, such as eQuilibrator [beber_equilibrator_2022]. However, since these non-decomposable protein groups are conserved in all reactions within our model, their net contribution to the reaction thermodynamics is, at least from a group contribution perspective, null. Hence, we first augmented the group incidence matrix of the eQuilibrator database (see Section S1.1 in [beber_equilibrator_2022]) by treating protein groups as non-decomposable units and manually curated the group decomposition of the metabolites containing them. The mean and covariance of the Gibbs standard free energy of reaction () estimates for all reactions in the model were then computed using this augmented matrix as described in [beber_equilibrator_2022].
In order to compute transformed Gibbs free energies of reactions (), for which an exact chemical definition of each metabolite is required, protein side groups were replaced by an appropriate chemical moiety best approximating the chemical environment of the metabolite. Specifically, ACP- groups were replaced by a phosphopantetheine group, the natural prosthetic group of acyl-carrier proteins, with a methyl group at the attachment site to the protein scaffold. Similarly, glutaredoxin was replaced by its Cys-Pro-Tyr-Cys active site, with the two cysteines being either free (for the reduced form of the cofactor) or linked by a disulfide bridge (for the oxidised state).
Corrections for reactions occurring across different compartments were computed as described in [beber_equilibrator_2022], using compartment-specific pH, pMg, ionic strength and potentials from [gollub_enkie_2023].
4.13 Max-min driving force computation
The max-min driving force (MDF) of each reference flux distribution was computed by extending the original formulation described in [noor_pathway_2014] to account for correlated uncertainty in the estimates of the thermodynamic constants. Let be a random vector representing the (uncertain) standard Gibbs free energy of reaction for the reactions in the network. Importantly, this vector includes only balanced metabolic reactions and excludes pseudoreactions such as exchange reactions. To describe our uncertain knowledge, we assume that the vector follows a multivariate normal distribution:
(7) |
where and are the mean and covariance of the estimates obtained through the component contribution methods. This random vector can equivalently be expressed as:
(8) |
where is a standard normal random vector in , with , and is a square root of the covariance matrix, i.e. it satisfies:
(9) |
In order to integrate this probabilistic description within a typical constrained-optimisation formulation, we define the set as the -level confidence region around the mean of . Using Eq. (7), and noting that the squared norm of a standard normal random variable is known to be Chi-squared distributed, we can represent this set as:
(10) |
where is a free parameter vector and denotes the quantile function (inverse cumulative distribution function) of a chi-squared distribution with degrees of freedom. We can thus account for the uncertainty in thermodynamic estimates by treating the free energies of reaction as decision variables, rather than known parameters, and constraining their value to belong to (). For a given reference flux distribution , this results in the following quadratically-constrained program (QCP):
(11) | ||||||
s.t. | ||||||
where is the min driving force (or the MDF after optimisation), is a vector of log-metabolite concentrations, are (log) lower and upper bounds on these concentrations, is the stoichiometric matrix of the model, is the Boltzmann gas constant, and is the temperature used for the computation of the free energy estimates. For our MDF calculations, we used a confidence level of and set the bounds on the concentration of all metabolites to the physiologically plausible range of (, ). The above quadratically constrained program was formulated and solved the the GUROBI package [gurobi].
4.14 Probabilistic thermodynamic analysis
Probabilistic metabolic optimisation (PMO) of the model was performed using the PTA python package [gollub_probabilistic_2021], providing the software with the curated thermodynamic estimates generated for this model. The default values available through the package for growth on M9 medium with glucose (which include measurements from [gerosa_pseudo-transition_2015]) were used as priors for the concentration of metabolites, and the growth rate was bounded below by the reported value in [gerosa_pseudo-transition_2015]. Further, for this analysis, which requires every flux in the solution to have a well-defined directionality, two transhydrogenase reactions (NADH17pp, and THD2pp) were allowed to operate reversibly [gollub_probabilistic_2021].
To analyse the thermodynamic state computed by PTA, we compute the flux-force efficacy for each reaction as (see Figure 8):
(12) |
where is the Gibbs free energy of reaction in the thermodynamic state, is the universal gas constant and is the temperature considered for the analysis (). To compare the PTA-predicted flux force efficacies with experimental measurements of enzyme abundance, we selected reactions with carrying at least of the glucose uptake flux, and pooled these into two groups, corresponding to ( reactions) and ( reactions). corresponds to kJ/mol.
5 Author contributions
A. B.-E. conceived the project and supervised its initial stage. M.C and H.H. assembled and curated the model. M.C. wrote software, performed simulations and analysed results. W.L., E.N. and H.H. supervised the project and contributed to the analysis and interpretation of results. M.C., H.H., W.L., and E.N. wrote the manuscript.
6 Competing interests
The authors declare no competing interest.
7 Code availability
The model, together with all relevant data, is available on Github at https://github.com/marco-corrao/iCH360. The code and files required to reproduce all analysis in this manuscript are available on Github at https://github.com/marco-corrao/iCH360_paper.
Appendix A Supplementary Information
A.1 Changes to reactions from the model iML1515
After assembling the model as a subset of reactions from the genome-scale model iML1515, a few minor corrections were applied to some of the reactions based on evidence gathered from the literature. Note, however, that these changes did not result from an exhaustive process of review of the parent model. Such corrections include:
-
•
In iCH360, the membrane-bound transhydrogenase reaction (THD2pp) only translocates 1 proton across the periplasmic membrane, as opposed to the 2 protons translocated in the iML1515 reaction [bizouarn_nucleotide_2005]
-
•
The gene-protein rule (GPR) of gene glxK (b0514) was reassigned to GLYCK2, which produces 2-phosphoglycerate. Meanwhile, the reaction GLYCK in from iML1515 (which produces 3-phosphoglycerate) was removed [zelcbuch_vivo_2015, bartsch_only_2008]
-
•
The NAPH-dependent homoserine dehydrogenase reaction (HSDy) was made irreversible towards the homoserine production direction [he_optimized_2020]. To avoid having a reaction irreversible in the backward direction, substrate and products of the reactions were flipped.
-
•
The succinate transport reaction SUCCt1pp was made irreversible in the export direction, enforcing the use of the more thermodynamically favourable SUCCt2_2pp (which translocates two protons instead of one) for succinate import. To avoid having a reaction irreversible in the backward direction, substrate and products of SUCCt1pp were flipped.
A.2 Computation of an equivalent biomass reaction
In constraint-based models of metabolism, the biomass reaction summarises the production of all molecules that are not explicitly described by the model. These are typically macromolecules such as proteins or polynucleotides. Which metabolites are drained from the model network towards these other parts of metabolism, and in what proportions, depends on what compounds are described by the model. Hence, starting from an existing model, taking only a subset of the reactions, but leaving the biomass reaction unchanged would lead to inconsistencies.
To construct a biomass reaction for iCH360 that corresponds equivalently to the biomass reaction in iML1515, the following method was used. First, we collected all the pathways required for the production of any component present in the iML1515 biomass reaction, but not in our model. These additional pathways (available in the repository supporting this manuscript) were manually curated based on available literature and database annotation to ensure they represent the most biologically-relevant bioproduction route for each biomass component. By adding these pathways to iCH360, we obtained an extended model (iCH360ext), which was able to grow directly through the original iML1515 biomass reaction. Let and denote the number of reactions in iCH360 and iCH360ext, respectively. Next, a reference flux distribution is computed on the extended model through FBA. We can express this flux vector as:
(13) |
where is the subset of its fluxes corresponding to reactions present in iCH360, is the subset of fluxes corresponding to the reactions added to create the extended model iCH360ext, and is the flux through the biomass reaction. If the fluxes were to be imposed on iCH360, a number of metabolites would necessarily remain unbalanced, that is:
(14) |
where denotes the stoichiometric matrix of iCH360. Hence, we compute the equivalent biomass reaction as the additional reaction (column of ) required to balance the system while achieving the same biomass flux of the reference distribution. That is:
(15) |
which implies:
(16) |
The stoichiometry of this equivalent biomass reaction is dependent on the choice of reference flux distribution and, generally, may be different for different growth conditions. This results from the fact that, depending on the condition, the extended model may produce the same biomass component through different pathways, which would then be converted by our procedure into different equivalent costs of precursors in the sub-model. Nevertheless, the additional biosynthesis pathways we used to form the extended model do not allow for alternative routes to biomass, hence we found the equivalent biomass reaction to be, in this case, unique across conditions.
A.3 Recursive computation of attributes in the annotation graph
Using the knowledge graph supporting the stoichiometric model of iCH360 (Section 3 in the man text), a number of useful properties can be computed based on simple recursive operations. Here, we outline how such an approach was used to compute i) the molecular mass of all enzyme complexes in the model, based on known masses for all polypeptides and ii) the computation of boolean rules (GPRs) linking all reaction and protein nodes in the graph to the model genes. For a description of the different types of nodes and edges mentioned in this section, see Supplementary Tables 1 and 2.
A.3.1 Computation of molecular masses for all protein nodes
In order to compute the molecular mass of all protein nodes in the graph, the protein nodes corresponding to polypeptides were first annotated with their molecular masses, readily available on the EcoCyc database. Then the molecular mass of all other protein nodes is estimated recursively as follows. Let denotes the index set of all protein nodes and the index set of protein components of node , i.e. all nodes connected to node by a subunit composition relationship. The molecular mass of any protein node , , is computed as:
(17) |
where denotes the (known) molecular mass of polypeptide node and denotes the weight of the edge between and .
A.3.2 Computation of gene-protein-reaction Rules
Similarly, boolean gene-protein-reaction (GPR) rules can also be computed recursively. We compute the GPR expression of a node depending on its type (reaction, protein, or logical) and the type of edges connecting it to its neighbouring nodes. Particularly, the GPR expression of node , , is computed according to the following rules:
-
•
If node is a logical AND (logical OR) node, its GPR is the AND (OR) of the GPR of its children nodes.
-
•
If node is a polypeptide, its GPR expression is, trivially, the gene associated to the node.
-
•
If node is a multimeric protein, its GPR is computed as the AND of the GPR expressions of the children nodes connected to by a ”subunit composition” edge.
-
•
If node is a modified protein, its GPR is computed as the AND of the children nodes connected to by a ”protein modification” or ”protein modification requirement” edge.
-
•
If node is a reaction, its GPR is first constructed as the OR of the GPRs of its catalytic children (nodes connected to via a catalytic edge, representing isoenzymes for the reaction). To account for non-catalytic protein requirements of the reaction this expression is then ANDed with the GPR expressions of all non-catalytic children (nodes connected to via a ”non-catalytic requirement” edge). More formally, denoting with and the index sets of catalytic and non-catalytic children nodes of , respectively, we have:
(18)
A.4 Estimation of enzyme complex abundances
In order to estimate the abundance of all enzyme in the model from proteomics data, we use the model graph (Main Text, Section 4.3) to construct a stoichiometric map between enzymes and polypeptides. this map takes the form of a matrix , where is the number of enzymes in the model and the number of polypeptides, such that denotes the stoichiometry of polypeptide in enzyme . Some polypeptides may be part of additional complexes which are not part of the model. Hence, using the available annotation to the EcoCyc database, we identify these instances and construct a matrix by augmenting with additional rows corresponding to these out-of-model complexes.
With this mapping at hand, we assume that polypeptide abundances, are related to enzyme abundances (including the required additional complexes not accounted for in the model) by
(19) |
Hence, given a vector of experimental measurements of polypeptide abundances , we estimate enzyme abundances by solving the nonnegative least square (NNLS) problem:
(20) | |||
A.5 Adjustment of turnover numbers based on proteomics measurements across conditions
In this section, we outline the procedure used to adjust the turnover numbers in EC-iCH360 by fitting proteomic measurements across conditions (see Section 2.5 in the main text). Briefly, our aim is to adjust the turnover numbers that parametrise the model so that enzyme allocation predictions obtained through the enzyme-constrained formulation of FBA (see Methods in the main text) more closely match experimental measurements of enzyme abundances. By simultaneously fitting experimental measurements across many growth conditions, we improve the robustness of the fitting procedure to experimental error and generate a condition-independent parameter set that predicts average trends of enzyme allocation across conditions.
Adjusting turnover numbers is, in general, not simple. Due to the linear programming formulation underlying the enzyme-constrained FBA problem, optimal flux distributions (and, by direct consequence, enzyme allocation predictions) are discontinuous over the turnover parameter space, making derivative-based searches through this space problematic from a numerical perspective. Similarly, the high dimensionality of the parameter space limits the applicability of gradient-free optimisation algorithms. Hence, we restrict ourselves to the (comparatively simpler) problem of adjusting turnover parameters for a fixed set of reference flux distributions across growth conditions, constraining our search to the portion of parameter space in which these reference flux vectors remain optimal for their respective conditions. Our adjustment heuristic thus consists of two steps, which we detail in the remainder of this section
A.5.1 Obtaining a set of reference flux distributions
Following the sMOMENT formulation of enzyme-constrained FBA [bekiaris_automatic_2020], we consider a metabolic network with reactions and metabolites where all metabolic fluxes are positive (i.e. reversible reactions are split into forward and backwards components) and at most one enzyme is associated with each reaction. Hence, we assume that the enzyme cost required to sustain flux for a given growth condition is given by , where the cost per unit flux is given by:
(21) |
Here, is the molecular weight of the enzyme associated with the reaction, is the turnover number for the enzyme-reaction pair, and is a condition-specific coefficient (typically interpreted as an average enzyme saturation level) scaling the total enzyme abundance predicted for a flux distribution. Given an initial vector of turnover numbers to be adjusted, we compute a reference flux distribution for the th growth condition, by fixing the biomass flux, to the experimentally measured rate and minimizing the total enzyme cost:
(22) | |||||
Here, the objective is the total enzyme cost for this condition, is the experimentally measured growth rate for the th growth condition, is the stoichiometric matrix of the network and and are a matrix and a vector, respectively, encoding any desired upper bound (or positive lower bound) on the fluxes for the th growth condition. Noting that constraint 22c can be equivalently cast as a double inequality, we rewrite the problem in the more general form:
(23) | |||||
where the biomass flux is assumed to be the last component of the flux vector and the augmented matrices:
(24) |
were introduced. Finally, from (21) we note that the solution of problem (23) is independent of the choice of the coefficient , as this merely amounts to a scaling of the objective function. Hence, without loss of generality, we can simply set for all growth conditions in this step and fit a condition-specific value for this parameter in the next half of the procedure. Solving (23 for each of the growth conditions available in the experimental dataset, we obtain a set of optimal flux distributions , which we will use as a reference in the next step.
A.5.2 Adjusting turnover numbers for the reference set of flux distributions
We now turn to fit the turnover and saturation parameters parameters against experimental measurements of enzyme abundances. For this purpose, we denote with a vector of log10 turnover parameters (one for each enzyme-reaction pair in the model) and with a vector of condition-specific log10-scaling factors (i.e. and ). Further, we denote with the vector of original log-turnover numbers (i.e. the one used to obtain the reference flux distributions). For the choice of reference flux distribution computed in the previous step, the abundance of the th enzyme in condition , , is then a function of and :
(25) |
where the summation index runs across all reactions catalysed by enzyme and the flux cost of reaction in condition , , is computed as in (21). From the formulation of the linear program (23), it is clear that there must exist a region of log-turnover parameter space such that and that, for every , is the optimal solution of problem (23) for its growth condition. Hence, in this step of the adjustment procedure, we aim to find a set of adjusted log turnover parameters, and log scaling factors, by minimising the discrepancy between enzyme abundance predictions and measurements, constraining our search to this region of the parameter space where all reference flux distributions are optimal for their respective growth condition:
(26) | ||||
where is the number of enzyme-condition pairs, is the number of turnover parameters, is the experimental measurement of enzyme in condition , and are bounds on the allowable adjustment, is a scalar hyperparameter, and the function is defined as:
(27) |
The objective function of the nonlinear program (26) is a combination of two terms. The first term penalises the mean squared deviation between measurements and predictions of log-enzyme abundance. Note that the above definition of implies that, for each condition, only enzymes with nonzero predicted abundance are included in this term. The second term is a regularisation expression (whose strength is controlled by the hyperparameter ) penalising mean squared deviation from the original parameter set. The latter term mainly serves two purposes: first, it ensures that, whenever a turnover parameter is ”free” in the problem (which can happen if its associated reaction fluxes are across all conditions, or if no experimental measurements are available for its associated enzyme), it will be kept at its original value; secondly, it provides a mean to tune the strength of the adjustment procedure.
In order to define the region , we shall exploit the sufficient optimality conditions of LP (23). Introducing, for each growth condition, the vectors of dual variables, and , corresponding to constraints 23b and 23c, respectively, we note that problem 23 admits (for the th growth condition) a dual problem in the form:
(28) | |||||
A well-known result in linear programming duality theory [maranas_zomorrodi_2016] states that a flux distribution is the optimal solution of the primal problem 23, if, and only if, the dual problem 28 is feasible (dual feasibility) and its optimal objective coincides with the primal optimal objective, that is (strong duality). Taken together, dual feasibility and strong duality thus define the region of optimality in turnover parameter space of each reference flux distribution. Hence, we can integrate the above definition of within problem 26 by introducing the two vectors of dual variables ( and ) for each condition as additional optimisation variables, and simultaneously enforcing the optimality for each reference distribution. By doing this, we obtain the final formulation of the nonlinear program for turnover number adjustment, which we solved to obtain the results shown in the main text:
(29) | |||||
A.5.3 Conversion to apparent turnover numbers
The above procedure produces (after conversion back to a linear scale) a set of adjusted turnover numbers, and scalings, . In order to facilitate the physical interpretation of the resulting parameter set, we express the fitted scalings for the th condition, as the product of two terms:
(30) |
Here, is the geometric mean of the scalings across conditions, which we interpret as typical enzyme saturation level, while is a residual scaling factor fluctuating around which required to account for differences in total measured enzymes across conditions. Based on this reparametrisation of the scaling, we now define a set of adjusted apparent turnover numbers, , which now incorporate saturation information, as:
(31) |
A.5.4 Potential extensions
We conclude this section by noting that procedure described above assumes that the original parameter set is sufficiently ”good” to produce a realistic flux distribution to use as a reference for the adjustment step. If this is not the case, then a multi-start approach can be implemented, where multiple turnover parameter sets are first generated by perturbing the original parameter set, and each is used to generate a separate set of flux distributions. Each reference set is then provided as an input to problem (29), and the solution achieving the lowest objective chosen in the end. The perturbed parameter set may be generated either randomly (for example, by adding log-normal noise to the original turnover parameter vector) or systematically. The latter could be achieved, for example, by identifying reactions with zero-predicted flux but high measured abundance of the associated enzyme. By systematically increasing the corresponding turnover number, one can ”encourage” these reactions to be included in the reference set, potentially leading to the exploration of more relevant reference set than achieved by random perturbation.
Note that, for the purposes of this work, we limited ourselves to the original parameter set and thus did not explore this potential heuristic.
Appendix B Supplementary Figures
Appendix C Supplementary Tables
\RaggedRight Node type | \RaggedRight Node subtype | Description | Example |
\RaggedRight reaction | \RaggedRight | a mass balanced chemical or biochemical reaction | bigg:GAPD |
\RaggedRight protein | \RaggedRight polypeptide | a single polypeptide coded by a gene | GAPDH-A-MONOMER |
\RaggedRight | \RaggedRight multimer | a complex formed by stoichiometric binding of different polypeptides and/or other complexes | GAPDH-A-CPLX |
\RaggedRight | \RaggedRight modified protein | a polypeptide or multimer which underwent post-translational modification | LIPOYL-GCVH |
\RaggedRight gene | \RaggedRight | a gene | b1779 |
\RaggedRight metabolite | \RaggedRight | an organic or inorganic molecule | L-ASPARTATE |
\RaggedRight
logical-AND/
logical-OR |
\RaggedRight | used as intermediates node to encode arbitrary logical rules in GPRs generated thorough the graph (e.g. one of two proteins being required by another node) | THIOREDOXINS |
Edge type | parent node type(s) | child node type(s) | description | \RaggedRight weight | example |
---|---|---|---|---|---|
catalysis | reaction | protein | a catalytic relationship between a reaction and an enzyme which catalyzes it. Further annotated as either primary or secondary catalysis (see main text and Methods) | \RaggedRight NA |
bigg:PFK 6PFK-1-CPX (primary)
bigg:PFK 6PFK-2-CPX (secondary) |
subunit composition | protein | protein | indicates that the child node is a subunit of the parent node | \RaggedRight the stoichiometry of the subunit in the complex. | FABA-CPLXFABA-MONOMER |
accessory component | protein | protein | Indicates that the child protein is an accessory subunit of the parent protein, i.e. can be part of the complex but is not strictly required for correct function of the complex. | \RaggedRight the stoichiometry of the subunit in the complex. | ATPSYN-CPLXEG10106-MONOMER |
protein modification | protein | protein | Indicates that the parent protein is a obtained by post-translational modification of the child protein. | \RaggedRight NA | PYRUVFORMLY-CPLXPYRUVFORMLY-INACTIVE-CPLX |
protein modification requirement | protein | protein | indicates that the child protein is required to accomplish the post-translational modification leading to the parent protein. | \RaggedRight NA | PYRUVFORMLY-CPLXPFLACTENZ-MONOMER |
non-catalytic requirement | reaction | protein | Indicates that the child protein is required by the parent reaction, although not as a catalyst. Typical examples include proteins used as cofactors (e.g. glutaredoxins) in the reaction or featuring as prosthetic groups for a metabolite involved in the reaction (e.g. Acyl-Carrier-protein) | \RaggedRight NA | bigg:ACOATAACP-MONOMER |
putative association | reaction | protein | Indicates that a putative association between the reaction and the protein has been proposed in literature | \RaggedRight NA | bigg:PFLEG11910-MONOMER |
inactive catalysis | reaction | protein | Indicates that the child protein is an enzyme for the parent reaction, but it’s inactive in this strain due e.g. to a frameshift mutation. | \RaggedRight NA | bigg:ACHBSACETOLACTSYNII-CPLX |
regulatory interaction | protein | metabolite, protein | Indicate that the child metabolite or protein is a regulator for the enzyme. Information about the regulation mode (activation vs inhibition), the regulation mechanism (competitive vs allosteric) and the regulated reaction (if the enzyme catalyses multiple) is provided as edge metadata whenever available. | \RaggedRight If present, denotes the activation/inhibition constant for the interaction as reported in EcoCyc. | SHIKIMATEAROE-MONOMER |
coding relation | protein | gene | Indicate that the child gene codes for the parent polypeptide | \RaggedRight NA | RIBOKIN-MONOMERb3752 |
Reaction set | Pruned in iCH360red | Notes |
---|---|---|
EAR(n)x, EAR(n)y * | EAR(n)y | Enzyme FabI can work with both NADH/NADPH, but higher activity was found with NADH [bergler_enoyl-acyl-carrier-protein_1996] |
ACOATA, KAS14, KAS15 | ACOATA, KAS14 | Initiation of fatty acid biosynthesis can occur by either direct condensation of acetyl-CoA with malonyl-ACP (KAS15) or by transacylation of acetyl-CoA followed by condensation with malonyl-ACP (ACOATA + KAS15). Because the transacylacylase activity of FabH (ACOATA) has been to be significantly lower than its condensation activity (KAS15) [tsay_isolation_1992], only the former pathway is maintained in iCH360red. |
VALTA, VPAMTr | VPAMTr | These are both routes to production of valine. We keep VALTA (ilvE) as it is the last step in the canonical valine biosynthesis route. |
RNDR1, RNDR2, RNDR1b, RNDR2b | RNDR1b, RNDR2b | The ribonucleoside diphosphate reductase can work both with the thioredoxin and glutaredoxin redox systems. iCH360 retains only the thioredoxin version. |
SULabcpp, SO4t2pp | SO4t2pp | SULabcpp is an ATP-mediated active transport of sulfate in the cell via an ATP-binding-cassette (ABC) transporter [sirko_sulfate_1995], while SO42tpp (cysZ) is a proton symporter [zhang_escherichia_2014]. We maintain the former as its impairment was shown to lead to cysteine auxotrophy [sirko_sulfate_1995]. |
*
number of EFMs (filtered) | ||
---|---|---|
Aerobic | Anaerobic | |
Glucose | 13468719 (1035696)) | 204028 (195670)) |
Pyruvate | 1763631 (135266)) | 6949 (6480)) |
Glycerol | 922217 (82112)) | NA |
Acetate | 38099 (7596) | NA |
Lactate | 1270315 (5897) ) | 1497 (1424)) |