Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\sidecaptionvpos

figurec \addbibresourcebibliography.bib

A compact model of Escherichia coli core and biosynthetic metabolism

Marco Corrao1,∗ Hai He2 Wolfram Liebermeister3 Elad Noor4 Arren Bar-Even5,∓
1 Department of Engineering Science
University of Oxford Parks Road OX1 3PJ Oxford UK
2 Max Planck Institute for Terrestrial Microbiology
Karl-von-Frisch-Str. 10 35043 Marburg Germany
3 Université Paris-Saclay
INRAE MaIAGE 78350 Jouy-en-Josas France
4 Department of Plant and Environmental Sciences
Weizmann Institute of Science 7610001 Rehovot Israel
5 Max Planck Institute of Molecular Plant Physiology
Am Mühlenberg 1 14476 Potsdam-Golm Germany
Corresponding author: Marco Corrao; email: marco.corrao@eng.ox.ac.uk
minus-or-plus{\mp} Deceased in 2020
Abstract

Metabolic models condense biochemical knowledge about organisms in a structured and standardised way. As large-scale network reconstructions are readily available for many organisms of interest, genome-scale models are being widely used among modellers and engineers. However, these large models can be difficult to analyse and visualise, and occasionally generate hard-to-interpret or even biologically unrealistic predictions. Out of the thousands of enzymatic reactions in a typical bacterial metabolism, only a few hundred comprise the metabolic pathways essential to produce energy carriers and biosynthetic precursors. These pathways carry relatively high flux, are central to maintaining and reproducing the cell, and provide precursors and energy to engineered metabolic pathways. Here, focusing on these central metabolic subsystems, we present a manually-curated medium-scale model of energy and biosynthesis metabolism for the well-studied prokaryote Escherichia coli K-12 MG1655. The model is a sub-network of the most recent genome-scale reconstruction, iML1515, and comes with an updated layer of database annotations, as well as a range of metabolic maps for visualisation. We enriched the stoichiometric network with extensive biological information and quantitative data, enhancing the scope and applicability of the model. In addition, here we assess the properties of this model in relation to its genome-scale parent and demonstrate the use of the network and supporting data in various scenarios, including enzyme-constrained flux balance analysis, elementary flux mode analysis, and thermodynamic analysis. Overall, we believe this model holds the potential to become a reference medium-scale metabolic model for E. coli.

Keywords: Metabolic network, functional annotation, gene-reaction mapping, Elementary Flux Mode, biochemical thermodynamics

Abbreviations: EFM: Elementary Flux Mode; FBA: Flux Balance Analysis; GEM: Genome-scale models; MDF: Max-Min Driving Force; RMSE: Root Mean Squared Error; 3GP: glycerol 3-phosphate; 3hmrsACP: (R)-3-Hydroxytetradecanoyl-ACP; 3MOB: 3-Methyl-2-oxobutanoate; Ala: L-alanine; Asp: L-aspartate; ATP: adenosine triphosphate; F6P: fructose 6-phosphate; G3P: glyceraldehyde 3-phosphate; HdeACP: cis-hexadec-9-enoyl-ACP; palmACP: palmitoyl-ACP; PEP: phosphoenolpyruvate; Ru5P: D-ribulose 5-phosphate.

1 Introduction

Metabolic models are a valuable tool for biologists and biotechnologists aiming to elucidate and engineer cell metabolism [sarkar_engineering_2019, gu_current_2019, lu_silico_2023]. In their simplest form, such models encode basic biochemical knowledge, such as network structure, reaction stoichiometries, or known reaction directionalities, in a structured and standardised format. However, the scope of these models is often wider, including information about catalysing enzymes and kinetic parameters, as well as annotations that link model elements to external databases. The fast development of high-throughput experimental and computational pipelines has led to genome-scale reconstructions of metabolism now existing for a wide range of microorganisms of interest [gu_current_2019, fang_reconstructing_2020]. One of them, Escherichia coli, has been the most studied prokaryotic organism and, as such, its metabolism has been the subject of extensive modelling efforts spanning over three decades [reed_thirteen_2003, feist_genomescale_2007, orth_comprehensive_2011, monk_iml1515_2017]. In particular, for the common laboratory strain E. coli K-12 MG1655, the most recent genome-scale reconstruction, iML1515, accounts for 2712 enzyme-catalysed reactions mapped in detail to 1515 genes [monk_iml1515_2017].

Genome-scale metabolic network models (GEMs) provide a comprehensive picture of cell metabolism, and constraint-based modelling algorithms using these models have shown remarkable predictive power, for example, when predicting gene essentiality [bernstein_evaluating_2023]. Nevertheless, working with such large models comes with some disadvantages. In the absence of sufficient constraining or parametrisation, simulations based on large networks can easily lead to biologically unrealistic solutions, which may confound the interpretation of modelling results. For example, when designing and testing gene knockout strategies, genome-scale networks often wrongly predict unphysiological metabolic bypasses which have to be manually inspected and filtered out. Another issue is that, owing to their size and complexity, analysis of genome-scale networks is often limited to relatively simple modelling frameworks, such as flux balance analysis (FBA), that can only answer a limited range of questions. More complex methods, including kinetic modelling, sampling of metabolic flux distributions, or elementary flux mode (EFM) analysis [wortel_metabolic_2018], can be used to gain additional insight into the governing principles and constraints of microbial metabolism, but are difficult to apply to large models. Finally, genome-scale models are often hard to visualise comprehensively, which can make the interpretation of computed flux distributions cumbersome and unintuitive.

For all these reasons, small-scale models of E. coli metabolism are commonly used both for strain design and for the development of novel modelling frameworks. Among these, the E. coli core model (ECC) developed by \citeauthor*orth_reconstruction_2010 [orth_reconstruction_2010] has been widely used in the literature. While being popular as an educational and benchmark tool, ECC is limited in scope: it lacks, among others, most biosynthesis pathways, which would be relevant to many metabolic engineering applications. This limitation was later addressed by \citeauthorhadicke_ecolicore2_2017 [hadicke_ecolicore2_2017], who constructed a medium-scale model, E. coli core 2 (ECC2), as a subnetwork of iJO1366, the most up-to-date GEM available at the time [orth_comprehensive_2011]. ECC2 was obtained through an algorithmic reduction [erdrich_algorithm_2015] that iteratively prunes reactions from a template model while retaining user-specified structural and phenotypic features, such as the ability to grow on a defined set of conditions. However, to enforce the desired phenotypes the algorithm relied only on steady-state stoichiometric modelling and did not account for other important factors, such as thermodynamics, kinetics, or regulatory effects, which are relevant under physiological conditions. Therefore, while the resulting submodel satisfies the imposed stoichiometric constraints by construction, further manual curation is often needed depending on the application at hand.

Here, we introduce iCH360 (named, as per convention, by the initials of the authors followed by the number of genes covered by the model), a manually curated, “Goldilocks-sized” model of E. coli K-12 MG1655 energy and biosynthesis metabolism. The model was derived from the most recent genome-scale reconstruction (iML1515 [monk_iml1515_2017]) and includes all pathways required for energy production and the biosynthesis of the main biomass building blocks, such as amino acids, nucleotides, and fatty acids, but accounts only implicitly for the conversion of these precursors into more complex biomass components. We extended the coverage of annotations pointing to external databases from iML1515, and built custom metabolic maps to facilitate visualising the model and its subsystems [jahan_development_2016]. We complemented the stoichiometric network structure with a curated layer of biological information about catalytic function, protein complex composition, and small-molecule regulation. Finally, we enriched the model with useful quantitative data, including thermodynamic and kinetic constants. Thanks to all these extra layers of information, our model can support a wide range of modelling efforts beyond simple stoichiometric ones.

Below we present the model and demonstrate several use-cases across various modelling scenarios, including enzyme-allocation predictions, EFM analysis, and thermodynamic analysis. The model is freely available in SBML, JSON, and SBTab standard formats, and can be directly used with popular metabolic modelling tools such as the COBRA toolbox [ebrahim_cobrapy_2013].

2 Results

2.1 A compact model of E. coli energy and biosynthesis metabolism

To assemble the iCH360 model, we started from the core metabolic reactions present in ECC and extended them with the pathways required for the biosynthesis of the main biomass building blocks, including the twenty amino acids, the four nucleotides, and both saturated and unsaturated fatty acids (Fig. 1, Table 1, and Supplementary Figures 1-4). In addition, while not performing a comprehensive review of iML1515, we applied a small number of corrections to the original reactions based on literature knowledge (Supplementary Information A.1).

Escher map of iCH360
Refer to caption

Figure 1: Metabolic map of the iCH360 model, built with the metabolic visualisation tool Escher [king_escher_2015] and showing the metabolic subsystems included in the model. Shaded areas denote metabolic subsystems already present in the ECC model [orth_reconstruction_2010]. Reaction and metabolite names were omitted from the plot for clarity. Overlaid onto the map is a flux distribution for aerobic growth on glucose computed via parsimonious FBA.
Table 1: The main metabolic subsystems covered by iCH360.
Subsystem Description Metabolic map
Carbon uptake and transport Reactions required for the uptake and assimilation of the following carbon sources: glucose, fructose, ribose, xylose, lactate, acetate, gluconate, pyruvate, glycerol, glycerate, succinate, 2-ketoglutarate Figure 1
Central carbon metabolism Glycolysis, pentose phosphate pathway, pyruvate fermentation, TCA cycle, oxidative phosphorylation Figure 1
Amino acids biosynthesis Biosynthesis of all 20 amino acids from core metabolism precursors Supplementary Figure 1
Nucleotide biosynthesis Biosynthesis of purine and pyrimidine nucleotides (and deoxynucleotides) from core and amino acid metabolism Supplementary Figure 2
Fatty-acids biosynthesis Biosynthesis of saturated and unsaturated fatty acids present in E. coli from acetyl-CoA Supplementary Figure 3
C1 metabolism One-carbon metabolism Supplementary Figure 4

The final assembled model (Figure 1) comprises 304304304304 compartment-specific metabolites and 323323323323 metabolic reactions mapped to 360360360360 genes, thus qualifying as a medium-scale model ranging in between ECC and iML1515 (Supplementary Figure 5). While similar in scale, our model and ECC2 present a fundamental structural difference. ECC2 was built by systematically removing reactions from its genome-scale parent (iJO1366) [hadicke_ecolicore2_2017]. Thus, its metabolic space spans the production of all compounds consumed in the iJO1336 biomass reaction. In contrast, the metabolic space of iCH360 only reaches biomass building blocks, without explicitly considering peripheral pathways, such as assembly of cell-membrane components, de novo synthesis of cofactors, and active transport of ions in the cell.

To make the model comparable to its parent model iML1515, we computed an equivalent biomass reaction, wherein all biomass requirements not present in our model are converted into an equivalent metabolic cost in terms of precursors, based on manually curated bioproduction routes (Table 2, Supplementary Information A.2).

Table 2: Biosynthesis pathways outside of the iCH360 model that were considered to construct a biomass reaction equivalent to the biomass reaction in the iML1515 model. The right column shows the main precursors present in iCH360. Note that only the main precursors are shown here, but the equivalent biomass reaction computed also accounts for any net production or consumption of metabolites in the reduced model.
Pathway Precursor in iCH360
Biosynthesis of phosphatidylethanolamine (C16:0 and C16:1) 3GP, PalmACP (C16:0), HdeACP (C16:1)
Biosynthesis of KDO2-Lipid-A F6P, Ru5P, 3hmrsACP
Murein Biosynthesis G3P, PEP, F6P, Ala
NAD/NADP de novo biosynthesis Asp, DHAP
FAD de novo biosynthesis GTP, Ru5P
CoA de novo biosynthesis 3MOB, Asp
Active transport of ions ATP

2.2 Range of metabolic conversions described by production envelopes

To check how its significantly smaller size and complexity affects the solution space of our model, we first looked at the maximally achievable biomass production flux under a range of growth conditions (Supplementary Figure 6). Across most conditions considered, the model can achieve biomass fluxes comparable to its genome-scale parent. The main differences exist in anaerobic growth on fumarate, alpha-ketoglutarate (AKG) and glycerol, where our model predicts zero growth, while the GEM achieves some (albeit small) biomass production rate. In practice, fermentation in these scenarios is in fact biologically unrealistic (e.g. [boecker_enabling_2022]). The reduced model is thus a good basis for metabolic simulation frameworks with few constraints such as FBA, since the reduction procedure, at least in this test, limits the original solution space of the GEM to more physiologically relevant regions.

To investigate the metabolic capabilities of our model even further, we looked at production envelopes (projections of the model’s solution space onto a smaller set of dimensions, also known as phenotypic phase planes), describing production rates for biomass and a range of metabolites. Figure 2 shows the computed envelopes for aerobic growth on glucose. The reduced model has a production envelope comparable with iML1515, where the main apparent difference is in the production of acetate. These differences can be ascribed to the absence of a series of degradation pathways in our model. These pathways allow the genome-scale model to channel additional carbon flux towards acetyl-CoA and, as a result, to reach higher theoretical acetate production yields of up to 3absent3\approx 3≈ 3 mole acetate per mole of glucose. Since the physiological relevance of such high acetate yields, in the absence of heterologously expressed pathways, has already been questioned [hadicke_ecolicore2_2017], we chose not to include in the model those additional reactions that would allow for higher acetate yields. A similar pattern of agreement with iML1515 was found when comparing production envelopes under different growth conditions (Supplementary Figure 7).

Aerobic growth on glucose
Refer to caption

Figure 2: The iCH360 model shows similar but more precise metabolic capabilities than iML1515. Considering glucose as feedstock and studying ethanol, lactate, and succinate production, a production envelope analysis yields similar results in the two models (note that the dashed line representing the production envelope of iML1515 is sometimes hidden behind the coloured lines). Growth rate and production fluxes were computed by bounding the glucose uptake rate to a maximum of 10 mmol/gDW/h. In the scenario of acetate production (top right panel) iCH360 does not yield an unrealistically high production flux [hadicke_ecolicore2_2017] as predicted iML1515.

2.3 Connecting reactions to their catalysing enzymes and the enzymes’ protein components

Metabolic models often contain annotations that connect model elements to entries in biological databases, such as EcoCyc [keseler_ecocyc_2017], KEGG [kanehisa_kegg_2017] and MetaNetX [moretti_metanetxmnxref_2021]. However, even for the subset of reactions included in iCH360, the annotations present in iML1515 were incomplete and in part outdated. To fill these gaps, we extended and corrected the original annotations through a mixture of automated querying and manual curation (Figure 3A). Notably, the annotations to EcoCyc, the main knowledgebase for E. coli, are complete so that every metabolic reaction in iCH360 is uniquely mapped to a corresponding EcoCyc ID (with the exception of three reactions, for which a match on the database could only be found for a different choice of redox cofactor, namely NADH instead of NADPH). Additionally, we found 134134134134 deprecated annotations pointing to the MetaNetX namespace, which were consequently updated with the most up-to-date IDs.

Refer to caption
Figure 3: Layers of annotation and biological knowledge supporting the stoichiometric model in iCH360. A: Annotations for the model reactions point to the EcoCyc, metaNetX, and KEGG databases. Bars show the numbers of annotations, highlighting the share of annotations that were added to or updated from the parent model iML1515. B: Some of the biological knowledge parsed from EcoCyc (and manually curated) included in the model-supporting functional annotation graph. The graph captures catalytic relationships between reactions and enzymes, protein subunit compositions, protein-gene mappings, and small-molecule regulation interactions, among others. Shown here as an example are the branches of the graph corresponding to the Glutamate Dehydrogenase (GLUDy) and Glutamate Synthase (GLUSy) reactions. C: Examples of catalytic relationships functionally annotated as either primary or secondary in the graph. D: Functional annotation of catalytic edges as primary or secondary supports phenotypic predictions. Left: Classification of catalytic edge disruptions in the network resulting from simulated knockout of genes associated with essential reactions in the model across 9999 growth conditions (see main text for a description of each disruption class). Right: comparison of predicted disruption outcomes against a large dataset of mutant fitness data [price_mutant_2018] shows that significantly different fitness changes are associated with the different types of disruption in the network.

Leveraging the complete mapping with the EcoCyc database, we parsed, assembled, and manually curated a data structure that enhances the stoichiometric model with a detailed layer of information about enzymes, polypeptides, and genes related to the network (Methods). This data structure takes the form of a weighted graph, wherein nodes represent biological entities (reactions, proteins, genes or compounds) and edges represent (potentially quantitative) functional relationships between them (Methods, Figure 3B, Supplementary Tables 1 and 2), such as catalysis, regulation, protein modification, and protein subunit composition. This graph contains the gathered information in a unified form, allowing users to perform a number of tasks related to metabolic modelling applications. For example, by explicitly mapping reactions to their catalysing enzymes rather than to the associated genes (which may additionally include protein activators or cofactors), it simplifies the definition of meaningful enzyme capacity constraints in the model. Similarly, since the graph topology implicitly defines associations between reactions and genes, boolean gene-protein-reaction (GPR) rules needed for in silico knockout studies can be generated accounting for both catalytic and non-catalytic requirements for each reaction (Supplementary Information A.3). Crucially, these GPR rules can be re-generated as needed whenever the graph is updated with new nodes or edges. Finally, we employed it to estimate the abundance of protein complexes included in the model from measured polypeptide abundances (Supplementary Information A.4) using an automated procedure (see Result section 2.5).

2.4 Primary and secondary catalysis

Through this annotation graph, 318318318318 metabolic reactions in the model are linked to 289289289289 catalysing enzymes, with more than 25%percent2525\%25 % of the reactions being catalysed by multiple enzymes (isozymes). Such enzymatic redundancy plays an important role, for example, when designing metabolic engineering strategies to prevent flux through a pathway. However, the different isoenzymes of a reaction need not all be equally important, and treating them as completely equivalent may generate inaccuracies in some phenotypic predictions. For example, while phosphofructokinase activity in E. coli (reaction PFK) is known to be carried by two isozymes, pfkA and pfkB, the latter is known to be responsible only for minor activity under normal physiological conditions [babul_phosphofructokinases_1978]. If these differences are not taken into account, metabolic models can overpredict redundancy in the network, for instance when assessing phenotypic predictions in response to gene knockouts [monk_iml1515_2017].

To address this issue and make the model usable in a wider range of modelling scenarios, we qualified catalytic edges in the graph as either primary or secondary catalysis (Methods, Figure 3C). A catalytic relationship between a reaction and an enzyme was annotated as secondary whenever the enzyme, according to experiments, accounts only for negligible activity for the reaction in the wild-type strain. As experimental evidence we considered in vitro and in vivo complementation studies. Through this curation process, a total of 72727272 catalytic edges were functionally annotated as secondary.

To test how this functional annotation can support phenotypic predictions, we investigated the disruption of each essential reaction in the model resulting from the knockout of each one of its associated genes (Methods). We classified the outcome of each possible knockout as: (i) complete disruption, if a reaction loses all of its catalytic edges, (ii) full primary disruption, if a reaction loses all of its primary catalysis edges, but secondary ones are left, (iii) partial primary disruption, if the reaction loses some, but not all, of its primary catalysis edges, or (iv) secondary disruption, if some or all secondary edges are disrupted, but none of the primary ones. We classified the effect of in silico knockouts for aerobic growth on 9999 different carbon sources and compared the results against a large dataset of mutant fitness data, obtained via competitive fitness assays, from [price_mutant_2018].

Based on this analysis, we found that disruptions of primary edges as a result of a knockout were significantly associated with greater fitness losses than disruptions of secondary edges (Wilcoxon rank-sum test, p<106𝑝superscript106p<10^{-6}italic_p < 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, Figure 3D). In addition, primary disruptions that were only partial were associated with more contained fitness losses. Finally, minor fitness gains could be identified when mutants were associated with secondary and partial primary disruptions, but not when complete primary disruptions occurred. We did not find as significant differences between complete disruptions and full primary disruptions (Supplementary Figure 8), supporting the idea that catalytic relationships annotated as secondary are unlikely to be strong enough to stand-in for disrupted primary ones under normal physiological conditions.

2.5 Enzyme-constrained Flux Balance Analysis with EC-iCH360

To extend the scope and applicability of iCH360, we constructed a version of the model that allows for enzyme-constrained flux simulations, which we denote as EC-iCH360. We constructed EC-iCH360 in the sMOMENT [bekiaris_automatic_2020] format (Methods). The sMOMENT framework is inherently simple and generates the same solution space as more complex model formats, such as GECKO, unless reaction-specific capacity constraints are specified in the latter [sanchez_improving_2017, bekiaris_automatic_2020]. Since it requires unique reaction-enzyme mappings, we used the knowledge graph to remove all secondary catalytic relationships in the model (Section 2.4).

We first parametrized the model by defining a flux-specific enzyme cost for each reaction (in units of grams of enzyme per unit flux), using as values the estimated in vivo turnover number from [heckmann_kinetic_2020]. Using this enzyme-constrained model, we then predicted enzyme abundances and compared them to experimental enzyme abundances, estimated from a dataset of proteomic measurements for aerobic growth on eight different carbon sources [schmidt_quantitative_2016](Methods). This analysis led to generally good predictions, with root mean squared error (RMSE, computed for log10-transformed enzyme abundances) ranging from 0.530.530.530.53 to 0.620.620.620.62 (Supplementary Figure 9). In order to assess the nature of residuals between measurements and predictions, we investigated the geometric mean of enzyme abundances across all eight conditions (Supplementary Figure 9, bottom right panel). Notably, averaging predicted and measured enzyme abundances, respectively, across conditions did not significantly improve the RMSE, indicating that the abundances of individual enzymes were systematically over- or under-predicted across conditions.

To increase the predictive ability of the model, we thus adjusted the turnover numbers by fitting them to experimental measurements through a custom heuristic (Methods, Supplementary Information A.5). By simultaneously fitting all available conditions, we ensured that our adjustment procedure is robust to condition-specific biases. Further, by introducing regularisation within our adjustment scheme, we penalised large deviations of parameters from the original dataset, increasing the robustness of the procedure to overfitting. Our adjusted parameter set shows a mean absolute deviation (computed for log10-transformed turnover values) from the original parameter set of 0.22absent0.22\approx 0.22≈ 0.22 (Supplementary Figure 10) and results in significantly better agreement with experimental measurements across conditions (Figure 4A). Further, mean enzyme abundances across conditions are very well predicted with the adjusted parameter set (Figure 4B), implying that residuals between measurements and predictions are now to be attributed to variability in enzyme cost across conditions (which simple frameworks such as enzyme-constrained FBA cannot account for), rather than systematic over- or under-predictions. Thus, each adjusted turnover parameter can be thought of as a ”typical” apparent kcatsubscript𝑘catk_{\rm cat}italic_k start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT value, incorporating average saturation trends for an enzyme across growth conditions.

Refer to caption
Figure 4: Enzyme allocation predictions obtained with the model variant EC-iCH360 after adjusting the turnover parameters. A: Predicted vs measured enzyme abundances for aerobic growth on eight different carbon sources. Each data point represents an enzyme-condition pair. A total of 325325325325 data points corresponding to zero predictions (enzymes associated with zero-flux in the enzyme-constrained FBA solution for a given condition) were omitted from the plot. B: Geometric mean across conditions of predicted vs measured enzyme abundances. For each enzyme, the geometric mean was computed across the conditions with non-zero predicted abundance. A total of 27272727 data points, corresponding to enzymes with zero predictions across all conditions, were omitted from the plot.

2.6 Elementary Flux Modes in the reduced model variant iCH360red

Despite the small size of iCH360, we found the explicit enumeration of its elementary flux modes (EFMs) to be intractable. This is not necessarily surprising, since the topology of a metabolic network is a much better determinant of EFM count than its sheer size. Metabolic networks can possess different types of redundancies, such as the presence of alternative pathways for the production of the same metabolite, the use of alternative cofactors for the same catalytic step in a pathway, or the presence of alternative transporters for the uptake/excretion of a compound. While knowledge about these redundancies is often valuable, including them in the model can increase the number of EFMs exponentially, hampering or even preventing EFM-based analyses.

To address this issue, we identified and removed a small set of such redundancies in iCH360, using available information from literature, whenever possible, to ensure the most physiologically relevant alternative was maintained (Supplementary Table 3). This resulted a metabolic submodel of iCH360, a model variant that we denote by iCH360red. iCH360red contains 305305305305 metabolic reactions (18181818 less than iCH360) and shares the same production envelopes with its parent model for a number of metabolites of interest (Supplementary Figure 12). While the number of EFMs in iCH360red is still relatively large (13.5absent13.5\approx 13.5≈ 13.5 millions for aerobic growth on glucose, see Supplementary Table 4), it is not prohibitive for most types of EFM-based analysis, and their explicit enumeration does not require high-performance computing (Methods).

We used the obtained EFMs to study the possible combinations of achievable growth rates and yields in the network [wortel_metabolic_2018]. To this end, we considered growth on glucose as a scenario and computed, for each EFM from iCH360red, its yield and its achievable cell growth rate, which we estimate based on the enzyme costs defined for the enzyme-constrained model (Methods). Based on this analysis, we identified a front of Pareto-optimal EFMs, along which any increase in the growth rate will necessarily lead to a reduction in yield (Figure 5). Along the Pareto front, we observe a transition from a purely respiratory mode at maximum yield (Supplementary Figure 13) to a mixed respiratory-fermentative mode at maximum growth (Supplementary Figure 14). Quantitatively, the extent of this tradeoff was rather modest: the EFM with maximal yield reaches almost the maximal growth rate, so only minor gains in growth rate can be achieved by using other, fermentative modes along the Pareto front. However, it is worth noting that this analysis was performed using a simple capacity-based enzyme cost function, which summarises different determinants of enzyme cost such as driving force and substrate saturation [noor_protein_2016] in one single value, assumed to be constant. Hence, repeating the analysis with a more complete enzyme-cost function [noor_protein_2016] could help elucidate the nature of this tradeoff. To demonstrate that the shape and size of the Pareto front depend strongly on growth conditions, we also simulated an environment with very low oxygen levels. We implemented this by increasing the flux-specific enzyme cost of the oxygen-dependent reaction in the respiratory chain, equivalent to assuming a lower enzyme efficiency due to a lower oxygen level (Methods). Results (Supplementary Figure 15) show a much broader front of Pareto optimal EFMs, indicating that the nature of the observed tradeoff is indeed condition-dependent. Notably, a similar dependence of the Pareto front on extracellular oxygen availability has been previously observed in a small-scale model of E. coli core metabolism [wortel_metabolic_2018].

Refer to caption
Figure 5: Growth rates and biomass yields achieved by different elementary flux modes of iCH360red for growth on glucose. The inset on the top left (corresponding to the area of the plot enclosed by the red rectangle) highlights the front of Pareto-optimal EFMs (squares), with the maximum-growth and maximum yield modes laying at the extremes of the front. The growth rate of each mode was estimated by assuming that the metabolic enzymes in the model occupy, by mass, a constant fraction of the cell’s dry weight (see Methods).

2.7 Saturation FBA and modelling of overflow metabolism

In order to study the effect of external conditions on optimal metabolic strategies in more detail, we used another framework that does not require an enumeration of EFMs and that allows for additional flux bounds, for example, a lower bound on the ATP consumption for cell maintenance. The saturation FBA (satFBA) framework [muller_resource_2015] is a variant of enzyme-constrained FBA, wherein a fixed enzyme cost per flux is assumed for all metabolic reactions in a model, except for the substrate transporter, for which a complete kinetic law is used (Methods). Since the external substrate concentration is a simple parameter, screening this concentration is equivalent to screening the values of the transporter efficiency. Here we used satFBA to simulate how the growth-maximising solution of the network varies in response to changing extracellular glucose concentration. By solving the satFBA problem for a range of glucose concentrations, we predicted the dependence of the cell’s growth rate on substrate concentration, resulting in the typically observed Monod curve (Figure 6A). If the problem does not contain any further flux bounds (so that the magnitude of fluxes at the optimum is solely limited by the maximum enzyme availability) the solution of satFBA problems will be an elementary flux mode [muller_resource_2015]. Hence, in this case we can use satFBA to explore how a cell should switch across elementary modes as a function of the growth environment.

At low glucose concentrations, the glucose transporter operates at low saturation and glucose uptake is enzymatically expensive, leading to a high-yield, purely respiratory metabolic mode at the optimum (Figure 6B and C). As substrate availability is increased, the cost of substrate uptake decreases and higher growth rates are achieved by switching to lower yield, acetate-secreting modes, in line with typical observations of overflow metabolism in E. coli [basan_overflow_2015, chen_energy_2019]. However, the satFBA formalism can also be used with additional flux bounds, thus going beyond EFM-based analysis. For example, if a positive lower bound on ATP hydrolysis is added as a maintenance requirement, optimal solutions to the satFBA problem will no longer be elementary modes, and the yield of the optimal solution no longer follows a piecewise constant profile (Supplementary Figure 16)

Refer to caption
Figure 6: Saturation FBA enables the exploration of the optimal switching across elementary modes as a function of the growth environment. A: satFBA predictions for the growth rate as a function of external glucose concentration, showing a typical Monod curve. Note that, satFBA computes the cell’s growth rate by assuming a fixed total enzyme mass budget while varying the saturation of the substrate transporter as a function of external substrate concentration. Importantly, although the curve is continuous and smooth, it comprises many smaller sections where each one has a different dominating elementary mode. B: satFBA predictions for the acetate excretion flux, showing progressively higher use of fermentative metabolism in the optimal solution as external glucose availability increases. C: The yield of the optimal satFBA solution, computed as the ratio of biomass flux to glucose uptake, progressively decreases as external glucose availability increases.

2.8 Equilibrium constants, thermodynamic forces, and thermodynamically feasible states

Living systems operate out of thermodynamic equilibrium, and thermodynamics places strong constraints on the operation of metabolic systems. In any metabolic state, the flux directions must follow the signs of thermodynamic forces, which depend on metabolite concentrations and equilibrium constants. To provide these constants as parts of our model, we used the component contribution framework [noor_consistent_2013] to estimate the standard Gibbs free energy (ΔrGsubscriptΔ𝑟superscript𝐺\Delta_{r}G^{\prime\circ}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT) of each reaction (Methods). These estimates account for compartment-specific chemical environment parameters, such as pH, pMg, and ionic strength, and were corrected to account for protons and charge translocation in multi-compartment reactions.

The resulting parameter set covers the vast majority of the model (over 97 % of metabolic reactions, with the remaining being not covered by the component contribution database used here) and accounts for the uncertainty in the estimates through a multivariate covariance matrix. Accounting for correlations between the different ΔrGsubscriptΔ𝑟superscript𝐺\Delta_{r}G^{\prime\circ}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT values becomes important when imposing thermodynamic constraints on the model. For example, the fatty acid biosynthesis subsystem in the model consists of repeated elongation cycles, where a short sequence of chemical transformations is repeatedly performed on a growing carbon chain. As a result, even if the ΔrGsubscriptΔ𝑟superscript𝐺\Delta_{r}G^{\prime\circ}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT for each reaction in the pathway is known with some uncertainty, this uncertainty is tightly correlated across the reactions, which constrains the set of achievable thermodynamic states in the network.

Using this set of thermodynamic constants, we first tested whether some typical flux distributions obtained from the model are thermodynamically feasible. To this end, we considered flux distributions generated by parsimonious Flux Balance Analysis (pFBA) across 12121212 growth conditions and computed their max-min driving force (MDF) [noor_pathway_2014], accounting for uncertainty in the estimates (Methods). We found a positive MDF for each of the flux distributions, indicating that all pFBA solutions tested are thermodynamically feasible. Notably, we found that the computed MDF values cluster very clearly in three groups, corresponding to aerobic growth on glycolytic substrates, aerobic growth on gluconeogenic substrates, and anaerobic conditions (Figure 7A).

Having confirmed that our reference FBA-derived flux distributions are thermodynamically feasible, we then employed an alternative flux prediction method that ensures thermodynamically feasibilty by construction. The probabilistic metabolic optimisation (PMO) framework [gollub_probabilistic_2021] uses a mixed-integer quadratic programming approach to compute a set of fluxes, metabolite concentrations, and reaction driving forces which is probabilistically most in agreement with experimentally measured metabolite concentrations (Methods). Using this framework, we computed a maximum-likelihood thermodynamic state for the model, simulating aerobic growth on glucose. In the thermodynamic state computed by PMO, we found that all metabolites lie within physiologically reasonable concentrations (1 μ𝜇\muitalic_μM – 1 mM). In addition, all ”anomalous concentrations” identified by the framework (metabolite concentrations lying more than one standard deviation away from the mean experimental value) had been identified and explained previously [gollub_probabilistic_2021], and can potentially be addressed by lumping together those reactions for which substrate channelling is known to happen in literature.

We used the PMO-derived thermodynamic state to identify candidate bottlenecks, in terms of enzyme demand, across the network. To this end, we first computed the flux-force efficacy of each reaction in the thermodynamic state [noor_pathway_2014] (Methods). The flux-force efficacy is a unitless quantity, ranging from 00 to 1111, denoting the ratio between net flux (forward minus backward flux) and total flux (forward plus backward flux) of a reaction. Reactions operating at low flux-force efficacy have a lower net flux due to two reasons: (1) the forward flux is lower in absolute terms, and (2) the higher backward flux is counter-productive and subtracts from the forward flux. Therefore, to achieve a given required net flux, the cell has to invest more resources in maintaining a higher enzyme level. To identify potential thermodynamic bottlenecks leading to high enzyme costs in specific reactions, we thus screened all reactions for their predicted flux-force efficacy and flux (Figure 7B) and identified those predicted to operate at low efficacy, while carrying significant flux (Figure 7B, labelled points). To assess the extent to which the thermodynamic operating states predicted by PTA are predictive of enzyme investment, we split reactions in two groups based on whether their predicted flux-force efficacy sits above or below 50%percent5050\%50 %, and compared the distributions of measured enzyme abundances between the two groups (Methods, Supplementary Figure 17). While there are many determinants of enzyme abundance beyond thermodynamics, including enzyme turnover, affinity, or regulation, we observed a significant difference between the two distributions (p<0.001𝑝0.001p<0.001italic_p < 0.001, two-sided Wilcox sum-rank test), with the low efficacy having approximately a 3-fold higher median enzyme abundance than the high efficacy group.

Refer to caption
Figure 7: Thermodynamic analysis of the model via the curated thermodynamic parameter set. A Probabilistic Min-max driving force analysis of flux distributions obtained by parsimonious flux balance analysis for a total of 12121212 growth conditions. All flux distributions tested have a positive MDF, implying they are thermodynamically feasible under reasonable physiological metabolite concentration ranges. The computed MDF values cluster in three groups, corresponding to glycolytic aerobic, gluconeogenic aerobic, and anaerobic growth conditions. B: Fluxes relative to the glucose uptake flux (EX_glc__D_e) and flux-force efficacies computed by probabilistic metabolic optimisation (PMO). The labelled data points represent examples of reactions (excluding transport and spontaneous reactions) with low predicted flux-force efficacy (here, below 20%percent2020\%20 %), but carrying high relative flux in the optimal solution (here, more than 5%percent55\%5 % of the glucose uptake flux). xyl__D: D-xylose; glc__D: D-glucose; rib__D: D-ribose; glyc: glycerol; akg: alpha-keto-glutarate; ac: acetate; pyr: pyruvate; lac__D: D-lactate; succ: succinate. fum: fumarate

.

3 Discussion

Table 3: A summary of knowledge captured by the iCH360 model, as well as example simulations and analyses shown in this article.
\RaggedRight Model structure Notes Data source
\RaggedRight Network (reaction stoichiometries) Selected reactions from iML1515, hand-curated [monk_iml1515_2017]
\RaggedRight Annotations to external databases Parsed from iML1515, extended, and updated [monk_iml1515_2017], manual curation
\RaggedRight Network graphics Escher maps of the full model and its subsystems
\RaggedRight Biological knowledge supporting the stoichiometric model Catalytic relationships, protein complex composition, small-molecule regulations, and others [keseler_ecocyc_2017]
\RaggedRight Physico-chemical parameters mapped to model
\RaggedRight Thermodynamic constants (ΔG0Δsuperscript𝐺0\Delta G^{0}roman_Δ italic_G start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT) Account for compartment-specific chemical environment (pH, pMg, ionic strength, and potential) and include corrections for reactions occurring across compartments. [noor_consistent_2013], manual curation
\RaggedRight kappmaxsuperscriptsubscript𝑘appmaxk_{\rm app}^{\rm max}italic_k start_POSTSUBSCRIPT roman_app end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT values in vivo estimates of catalytic turnover numbers [heckmann_kinetic_2020]
\RaggedRight Typical kappsubscript𝑘appk_{\rm app}italic_k start_POSTSUBSCRIPT roman_app end_POSTSUBSCRIPT values Adjusted estimates of turnover numbers, fitted to proteomic data, accounting for typical saturation levels across growth conditions [heckmann_kinetic_2020] [schmidt_quantitative_2016]
\RaggedRight protein MWs molecular mass for all proteins/protein complexes covered by the model [keseler_ecocyc_2017]
\RaggedRight Cell-state data mapped to the model
\RaggedRight Protein abundances Abundance across different growth conditions for proteins/protein complexes covered by the model, estimated from experimentally measured polypeptide abundances [schmidt_quantitative_2016]
\RaggedRight Metabolite concentrations Measured metabolite concentrations across growth conditions [gerosa_pseudo-transition_2015]
\RaggedRight Metabolic fluxes Measured metabolite fluxes for aerobic growth on glucose [heckmann_kinetic_2020], [long_metabolic_2019]
\RaggedRight Example applications shown
\RaggedRight Phenotypic phase plane analysis See Figures 2, S7
\RaggedRight Enzyme-constrained FBA With enzyme-constrained version of the model, EC-iCH360, constructed in sMOMENT format [bekiaris_automatic_2020]
\RaggedRight EFM analysis With reduced model variant iCH360red; see Figures 5
\RaggedRight Saturation FBA Predicts metabolic fluxes as a function of external substrate concentration
\RaggedRight Max-Min Driving force Formulation from [noor_pathway_2014], extended to account for correlated uncertainty in the themodynamic constants estimates.
\RaggedRight Probabilistic Metabolic Optimisation Prediction of thermodynamic states (metabolite concentrations and relative fluxes) maximally consistent with measured metabolite concentrations [gollub_probabilistic_2021]

Here we presented iCH360, a medium-scale metabolic model of E. coli covering central and biosynthesis metabolism, together with the associated data and metabolic maps and results from several analysed use-cases. Similarly to previously constructed core models [orth_reconstruction_2010, hadicke_ecolicore2_2017], this model trades metabolic coverage for usability, interpretability, and ease of visualisation. It is well suited whenever a relatively small, highly curated network is desired, when computationally demanding analyses are to be performed, or as an educational tool in the field of metabolic modelling. When comparing some key properties of this model with those of its parent genome-scale model, we observed only small differences in the achievable biomass and product yields across a range of growth conditions, validating that, despite its contained size, the model captures the most salient metabolic features of the genome-scale network.

To make this model easily usable in a variety of applications, we enriched the stoichiometric network structure by a curated layer of biological knowledge in the form of a graph data structure. This graph encodes information about biological entities in the network in a structured, ready-to-use format, including catalytic relationships between reactions and enzymes, the stoichiometric composition of protein complexes, and small-molecule regulation interactions. Further, we mapped to the model a range of quantitative parameters, including in vivo turnover number estimates and thermodynamic constants, extending the use of the model beyond simple stoichiometric analysis. A summary of the biological knowledge captured by iCH360 is shown in Table 3.

Owing to its medium size and the high level of curation, iCH360 lends itself to a wide range of modelling frameworks. Here, we demonstrated some representative examples. These include modelling the allocation of metabolic proteome via the enzyme-constraint variant of the model EC-iCH360, enumerating and analysing elementary flux modes in the network via the reduced model variant iCH360red, and performing thermodynamic-based analysis using the provided set of thermodynamic constants.

While, at this stage, the parameterisation of the model is limited to turnover numbers and thermodynamic constants, we anticipate that additional parameter sets can easily be mapped to the model. Facilitated by extensive annotations present in the model and by the recent development of machine-learning-enabled kinetic constant estimators [kroll_deep_2021, li_deep_2022, kroll_turnover_2023, gollub_enkie_2023], a complete kinetic parameterisation of iCH360 is thus a valuable potential future development. In addition, probabilistic estimates of kinetic parameters [gollub_enkie_2023, lubitz_parameter_2019] can be combined with our existing thermodynamic parameterisation, making it possible to account for (potentially correlated) parameter uncertainty throughout the kinetic modelling process.

In light of the above results, we believe that iCH360 holds the potential to become a reference metabolic model for E. coli.

4 Methods

4.1 Model assembly and curation

All the relevant pathways included in the model were assembled and curated based on information available in the EcoCyc [keseler_ecocyc_2017] and KEGG [kanehisa_kegg_2017] databases. The respective reactions were then extracted from iML1515 and parsed into a new model. In order to compute an equivalent biomass reaction, we first collected all the pathways required for the production of any component present in the iML1515 biomass reaction, but not in our model. These additional pathways (available in the repository supporting this manuscript) were manually curated based on available literature and database annotation to ensure they represent the biologically most relevant bioproduction route for each biomass component. By adding these pathways to iCH360, we obtained an extended model that was able to predict growth rates directly through the original iML1515 biomass reaction. The equivalent biomass function was then computed based on a reference flux distribution computed on this extended model, as explained in Supplementary Information A.2. Model assembly, manipulation and validation were performed using the COBRA Toolbox [ebrahim_cobrapy_2013]. Extension of database cross-annotation for the model reactions was performed through a mixture of automated database query and manual curation.

4.2 Network graphics

iCH360 can be visualised through a series of custom-built maps using the metabolic visualisation tool Escher [king_escher_2015] (see Figure 1). There are three main ways to visualise the model or solutions thereof. First, a complete map of the model, including all of its reactions, can be used (Figure 1). In order to provide a more compact representation of the network, a compressed second variant of the same map was constructed (Supplementary Figure 18). Here, long biosynthetic linear pathways were lumped into single pseudo-reactions, which only show the net production or consumption of metabolites by the pathway while omitting intermediates. Finally, individual maps for each of the main subsystems in the model are provided.

4.3 Graph data structure for linking reactions to enzymes and proteins

Information about the enzymes and proteins behind the stoichiometric model was collected in a knowledge graph. To build this graph, all available data on reaction-protein association and subunit composition were retrieved by automatic querying of the BioCyc database through its REST-based data retrieval API (https://biocyc.org/web-services.shtml). This information was then extended and curated based on a comparison with existing iML1515 GPR annotations. The resulting data were used to generate a directed graph wherein nodes represent biological entities (such as reactions, proteins, genes, and metabolites) and edges represent functional dependencies across them, including catalysis, subunit composition, post-translational modifications, and others. A complete list of node and edge types in the graph are provided in Supplementary Tables 1 and 2, respectively. All polypeptide nodes were annotated with their molecular mass (parsed from EcoCyc), enabling recursive computation of the molecular mass of any protein node in the graph (see Supplementary Information A.3). All manipulation and analyses of the graph data structure were performed using the NetworkX Python package [hagberg_exploring_2008], and the final data structure provided to the user in both Cytoscape and GML formats.

4.4 Primary and secondary catalytic edges

Catalytic edges in the graph, i.e. edges connecting reaction and enzyme nodes, were manually annotated as either primary or secondary, based on available evidence for the activity of each enzyme in the model with respect to its associated reactions. More specifically, a catalytic edge between a reaction-enzyme pair was labelled as secondary whenever the enzyme had been shown in literature to account for only minor catalytic activity for a reaction when compared to another isozyme. In this case, the references to the relevant literature were included as metadata for that edge. Whenever sufficient information was not available, all isozyme edges for a reaction were conservatively treated as primary.

4.5 Catalytic disruption analysis

For the catalytic disruption test, we identified condition-specific essential reactions in the following way: in the model, for each condition considered, we determined the reactions whose knockout led to inability to produce biomass. A small number of known false-positives, which are essential in iCH360 only due to its lack of certain reactions or pathways, were excluded from the analysis. For each essential reaction, we considered all associated genes and investigated the removal of each of them, individually, from the graph. The effect of the simulated knockout was propagated across the graph by removing all nodes from the graph for which the gene is required according to the nodes’ boolean GPR rules (see Supplementary Information A.3). Finally, the outcome of each simulated knockout was catalogued based on the reaction-level disruption it caused (see main text). Whenever multiple reactions were disrupted by the knockout of a gene, the strongest disruption among them was assigned to the knockout, in the following precedence order: complete disruption, full primary disruption, partial primary disruption, secondary disruption. The analysis was repeated for a total of 9999 growth conditions, and each condition-gene pair, labelled with the assigned disruption type, was compared with experimentally measured mutant relative fitness (averaged across available replicates for that condition-gene combination), using the dataset from [price_mutant_2018].

4.6 Construction of the enzyme-constrained metabolic model

In the sMOMENT formalism, positive and negative fluxes in each reaction are formally described, respectively, as positive fluxes in separate “forward” and “backward” versions of the reaction. To construct the enzyme-constrained model in sMOMENT format, reversible reactions in the model were therefore duplicated in the model to separately represent fluxes in forward or backward direction. Direction-specific turnover number estimates were parsed from [heckmann_kinetic_2020] and a default value of 65s165superscripts165~{}\rm{s}^{-1}65 roman_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT was used for transporters as in [heckmann_machine_2018]. To account for the fact that [heckmann_kinetic_2020] reports values as turnover numbers per polypeptide, the values were multiplied by the number of polypeptide subunits in each enzyme. For each reaction-enzyme pair, an enzyme cost per unit flux (in units of ghmmol1ghsuperscriptmmol1{\rm g\cdot h\cdot mmol^{-1}}roman_g ⋅ roman_h ⋅ roman_mmol start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT) was then defined as:

ai=Mikcat,iσsubscript𝑎𝑖subscript𝑀𝑖subscript𝑘cat𝑖𝜎a_{i}=\frac{M_{i}}{k_{\text{cat},i}~{}\sigma}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_k start_POSTSUBSCRIPT cat , italic_i end_POSTSUBSCRIPT italic_σ end_ARG (1)

where aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the enzyme cost of reaction-enzyme pair i𝑖iitalic_i, kcat,isubscript𝑘cat𝑖k_{\text{cat},i}italic_k start_POSTSUBSCRIPT cat , italic_i end_POSTSUBSCRIPT is the turnover rate estimate for the pair (here, in units of h1superscripth1\rm h^{-1}roman_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT), Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the molecular mass of the enzyme involved (in kDakDa{\rm kDa}roman_kDa), and σ𝜎\sigmaitalic_σ is a unitless condition-specific scaling factor (typically interpreted as an average enzyme saturation value). Then, a unique enzyme was assigned to each reaction. To this end, secondary catalytic relationships were first discarded and, for reactions with multiple annotated primary isoenzymes, the enzyme with the highest measured abundance in the integrated PAX Database [huang_paxdb_2023] was heuristically chosen. Based on these costs per unit flux, an enzyme capacity constraint was introduced as:

iaiviEtotsubscript𝑖subscript𝑎𝑖subscript𝑣𝑖subscript𝐸tot\sum_{i}a_{i}~{}v_{i}\leq E_{\text{tot}}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_E start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT (2)

where visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the flux in reaction i𝑖iitalic_i (in mmolh1gDW1mmolsuperscripth1superscriptgDW1{\rm mmol~{}h^{-1}~{}gDW^{-1}}roman_mmol roman_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_gDW start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT) and Etotsubscript𝐸totE_{\text{tot}}italic_E start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT is a parameter denoting the total amount of enzyme mass (in g/gDW) that can be allocated to the flux mode. The constraint is enforced by augmenting the stoichiometric matrix of the model with an additional enzyme supply pseudoreaction (upper-bounded by Etotsubscript𝐸totE_{\text{tot}}italic_E start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT) and an enzyme pool pseudometabolite consumed in each reaction with stoichiometry eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [bekiaris_automatic_2020].

4.7 Adjustment of turnover numbers across conditions

Turnover numbers from [heckmann_kinetic_2020] were adjusted based on the nonlinear programming (NLP) formulation detailed in Supplementary Information A.5). The reference flux distributions required by the procedure were obtained using the original (unadjusted) set of turnover numbers and the bounds on allowable adjustments (𝐥𝐥\mathbf{l}bold_l and 𝐮𝐮\mathbf{u}bold_u in Eq. 29) set to ±2plus-or-minus2\pm 2± 2 (corresponding to a maximal 100-fold increase or reduction of each parameter from the original value). The linear program used to obtain reference flux distributions was formulated and solved with GUROBI [gurobi], while the nonlinear program used for turnover adjustment was formulated and solved with the open-source optimisation package CasADi [Andersson2019]. To investigate the effect of the ridge regularisation hyperparameter (ρ𝜌\rhoitalic_ρ in equation 29), we solved the adjustment problem for a broad range of values of this parameter and, each time, we computed the RMSE between measurements and predictions of enzyme abundance, as well as the mean absolute deviation of between original and adjusted turnover values, both computed for log10-transformed data (Supplementary Figure 11). Based on this information, a value of ρ=1𝜌1\rho=1italic_ρ = 1, after which any further decrease in the amount of regularisation results in marginal reduction of the RMSE, was chosen to compute the final set of adjusted turnover numbers. Upon reparametrisation into apparent turnover numbers, (see Supplementary Information A.5.3), this set of adjusted parameters was used to parametrise the enzyme-constrained model variant EC-iCH360.

4.8 Enzyme allocation predictions

To validate enzyme allocation predictions against experimental values, we first retrieved measured polypeptide abundances for each growth condition from [schmidt_quantitative_2016] and imputed missing values using, whenever available, abundance values from the PAX Database [huang_paxdb_2023]. Then we estimated enzyme counts across conditions from polypeptide counts using non-negative least-squares estimation (Supplementary Information A.4). Next, we converted them into mass abundances (in units of ggDW1gsuperscriptgDW1{\rm g~{}gDW^{-1}}roman_g roman_gDW start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT) based on the molecular mass of each complex (see Supplementary Information A.3) and assuming a cell mass of 2.8×1013g2.8superscript1013g2.8\times 10^{-13}\rm{g}2.8 × 10 start_POSTSUPERSCRIPT - 13 end_POSTSUPERSCRIPT roman_g (BIONUMBER ID 103904 [milo_bionumbers--database_2010]). Enzyme allocation predictions for each condition were then computed via EC-iCH360 by fixing the growth rate to the experimentally measured one from [schmidt_quantitative_2016] and minimising the total enzyme cost. To compute predictions using the turnover number estimates from [heckmann_kinetic_2020], the average saturation coefficient σ𝜎\sigmaitalic_σ was estimated, for each condition as:

σ=ieiEtot¯𝜎subscript𝑖subscript𝑒𝑖¯subscript𝐸tot\sigma=\frac{\sum_{i\in\mathcal{M}}e_{i}}{\bar{E_{\rm{tot}}}}italic_σ = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_M end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_E start_POSTSUBSCRIPT roman_tot end_POSTSUBSCRIPT end_ARG end_ARG (3)

where \mathcal{M}caligraphic_M denotes the index set of model enzymes for which measurements are available and Etot¯¯subscript𝐸tot\bar{E_{\rm{tot}}}over¯ start_ARG italic_E start_POSTSUBSCRIPT roman_tot end_POSTSUBSCRIPT end_ARG is the total measured model enzyme abundance for that condition. This value of σ𝜎\sigmaitalic_σ was then used to scale predicted enzyme abundances before comparing them to the experimentally measured ones. To compute predictions with the adjusted parameter set, the scaling factors for each condition were obtained as part of the fitting procedure (see Supplementary Information A.5). All reported root mean squared errors were computed on log10 transformed enzyme abundances, excluding zero-predictions from the computation.

4.9 Enumeration of Elementary Flux Modes

Elementary Flux Modes (EFMs) enumeriation for the submodel iCH360red was performed, for each growth condition, using EFMtools [terzer_large-scale_2008]. Filtered modes (Supplementary Table 4) were defined as those supporting nonzero biomass flux and, for aerobic modes, nonzero oxygen uptake. In addition, in the aerobic case, filtered modes exclude those carrying flux in either of three reactions – Pyruvate-Formate Lyase (PFL), Fumarate Reductase (FRD2), and the menaquinone-dependent Dihydroorotate dehydrogenase (DHORD5) – known to be only physiologically active under anaerobic conditions [zhang_inactivation_2001, cecchini_succinate_2002, andrews_anaerobic_1977]

4.10 Growth/yield tradeoff analysis

To analyse trade-offs between growth rate and yield, we computed the yield of each EFM as the ratio of its biomass flux vBMsubscript𝑣BMv_{\rm BM}italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT to its glucose uptake flux. In FBA models, fluxes are given in units of mmol/gDW/h, except for the biomass flux, whose conventional unit is h1superscripth1\rm{h}^{-1}roman_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT because the biomass flux in these models is scaled to describe the biomass production rate per amount of biomass present. This means that our yields are given in units of gDW/mmol. To determine the cell growth rate allowed by a flux distribution, we first computed its absolute enzyme cost

cenz=iaivisubscript𝑐enzsubscript𝑖subscript𝑎𝑖subscript𝑣𝑖c_{\rm enz}=\sum_{i}{a_{i}~{}v_{i}}italic_c start_POSTSUBSCRIPT roman_enz end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (4)

where cenzsubscript𝑐enzc_{\rm enz}italic_c start_POSTSUBSCRIPT roman_enz end_POSTSUBSCRIPT is the enzyme cost of the flux distribution (an enzyme mass, measured in g/gDW), visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the flux of reaction i𝑖iitalic_i, and aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the enzyme cost per unit flux in reaction i𝑖iitalic_i, computed as per equation 1 using the set of adjusted turnover numbers. Clearly, both vBMsubscript𝑣BMv_{\rm BM}italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT and cenzsubscript𝑐enzc_{\rm enz}italic_c start_POSTSUBSCRIPT roman_enz end_POSTSUBSCRIPT depend on the particular choice of scaling of the mode (though their ratio, vBM/cenzsubscript𝑣BMsubscript𝑐enzv_{\rm BM}/c_{\rm enz}italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT / italic_c start_POSTSUBSCRIPT roman_enz end_POSTSUBSCRIPT, doesn’t). In order to obtain an estimate for the growth rate that is scale-invariant, we thus rescaled all modes to the same enzyme cost f𝑓fitalic_f (in g/gDW), and looked at the resulting flux through the biomass reaction. More formally, the achievable growth rate μ𝜇\muitalic_μ (in h1superscripth1\rm h^{-1}roman_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT) for an (arbitrarily scaled) EFM with biomass flux vBMsubscript𝑣BMv_{\rm BM}italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT and enzyme cost cenzsubscript𝑐enzc_{\rm enz}italic_c start_POSTSUBSCRIPT roman_enz end_POSTSUBSCRIPT was computed as:

μ=fvBMcenz𝜇𝑓subscript𝑣BMsubscript𝑐enz\mu=f~{}\frac{v_{\rm BM}}{c_{\rm enz}}italic_μ = italic_f divide start_ARG italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT end_ARG start_ARG italic_c start_POSTSUBSCRIPT roman_enz end_POSTSUBSCRIPT end_ARG (5)

For simplicity, we approximate f𝑓fitalic_f by a constant value of f=0.285𝑓0.285f=0.285italic_f = 0.285 g/gDW, which we obtained by taking the minimum enzyme investment required by the enzyme-constrained model to support the experimentally measured growth rate reported in the proteomic dataset [schmidt_quantitative_2016] for the condition of interest in this analysis (aerobic growth on glucose).

To simulate low-oxygen conditions, the cost per unit flux of all oxygen-consuming reactions (CYTBO3_4pp, CYTBDpp, CYTBD2pp) was increased by a 1000100010001000-fold, mimicking the physiological state in which these reaction operate at low substrate saturation.

4.11 Saturation FBA analysis

Saturation FBA calculations were performed by optimising biomass production in the enzyme-constrained model, setting the saturation coefficient (σ𝜎\sigmaitalic_σ in Eq. (1) to the value fitted as part of the turnover number adjusting procedure (Section 4.7) for all reactions except the glucose transporter (GLCptspp). The saturation of the glucose transporter, σupsubscript𝜎up\sigma_{\rm up}italic_σ start_POSTSUBSCRIPT roman_up end_POSTSUBSCRIPT, was computed as a function of external glucose concentration assuming irreversible Michaelis Mentens kinetics, so that:

σup=[Glc]Km+[Glc]subscript𝜎updelimited-[]Glcsubscript𝐾mdelimited-[]Glc\sigma_{\rm up}=\frac{\rm[Glc]}{K_{\rm m}+{\rm[Glc]}}italic_σ start_POSTSUBSCRIPT roman_up end_POSTSUBSCRIPT = divide start_ARG [ roman_Glc ] end_ARG start_ARG italic_K start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT + [ roman_Glc ] end_ARG (6)

where [Glc]delimited-[]Glc\rm[Glc][ roman_Glc ] is the external glucose concentration (in mM) and a value of 0.1160.1160.1160.116 mM was used for the Michaelis constant Kmsubscript𝐾mK_{\rm m}italic_K start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT [wortel_metabolic_2018]

4.12 Component contribution estimates of thermodynamic constants

Estimates of the free energies of reactions, and their uncertainties, were obtained using the component contribution framework previously described in [noor_consistent_2013]. Several reactions in the model involve protein side groups, such as the acyl-carrying protein (ACP), or cofactors, such as glutaredoxin, for which a decomposition in terms of chemical groups is not available. As a result, thermodynamic constants for these reactions cannot be directly estimated through database-based implementations of the component contribution method, such as eQuilibrator [beber_equilibrator_2022]. However, since these non-decomposable protein groups are conserved in all reactions within our model, their net contribution to the reaction thermodynamics is, at least from a group contribution perspective, null. Hence, we first augmented the group incidence matrix of the eQuilibrator database (see Section S1.1 in [beber_equilibrator_2022]) by treating protein groups as non-decomposable units and manually curated the group decomposition of the metabolites containing them. The mean and covariance of the Gibbs standard free energy of reaction (ΔrGsubscriptΔ𝑟superscript𝐺\Delta_{r}G^{\circ}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT) estimates for all reactions in the model were then computed using this augmented matrix as described in [beber_equilibrator_2022].

In order to compute transformed Gibbs free energies of reactions (ΔrGsubscriptΔ𝑟superscript𝐺\Delta_{r}G^{\prime\circ}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT), for which an exact chemical definition of each metabolite is required, protein side groups were replaced by an appropriate chemical moiety best approximating the chemical environment of the metabolite. Specifically, ACP- groups were replaced by a phosphopantetheine group, the natural prosthetic group of acyl-carrier proteins, with a methyl group at the attachment site to the protein scaffold. Similarly, glutaredoxin was replaced by its Cys-Pro-Tyr-Cys active site, with the two cysteines being either free (for the reduced form of the cofactor) or linked by a disulfide bridge (for the oxidised state).

Corrections for reactions occurring across different compartments were computed as described in [beber_equilibrator_2022], using compartment-specific pH, pMg, ionic strength and potentials from [gollub_enkie_2023].

4.13 Max-min driving force computation

The max-min driving force (MDF) of each reference flux distribution was computed by extending the original formulation described in [noor_pathway_2014] to account for correlated uncertainty in the estimates of the thermodynamic constants. Let ΔGrNΔsuperscriptsubscript𝐺𝑟superscript𝑁\Delta G_{r}^{\prime\circ}\in\mathbb{R}^{N}roman_Δ italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT be a random vector representing the (uncertain) standard Gibbs free energy of reaction for the reactions in the network. Importantly, this vector includes only balanced metabolic reactions and excludes pseudoreactions such as exchange reactions. To describe our uncertain knowledge, we assume that the vector ΔGrΔsuperscriptsubscript𝐺𝑟\Delta G_{r}^{\prime\circ}roman_Δ italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT follows a multivariate normal distribution:

ΔGr𝒩(ΔGr¯,Σ)similar-toΔsuperscriptsubscript𝐺𝑟𝒩¯Δsuperscriptsubscript𝐺𝑟Σ\Delta G_{r}^{\prime\circ}\sim\mathcal{N}(\bar{\Delta G_{r}^{\prime\circ}},\Sigma)roman_Δ italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT ∼ caligraphic_N ( over¯ start_ARG roman_Δ italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT end_ARG , roman_Σ ) (7)

where ΔGr¯¯Δsuperscriptsubscript𝐺𝑟\bar{\Delta G_{r}^{\prime\circ}}over¯ start_ARG roman_Δ italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT end_ARG and ΣΣ\Sigmaroman_Σ are the mean and covariance of the estimates obtained through the component contribution methods. This random vector can equivalently be expressed as:

ΔGr=ΔGr¯+𝐐𝐳Δsuperscriptsubscript𝐺𝑟¯Δsuperscriptsubscript𝐺𝑟𝐐𝐳\Delta G_{r}^{\prime\circ}=\bar{\Delta G_{r}^{\prime\circ}}+\mathbf{Q}~{}% \mathbf{z}roman_Δ italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT = over¯ start_ARG roman_Δ italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT end_ARG + bold_Q bold_z (8)

where 𝐳𝐳\mathbf{z}bold_z is a standard normal random vector in qsuperscript𝑞\mathbb{R}^{q}blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, with q=rank(Σ)𝑞rankΣq=\text{rank}(\Sigma)italic_q = rank ( roman_Σ ), and 𝐐N×q𝐐superscript𝑁𝑞\mathbf{Q}\in\mathbb{R}^{N\times q}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_q end_POSTSUPERSCRIPT is a square root of the covariance matrix, i.e. it satisfies:

Σ=𝐐𝐐T.Σ𝐐superscript𝐐T\Sigma=\rm\mathbf{Q}~{}\mathbf{Q}^{T}.roman_Σ = bold_Q bold_Q start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT . (9)

In order to integrate this probabilistic description within a typical constrained-optimisation formulation, we define the set 𝒟αsubscript𝒟𝛼\mathcal{D}_{\alpha}caligraphic_D start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT as the α𝛼\alphaitalic_α-level confidence region around the mean of ΔGrΔsuperscriptsubscript𝐺𝑟\Delta G_{r}^{\prime\circ}roman_Δ italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT. Using Eq. (7), and noting that the squared norm of a standard normal random variable is known to be Chi-squared distributed, we can represent this set as:

𝒟α={𝐱N|𝐱=ΔGr¯+Q𝐦,𝐦22χq;α2}subscript𝒟𝛼conditional-set𝐱superscript𝑁formulae-sequence𝐱¯Δsuperscriptsubscript𝐺𝑟𝑄𝐦superscriptsubscriptnorm𝐦22subscriptsuperscript𝜒2𝑞𝛼\mathcal{D}_{\alpha}=\{\mathbf{x}\in\mathbb{R}^{N}\hskip 2.84544pt|\hskip 2.84% 544pt\mathbf{x}=\bar{\Delta G_{r}^{\prime\circ}}+Q~{}\mathbf{m},\quad||\mathbf% {m}||_{2}^{2}\leq\chi^{2}_{q;\alpha}\}caligraphic_D start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = { bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | bold_x = over¯ start_ARG roman_Δ italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ∘ end_POSTSUPERSCRIPT end_ARG + italic_Q bold_m , | | bold_m | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q ; italic_α end_POSTSUBSCRIPT } (10)

where 𝐦q𝐦superscript𝑞\mathbf{m}\in\mathbb{R}^{q}bold_m ∈ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT is a free parameter vector and χq(.)\chi_{q}(.)italic_χ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( . ) denotes the quantile function (inverse cumulative distribution function) of a chi-squared distribution with q𝑞qitalic_q degrees of freedom. We can thus account for the uncertainty in thermodynamic estimates by treating the free energies of reaction as decision variables, rather than known parameters, and constraining their value to belong to (𝒟αsubscript𝒟𝛼\mathcal{D}_{\alpha}caligraphic_D start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT). For a given reference flux distribution 𝐯𝐯\mathbf{v}bold_v, this results in the following quadratically-constrained program (QCP):

max𝐦,b𝐦𝑏max\displaystyle\underset{\mathbf{m},b}{\text{max}}start_UNDERACCENT bold_m , italic_b end_UNDERACCENT start_ARG max end_ARG b𝑏\displaystyle bitalic_b (11)
s.t. ΔrGo=ΔrGo¯+𝐐𝐦subscriptΔ𝑟superscript𝐺𝑜¯subscriptΔ𝑟superscript𝐺𝑜𝐐𝐦\displaystyle\Delta_{r}G^{\prime o}=\bar{{\Delta_{r}G^{\prime o}}}+\mathbf{Q}% \mathbf{m}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ italic_o end_POSTSUPERSCRIPT = over¯ start_ARG roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ italic_o end_POSTSUPERSCRIPT end_ARG + bold_Qm
ΔrG=ΔrGo+RT𝐒𝐜subscriptΔ𝑟superscript𝐺subscriptΔ𝑟superscript𝐺𝑜RTsuperscript𝐒top𝐜\displaystyle\Delta_{r}G^{\prime}=\Delta_{r}G^{\prime o}+\rm{RT}~{}\mathbf{S}^% {\top}\mathbf{c}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ italic_o end_POSTSUPERSCRIPT + roman_RT bold_S start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_c
sign(vi)ΔrGi>b,if vi0formulae-sequencesignsubscript𝑣𝑖subscriptΔ𝑟subscriptsuperscript𝐺𝑖𝑏if subscript𝑣𝑖0\displaystyle-\text{sign}(v_{i})~{}\Delta_{r}G^{\prime}_{i}>b,\hskip 14.22636% pt\text{if }v_{i}\neq 0- sign ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_b , if italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0
𝐦22χq;α2superscriptsubscriptnorm𝐦22subscriptsuperscript𝜒2𝑞𝛼\displaystyle||\mathbf{m}||_{2}^{2}\leq\chi^{2}_{q;\alpha}| | bold_m | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q ; italic_α end_POSTSUBSCRIPT
𝐥𝐜𝐜𝐮𝐜subscript𝐥𝐜𝐜subscript𝐮𝐜\displaystyle\mathbf{l_{c}}\leq\mathbf{c}\leq\mathbf{u_{c}}bold_l start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT ≤ bold_c ≤ bold_u start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT

where b𝑏bitalic_b is the min driving force (or the MDF after optimisation), 𝐜m𝐜superscript𝑚\mathbf{c}\in\mathbb{R}^{m}bold_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a vector of log-metabolite concentrations, 𝐥𝐜;𝐮𝐜msubscript𝐥𝐜subscript𝐮𝐜superscript𝑚\mathbf{l_{c}};\mathbf{u_{c}}\in\mathbb{R}^{m}bold_l start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT ; bold_u start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT are (log) lower and upper bounds on these concentrations, 𝐒m×N𝐒superscript𝑚𝑁\mathbf{S}\in\mathbb{R}^{m\times N}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_N end_POSTSUPERSCRIPT is the stoichiometric matrix of the model, RR{\rm R}roman_R is the Boltzmann gas constant, and TT\rm Troman_T is the temperature used for the computation of the free energy estimates. For our MDF calculations, we used a confidence level of 90%percent9090\%90 % and set the bounds on the concentration of all metabolites to the physiologically plausible range of (1μM1μM1\upmu\text{M}1 roman_μ M, 10mM10mM10\text{mM}10 mM). The above quadratically constrained program was formulated and solved the the GUROBI package [gurobi].

4.14 Probabilistic thermodynamic analysis

Probabilistic metabolic optimisation (PMO) of the model was performed using the PTA python package [gollub_probabilistic_2021], providing the software with the curated thermodynamic estimates generated for this model. The default values available through the package for growth on M9 medium with glucose (which include measurements from [gerosa_pseudo-transition_2015]) were used as priors for the concentration of metabolites, and the growth rate was bounded below by the reported value in [gerosa_pseudo-transition_2015]. Further, for this analysis, which requires every flux in the solution to have a well-defined directionality, two transhydrogenase reactions (NADH17pp, and THD2pp) were allowed to operate reversibly [gollub_probabilistic_2021].

To analyse the thermodynamic state computed by PTA, we compute the flux-force efficacy η𝜂\etaitalic_η for each reaction as (see Figure 8):

η=eΔrGRT1eΔrGRT+1𝜂superscript𝑒subscriptΔ𝑟superscript𝐺RT1superscript𝑒subscriptΔ𝑟superscript𝐺RT1\eta=\frac{e^{-\frac{\Delta_{r}G^{\prime}}{\rm RT}}-1}{e^{-\frac{\Delta_{r}G^{% \prime}}{\rm RT}}+1}italic_η = divide start_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG roman_RT end_ARG end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG roman_RT end_ARG end_POSTSUPERSCRIPT + 1 end_ARG (12)

where ΔrGsubscriptΔ𝑟superscript𝐺\Delta_{r}G^{\prime}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the Gibbs free energy of reaction in the thermodynamic state, RR{\rm R}roman_R is the universal gas constant and TT{\rm T}roman_T is the temperature considered for the analysis (310.15K310.15𝐾310.15~{}K310.15 italic_K). To compare the PTA-predicted flux force efficacies with experimental measurements of enzyme abundance, we selected reactions with carrying at least 2.5%percent2.52.5\%2.5 % of the glucose uptake flux, and pooled these into two groups, corresponding to η>0.5𝜂0.5\eta>0.5italic_η > 0.5 (72727272 reactions) and η<0.5𝜂0.5\eta<0.5italic_η < 0.5 (31313131 reactions). η=0.5𝜂0.5\eta=0.5italic_η = 0.5 corresponds to ΔrG1.1RT=2.8subscriptΔ𝑟superscript𝐺1.1RT2.8\Delta_{r}G^{\prime}\approx-1.1~{}{\rm RT}=-2.8roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≈ - 1.1 roman_RT = - 2.8 kJ/mol.

00222244446666000.50.50.50.51111ΔrGRTsubscriptΔ𝑟superscript𝐺RT-\frac{\Delta_{r}G^{\prime}}{\rm RT}- divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG roman_RT end_ARGη𝜂\etaitalic_η
Figure 8: The flux force efficacy (η𝜂\etaitalic_η) as a function of the (scaled) negative Gibbs free energy of reaction, ΔrGRTsubscriptΔ𝑟superscript𝐺RT-\frac{\Delta_{r}G^{\prime}}{\rm RT}- divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG roman_RT end_ARG. The flux force efficacy corresponds to the ratio between net flux (forward minus backward flux) and total flux (forward plus backward flux) of a reaction, which approaches 1111 for reactions operating far from chemical equilibrium (ΔrG0much-less-thansubscriptΔ𝑟superscript𝐺0\Delta_{r}G^{\prime}\ll 0roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≪ 0).

5 Author contributions

A. B.-E. conceived the project and supervised its initial stage. M.C and H.H. assembled and curated the model. M.C. wrote software, performed simulations and analysed results. W.L., E.N. and H.H. supervised the project and contributed to the analysis and interpretation of results. M.C., H.H., W.L., and E.N. wrote the manuscript.

6 Competing interests

The authors declare no competing interest.

7 Code availability

The model, together with all relevant data, is available on Github at https://github.com/marco-corrao/iCH360. The code and files required to reproduce all analysis in this manuscript are available on Github at https://github.com/marco-corrao/iCH360_paper.

\printbibliography

Appendix A Supplementary Information

A.1 Changes to reactions from the model iML1515

After assembling the model as a subset of reactions from the genome-scale model iML1515, a few minor corrections were applied to some of the reactions based on evidence gathered from the literature. Note, however, that these changes did not result from an exhaustive process of review of the parent model. Such corrections include:

  • In iCH360, the membrane-bound transhydrogenase reaction (THD2pp) only translocates 1 proton across the periplasmic membrane, as opposed to the 2 protons translocated in the iML1515 reaction [bizouarn_nucleotide_2005]

  • The gene-protein rule (GPR) of gene glxK (b0514) was reassigned to GLYCK2, which produces 2-phosphoglycerate. Meanwhile, the reaction GLYCK in from iML1515 (which produces 3-phosphoglycerate) was removed [zelcbuch_vivo_2015, bartsch_only_2008]

  • The NAPH-dependent homoserine dehydrogenase reaction (HSDy) was made irreversible towards the homoserine production direction [he_optimized_2020]. To avoid having a reaction irreversible in the backward direction, substrate and products of the reactions were flipped.

  • The succinate transport reaction SUCCt1pp was made irreversible in the export direction, enforcing the use of the more thermodynamically favourable SUCCt2_2pp (which translocates two protons instead of one) for succinate import. To avoid having a reaction irreversible in the backward direction, substrate and products of SUCCt1pp were flipped.

A.2 Computation of an equivalent biomass reaction

In constraint-based models of metabolism, the biomass reaction summarises the production of all molecules that are not explicitly described by the model. These are typically macromolecules such as proteins or polynucleotides. Which metabolites are drained from the model network towards these other parts of metabolism, and in what proportions, depends on what compounds are described by the model. Hence, starting from an existing model, taking only a subset of the reactions, but leaving the biomass reaction unchanged would lead to inconsistencies.

To construct a biomass reaction for iCH360 that corresponds equivalently to the biomass reaction in iML1515, the following method was used. First, we collected all the pathways required for the production of any component present in the iML1515 biomass reaction, but not in our model. These additional pathways (available in the repository supporting this manuscript) were manually curated based on available literature and database annotation to ensure they represent the most biologically-relevant bioproduction route for each biomass component. By adding these pathways to iCH360, we obtained an extended model (iCH360ext), which was able to grow directly through the original iML1515 biomass reaction. Let N𝑁Nitalic_N and M𝑀Mitalic_M denote the number of reactions in iCH360 and iCH360ext, respectively. Next, a reference flux distribution 𝐯refMsubscriptsuperscript𝐯𝑟𝑒𝑓superscript𝑀\mathbf{v}^{*}_{ref}\in\mathbb{R}^{M}bold_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is computed on the extended model through FBA. We can express this flux vector as:

𝐯ref=(𝐯𝐯+vBM)subscriptsuperscript𝐯𝑟𝑒𝑓matrixsuperscript𝐯subscriptsuperscript𝐯superscriptsubscript𝑣BM\mathbf{v}^{*}_{ref}=\begin{pmatrix}\mathbf{v}^{*}\\ \mathbf{v}^{*}_{+}\\ v_{\rm BM}^{*}\end{pmatrix}bold_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL bold_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) (13)

where 𝐯superscript𝐯\mathbf{v}^{*}bold_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the subset of its fluxes corresponding to reactions present in iCH360, 𝐯+subscriptsuperscript𝐯\mathbf{v}^{*}_{+}bold_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is the subset of fluxes corresponding to the reactions added to create the extended model iCH360ext, and vBMsuperscriptsubscript𝑣BMv_{\rm BM}^{*}italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the flux through the biomass reaction. If the fluxes 𝐯superscript𝐯\mathbf{v}^{*}bold_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT were to be imposed on iCH360, a number of metabolites would necessarily remain unbalanced, that is:

𝐒𝐯𝟎superscript𝐒𝐯0{\rm\mathbf{S}}\mathbf{v}^{*}\neq\mathbf{0}bold_Sv start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≠ bold_0 (14)

where 𝐒𝐒{\rm\mathbf{S}}bold_S denotes the stoichiometric matrix of iCH360. Hence, we compute the equivalent biomass reaction as the additional reaction (column of 𝐒𝐒{\rm\mathbf{S}}bold_S) required to balance the system while achieving the same biomass flux of the reference distribution. That is:

[𝐒|req](𝐯vBM)=𝟎delimited-[]conditional𝐒subscript𝑟𝑒𝑞matrixsuperscript𝐯superscriptsubscript𝑣BM0\Big{[}{\rm\mathbf{S}}|r_{eq}\Big{]}*\begin{pmatrix}\mathbf{v}^{*}\\ v_{\rm BM}^{*}\end{pmatrix}=\mathbf{0}[ bold_S | italic_r start_POSTSUBSCRIPT italic_e italic_q end_POSTSUBSCRIPT ] ∗ ( start_ARG start_ROW start_CELL bold_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) = bold_0 (15)

which implies:

req=1vBM𝐒𝐯subscript𝑟𝑒𝑞1superscriptsubscript𝑣BMsuperscript𝐒𝐯r_{eq}=-\frac{1}{v_{\rm BM}^{*}}{\rm\mathbf{S}}\mathbf{v}^{*}italic_r start_POSTSUBSCRIPT italic_e italic_q end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG bold_Sv start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (16)

The stoichiometry of this equivalent biomass reaction is dependent on the choice of reference flux distribution 𝐯refsubscriptsuperscript𝐯𝑟𝑒𝑓\mathbf{v}^{*}_{ref}bold_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT and, generally, may be different for different growth conditions. This results from the fact that, depending on the condition, the extended model may produce the same biomass component through different pathways, which would then be converted by our procedure into different equivalent costs of precursors in the sub-model. Nevertheless, the additional biosynthesis pathways we used to form the extended model do not allow for alternative routes to biomass, hence we found the equivalent biomass reaction to be, in this case, unique across conditions.

A.3 Recursive computation of attributes in the annotation graph

Using the knowledge graph supporting the stoichiometric model of iCH360 (Section 3 in the man text), a number of useful properties can be computed based on simple recursive operations. Here, we outline how such an approach was used to compute i) the molecular mass of all enzyme complexes in the model, based on known masses for all polypeptides and ii) the computation of boolean rules (GPRs) linking all reaction and protein nodes in the graph to the model genes. For a description of the different types of nodes and edges mentioned in this section, see Supplementary Tables 1 and 2.

A.3.1 Computation of molecular masses for all protein nodes

In order to compute the molecular mass of all protein nodes in the graph, the protein nodes corresponding to polypeptides were first annotated with their molecular masses, readily available on the EcoCyc database. Then the molecular mass of all other protein nodes is estimated recursively as follows. Let \mathcal{I}caligraphic_I denotes the index set of all protein nodes and 𝒞(i)𝒞𝑖\mathcal{C}(i)\in\mathcal{I}caligraphic_C ( italic_i ) ∈ caligraphic_I the index set of protein components of node i𝑖iitalic_i, i.e. all nodes connected to node i𝑖iitalic_i by a subunit composition relationship. The molecular mass of any protein node i𝑖iitalic_i, Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, is computed as:

Mi={M¯iif node i is a polypeptidek𝒞(i)wikMkotherwisesubscript𝑀𝑖casessubscript¯𝑀𝑖if node i is a polypeptidesubscript𝑘𝒞𝑖subscript𝑤𝑖𝑘subscript𝑀𝑘otherwiseM_{i}=\begin{cases}\bar{M}_{i}\qquad&\text{if node $i$ is a polypeptide}\\ \sum_{k\in\mathcal{C}(i)}w_{ik}M_{k}\qquad&\text{otherwise}\\ \end{cases}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL over¯ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if node italic_i is a polypeptide end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_C ( italic_i ) end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW (17)

where M¯isubscript¯𝑀𝑖\bar{M}_{i}over¯ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the (known) molecular mass of polypeptide node i𝑖iitalic_i and wiksubscript𝑤𝑖𝑘w_{ik}italic_w start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT denotes the weight of the edge between i𝑖iitalic_i and k𝑘kitalic_k.

A.3.2 Computation of gene-protein-reaction Rules

Similarly, boolean gene-protein-reaction (GPR) rules can also be computed recursively. We compute the GPR expression of a node depending on its type (reaction, protein, or logical) and the type of edges connecting it to its neighbouring nodes. Particularly, the GPR expression of node i𝑖iitalic_i, Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, is computed according to the following rules:

  • If node i𝑖iitalic_i is a logical AND (logical OR) node, its GPR is the AND (OR) of the GPR of its children nodes.

  • If node i𝑖iitalic_i is a polypeptide, its GPR expression is, trivially, the gene associated to the node.

  • If node i𝑖iitalic_i is a multimeric protein, its GPR is computed as the AND of the GPR expressions of the children nodes connected to i𝑖iitalic_i by a ”subunit composition” edge.

  • If node i𝑖iitalic_i is a modified protein, its GPR is computed as the AND of the children nodes connected to i𝑖iitalic_i by a ”protein modification” or ”protein modification requirement” edge.

  • If node i𝑖iitalic_i is a reaction, its GPR is first constructed as the OR of the GPRs of its catalytic children (nodes connected to i𝑖iitalic_i via a catalytic edge, representing isoenzymes for the reaction). To account for non-catalytic protein requirements of the reaction this expression is then ANDed with the GPR expressions of all non-catalytic children (nodes connected to i𝑖iitalic_i via a ”non-catalytic requirement” edge). More formally, denoting with (i)𝑖\mathcal{E}(i)caligraphic_E ( italic_i ) and 𝒩(i)𝒩𝑖\mathcal{N}(i)caligraphic_N ( italic_i ) the index sets of catalytic and non-catalytic children nodes of i𝑖iitalic_i, respectively, we have:

    Gi=[j(i)Gj]catalytic[k𝒩(i)Gk]non-catalyticsubscript𝐺𝑖subscriptdelimited-[]subscript𝑗𝑖subscript𝐺𝑗catalyticsubscriptdelimited-[]subscript𝑘𝒩𝑖subscript𝐺𝑘non-catalyticG_{i}=\underbrace{\left[~{}\bigvee_{j\in\mathcal{E}(i)}G_{j}~{}\right]}_{\text% {catalytic}}\wedge\underbrace{\left[~{}\bigwedge_{k\in\mathcal{N}(i)}G_{k}~{}% \right]}_{\text{non-catalytic}}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = under⏟ start_ARG [ ⋁ start_POSTSUBSCRIPT italic_j ∈ caligraphic_E ( italic_i ) end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_ARG start_POSTSUBSCRIPT catalytic end_POSTSUBSCRIPT ∧ under⏟ start_ARG [ ⋀ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N ( italic_i ) end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] end_ARG start_POSTSUBSCRIPT non-catalytic end_POSTSUBSCRIPT (18)

A.4 Estimation of enzyme complex abundances

In order to estimate the abundance of all enzyme in the model from proteomics data, we use the model graph (Main Text, Section 4.3) to construct a stoichiometric map between enzymes and polypeptides. this map takes the form of a matrix 𝐄n×m𝐄superscript𝑛𝑚{\rm\mathbf{E}}\in\mathbb{R}^{n\times m}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT, where n𝑛nitalic_n is the number of enzymes in the model and mn𝑚𝑛m\geq nitalic_m ≥ italic_n the number of polypeptides, such that 𝐄i,jsubscript𝐄𝑖𝑗{\rm\mathbf{E}}_{i,j}bold_E start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the stoichiometry of polypeptide j𝑗jitalic_j in enzyme i𝑖iitalic_i. Some polypeptides may be part of additional complexes which are not part of the model. Hence, using the available annotation to the EcoCyc database, we identify these instances and construct a matrix 𝐄^^𝐄{\rm\mathbf{\hat{E}}}over^ start_ARG bold_E end_ARG by augmenting 𝐄𝐄{\rm\mathbf{E}}bold_E with additional rows corresponding to these out-of-model complexes.

With this mapping at hand, we assume that polypeptide abundances, 𝐩𝐩\mathbf{p}bold_p are related to enzyme abundances 𝐞𝐞\mathbf{e}bold_e (including the required additional complexes not accounted for in the model) by

𝐩=𝐄^𝐞𝐩superscript^𝐄top𝐞\mathbf{p}={\rm\mathbf{\hat{E}}}^{\top}\mathbf{e}bold_p = over^ start_ARG bold_E end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_e (19)

Hence, given a vector of experimental measurements of polypeptide abundances 𝐩¯¯𝐩\bar{\mathbf{p}}over¯ start_ARG bold_p end_ARG, we estimate enzyme abundances by solving the nonnegative least square (NNLS) problem:

min𝐞𝐩¯𝐄¯𝐞22subscript𝐞superscriptsubscriptnorm¯𝐩superscript¯𝐄top𝐞22\displaystyle\min_{\mathbf{e}}\quad||\bar{\mathbf{p}}-{\rm\mathbf{\bar{E}}}^{% \top}\mathbf{e}||_{2}^{2}roman_min start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT | | over¯ start_ARG bold_p end_ARG - over¯ start_ARG bold_E end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_e | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (20)
s.t𝐞0formulae-sequencest𝐞0\displaystyle\rm{s.t}\qquad\mathbf{e}\geq 0roman_s . roman_t bold_e ≥ 0

A.5 Adjustment of turnover numbers based on proteomics measurements across conditions

In this section, we outline the procedure used to adjust the turnover numbers in EC-iCH360 by fitting proteomic measurements across conditions (see Section 2.5 in the main text). Briefly, our aim is to adjust the turnover numbers that parametrise the model so that enzyme allocation predictions obtained through the enzyme-constrained formulation of FBA (see Methods in the main text) more closely match experimental measurements of enzyme abundances. By simultaneously fitting experimental measurements across many growth conditions, we improve the robustness of the fitting procedure to experimental error and generate a condition-independent parameter set that predicts average trends of enzyme allocation across conditions.

Adjusting turnover numbers is, in general, not simple. Due to the linear programming formulation underlying the enzyme-constrained FBA problem, optimal flux distributions (and, by direct consequence, enzyme allocation predictions) are discontinuous over the turnover parameter space, making derivative-based searches through this space problematic from a numerical perspective. Similarly, the high dimensionality of the parameter space limits the applicability of gradient-free optimisation algorithms. Hence, we restrict ourselves to the (comparatively simpler) problem of adjusting turnover parameters for a fixed set of reference flux distributions across growth conditions, constraining our search to the portion of parameter space in which these reference flux vectors remain optimal for their respective conditions. Our adjustment heuristic thus consists of two steps, which we detail in the remainder of this section

A.5.1 Obtaining a set of reference flux distributions

Following the sMOMENT formulation of enzyme-constrained FBA [bekiaris_automatic_2020], we consider a metabolic network with N𝑁Nitalic_N reactions and M𝑀Mitalic_M metabolites where all metabolic fluxes are positive (i.e. reversible reactions are split into forward and backwards components) and at most one enzyme is associated with each reaction. Hence, we assume that the enzyme cost required to sustain flux visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for a given growth condition j𝑗jitalic_j is given by aijvisubscript𝑎𝑖𝑗subscript𝑣𝑖a_{ij}~{}v_{i}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where the cost per unit flux aijsubscript𝑎𝑖𝑗a_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is given by:

{Miσjkcat,iif reaction i is enzymatic 0otherwisecasessubscript𝑀𝑖subscript𝜎𝑗subscript𝑘cat𝑖if reaction i is enzymatic 0otherwise\begin{cases}\frac{M_{i}}{\sigma_{j}~{}k_{\text{cat},i}}\quad&\text{if % reaction $i$ is enzymatic }\\ 0\quad&\text{otherwise}\end{cases}{ start_ROW start_CELL divide start_ARG italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT cat , italic_i end_POSTSUBSCRIPT end_ARG end_CELL start_CELL if reaction italic_i is enzymatic end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (21)

Here, Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the molecular weight of the enzyme associated with the reaction, kcat,isubscript𝑘cat𝑖k_{\text{cat},i}italic_k start_POSTSUBSCRIPT cat , italic_i end_POSTSUBSCRIPT is the turnover number for the enzyme-reaction pair, and σjsubscript𝜎𝑗\sigma_{j}italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a condition-specific coefficient (typically interpreted as an average enzyme saturation level) scaling the total enzyme abundance predicted for a flux distribution. Given an initial vector of turnover numbers to be adjusted, we compute a reference flux distribution for the j𝑗jitalic_jth growth condition, 𝐯jsuperscriptsubscript𝐯𝑗\mathbf{v}_{j}^{*}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by fixing the biomass flux, vBMsubscript𝑣BMv_{\rm BM}italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT to the experimentally measured rate and minimizing the total enzyme cost:

𝐯j=argmin𝐯superscriptsubscript𝐯𝑗subscriptargmin𝐯\displaystyle\mathbf{v}_{j}^{*}=\operatorname*{arg\,min}_{\mathbf{v}}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT 𝐚j𝐯superscriptsubscript𝐚𝑗top𝐯\displaystyle\quad\mathbf{a}_{j}^{\top}\mathbf{v}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v (22)
s.t.formulae-sequencest\displaystyle\rm{s.t.}roman_s . roman_t . 𝐒𝐯=𝟎𝐒𝐯0\displaystyle\quad{\rm\mathbf{S}}~{}\mathbf{v}=\mathbf{0}bold_S bold_v = bold_0 (a)a\displaystyle\qquad\qquad\rm{(a)}( roman_a )
𝐁j𝐯𝐛jsubscript𝐁𝑗𝐯subscript𝐛𝑗\displaystyle\quad{\rm\mathbf{B}}_{j}~{}\mathbf{v}\leq\mathbf{b}_{j}bold_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_v ≤ bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (b)b\displaystyle\qquad\qquad\rm{(b)}( roman_b )
vBM=v¯BM,jsubscript𝑣BMsubscript¯𝑣BM𝑗\displaystyle\quad v_{\rm BM}=\bar{v}_{\mathrm{BM},j}italic_v start_POSTSUBSCRIPT roman_BM end_POSTSUBSCRIPT = over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT roman_BM , italic_j end_POSTSUBSCRIPT (c)c\displaystyle\qquad\qquad\rm{(c)}( roman_c )
𝐯𝟎𝐯0\displaystyle\quad\mathbf{v}\geq\mathbf{0}bold_v ≥ bold_0

Here, the objective 𝐚j𝐯superscriptsubscript𝐚𝑗top𝐯\mathbf{a}_{j}^{\top}\mathbf{v}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v is the total enzyme cost for this condition, v¯BM,jsubscript¯𝑣BM𝑗\bar{v}_{\mathrm{BM},j}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT roman_BM , italic_j end_POSTSUBSCRIPT is the experimentally measured growth rate for the j𝑗jitalic_jth growth condition, 𝐒M×N𝐒superscriptMN{\rm\mathbf{S}}\in\mathbb{R}^{\rm{M}\times\rm{N}}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT roman_M × roman_N end_POSTSUPERSCRIPT is the stoichiometric matrix of the network and 𝐁jP×Nsubscript𝐁𝑗superscriptPN{\rm\mathbf{B}}_{j}\in\mathbb{R}^{\rm{P}\times\rm{N}}bold_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_P × roman_N end_POSTSUPERSCRIPT and 𝐛jPsubscript𝐛𝑗superscriptP\mathbf{b}_{j}\in\mathbb{R}^{\rm{P}}bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_P end_POSTSUPERSCRIPT are a matrix and a vector, respectively, encoding any desired upper bound (or positive lower bound) on the fluxes for the j𝑗jitalic_jth growth condition. Noting that constraint 22c can be equivalently cast as a double inequality, we rewrite the problem in the more general form:

𝐯j=argmin𝐯superscriptsubscript𝐯𝑗subscriptargmin𝐯\displaystyle\mathbf{v}_{j}^{*}=\operatorname*{arg\,min}_{\mathbf{v}}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT 𝐚j𝐯superscriptsubscript𝐚𝑗top𝐯\displaystyle\quad\mathbf{a}_{j}^{\top}\mathbf{v}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v (23)
s.t.formulae-sequencest\displaystyle\rm{s.t.}roman_s . roman_t . 𝐒𝐯=𝟎𝐒𝐯0\displaystyle\quad{\rm\mathbf{S}}\mathbf{v}=\mathbf{0}bold_Sv = bold_0 (a)a\displaystyle\qquad\qquad\rm{(a)}( roman_a )
𝐁^j𝐯𝐛^jsubscript^𝐁𝑗𝐯subscript^𝐛𝑗\displaystyle\quad\hat{{\rm\mathbf{B}}}_{j}\mathbf{v}\leq\hat{\mathbf{b}}_{j}over^ start_ARG bold_B end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_v ≤ over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (b)b\displaystyle\qquad\qquad\rm{(b)}( roman_b )
𝐯𝟎𝐯0\displaystyle\quad\mathbf{v}\geq\mathbf{0}bold_v ≥ bold_0

where the biomass flux is assumed to be the last component of the flux vector and the augmented matrices:

𝐁^j(𝐁𝐣[0,,1][0,,1])𝐛^j(𝐛jv¯BM,kv¯BM,k)formulae-sequencesubscript^𝐁𝑗matrixsubscript𝐁𝐣0101subscript^𝐛𝑗matrixsubscript𝐛𝑗subscript¯𝑣BM𝑘subscript¯𝑣BM𝑘\hat{{\rm\mathbf{B}}}_{j}\equiv\begin{pmatrix}{\rm\mathbf{B_{j}}}\\ [0,\cdots,\hphantom{-}1]\\ [0,\cdots,-1]\end{pmatrix}\qquad\hat{\mathbf{b}}_{j}\equiv\begin{pmatrix}% \mathbf{b}_{j}\\ \bar{v}_{\mathrm{BM},k}\\ -\bar{v}_{\mathrm{BM},k}\end{pmatrix}over^ start_ARG bold_B end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≡ ( start_ARG start_ROW start_CELL bold_B start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL [ 0 , ⋯ , 1 ] end_CELL end_ROW start_ROW start_CELL [ 0 , ⋯ , - 1 ] end_CELL end_ROW end_ARG ) over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≡ ( start_ARG start_ROW start_CELL bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT roman_BM , italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT roman_BM , italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) (24)

were introduced. Finally, from (21) we note that the solution of problem (23) is independent of the choice of the coefficient σ𝜎\sigmaitalic_σ, as this merely amounts to a scaling of the objective function. Hence, without loss of generality, we can simply set σ=1𝜎1\sigma=1italic_σ = 1 for all growth conditions in this step and fit a condition-specific value for this parameter in the next half of the procedure. Solving (23 for each of the J𝐽Jitalic_J growth conditions available in the experimental dataset, we obtain a set of optimal flux distributions 𝒱={𝐯1,,𝐯J}superscript𝒱superscriptsubscript𝐯1superscriptsubscript𝐯𝐽\mathcal{V^{*}}=\{\mathbf{v}_{1}^{*},\cdots,\mathbf{v}_{J}^{*}\}caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ⋯ , bold_v start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }, which we will use as a reference in the next step.

A.5.2 Adjusting turnover numbers for the reference set of flux distributions

We now turn to fit the turnover and saturation parameters parameters against experimental measurements of enzyme abundances. For this purpose, we denote with 𝐩𝐩\mathbf{p}bold_p a vector of log10 turnover parameters (one for each enzyme-reaction pair in the model) and with 𝐬𝐬\mathbf{s}bold_s a vector of condition-specific log10-scaling factors (i.e. pi=log10(kcat,i)subscript𝑝𝑖subscript10subscript𝑘catip_{i}=\log_{10}({k_{\rm{cat},i}})italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT roman_cat , roman_i end_POSTSUBSCRIPT ) and sj=log10(σj)subscript𝑠𝑗subscript10subscript𝜎𝑗s_{j}=\log_{10}({\sigma_{j}})italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )). Further, we denote with 𝐩¯¯𝐩\bar{\mathbf{p}}over¯ start_ARG bold_p end_ARG the vector of original log-turnover numbers (i.e. the one used to obtain the reference flux distributions). For the choice of reference flux distribution computed in the previous step, the abundance of the i𝑖iitalic_ith enzyme in condition j𝑗jitalic_j, eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, is then a function of 𝐩𝐩\mathbf{p}bold_p and 𝐬𝐬\mathbf{s}bold_s:

eij(𝐩,𝐬)=kakjvkjsubscript𝑒𝑖𝑗𝐩𝐬subscript𝑘subscript𝑎𝑘𝑗subscriptsuperscript𝑣𝑘𝑗e_{ij}(\mathbf{p},\mathbf{s})=\sum_{k}a_{kj}v^{*}_{kj}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( bold_p , bold_s ) = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT (25)

where the summation index k𝑘kitalic_k runs across all reactions catalysed by enzyme i𝑖iitalic_i and the flux cost of reaction k𝑘kitalic_k in condition j𝑗jitalic_j, akjsubscript𝑎𝑘𝑗a_{kj}italic_a start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT, is computed as in (21). From the formulation of the linear program (23), it is clear that there must exist a region 𝒮jsubscript𝒮𝑗\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of log-turnover parameter space such that 𝐩¯𝒮j¯𝐩subscript𝒮𝑗\bar{\mathbf{p}}\in\mathcal{S}_{j}over¯ start_ARG bold_p end_ARG ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and that, for every 𝐩𝒮j𝐩subscript𝒮𝑗\mathbf{p}\in\mathcal{S}_{j}bold_p ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 𝐯j𝒱superscriptsubscript𝐯𝑗superscript𝒱\mathbf{v}_{j}^{*}\in\mathcal{V}^{*}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal solution of problem (23) for its growth condition. Hence, in this step of the adjustment procedure, we aim to find a set of adjusted log turnover parameters, 𝐩superscript𝐩\mathbf{p}^{*}bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and log scaling factors, 𝐬superscript𝐬\mathbf{s}^{*}bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by minimising the discrepancy between enzyme abundance predictions and measurements, constraining our search to this region of the parameter space 𝒮j𝒮j𝒮subscript𝑗subscript𝒮𝑗\mathcal{S}\equiv\bigcap\limits_{j}\mathcal{S}_{j}caligraphic_S ≡ ⋂ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT where all reference flux distributions are optimal for their respective growth condition:

(𝐩,𝐬)=argmin𝐩,𝐬superscript𝐩superscript𝐬subscriptargmin𝐩𝐬\displaystyle(\mathbf{p}^{*},\mathbf{s}^{*})=\operatorname*{arg\,min}_{\mathbf% {p},~{}\mathbf{s}}( bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_p , bold_s end_POSTSUBSCRIPT 1Nei,j[l(eij)l(e¯ij)]2+ρNpi(pip¯i)21subscript𝑁esubscript𝑖𝑗superscriptdelimited-[]𝑙subscript𝑒𝑖𝑗𝑙subscript¯𝑒𝑖𝑗2𝜌subscript𝑁psubscript𝑖superscriptsubscript𝑝𝑖subscript¯𝑝𝑖2\displaystyle\quad\frac{1}{N_{\rm e}}\sum_{i,j}~{}\big{[}~{}l(e_{ij})-l(\bar{e% }_{ij})~{}\big{]}^{2}+\frac{\rho}{N_{\rm p}}\sum_{i}\big{(}~{}p_{i}-\bar{p}_{i% }~{}\big{)}^{2}divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT [ italic_l ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - italic_l ( over¯ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (26)
s.t.formulae-sequencest\displaystyle\rm{s.t.}roman_s . roman_t . 𝐥𝐩𝐩¯𝐮𝐥𝐩¯𝐩𝐮\displaystyle\quad\mathbf{l}\leq\mathbf{p}-\bar{\mathbf{p}}\leq\mathbf{u}bold_l ≤ bold_p - over¯ start_ARG bold_p end_ARG ≤ bold_u
𝐩𝒮𝐩𝒮\displaystyle\quad\mathbf{p}\in\mathcal{S}bold_p ∈ caligraphic_S

where Nesubscript𝑁eN_{\rm e}italic_N start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT is the number of enzyme-condition pairs, Npsubscript𝑁pN_{\rm p}italic_N start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT is the number of turnover parameters, e¯ijsubscript¯𝑒𝑖𝑗\bar{e}_{ij}over¯ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the experimental measurement of enzyme i𝑖iitalic_i in condition j𝑗jitalic_j, 𝐥𝐥\mathbf{l}bold_l and 𝐮𝐮\mathbf{u}bold_u are bounds on the allowable adjustment, ρ>0𝜌0\rho>0italic_ρ > 0 is a scalar hyperparameter, and the function l()𝑙l(\cdot)italic_l ( ⋅ ) is defined as:

l(x)={log10(x)x>00x=0𝑙𝑥casessubscript10𝑥𝑥00𝑥0l(x)=\begin{cases}\log_{10}(x)\quad&x>0\\ 0\quad&x=0\end{cases}italic_l ( italic_x ) = { start_ROW start_CELL roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_x ) end_CELL start_CELL italic_x > 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_x = 0 end_CELL end_ROW (27)

The objective function of the nonlinear program (26) is a combination of two terms. The first term penalises the mean squared deviation between measurements and predictions of log-enzyme abundance. Note that the above definition of l()𝑙l(\cdot)italic_l ( ⋅ ) implies that, for each condition, only enzymes with nonzero predicted abundance are included in this term. The second term is a regularisation expression (whose strength is controlled by the hyperparameter ρ𝜌\rhoitalic_ρ) penalising mean squared deviation from the original parameter set. The latter term mainly serves two purposes: first, it ensures that, whenever a turnover parameter is ”free” in the problem (which can happen if its associated reaction fluxes are 00 across all conditions, or if no experimental measurements are available for its associated enzyme), it will be kept at its original value; secondly, it provides a mean to tune the strength of the adjustment procedure.

In order to define the region 𝒮𝒮\mathcal{S}caligraphic_S, we shall exploit the sufficient optimality conditions of LP (23). Introducing, for each growth condition, the vectors of dual variables, 𝝀jMsubscript𝝀𝑗superscript𝑀\boldsymbol{\lambda}_{j}\in\mathbb{R}^{M}bold_italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and 𝝁jPsubscript𝝁𝑗superscript𝑃\boldsymbol{\mu}_{j}\in\mathbb{R}^{P}bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, corresponding to constraints 23b and 23c, respectively, we note that problem 23 admits (for the j𝑗jitalic_jth growth condition) a dual problem in the form:

max𝝀j,𝝁jsubscriptsubscript𝝀𝑗subscript𝝁𝑗\displaystyle\max_{\boldsymbol{\lambda}_{j},~{}\boldsymbol{\mu}_{j}}roman_max start_POSTSUBSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT 𝐛j𝝁jsuperscriptsubscript𝐛𝑗topsubscript𝝁𝑗\displaystyle\quad-\mathbf{b}_{j}^{\top}\boldsymbol{\mu}_{j}- bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (28)
s.t.formulae-sequencest\displaystyle\rm{s.t.}roman_s . roman_t . 𝐒𝝀j+𝐁j𝝁j+𝐚𝟎superscript𝐒topsubscript𝝀𝑗superscriptsubscript𝐁𝑗topsubscript𝝁𝑗𝐚0\displaystyle\quad{\rm\mathbf{S}}^{\top}\boldsymbol{\lambda}_{j}~{}+{\rm% \mathbf{B}}_{j}^{\top}\boldsymbol{\mu}_{j}+\mathbf{a}\geq\mathbf{0}bold_S start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + bold_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + bold_a ≥ bold_0 (a)a\displaystyle\qquad\qquad\rm{(a)}( roman_a )
𝝁j𝟎subscript𝝁𝑗0\displaystyle\quad\boldsymbol{\mu}_{j}\geq\mathbf{0}bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ bold_0

A well-known result in linear programming duality theory [maranas_zomorrodi_2016] states that a flux distribution is the optimal solution of the primal problem 23, if, and only if, the dual problem 28 is feasible (dual feasibility) and its optimal objective coincides with the primal optimal objective, that is 𝐚j𝐯j=𝐛j𝝁jsuperscriptsubscript𝐚𝑗topsuperscriptsubscript𝐯𝑗superscriptsubscript𝐛𝑗topsubscript𝝁𝑗\mathbf{a}_{j}^{\top}\mathbf{v}_{j}^{*}=-\mathbf{b}_{j}^{\top}\boldsymbol{\mu}% _{j}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = - bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (strong duality). Taken together, dual feasibility and strong duality thus define the region of optimality in turnover parameter space of each reference flux distribution. Hence, we can integrate the above definition of 𝒮𝒮\mathcal{S}caligraphic_S within problem 26 by introducing the two vectors of dual variables (𝝀jsubscript𝝀𝑗\boldsymbol{\lambda}_{j}bold_italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝝁jsubscript𝝁𝑗\boldsymbol{\mu}_{j}bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) for each condition as additional optimisation variables, and simultaneously enforcing the optimality for each reference distribution. By doing this, we obtain the final formulation of the nonlinear program for turnover number adjustment, which we solved to obtain the results shown in the main text:

(𝐩,𝐬)=argmin𝐩,𝐬superscript𝐩superscript𝐬subscriptargmin𝐩𝐬\displaystyle(\mathbf{p}^{*},\mathbf{s}^{*})=\operatorname*{arg\,min}_{\mathbf% {p},~{}\mathbf{s}}( bold_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_p , bold_s end_POSTSUBSCRIPT 1Nei,j[l(eij)l(e¯ij)]2+ρNpi(pip¯i)21subscript𝑁esubscript𝑖𝑗superscriptdelimited-[]𝑙subscript𝑒𝑖𝑗𝑙subscript¯𝑒𝑖𝑗2𝜌subscript𝑁psubscript𝑖superscriptsubscript𝑝𝑖subscript¯𝑝𝑖2\displaystyle\quad\frac{1}{N_{\rm e}}\sum_{i,j}~{}\big{[}~{}l(e_{ij})-l(\bar{e% }_{ij})~{}\big{]}^{2}+\frac{\rho}{N_{\rm p}}\sum_{i}\big{(}~{}p_{i}-\bar{p}_{i% }~{}\big{)}^{2}divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT [ italic_l ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - italic_l ( over¯ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_ρ end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (29)
s.t.formulae-sequencest\displaystyle\rm{s.t.}roman_s . roman_t . 𝐥𝐩𝐩¯𝐮𝐥𝐩¯𝐩𝐮\displaystyle\quad\mathbf{l}\leq\mathbf{p}-\bar{\mathbf{p}}\leq\mathbf{u}bold_l ≤ bold_p - over¯ start_ARG bold_p end_ARG ≤ bold_u
𝐒𝝀j+𝐁j𝝁j+𝐚j𝟎superscript𝐒topsubscript𝝀𝑗superscriptsubscript𝐁𝑗topsubscript𝝁𝑗subscript𝐚𝑗0\displaystyle\quad{\rm\mathbf{S}}^{\top}\boldsymbol{\lambda}_{j}~{}+{\rm% \mathbf{B}}_{j}^{\top}\boldsymbol{\mu}_{j}+\mathbf{a}_{j}\geq\mathbf{0}bold_S start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + bold_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ bold_0 j=1,,J𝑗1𝐽\displaystyle\quad j=1,\dotsc,Jitalic_j = 1 , … , italic_J
𝝁j𝟎subscript𝝁𝑗0\displaystyle\quad\boldsymbol{\mu}_{j}\geq\mathbf{0}bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ bold_0 j=1,,J𝑗1𝐽\displaystyle\quad j=1,\dotsc,Jitalic_j = 1 , … , italic_J
𝐚j𝐯j=𝐛j𝝁jsuperscriptsubscript𝐚𝑗topsuperscriptsubscript𝐯𝑗superscriptsubscript𝐛𝑗topsubscript𝝁𝑗\displaystyle\quad\mathbf{a}_{j}^{\top}\mathbf{v}_{j}^{*}=-\mathbf{b}_{j}^{% \top}\boldsymbol{\mu}_{j}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = - bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT j=1,,J𝑗1𝐽\displaystyle\quad j=1,\dotsc,Jitalic_j = 1 , … , italic_J

A.5.3 Conversion to apparent turnover numbers

The above procedure produces (after conversion back to a linear scale) a set of adjusted turnover numbers, 𝐤catsuperscriptsubscript𝐤cat\mathbf{k}_{\rm cat}^{*}bold_k start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and scalings, 𝝈superscript𝝈\boldsymbol{\sigma}^{*}bold_italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. In order to facilitate the physical interpretation of the resulting parameter set, we express the fitted scalings for the j𝑗jitalic_jth condition, σjsuperscriptsubscript𝜎𝑗\sigma_{j}^{*}italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as the product of two terms:

σjσ¯σ^jsuperscriptsubscript𝜎𝑗superscript¯𝜎superscriptsubscript^𝜎𝑗\sigma_{j}^{*}\equiv\bar{\sigma}^{*}~{}\hat{\sigma}_{j}^{*}italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≡ over¯ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (30)

Here, σ¯superscript¯𝜎\bar{\sigma}^{*}over¯ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the geometric mean of the scalings across conditions, which we interpret as typical enzyme saturation level, while σ^jsuperscriptsubscript^𝜎𝑗\hat{\sigma}_{j}^{*}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a residual scaling factor fluctuating around 1111 which required to account for differences in total measured enzymes across conditions. Based on this reparametrisation of the scaling, we now define a set of adjusted apparent turnover numbers, 𝐤catappsuperscriptsubscript𝐤catapp\mathbf{k}_{\rm cat}^{\rm app}bold_k start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_app end_POSTSUPERSCRIPT, which now incorporate saturation information, as:

𝐤catappσ¯𝐤catsuperscriptsubscript𝐤catappsuperscript¯𝜎superscriptsubscript𝐤cat\mathbf{k}_{\rm cat}^{\rm app~{}*}\equiv\bar{\sigma}^{*}~{}\mathbf{k}_{\rm cat% }^{*}bold_k start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_app ∗ end_POSTSUPERSCRIPT ≡ over¯ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT roman_cat end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (31)

A.5.4 Potential extensions

We conclude this section by noting that procedure described above assumes that the original parameter set is sufficiently ”good” to produce a realistic flux distribution to use as a reference for the adjustment step. If this is not the case, then a multi-start approach can be implemented, where multiple turnover parameter sets are first generated by perturbing the original parameter set, and each is used to generate a separate set of flux distributions. Each reference set is then provided as an input to problem (29), and the solution achieving the lowest objective chosen in the end. The perturbed parameter set may be generated either randomly (for example, by adding log-normal noise to the original turnover parameter vector) or systematically. The latter could be achieved, for example, by identifying reactions with zero-predicted flux but high measured abundance of the associated enzyme. By systematically increasing the corresponding turnover number, one can ”encourage” these reactions to be included in the reference set, potentially leading to the exploration of more relevant reference set than achieved by random perturbation.

Note that, for the purposes of this work, we limited ourselves to the original parameter set and thus did not explore this potential heuristic.

Appendix B Supplementary Figures

Refer to caption
Supplementary Figure 1: Biosynthesis of amino acids from core metabolism precursors in iCH360. The Escher maps (including Suppl. Fig. 2 to 4) are deposited along with the code.
Refer to caption
Supplementary Figure 2: Biosynthesis of pyrimidine and purine (deoxy)ribonucleotides from in iCH360.
Refer to caption
Supplementary Figure 3: Biosynthesis of saturated and unsaturated fatty acids from acetyl-CoA. The map for the saturated fatty acids is taken from Escher [king_escher_2015].
Refer to caption
Supplementary Figure 4: Metabolism of one-carbon compounds in iCH360.
Refer to caption
Supplementary Figure 5: Comparison of model size between ECC, ECC2, iCH360 and iML1515. To allow for a fair comparison, pseudo-reactions (e.g. exchange reactions) were excluded from the count. Note that gene annotation were not available in the SBML model of ECC2 accompanying its publication[hadicke_ecolicore2_2017].
Refer to caption
Supplementary Figure 6: Maximal biomass fluxes achieved by iCH360 and the parent model iML1515 for aerobic and anaerobic growth across multiple carbon sources. In all cases, the substrate uptake flux was bounded to 10101010 mmol/gDW/h.
Refer to caption
Supplementary Figure 7: Comparison of production envelopes between iCH360 and the parent model iML1515. Top: production of ethanol, acetate and succinate underaerobic growth on lactate. Bottom: production of ethanol, acetate, lactate and succinate under aerobic growth on glycerol. Note, that the dashed line representing the production envelope of iML1515 is sometimes hidden behind the blue line corresponding to iCH360.
Refer to caption
Supplementary Figure 8: Fitness loss associated with network disruptions classified as complete disruptions (disruption of all catalytic edges for a reaction) and full primary disruption (disruption of all primary catalytic associations, but with remaining secondary ones) are not significantly different (Wilcoxon rank-sum test, p=0.063𝑝0.063p=0.063italic_p = 0.063).
Refer to caption
Supplementary Figure 9: Proteome allocation predictions across conditions obtained with the turnover parameter set from [heckmann_kinetic_2020]. The bottom-right panel shows the geometric mean of measurements and predictions across conditions.
Refer to caption
Supplementary Figure 10: Turnover parameters used in EC-iCH360 before and after the adjustment procedure (see section 2.5 and Methods in the main text, as well as Supplementary Information A.5 for details). The original parameter set was directly parsed from [heckmann_kinetic_2020].
Refer to caption
Supplementary Figure 11: Impact of the magnitude of the ridge regularisation coefficient (ρ𝜌\rhoitalic_ρ in Eq. 29) on the outcome of the turnover number adjustment procedure. The blue curve represents the RMSE between measured and predicted log-enzyme abundances and decreases monotonically as less less regularisation is applied to the problem. The red curve represents the mean absolute deviation between original and adjusted log-turnover numbers, which follows the opposite trend and converges to 00 as more and more regularisation is applied.
Refer to caption
Supplementary Figure 12: Comparison of production envelopes for growth on glucose between iCH360 and the reduced variant iCH360red, used for elementary flux mode enumeration and analysis. The two models have virtually identical solution spaces for the products and growth conditions shown.
Refer to caption
Supplementary Figure 13: Flux distribution in the network corresponding to the maximum yield elementary flux mode (see main text, section 2.6). The mode is purely respiratory, with no excretion of typical fermentation by-products such as acetate, ethanol, or lactate. The graphics was produced in Escher [king_escher_2015].
Refer to caption
Supplementary Figure 14: Flux distribution in the network corresponding to the maximum growth elementary flux mode (see main text, section 2.6). The mode shows a mixed respiratory/fermentative phenotype, with significant acetate excretion, in line with typical observations of overflow metabolism. The graphics was produced in Escher [king_escher_2015].
Refer to caption
Supplementary Figure 15: EFM growth-yield tradeoff front under simulated low oxygen conditions, obtained by setting 1000-fold higher cost for the cytochrome reactions. The higher enzyme cost is used to mimic the growth condition in which the oxygen-consuming reaction must operate at very low enzyme saturation, thus requiring higher enzyme-mass investment per unit flux. Black squares denote Pareto optimal EFMs. The pareto front is much broader than the one obtained with the original enzyme cost for the oxygen-consuming reactions (see figure 5 in main text).
Refer to caption
Supplementary Figure 16: satFBA results (corresponding to Figure 6 in the main text) obtained by enforcing a lower bound on the ATP maintenance flux equal to 6.866.866.866.86 mmol/gDW/h (taken directly from the parent model iML1515). In this case, the optimal solution to the EFM is no longer an EFM, as evident from the yield profile (C), which is not piecewise constant with respect to the external glucose concentration.
Refer to caption
Supplementary Figure 17: Empirical cumulative distribution functions (CDF) of measured enzyme abundances for reactions with predicted flux force efficacy above or below 50%percent5050\%50 % (blue and red curves, respectively). The two are significantly different (p<0.001𝑝0.001p<0.001italic_p < 0.001, two-sided Wilcox rank-sum test), with the low-efficacy group being associated to an approximately 3333-fold higher median median enzyme abundance than the high efficacy group.
Refer to caption
Supplementary Figure 18: Compressed metabolic map of iCH360, wherein linear pathways longer than two reactions were lumped together. The map was produced in Escher [king_escher_2015] and can thus be use to visualise fluxes, metabolite concentration, and gene expression data.

Appendix C Supplementary Tables

Supplementary Table 1: Description of node types in the knowledge graph supporting the stoichiometric model. The IDs used in the ”Example” column refer to node identifiers in the graph.
\RaggedRight Node type \RaggedRight Node subtype Description Example
\RaggedRight reaction \RaggedRight a mass balanced chemical or biochemical reaction bigg:GAPD
\RaggedRight protein \RaggedRight polypeptide a single polypeptide coded by a gene GAPDH-A-MONOMER
\RaggedRight \RaggedRight multimer a complex formed by stoichiometric binding of different polypeptides and/or other complexes GAPDH-A-CPLX
\RaggedRight \RaggedRight modified protein a polypeptide or multimer which underwent post-translational modification LIPOYL-GCVH
\RaggedRight gene \RaggedRight a gene b1779
\RaggedRight metabolite \RaggedRight an organic or inorganic molecule L-ASPARTATE
\RaggedRight logical-AND/
logical-OR
\RaggedRight used as intermediates node to encode arbitrary logical rules in GPRs generated thorough the graph (e.g. one of two proteins being required by another node) THIOREDOXINS
Supplementary Table 2: Classification and properties of edge types (and biological meaning of their associated weights, when applicable) in the iCH360 graph data structure supporting the stoichiometric model.
Edge type parent node type(s) child node type(s) description \RaggedRight weight example
catalysis reaction protein a catalytic relationship between a reaction and an enzyme which catalyzes it. Further annotated as either primary or secondary catalysis (see main text and Methods) \RaggedRight NA bigg:PFK \rightarrow 6PFK-1-CPX (primary)
bigg:PFK \rightarrow 6PFK-2-CPX (secondary)
subunit composition protein protein indicates that the child node is a subunit of the parent node \RaggedRight the stoichiometry of the subunit in the complex. FABA-CPLX22\hskip 1.42271pt\xrightarrow{2}\hskip 1.42271ptstart_ARROW over2 → end_ARROWFABA-MONOMER
accessory component protein protein Indicates that the child protein is an accessory subunit of the parent protein, i.e. can be part of the complex but is not strictly required for correct function of the complex. \RaggedRight the stoichiometry of the subunit in the complex. ATPSYN-CPLX11\hskip 1.42271pt\xrightarrow{1}\hskip 1.42271ptstart_ARROW over1 → end_ARROWEG10106-MONOMER
protein modification protein protein Indicates that the parent protein is a obtained by post-translational modification of the child protein. \RaggedRight NA PYRUVFORMLY-CPLX\hskip 1.42271pt\rightarrow\hskip 1.42271ptPYRUVFORMLY-INACTIVE-CPLX
protein modification requirement protein protein indicates that the child protein is required to accomplish the post-translational modification leading to the parent protein. \RaggedRight NA PYRUVFORMLY-CPLX\hskip 1.42271pt\rightarrow\hskip 1.42271ptPFLACTENZ-MONOMER
non-catalytic requirement reaction protein Indicates that the child protein is required by the parent reaction, although not as a catalyst. Typical examples include proteins used as cofactors (e.g. glutaredoxins) in the reaction or featuring as prosthetic groups for a metabolite involved in the reaction (e.g. Acyl-Carrier-protein) \RaggedRight NA bigg:ACOATA\hskip 1.42271pt\rightarrow\hskip 1.42271ptACP-MONOMER
putative association reaction protein Indicates that a putative association between the reaction and the protein has been proposed in literature \RaggedRight NA bigg:PFL\hskip 1.42271pt\rightarrow\hskip 1.42271ptEG11910-MONOMER
inactive catalysis reaction protein Indicates that the child protein is an enzyme for the parent reaction, but it’s inactive in this strain due e.g. to a frameshift mutation. \RaggedRight NA bigg:ACHBS\hskip 1.42271pt\rightarrow\hskip 1.42271ptACETOLACTSYNII-CPLX
regulatory interaction protein metabolite, protein Indicate that the child metabolite or protein is a regulator for the enzyme. Information about the regulation mode (activation vs inhibition), the regulation mechanism (competitive vs allosteric) and the regulated reaction (if the enzyme catalyses multiple) is provided as edge metadata whenever available. \RaggedRight If present, denotes the activation/inhibition constant for the interaction as reported in EcoCyc. SHIKIMATE160.0μM160.0𝜇M\hskip 1.42271pt\xrightarrow{\rm 160.0\mu M}\hskip 1.42271ptstart_ARROW start_OVERACCENT 160.0 italic_μ roman_M end_OVERACCENT → end_ARROWAROE-MONOMER
coding relation protein gene Indicate that the child gene codes for the parent polypeptide \RaggedRight NA RIBOKIN-MONOMER\hskip 1.42271pt\rightarrow\hskip 1.42271ptb3752
Supplementary Table 3: Manual curation of the reaction pruning process used to construct iCH360red. Each reaction set represents a set of alternative routes for the production of the same metabolite (but using, for example, different cofactors). For each set, the most physiologically relevant option, based on available literature, was preserved in the reduced model variant.
Reaction set Pruned in iCH360red Notes
EAR(n)x, EAR(n)y * EAR(n)y Enzyme FabI can work with both NADH/NADPH, but higher activity was found with NADH [bergler_enoyl-acyl-carrier-protein_1996]
ACOATA, KAS14, KAS15 ACOATA, KAS14 Initiation of fatty acid biosynthesis can occur by either direct condensation of acetyl-CoA with malonyl-ACP (KAS15) or by transacylation of acetyl-CoA followed by condensation with malonyl-ACP (ACOATA + KAS15). Because the transacylacylase activity of FabH (ACOATA) has been to be significantly lower than its condensation activity (KAS15) [tsay_isolation_1992], only the former pathway is maintained in iCH360red.
VALTA, VPAMTr VPAMTr These are both routes to production of valine. We keep VALTA (ilvE) as it is the last step in the canonical valine biosynthesis route.
RNDR1, RNDR2, RNDR1b, RNDR2b RNDR1b, RNDR2b The ribonucleoside diphosphate reductase can work both with the thioredoxin and glutaredoxin redox systems. iCH360 retains only the thioredoxin version.
SULabcpp, SO4t2pp SO4t2pp SULabcpp is an ATP-mediated active transport of sulfate in the cell via an ATP-binding-cassette (ABC) transporter [sirko_sulfate_1995], while SO42tpp (cysZ) is a proton symporter [zhang_escherichia_2014]. We maintain the former as its impairment was shown to lead to cysteine auxotrophy [sirko_sulfate_1995].

* n(60,80,100,120,140,160,180,121,141,161,181)𝑛6080100120140160180121141161181n\in(60,80,100,120,140,160,180,121,141,161,181)italic_n ∈ ( 60 , 80 , 100 , 120 , 140 , 160 , 180 , 121 , 141 , 161 , 181 )

Supplementary Table 4: Numbers of elementary flux modes enumerated for the reduced model variant iCH360red under different growth conditions. Numbers in brackets represent the numbers of EFMs after filtering. Filtered modes include only those supporting Biomass flux and, for aerobic conditions, those that a) have nonzero oxygen uptake and b) do not use either of three reactions (PFL, DHORD5, FRD2), which are known to be only physiologically active under anaerobic conditions
number of EFMs (filtered)
Aerobic Anaerobic
Glucose 13468719 (1035696)) 204028 (195670))
Pyruvate 1763631 (135266)) 6949 (6480))
Glycerol 922217 (82112)) NA
Acetate 38099 (7596) NA
Lactate 1270315 (5897) ) 1497 (1424))