Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Cancer evolution lays the groundwork for predictive oncology. Testing evolutionary metrics requires quantitative measurements in controlled clinical trials. We mapped genomic intratumor heterogeneity in locally advanced prostate cancer... more
Cancer evolution lays the groundwork for predictive oncology. Testing evolutionary metrics requires quantitative measurements in controlled clinical trials. We mapped genomic intratumor heterogeneity in locally advanced prostate cancer using 642 samples from 114 individuals enrolled in clinical trials with a 12-year median follow-up. We concomitantly assessed morphological heterogeneity using deep learning in 1,923 histological sections from 250 individuals. Genetic and morphological (Gleason) diversity were independent predictors of recurrence (hazard ratio (HR) = 3.12 and 95% confidence interval (95% CI) = 1.34–7.3; HR = 2.24 and 95% CI = 1.28–3.92). Combined, they identified a group with half the median time to recurrence. Spatial segregation of clones was also an independent marker of recurrence (HR = 2.3 and 95% CI = 1.11–4.8). We identified copy number changes associated with Gleason grade and found that chromosome 6p loss correlated with reduced immune infiltration. Matched profiling of relapse, decades after diagnosis, confirmed that genomic instability is a driving force in prostate cancer progression. This study shows that combining genomics with artificial intelligence-aided histopathology leads to the identification of clinical biomarkers of evolution.
The patterns by which primary tumors spread to metastatic sites remain poorly understood. Here, we define patterns of metastatic seeding in prostate cancer (PCa) using a novel injection-based mouse model — EvoCaP (Evolution in Cancer of... more
The patterns by which primary tumors spread to metastatic sites remain poorly understood. Here, we define patterns of metastatic seeding in prostate cancer (PCa) using a novel injection-based mouse model — EvoCaP (Evolution in Cancer of the Prostate), featuring aggressive metastatic cancer to bone, liver, lungs, and lymph nodes. To define migration histories between primary and metastatic sites, we used our EvoTraceR pipeline to track distinct tumor clones containing recordable barcodes. We detected widespread intratumoral heterogeneity from the primary tumor in metastatic seeding, with few clonal populations (CPs) instigating most migration. Metastasis-to-metastasis seeding was uncommon, as most cells remained confined within the tissue. Migration patterns in our model were congruent with human PCa seeding topologies. Our findings support the view of metastatic PCa as a systemic disease driven by waves of aggressive clones expanding their niche, infrequently overcoming constraints that otherwise keep them confined in the primary or metastatic site.
Background Copy number alterations (CNAs) are genetic variations that cause an abnormal increase or decrease in the number of copies of a genomic region, and they are commonly detected in cancer. CNAs can affect various regions of the... more
Background Copy number alterations (CNAs) are genetic variations that cause an abnormal increase or decrease in the number of copies of a genomic region, and they are commonly detected in cancer. CNAs can affect various regions of the genome, including broad regions that encompass multiple genes, individual genes, or even non-coding RNA molecules of small size. CNAs contribute to tumorigenesis and can have a significant impact
A group of 27 patients diagnosed with metastatic triple-negative breast cancer (mTNBC) was randomly distributed into two groups and underwent different lines of metronomic treatment (mCHT). The former group (N 14) received first-line mCHT... more
A group of 27 patients diagnosed with metastatic triple-negative breast cancer (mTNBC) was randomly distributed into two groups and underwent different lines of metronomic treatment (mCHT). The former group (N 14) received first-line mCHT and showed a higher overall survival rate than the second group (N 13), which underwent second-line mCHT. Analysis of one patient still alive from the first group, diagnosed with mTNBC in 2019, showed a complete metabolic response (CMR) after a composite approach implicating first-line mCHT followed by second-line epirubicin and third-line nab-paclitaxel, and was chosen for subsequent molecular characterization. We found altered expression in the cancer stemness-associated gene NOTCH-1 and its corresponding protein. Additionally, we found changes in the expression of oncogenes, such as MYC and AKT, along with their respective proteins. Overall, our data suggest that a first-line treatment with mCHT followed by MTD might be effective by negatively regulating stemness traits usually associated with the emergence of drug resistance.
Polycythemia Vera (PV) is typically caused by V617F or exon 12 JAK2 mutations. Little is known about Polycythemia cases where no JAK2 variants can be detected, and no other causes identified. This condition is defined as idiopathic... more
Polycythemia Vera (PV) is typically caused by V617F or exon 12 JAK2 mutations. Little is known about Polycythemia cases where no JAK2 variants can be detected, and no other causes identified. This condition is defined as idiopathic erythrocytosis (IE). We evaluated clinical-laboratory parameters of a cohort of 56 IE patients and we determined their molecular profile at diagnosis with paired blood/buccal-DNA exome-sequencing coupled with a high-depth targeted OncoPanel to identify a possible underling germline or somatic cause. We demonstrated that most of our cohort (40/56: 71.4%) showed no evidence of clonal hematopoiesis, suggesting that IE is, in large part, a germline disorder. We identified 20 low mutation burden somatic variants (Variant allelic fraction, VAF, < 10%) in only 14 (25%) patients, principally involving DNMT3A and TET2. Only 2 patients presented high mutation burden somatic variants, involving DNMT3A, TET2, ASXL1 and WT1. We identified recurrent germline variants in 42 (75%) patients occurring mainly in JAK/STAT, Hypoxia and Iron metabolism pathways, among them: JAK3-V722I and HIF1A-P582S; a high fraction of patients (48.2%) resulted also mutated in homeostatic iron regulatory gene HFE-H63D or C282Y. By generating cellular models, we showed that JAK3-V722I causes activation of the JAK-STAT5 axis and upregulation of EPAS1/HIF2A, while HIF1A-P582S causes suppression of hepcidin mRNA synthesis, suggesting a major role for these variants in the onset of IE.
SETBP1 mutations are found in various clonal myeloid disorders. However, it is unclear whether they can initiate leukemia, as SETBP1 mutations typically appear as later events during oncogenesis. To answer this question, we generated a... more
SETBP1 mutations are found in various clonal myeloid disorders. However, it is unclear whether they can initiate leukemia, as SETBP1 mutations typically appear as later events during oncogenesis. To answer this question, we generated a mouse model expressing mutated SETBP1 in hematopoietic tissue: this model showed profound alterations in the differentiation program of hematopoietic progenitors and developed a myeloid neoplasm with megakaryocytic dysplasia, splenomegaly, and bone marrow fibrosis, prompting us to investigate SETBP1 mutations in a cohort of 36 triple-negative primary myelofibrosis (TN-PMF) cases. We identified two distinct subgroups, one carrying SETBP1 mutations and the other completely devoid of somatic variants. Clinically, a striking difference in disease aggressiveness was noted, with SETBP1-mutated patients showing a much worse clinical course. As opposite to myelodysplastic/myeloproliferative neoplasms, where SETBP1 mutations are mostly found as a late clonal event, single-cell clonal hierarchy reconstruction in three TN-PMF patients from our cohort revealed SETBP1 to be a very early event, suggesting that the phenotype of the different SETBP1+ disorders may be shaped by the opposite hierarchy of the same clonal SETBP1 variants.
In a first-of-its-kind study, we assessed the capabilities of large language models (LLMs) in making complex decisions in haematopoietic stem cell transplantation. The evaluation was conducted not only for Generative Pre-trained... more
In a first-of-its-kind study, we assessed the capabilities of large language models (LLMs) in making complex decisions in haematopoietic stem cell transplantation. The evaluation was conducted not only for Generative Pre-trained Transformer 4 (GPT-4) but also conducted on other artificial intelligence models: PaLm 2 and Llama-2. Using detailed haematological histories that include both clinical, molecular and donor data, we conducted a triple-blind survey to compare LLMs to haematology residents. We found that residents significantly outperformed LLMs (p = 0.02), particularly in transplant eligibility assessment (p = 0.01). Our triple-blind methodology aimed to mitigate potential biases in evaluating LLMs and revealed both their promise and limitations in deciphering complex haematological clinical scenarios.
The dominant mutational signature in colorectal cancer genomes is C > T deamination (COSMIC Signature 1) and, in a small subgroup, mismatch repair signature (COSMIC signatures 6 and 44). Mutations in common colorectal cancer driver genes... more
The dominant mutational signature in colorectal cancer genomes is C > T deamination (COSMIC Signature 1) and, in a small subgroup, mismatch repair signature (COSMIC signatures 6 and 44). Mutations in common colorectal cancer driver genes are often not consistent with those signatures. Here we perform whole-genome sequencing of normal colon crypts from cancer patients, matched to a previous multi-omic tumour dataset. We analyse normal crypts that were distant vs adjacent to the cancer. In contrast to healthy individuals, normal crypts of colon cancer patients have a high incidence of pks + (polyketide synthases) E.coli (Escherichia coli) mutational and indel signatures, and this is confirmed by metagenomics. These signatures are compatible with many clonal driver mutations detected in the corresponding cancer samples, including in chromatin modifier genes, supporting their role in early tumourigenesis. These results provide evidence that pks + E.coli is a potential driver of carcinogenesis in the human gut.
Recurring sequences of genomic alterations occurring across patients can highlight repeated evolutionary processes with significant implications for predicting cancer progression. Leveraging the ever-increasing availability of cancer... more
Recurring sequences of genomic alterations occurring across patients can highlight repeated evolutionary processes with significant implications for predicting cancer progression. Leveraging the ever-increasing availability of cancer omics data, here we unveil cancer’s evolutionary signatures tied to distinct disease outcomes, representing “favored trajectories” of acquisition of driver mutations detected in patients with similar prognosis. We present a framework named ASCETIC (Agony-baSed Cancer EvoluTion InferenCe) to extract such signatures from sequencing experiments generated by different technologies such as bulk and single-cell sequencing data. We apply ASCETIC to (i) single-cell data from 146 myeloid malignancy patients and bulk sequencing from 366 acute myeloid leukemia patients, (ii) multi-region sequencing from 100 early-stage lung cancer patients, (iii) exome/genome data from 10,000+ Pan-Cancer Atlas samples, and (iv) targeted sequencing from 25,000+ MSK-MET metastatic patients, revealing subtype-specific single-nucleotide variant signatures associated with distinct prognostic clusters. Validations on several datasets underscore the robustness and generalizability of the extracted signatures.
Mantle-cell lymphoma (MCL) is a B-cell non-Hodgkin Lymphoma (NHL) with a poor prognosis, at high risk of relapse after conventional treatment. MCL-associated tumour microenvironment (TME) is characterized by M2-like tumour-associated... more
Mantle-cell lymphoma (MCL) is a B-cell non-Hodgkin Lymphoma (NHL) with a poor prognosis, at high risk of relapse after conventional treatment. MCL-associated tumour microenvironment (TME) is characterized by M2-like tumour-associated macrophages (TAMs), able to interact with cancer cells, providing tumour survival and resistance to immuno-chemotherapy. Likewise, monocyte-derived nurse-like cells (NLCs) present M2-like profile and provide proliferation signals to chronic lymphocytic leukaemia (CLL), a B-cell malignancy sharing with MCL some biological and phenotypic features. Antibodies against TAMs targeted CD47, a ‘don't eat me’ signal (DEMs) able to quench phagocytosis by TAMs within TME, with clinical effectiveness when combined with Rituximab in pretreated NHL. Recently, CD24 was found as valid DEMs in solid cancer. Since CD24 is expressed during B-cell differentiation, we investigated and identified consistent CD24 in MCL, CLL and primary human samples. Phagocytosis increased when M2-like macrophages were co-cultured with cancer cells, particularly in the case of paired DEMs blockade (i.e. anti-CD24 + anti-CD47) combined with Rituximab. Similarly, unstimulated CLL patients-derived NLCs provided increased phagocytosis when DEMs blockade occurred. Since high levels of CD24 were associated with worse survival in both MCL and CLL, anti-CD24-induced phagocytosis could be considered for future clinical use, particularly in association with other agents such as Rituximab.
Cancer patients show heterogeneous phenotypes and very different outcomes and responses even to common treatments, such as standard chemotherapy. This state-of-affairs has motivated the need for the comprehensive characterization of... more
Cancer patients show heterogeneous phenotypes and very different outcomes and responses even to common treatments, such as standard chemotherapy. This state-of-affairs has motivated the need for the comprehensive characterization of cancer phenotypes and fueled the generation of large omics datasets, comprising multiple omics data reported for the same patients, which might now allow us to start deciphering cancer heterogeneity and implement personalized therapeutic strategies. In this work, we performed the analysis of four cancer types obtained from the latest efforts by The Cancer Genome Atlas, for which seven distinct omics data were available for each patient, in addition to curated clinical outcomes. We performed a uniform pipeline for raw data preprocessing and adopted the Cancer Integration via MultIkernel LeaRning (CIMLR) integrative clustering method to extract cancer subtypes. We then systematically review the discovered clusters for the considered cancer types, highlighting novel associations between the different omics and prognosis.
In recent years, many algorithmic strategies have been developed to exploit single-cell mutational profiles generated via sequencing experiments of cancer samples and return reliable models of cancer evolution. Here, we introduce the... more
In recent years, many algorithmic strategies have been developed to exploit single-cell mutational profiles generated via sequencing experiments of cancer samples and return reliable models of cancer evolution. Here, we introduce the COB-tree algorithm, which summarizes the solutions explored by state-of-the-art methods for clonal tree inference, to return a unique consensus optimum branching tree. The method proves to be highly effective in detecting pairwise temporal relations between genomic events, as demonstrated by extensive tests on simulated datasets. We also provide a new method to visualize and quantitatively inspect the solution space of the inference methods, via Principal Coordinate Analysis. Finally, the application of our method to a single-cell dataset of patient-derived melanoma xenografts shows significant differences between the COB-tree solution and the maximum likelihood ones.
Background Longitudinal single-cell sequencing experiments of patient-derived models are increasingly employed to investigate cancer evolution. In this context, robust computational methods are needed to properly exploit the mutational... more
Background
Longitudinal single-cell sequencing experiments of patient-derived models are increasingly employed to investigate cancer evolution. In this context, robust computational methods are needed to properly exploit the mutational profiles of single cells generated via variant calling, in order to reconstruct the evolutionary history of a tumor and characterize the impact of therapeutic strategies, such as the administration of drugs. To this end, we have recently developed the LACE framework for the Longitudinal Analysis of Cancer Evolution.

Results
The LACE 2.0 release aimed at inferring longitudinal clonal trees enhances the original framework with new key functionalities: an improved data management for preprocessing of standard variant calling data, a reworked inference engine, and direct connection to public databases.

Conclusions
All of this is accessible through a new and interactive Shiny R graphical interface offering the possibility to apply filters helpful in discriminating relevant or potential driver mutations, set up inferential parameters, and visualize the results. The software is available at: github.com/BIMIB-DISCo/LACE.
Recent investigations have improved our understanding of the molecular aberrations supporting Waldenström Macroglobulinemia (WM) biology; however, whether the immune microenvironment contributes to WM pathogenesis remains unanswered. We... more
Recent investigations have improved our understanding of the molecular aberrations supporting Waldenström Macroglobulinemia (WM) biology; however, whether the immune microenvironment contributes to WM pathogenesis remains unanswered. We first showed how a transgenic murine model of human-like lymphoplasmacytic lymphoma/WM exhibits an increased number of regulatory T (Treg) cells with respect to control mice. These findings were translated into the WM clinical setting, where the transcriptomic profiling of WM patients'-derived regulatory T cells (Tregs) unveiled a peculiar WM-devoted mRNA signature, with significant enrichment for NF-kB-mediated TNF-a signaling-, MAPK-, PI3K/AKT-related genes; paralleled by different Treg functional phenotype. We demonstrated a significantly higher Treg-induction,-expansion and-proliferation triggered by WM cells as compared to their normal cellular counterpart; with a more profound effect within the context of CXCR4 C1013G-mutated WM cells. By investigating the B-toT cell cross-talk at single-cell level, we identified the CD40/CD40-ligand as a potentially relevant axis supporting WM cell-Treg cell interaction. Our findings demonstrate the existence of a Treg-mediated immunosuppressive phenotype in WM, which can be therapeutically reversed by blocking the CD40L/CD40 axis to inhibit WM cell growth.
We present a large-scale analysis of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) substitutions, considering 1,585,456 high-quality raw sequencing samples, aimed at investigating the existence and quantifying the effect of... more
We present a large-scale analysis of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) substitutions, considering 1,585,456 high-quality raw sequencing samples, aimed at investigating the existence and quantifying the effect of mutational processes causing mutations in SARS-CoV-2 genomes when interacting with the human host. As a result, we confirmed the presence of three well-differentiated mutational processes likely ruled by reactive oxygen species (ROS), apolipoprotein B editing complex (APOBEC), and adenosine deaminase acting on RNA (ADAR). We then evaluated the activity of these mutational processes in different continental groups, showing that some samples from Africa present a significantly higher number of substitutions, most likely due to higher APOBEC activity. We finally analyzed the activity of mutational processes across different SARS-CoV-2 variants, and we found a significantly lower number of mutations attributable to APOBEC activity in samples assigned to the Omicron variant.
We outline the features of the R package SparseSignatures and its application to determine the signatures contributing to mutation profiles of tumor samples. We describe installation details and illustrate a step-by-step approach to (1)... more
We outline the features of the R package SparseSignatures and its application to determine the signatures contributing to mutation profiles of tumor samples. We describe installation details and illustrate a step-by-step approach to (1) pre- pare the data for signature analysis, (2) determine the optimal parameters, and (3) employ them to determine the signatures and related exposure levels in the point mutation dataset.

For complete details on the use and execution of this protocol, please refer to Lal et al. (2021).
A key task of genomic surveillance of infectious viral diseases lies in the early detection of dangerous variants. Unexpected help to this end is provided by the analysis of deep sequencing data of viral samples, which are typically... more
A key task of genomic surveillance of infectious viral diseases lies in the early detection of dangerous variants. Unexpected help to this end is provided by the analysis of deep sequencing data of viral samples, which are typically discarded after creating consensus sequences. Such analysis allows one to detect intra-host low-frequency mutations, which are a footprint of mutational processes underlying the origination of new variants. Their timely identification may improve public-health decision-making with respect to traditional approaches exploiting consensus sequences. We present the analysis of 220,788 high-quality deep sequencing SARS-CoV-2 samples, showing that many spike and nucleocapsid mutations of interest associated to the most circulating variants, including Beta, Delta, and Omicron, might have been intercepted several months in advance. Furthermore, we show that a refined genomic surveillance system leveraging deep sequencing data might allow one to pinpoint emerging mutation patterns, providing an automated data-driven support to virologists and epidemiologists.
We describe the procedures to perform the following: (1) the de novo discovery of mutational signatures from raw sequencing data of viral samples and (2) the association of existing viral mutational signatures to the samples of a given... more
We describe the procedures to perform the following: (1) the de novo discovery of mutational signatures from raw sequencing data of viral samples and (2) the association of existing viral mutational signatures to the samples of a given dataset. The goal is to identify and characterize the nucleotide substitution patterns related to the mutational processes that underlie the origination of variants in viral genomes. The VirMutSig protocol is available at this link: https://github.com/BIMIB-DISCo/VirMutSig.

For complete information on the theoretical aspects of this protocol, please refer to Graudenzi et al. (2021).
Genetic and epigenetic variation, together with transcriptional plasticity, contribute to intratumour heterogeneity. The interplay of these biological processes and their respective contributions to tumour evolution remain unknown. Here... more
Genetic and epigenetic variation, together with transcriptional plasticity, contribute to intratumour heterogeneity. The interplay of these biological processes and their respective contributions to tumour evolution remain unknown. Here we show that intratumour genetic ancestry only infrequently affects gene expression traits and subclonal evolution in colorectal cancer (CRC). Using spatially resolved paired whole-genome and transcriptome sequencing, we find that the majority of intratumour variation in gene expression is not strongly heritable but rather ‘plastic’. Somatic expression quantitative trait loci analysis identified a number of putative genetic controls of expression by cis-acting coding and non-coding mutations, the majority of which were clonal within a tumour, alongside frequent structural alterations. Consistently, computational inference on the spatial patterning of tumour phylogenies finds that a considerable proportion of CRCs did not show evidence of subclonal selection, with only a subset of putative genetic drivers associated with subclone expansions. Spatial intermixing of clones is common, with some tumours growing exponentially and others only at the periphery. Together, our data suggest that most genetic intratumour variation in CRC has no major phenotypic consequence and that transcriptional plasticity is, instead, widespread within a tumour.
Colorectal malignancies are a leading cause of cancer-related death and have undergone extensive genomic study. However, DNA mutations alone do not fully explain malignant transformation. Here we investigate the co-evolution of the genome... more
Colorectal malignancies are a leading cause of cancer-related death and have undergone extensive genomic study. However, DNA mutations alone do not fully explain malignant transformation. Here we investigate the co-evolution of the genome and epigenome of colorectal tumours at single-clone resolution using spatial multi-omic profiling of individual glands. We collected 1,370 samples from 30 primary cancers and 8 concomitant adenomas and generated 1,207 chromatin accessibility profiles, 527 whole genomes and 297 whole transcriptomes. We found positive selection for DNA mutations in chromatin modifier genes and recurrent somatic chromatin accessibility alterations, including in regulatory regions of cancer driver genes that were otherwise devoid of genetic mutations. Genome-wide alterations in accessibility for transcription factor binding involved CTCF, downregulation of interferon and increased accessibility for SOX and HOX transcription factor families, suggesting the involvement of developmental genes during tumourigenesis. Somatic chromatin accessibility alterations were heritable and distinguished adenomas from cancers. Mutational signature analysis showed that the epigenome in turn influences the accumulation of DNA mutations. This study provides a map of genetic and epigenetic tumour heterogeneity, with fundamental implications for understanding colorectal cancer biology.
Activation-induced cytidine deaminase, AICDA or AID, is a driver of somatic hypermutation and class-switch recombination in immunoglobulins. In addition, this deaminase belonging to the APOBEC family may have off-target effects... more
Activation-induced cytidine deaminase, AICDA or AID, is a driver of somatic hypermutation and class-switch recombination in immunoglobulins. In addition, this deaminase belonging to the APOBEC family may have off-target effects genome-wide, but its effects at pan-cancer level are not well elucidated. Here, we used different pan-cancer datasets, totaling more than 50,000 samples analyzed by whole-genome, whole-exome, or targeted sequencing. AID mutations are present at pan-cancer level with higher frequency in hematological cancers and higher presence at transcriptionally active TAD domains. AID synergizes initial hotspot mutations by a second composite mutation. AID mutational load was found to be independently associated with a favorable outcome in immune-checkpoint inhibitors (ICI) treated patients across cancers after analyzing 2000 samples. Finally, we found that AID-related neoepitopes, resulting from mutations at more frequent hotspots if compared to other mutational signatures, enhance CXCL13/CCR5 expression, immunogenicity, and T-cell exhaustion, which may increase ICI sensitivity.
Motivation: Driver (epi)genomic alterations underlie the positive selection of cancer subpopulations, which promotes drug resistance and relapse. Even though substantial heterogeneity is witnessed in most cancer types, mutation... more
Motivation: Driver (epi)genomic alterations underlie the positive selection of cancer subpopulations, which promotes drug resistance and relapse. Even though substantial heterogeneity is witnessed in most cancer types, mutation accumulation patterns can be regularly found and can be exploited to reconstruct predictive models of cancer evolution. Yet, available methods cannot infer logical formulas connecting events to represent alternative evolutionary routes or convergent evolution.
Results: We introduce PMCE, an expressive framework that leverages mutational profiles from cross-sectional sequencing data to infer probabilistic graphical models of cancer evolution including arbitrary logical formulas, and which outperforms the state-of-the-art in terms of accuracy and robustness to noise, on simulations.
The application of PMCE to 7866 samples from the TCGA database allows us to identify a highly significant correlation between the predicted evolutionary paths and the overall survival in 7 tumor types, proving that our approach can effectively stratify cancer patients in reliable risk groups.
Availability: PMCE is freely available at https://github.com/BIMIB-DISCo/PMCE, in addition to the code to replicate all the analyses presented in the manuscript.
Contacts: daniele.ramazzotti@unimib.it, alex.graudenzi@ibfm.cnr.it.
Many large national and transnational studies have been dedicated to the analysis of SARS-CoV-2 genome, most of which focused on missense and nonsense mutations. However, approximately 30% of the SARS-CoV-2 variants are synonymous,... more
Many large national and transnational studies have been dedicated to the analysis of SARS-CoV-2 genome, most of which focused on missense and nonsense mutations. However, approximately 30% of the SARS-CoV-2 variants are synonymous, therefore changing the target codon without affecting the corresponding protein sequence.
By performing a large-scale analysis of sequencing data generated from almost 400,000 SARS-CoV-2 samples, we show that silent mutations increasing the similarity of viral codons to the human ones tend to fixate in the viral genome over-time. This indicates that SARS-CoV-2 codon usage is adapting to the human host, likely improving its effectiveness in using the human aminoacyl-tRNA set through the accumulation of deceitfully neutral silent mutations.
Matters Arising from: Sharma, A., Cao, E.Y., Kumar, V. et al. Longitudinal single-cell RNA sequencing of patient-derived primary cells reveals drug-induced infidelity in stem cell hierarchy. Nat Commun 9, 4931 (2018).... more
Matters Arising from: Sharma, A., Cao, E.Y., Kumar, V. et al. Longitudinal single-cell RNA sequencing of patient-derived primary cells reveals drug-induced infidelity in stem cell hierarchy. Nat Commun 9, 4931 (2018). https://doi.org/10.1038/s41467-018-07261-3. In Sharma, A. et al. Nat Commun 9, 4931 (2018) the authors employ longitudinal single-cell transcriptomic data from patient-derived primary and metastatic oral squamous cell carcinomas cell lines, to investigate possible divergent modes of chemo-resistance in tumor cell subpopulations. We integrated the analyses presented in the manuscript, by performing variant calling from scRNA-seq data via GATK Best Practices. As a main result, we show that an extremely high number of single-nucleotide variants representative of the identity of a specific patient is unexpectedly found in the scRNA-seq data of the cell line derived from a second patient, and vice versa. This finding likely suggests the existence of a sample swap, thus jeopardizing some of the translational conclusions of the article. Our results prove the efficacy of a joint analysis of the genotypic and transcriptomic identity of single-cells.
Atypical chronic myeloid leukemia (aCML) is a BCR-ABL1-negative clonal disorder, which belongs to the myelodysplastic/ myeloproliferative group. This disease is characterized by recurrent somatic mutations in SETBP1, ASXL1 and ETNK1... more
Atypical chronic myeloid leukemia (aCML) is a BCR-ABL1-negative clonal disorder, which belongs to the myelodysplastic/ myeloproliferative group. This disease is characterized by recurrent somatic mutations in SETBP1, ASXL1 and ETNK1 genes, as well as high genetic heterogeneity, thus posing a great therapeutic challenge. To provide a comprehensive genomic characterization of aCML we applied a high-throughput sequencing strategy to 43 aCML samples, including both whole-exome and RNA-sequencing data. Our dataset identifies ASXL1, SETBP1, and ETNK1 as the most frequently mutated genes with a total of 43.2%, 29.7 and 16.2%, respectively. We characterized the clonal architecture of 7 aCML patients by means of colony assays and targeted resequencing. The results indicate that ETNK1 variants occur early in the clonal evolution history of aCML, while SETBP1 mutations often represent a late event. The presence of actionable mutations conferred both ex vivo and in vivo sensitivity to specific inhibitors with evidence of strong in vitro synergism in case of multiple targeting. In one patient, a clinical response was obtained. Stratification based on RNA-sequencing identified two different populations in terms of overall survival, and differential gene expression analysis identified 38 significantly overexpressed genes in the worse outcome group. Three genes correctly classified patients for overall survival.
Research Interests:
To dissect the mechanisms underlying the inflation of variants in the SARS-CoV-2 genome, we present one of the largest up-to-date analyses of intra-host genomic diversity, which reveals that most samples present heterogeneous genomic... more
To dissect the mechanisms underlying the inflation of variants in the SARS-CoV-2 genome, we present one of the largest up-to-date analyses of intra-host genomic diversity, which reveals that most samples present heterogeneous genomic architectures, due to the interplay between host-related mutational processes and transmission dynamics. The deconvolution of the set of intra-host minor variants unveils the existence of non overlapping mutational signatures related to specific nucleotide substitutions, which prove that distinct hosts respond differently to SARS-CoV-2 infections, and which are likely ruled by APOBEC, Reactive Oxygen Species (ROS) and ADAR. Thanks to a corrected-for-signatures dN/dS analysis we demonstrate that the mutational processes underlying such signatures are affected by purifying selection, with important exceptions. In fact, several mutations linked to low-rate mutational processes appear to transit to clonality in the population, eventually leading to the definition of new clonal genotypes and to a statistically significant increase of overall genomic diversity. Importantly, the analysis of the phylogenetic model shows the presence of multiple homoplasies, due to mutational hotspots, phantom mutations or positive selection, and supports the hypothesis of transmission of minor variants during infections. Overall, the results of this study pave the way for the integrated characterization of intra-host genomic diversity and clinical outcome of SARS-CoV-2 hosts.
We introduce VERSO, a two-step framework for the characterization of viral evolution from sequencing data of viral genomes, which improves over phylogenomic approaches for consensus sequences. VERSO exploits an efficient algorithmic... more
We introduce VERSO, a two-step framework for the characterization of viral evolution from sequencing data of viral genomes, which improves over phylogenomic approaches for consensus sequences. VERSO exploits an efficient algorithmic strategy to return robust phylogenies from clonal variant profiles, also in conditions of sampling limitations. It then leverages variant frequency patterns to characterize the intra-host genomic diversity of samples, revealing undetected infection chains and pinpointing variants likely involved in homoplasies. On simulations, VERSO outperforms state-of-the-art tools for phylogenetic inference. Notably, the application to 6726 Amplicon and RNA-seq samples refines the estimation of SARS-CoV-2 evolution, while co-occurrence patterns of minor variants unveil undetected infection paths, which are validated with contact tracing data. Finally, the analysis of SARS-CoV-2 mutational landscape uncovers a temporal increase of overall genomic diversity, and highlights variants transiting from minor to clonal state and homoplastic variants, some of which falling on the spike gene. Available at: https://github.com/BIMIB-DISCo/VERSO.
The rise of longitudinal single-cell sequencing experiments on patient-derived cell cultures, xenografts and organoids is opening new opportunities to track cancer evolution, assess the efficacy of therapies and identify resistant... more
The rise of longitudinal single-cell sequencing experiments on patient-derived cell cultures, xenografts and organoids is opening new opportunities to track cancer evolution, assess the efficacy of therapies and identify resistant subclones. We introduce LACE, the first algorithmic framework that processes single-cell mutational profiles from samples collected at different time points to reconstruct longitudinal models of cancer evolution. The approach maximizes a weighted likelihood function computed on longitudinal data points to solve a Boolean matrix factorization problem, via Markov chain Monte Carlo sampling. On simulations, LACE outperforms state-of-the-art methods for both bulk and single-cell sequencing data with respect to the reconstruction of the ground-truth clonal phylogeny and dynamics, also in conditions of unbalanced datasets, significant rates of sequencing errors and sampling limitations. As the results are robust with respect to data-specific errors, LACE is effective with mutational profiles generated by calling variants from (full-length) scRNA-seq data, and this allows one to investigate the relation between genomic and phenotypic evolution of tumors at the single-cell level. Here, we apply LACE to a longitudinal scRNA-seq dataset of patient-derived xenografts of BRAF V600E/K mutant melanomas, dissecting the impact of BRAF/MEK-inhibition on clonal evolution, also in terms of clone-specific gene expression dynamics. Furthermore, the analysis of breast cancer PDXs from longitudinal targeted scDNA-sequencing experiments delivers a high-resolution temporal characterization of intra-tumor heterogeneity.
The metabolic processes related to the synthesis of the molecules needed for a new round of cell division underlie the complex behaviour of cell populations in multi-cellular systems , such as tissues and organs, whereas their... more
The metabolic processes related to the synthesis of the molecules needed for a new round of cell division underlie the complex behaviour of cell populations in multi-cellular systems , such as tissues and organs, whereas their deregulation can lead to pathological states, such as cancer. Even within genetically homogeneous populations, complex dynamics, such as population oscillations or the emergence of specific metabolic and/or proliferative patterns, may arise, and this aspect is highly amplified in systems characterized by extreme heterogeneity. * Also affiliated at: Fondazione IRCCS Istituto Nazionale dei Tumori, To investigate the conditions and mechanisms that link metabolic processes to cell population dynamics, we here employ a previously introduced multi-scale model of multi-cellular system, named FBCA (Flux Balance Analysis with Cellular Automata), which couples biomass accumulation , simulated via Flux Balance Analysis of a metabolic network, with the simulation of population and spatial dynamics via Cellular Potts Models. In this work, we investigate the influence that different modes of nutrients diffusion within the system may have on the emerging behaviour of cell populations. In our model, metabolic communication among cells is allowed by letting secreted metabolites to diffuse over the lattice, in addition to diffusion of nutrients from given sources. The inclusion of the diffusion processes in the model proved its effectiveness in characterizing plausible biological scenarios.
Patients admitted to the intensive care unit frequently have anemia and impaired renal function, but often lack historical blood results to contextualize the acuteness of these findings. Using data available within two hours of ICU... more
Patients admitted to the intensive care unit frequently have anemia and impaired renal function, but often lack historical blood results to contextualize the acuteness of these findings. Using data available within two hours of ICU admission, we developed machine learning models that accurately (AUC 0.86–0.89) classify an individual patient’s baseline hemoglobin and creatinine levels. Compared to assuming the baseline to be the same as the admission lab value, machine learning performed significantly better at classifying acute kidney injury regardless of initial creatinine value, and significantly better at predicting baseline hemoglobin value in patients with admission hemoglobin of <10 g/dl.
Background. A large number of algorithms is being developed to reconstruct evolutionary models of individual tumours from genome sequencing data. Most methods can analyze multiple samples collected either through bulk multi-region... more
Background. A large number of algorithms is being developed to reconstruct evolutionary models of individual tumours from genome sequencing data. Most methods can analyze multiple samples collected either through bulk multi-region sequencing experiments or the sequencing of individual cancer cells. However, rarely the same method can support both data types.
Results. We introduce TRaIT, a computational framework to infer mutational graphs that model the accumulation of multiple types of somatic alterations driving tumour evolution. Compared to other tools, TRaIT supports multi-region and single-cell sequencing data within the same statistical framework, and delivers expressive models that capture many complex evolutionary phenomena. TRaIT improves accuracy, robustness to data-specific errors and computational complexity compared to competing methods.
Conclusions. We show that the application of TRaIT to single-cell and multi-region cancer datasets can produce accurate and reliable models of single-tumour evolution, quantify the extent of intra-tumour heterogeneity and generate new testable experimental hypotheses.
Background. Germline mutations in the BRCA1 and BRCA2 genes predispose carriers to breast and ovarian cancer, and there remains a need to identify the specific genomic mechanisms by which cancer evolves in these patients. Here we present... more
Background. Germline mutations in the BRCA1 and BRCA2 genes predispose carriers to breast and ovarian cancer, and there remains a need to identify the specific genomic mechanisms by which cancer evolves in these patients. Here we present a systematic genomic analysis of breast tumors with BRCA1 and BRCA2 mutations.

Methods. We analyzed genomic data from breast tumors, with a focus on comparing tumors with BRCA1/BRCA2 gene mutations with common classes of sporadic breast tumors.

Results. We identify differences between BRCA-mutated and sporadic breast tumors in patterns of point mutation, DNA methylation and structural variation. We show that structural variation disproportionately affects tumor suppressor genes and identify specific driver gene candidates that are enriched for structural variation.

Conclusions. Compared to sporadic tumors, BRCA-mutated breast tumors show signals of reduced DNA methylation, more ancestral cell divisions, and elevated rates of structural variation that tend to disrupt highly expressed protein-coding genes and known tumor suppressors. Our analysis suggests that BRCA-mutated tumors are more aggressive than sporadic breast cancers because loss of the BRCA pathway causes multiple processes of mutagenesis and gene dysregulation.
Over the past decades, both critical care and cancer care have improved substantially. Due to increased cancer-specific survival, we hypothesized that both the number of cancer patients admitted to the ICU and overall survival have... more
Over the past decades, both critical care and cancer care have improved substantially. Due to increased cancer-specific survival, we hypothesized that both the number of cancer patients admitted to the ICU and overall survival have increased since the millennium change. MIMIC-III, a freely accessible critical care database of Beth Israel Deaconess Medical Center, Boston, USA was used to retrospectively study trends and outcomes of cancer patients admitted to the ICU between 2002 and 2011. Multiple logistic regression analysis was performed to adjust for confounders of 28-day and 1-year mortality.
Out of 41,468 unique ICU admissions, 1,100 hemato-oncologic, 3,953 oncologic and 49 patients with both a hematological and solid malignancy were analyzed. Hematological patients had higher critical illness scores than non-cancer patients, while oncologic patients had similar APACHE-III and SOFA-scores compared to non-cancer patients. In the univariate analysis, cancer was strongly associated with mortality (OR= 2.74, 95%CI: 2.56, 2.94). Over the 10-year study period, 28-day mortality of cancer patients decreased by 30%. This trend persisted after adjustment for covariates, with cancer patients having significantly higher mortality (OR=2.63, 95%CI: 2.38, 2.88). Between 2002 and 2011, both the adjusted odds of 28-day mortality and the adjusted odds of 1-year mortality for cancer patients decreased by 6% (95%CI: 4%, 9%). Having cancer was the strongest single predictor of 1-year mortality in the multivariate model (OR=4.47, 95%CI: 4.11, 4.84).
Background. Critically ill patients may die despite invasive intervention. In this study, we examine trends in the application of two such treatments over a decade, namely, endotracheal ventilation and vasopressors and inotropes... more
Background. Critically ill patients may die despite invasive intervention. In this study, we examine trends in the application of two such treatments over a decade, namely, endotracheal ventilation and vasopressors and inotropes administration, as well as the impact of these trends on survival durations in patients who die within a month of ICU admission.

Methods. We considered observational data available from the MIMIC-III open-access ICU database and collected within a study period between year 2002 up to 2011. If a patient had multiple admissions to the ICU during the 30 days before death, only the first stay was analyzed, leading to a final set of 6,436 unique ICU admissions during the study period. We tested two hypotheses: (i) administration of invasive intervention during the ICU stay immediately preceding end-of-life would decrease over the study time period and (ii) time-to-death from ICU admission would also decrease, due to the decrease in invasive intervention administration. To investigate the latter hypothesis, we performed a subgroups analysis by considering patients with lowest and highest severity. To do so, we stratified the patients based on their SAPS I scores, and we considered patients within the first and the third tertiles of the score. We then assessed differences in trends within these groups between years 2002–05 vs. 2008–11.

Results. Comparing the period 2002–2005 vs. 2008–2011, we found a reduction in endotracheal ventilation among patients who died within 30 days of ICU admission (120.8 vs. 68.5 hours for the lowest severity patients, p<0.001; 47.7 vs. 46.0 hours for the highest severity patients, p = 0.004). This is explained in part by an increase in the use of non-invasive ventilation. Comparing the period 2002–2005 vs. 2008–2011, we found a reduction in the use of vasopressors and inotropes among patients with the lowest severity who died within 30 days of ICU admission (41.8 vs. 36.2 hours, p<0.001) but not among those with the highest severity. Despite a reduction in the use of invasive interventions, we did not find a reduction in the time to death between 2002–2005 vs. 2008–2011 (7.8 days vs. 8.2 days for the lowest severity patients, p = 0.32; 2.1 days vs. 2.0 days for the highest severity patients, p = 0.74).

Conclusion. We found that the reduction in the use of invasive treatments over time in patients with very poor prognosis did not shorten the time-to-death. These findings may be useful for goals of care discussions.
Mastering the dynamics of social influence requires separating, in a database of information propagation traces, the genuine causal processes from temporal correlation, homophily and other spurious causes. However, most of the studies to... more
Mastering the dynamics of social influence requires separating, in a database of information propagation traces, the genuine causal processes from temporal correlation, homophily and other spurious causes. However, most of the studies to characterize social influence and, in general, most data-science analyses focus on correlations , statistical independence, conditional independence etc.; only recently, there has been a resurgence of interest in " causal data science, " e.g., grounded on causality theories. In this paper we adopt a principled causal approach to the analysis of social influence from information-propagation data, rooted in probabilistic causal theory. Our approach develops around two phases. In the first step, in order to avoid the pitfalls of misinterpreting causation when the data spans a mixture of several subtypes (" Simpson's paradox "), we partition the set of propagation traces in groups, in such a way that each group is as less contradictory as possible in terms of the hierarchical structure of information propagation. For this goal we borrow from the literature the notion of " agony " [26] and define the Agony-bounded Partitioning problem, which we prove being hard, and for which we develop two efficient algorithms with approximation guarantees. In the second step, for each group from the first phase, we apply a constrained MLE approach to ultimately learn a minimal causal topology. Experiments on synthetic data show that our method is able to retrieve the genuine causal arcs w.r.t. a known ground-truth generative model. Experiments on real data show that, by focusing only on the extracted causal structures instead of the whole social network, we can improve the effectiveness of predicting influence spread.
Bayesian Networks have been widely used in the last decades in many _elds, to describe statistical dependencies among random variables. In general, learning the structure of such models is a problem with considerable theoretical interest... more
Bayesian Networks have been widely used in the last decades in many _elds, to describe statistical dependencies among random variables. In general, learning the structure of such models is a problem with considerable theoretical interest that poses many challenges. On the one hand, it is a well-known NP-complete problem, practically hardened by the huge search space of possible solutions. On the other hand, the phenomenon of I-equivalence, i.e., di_erent graphical structures underpinning the same set of statistical dependencies, may lead to multimodal _tness landscapes further hindering maximum likelihood approaches to solve the task. Despite all these di_culties, greedy search methods based on a likelihood score coupled with a regularizator score to account for model complexity, have been shown to be surprisingly e_ective in practice. In this paper, we consider the formulation of the task of learning the structure of Bayesian Networks as an optimization problem based on a likelihood score, without complexity terms to regularize it. In particular, we exploit the NSGA-II multi-objective optimization procedure in order to explicitly account for both the likelihood of a solution and the number of selected arcs, by setting these as the two objective functions of the method. The aim of this work is to investigate the behavior of NSGA-II and analyse the quality of its solutions. We thus thoroughly examined the optimization results obtained on a wide set of simulated data, by considering both the goodness of the inferred solutions in terms of the objective functions values achieved, and by comparing the retrieved structures with the ground truth, i.e., the networks used to generate the target data. Our results show that NSGA-II can converge to solutions characterized by better likelihood and less arcs than classic approaches, although paradoxically characterized in many cases by a lower similarity with the target network.
Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from... more
Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.
Outcomes for cancer patients vary greatly even within the same tumor type, and characterization of molecular subtypes of cancer holds important promise for improving prognosis and personalized treatment. This promise has motivated recent... more
Outcomes for cancer patients vary greatly even within the same tumor type, and characterization of molecular subtypes of cancer holds important promise for improving prognosis and personalized treatment. This promise has motivated recent efforts to produce large amounts of multidimensional genomic (‘multi-omic’) data, but current algorithms still ​face challenges in the integrated analysis of such data​. Here we present Cancer Integration via Multikernel Learning (CIMLR), a new cancer subtyping method that integrates multi-omic data to reveal molecular subtypes of cancer. We apply CIMLR to multi-omic data from 36 cancer types and show significant improvements in both computational efficiency and ability to extract biologically meaningful cancer subtypes. The discovered subtypes exhibit significant differences in patient survival for 27 of 36 cancer types. Our analysis reveals integrated patterns of gene expression, methylation, point mutations and copy number changes in multiple cancers and highlights patterns specifically associated with poor patient outcomes.
Identification of modules in molecular networks is at the core of many current analysis methods in biomedical research. However, how well different approaches identify disease-relevant modules in different types of gene and protein... more
Identification of modules in molecular networks is at the core of many current analysis methods in biomedical research. However, how well different approaches identify disease-relevant modules in different types of gene and protein networks remains poorly understood. We launched the “Disease Module Identification DREAM Challenge”, an open competition to comprehensively assess module identification methods across diverse protein-protein interaction, signaling, gene co-expression, homology, and cancer-gene networks. Predicted network modules were tested for association with complex traits and diseases using a unique collection of 180 genome-wide association studies (GWAS). Our critical assessment of 75 contributed module identification methods reveals novel top-performing algorithms, which recover complementary trait-associated modules. We find that most of these modules correspond to core disease-relevant pathways, which often comprise therapeutic targets and correctly prioritize candidate disease genes. This community challenge establishes benchmarks, tools and guidelines for molecular network analysis to study human disease biology (https://synapse.org/modulechallenge).
Carcinogenesis is an evolutionary process driven by the accumulation of genomic aberrations. Recurrent sequences of genomic changes, both between and within patients, reflect repeated evolution that is valuable for anticipating cancer... more
Carcinogenesis is an evolutionary process driven by the accumulation of genomic aberrations. Recurrent sequences of genomic changes, both between and within patients, reflect repeated evolution that is valuable for anticipating cancer progression. Multi-region sequencing and phylogenetic analysis allow inference of the partial temporal order of genomic changes within a patient's tumour. However, the inherent stochasticity of the evolutionary process makes phylogenetic trees from different patients appear very distinct, preventing the robust identification of recurrent evolutionary trajectories. Here we present a novel quantitative method based on a machine learning approach called Transfer Learning (TL) that allows overcoming the stochastic effects of cancer evolution and highlighting hidden recurrences in cancer patient cohorts. When applied to multi-region sequencing datasets from lung, breast and renal cancer (708 samples from 160 patients), our method detected repeated evolutionary trajectories that determine novel patient subgroups, which reproduce in large single- sample cohorts (n=2,641) and have prognostic value. Our method provides a novel patient classification measure that is grounded in the cancer evolution paradigm, and which reveals repeated evolution during tumorigenesis, with implications for our ability to anticipate malignant evolution.
Learning the structure of dependencies among multiple random variables is a problem of considerable theoretical and practical interest. Within the context of Bayesian Networks, a practical and surprisingly successful solution to this... more
Learning the structure of dependencies among multiple random variables is a problem of considerable theoretical and practical interest. Within the context of Bayesian Networks, a practical and surprisingly successful solution to this learning problem is achieved by adopting score-functions optimisation schema, augmented with multiple restarts to avoid local optima. Yet, the conditions under which such strategies work well are poorly understood, and there are also some intrinsic limitations to learning the directionality of the interaction among the variables. Following an early intuition of Friedman and Koller, we propose to decouple the learning problem into two steps: first, we identify a partial ordering among input variables which constrains the structural learning problem, and then propose an effective bootstrap-based algorithm to simulate augmented data sets, and select the most important dependencies among the variables. By using several synthetic data sets, we show that our algorithm yields better recovery performance than the state of the art, increasing the chances of identifying a globally-optimal solution to the learning problem, and solving also well-known identifiability issues that affect the standard approach. We use our new algorithm to infer statistical dependencies between cancer driver somatic mutations detected by high-throughput genome sequencing data of multiple colorectal cancer patients. In this way, we also show how the proposed methods can shade new insights about cancer initiation, and progression. Code: https://github.com/caravagn/Bootstrap-based-Learning
The increasing availability of sequencing data of cancer samples is fueling the development of algorithmic strategies to investigate tumor heterogeneity and infer reliable models of cancer evolution. We here build up on previous works on... more
The increasing availability of sequencing data of cancer samples is fueling the development of algorithmic strategies to investigate tumor heterogeneity and infer reliable models of cancer evolution. We here build up on previous works on cancer progression inference from genomic alteration data, to deliver two distinct Cytoscape-based applications, which allow to produce, visualize and manipulate cancer evolution models, also by interacting with public genomic and proteomics databases. In particular, we here introduce cyTRON, a stand-alone Cytoscape app, and cyTRON/JS, a web application which employs the functionalities of Cytoscape/JS.

cyTRON was developed in Java; the code is available at https://github.com/BIMIB-DISCo/cyTRON and on the Cytoscape App Store http://apps.cytoscape.org/apps/cytron. cyTRON/JS was developed in JavaScript and R; the source code of the tool is available at https://github.com/BIMIB-DISCo/cyTRON-js and the tool is accessible from https://bimib.disco.unimib.it/cytronjs/welcome.
One of the most challenging tasks when adopting Bayesian Networks (BNs) is the one of learning their structure from data. This task is complicated by the huge search space of possible solutions, and by the fact that the problem is... more
One of the most challenging tasks when adopting Bayesian Networks (BNs) is the one of learning their structure from data. This task is complicated by the huge search space of possible solutions, and by the fact that the problem is NP-hard. Hence, full enumeration of all the possible solutions is not always feasible and approximations are often required. However, to the best of our knowledge, a quantitative analysis of the performance and characteristics of the different heuristics to solve this problem has never been done before.

For this reason, in this work, we provide a detailed comparison of many different state-of-the-arts methods for structural learning on simulated data considering both BNs with discrete and continuous variables, and with different rates of noise in the data. In particular, we investigate the performance of different widespread scores and algorithmic approaches proposed for the inference and the statistical pitfalls within them.
Motivation We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a sample-to-sample similarity measure from expression data observed for... more
Motivation

We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a sample-to-sample similarity measure from expression data observed for heterogenous samples. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.
Availability and Implementation

SIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on http://bioconductor.org.
The complicated, evolving landscape of cancer mutations poses a formidable challenge to identify cancer genes among the large lists of mutations typically generated in NGS experiments. The ability to prioritize these variants is therefore... more
The complicated, evolving landscape of cancer mutations poses a formidable challenge to identify cancer genes among the large lists of mutations typically generated in NGS experiments. The ability to prioritize these variants is therefore of paramount importance. To address this issue we developed OncoScore, a text-mining tool that ranks genes according to their association with cancer, based on available biomedical literature. Receiver operating characteristic curve and the area under the curve (AUC) metrics on manually curated datasets confirmed the excellent discriminating capability of OncoScore (OncoScore cut-off threshold = 21.09; AUC = 90.3%, 95% CI: 88.1-92.5%), indicating that OncoScore provides useful results in cases where an efficient prioritization of cancer-associated genes is needed.
The most recent financial upheavals have cast doubt on the adequacy of some of the conventional quantitative risk management strategies, such as VaR (Value at Risk), in many common situations. Consequently, there has been an increasing... more
The most recent financial upheavals have cast doubt on the adequacy of some of the conventional quantitative risk management strategies, such as VaR (Value at Risk), in many common situations. Consequently, there has been an increasing need for verisimilar financial stress testings, namely simulating and analyzing financial portfolios in extreme, albeit rare scenarios. Unlike conventional risk management which exploits statistical correlations among financial instruments, here we focus our analysis on the notion of probabilistic causation, which is embodied by Suppes-Bayes Causal Networks (SBCNs); SBCNs are probabilistic graphical models that have many attractive features in terms of more accurate causal analysis for generating financial stress scenarios.

In this paper, we present a novel approach for conducting stress testing of financial portfolios based on SBCNs in combination with classical machine learning classification tools. The resulting method is shown to be capable of correctly discovering the causal relationships among financial factors that affect the portfolios and thus, simulating stress testing scenarios with a higher accuracy and lower computational complexity than conventional Monte Carlo simulations.
Structural learning of Bayesian Networks (BNs) is a NP-hard problem, which is further complicated by many theoretical issues, such as the I-equivalence among different structures. In this work, we focus on a specific subclass of BNs,... more
Structural learning of Bayesian Networks (BNs) is a NP-hard problem, which is further complicated by many theoretical issues, such as the I-equivalence among different structures. In this work, we focus on a specific subclass of BNs, named Suppes-Bayes Causal Networks (SBCNs), which include specific structural constraints based on Suppes’ probabilistic causation to efficiently model cumulative phenomena. Here we compare the performance, via extensive simulations, of various state-of-the-art search strategies, such as local search techniques and Genetic Algorithms, as well as of distinct regularization methods. The assessment is performed on a large number of simulated datasets from topologies with distinct levels of complexity, various sample size and different rates of errors in the data. Among the main results, we show that the introduction of Suppes’ constraints dramatically improve the inference accuracy, by reducing the solution space and providing a temporal ordering on the variables. We also report on trade-offs among different search techniques that can be efficiently employed in distinct experimental settings. This manuscript is an extended version of the paper “Structural Learning of Probabilistic Graphical Models of Cumulative Phenomena” presented at the 2018 International Conference on Computational Science.
Several statistical techniques have been recently developed for the inference of cancer progression models from the increasingly available NGS cross-sectional mutational profiles. A particular algorithm, CAPRI, was proven to be the most... more
Several statistical techniques have been recently developed for the inference of cancer progression models from the increasingly available NGS cross-sectional mutational profiles. A particular algorithm, CAPRI, was proven to be the most efficient with respect to sample size and level of noise in the data. The algorithm combines structural constraints based on Suppes' theory of probabilistic causation and maximum likelihood fit with regulariza-tion, and defines constrained Bayesian networks, named Suppes-Bayes Causal Networks (SBCNs), which account for the selective advantage relations among genomic events. In general, SBCNs are effective in modeling any phenomenon driven by cumulative dynami-cal, as long as the modeled events are persistent. We here discuss on the effectiveness of the SBCN theoretical framework and we investigate the influence of: (i) the priors based on Suppes' theory and (ii) different maximum likelihood regularization parameters on the inference performance estimated on large synthetically generated datasets.
Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical... more
Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. Here, we propose a novel similarity-learning framework, SIMLR (single-cell interpretation via multi-kernel learning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization applications. Benchmarking against state-of-the-art methods for these applications, we used SIMLR to re-analyse seven representative single-cell data sets, including high-throughput droplet-based data sets with tens of thousands of cells. We show that SIMLR greatly improves clustering sensitivity and accuracy, as well as the visualization and interpretability of the data.

And 12 more