Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Giorgio Valentini

    Giorgio Valentini

    • Full professor in Computer Science at the the University of Milano (Italy).My research interests are in Artificial In... moreedit
    The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Here we report on the results of the third CAFA challenge,... more
    The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Here we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility (P. aureginosa only). We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. We conclude that, w...
    A personalized approach is strongly advocated for treatment selection in Multiple Sclerosis patients due to the high number of available drugs. Machine learning methods proved to be valuable tools in the context of precision medicine. In... more
    A personalized approach is strongly advocated for treatment selection in Multiple Sclerosis patients due to the high number of available drugs. Machine learning methods proved to be valuable tools in the context of precision medicine. In the present work, we applied machine learning methods to identify a combined clinical and genetic signature of response to fingolimod that could support the prediction of drug response. Two cohorts of fingolimod-treated patients from Italy and France were enrolled and divided into training, validation, and test set. Random forest training and robust feature selection were performed in the first two sets respectively, and the independent test set was used to evaluate model performance. A genetic-only model and a combined clinical–genetic model were obtained. Overall, 381 patients were classified according to the NEDA-3 criterion at 2 years; we identified a genetic model, including 123 SNPs, that was able to predict fingolimod response with an AUROC= ...
    Background Cis-regulatory regions (CRRs) are non-coding regions of the DNA that fine control the spatio-temporal pattern of transcription; they are involved in a wide range of pivotal processes such as the development of specific... more
    Background Cis-regulatory regions (CRRs) are non-coding regions of the DNA that fine control the spatio-temporal pattern of transcription; they are involved in a wide range of pivotal processes such as the development of specific cell-lines/tissues and the dynamic cell response to physiological stimuli. Recent studies showed that genetic variants occurring in CRRs are strongly correlated with pathogenicity or deleteriousness. Considering the central role of CRRs in the regulation of physiological and pathological conditions, the correct identification of CRRs and of their tissue-specific activity status through Machine Learning methods plays a major role in dissecting the impact of genetic variants on human diseases. Unfortunately, the problem is still open, though some promising results have been already reported by (deep) machine-learning based methods that predict active promoters and enhancers in specific tissues or cell lines by encoding epigenetic or spectral features directly...
    The detection of pathogenic genomic variants associated with genetic or cancer diseases represents an open problem in the context of the Genomic Medicine. In particular the detection of mutations in the non-coding regions of human genome... more
    The detection of pathogenic genomic variants associated with genetic or cancer diseases represents an open problem in the context of the Genomic Medicine. In particular the detection of mutations in the non-coding regions of human genome represents a particularly challenging machine learning problem, since the number of neutral variants largely outnumber the pathogenic ones, thus resulting in highly imbalanced classification problems. We applied neural networks to the detection of pathogenic regulatory genomic variants in Mendelian diseases and we showed that leveraging imbalance-aware techniques and deep learning algorithms, we can obtain state-of-the-art results, using a less complex model than those proposed in literature for this challenging prediction task.
    Motivation Automated protein function prediction is a complex multi-class, multi-label, structured classification problem in which protein functions are organized in a controlled vocabulary, according to the Gene Ontology (GO).... more
    Motivation Automated protein function prediction is a complex multi-class, multi-label, structured classification problem in which protein functions are organized in a controlled vocabulary, according to the Gene Ontology (GO). ‘Hierarchy-unaware’ classifiers, also known as ‘flat’ methods, predict GO terms without exploiting the inherent structure of the ontology, potentially violating the True-Path-Rule (TPR) that governs the GO, while ‘hierarchy-aware’ approaches, even if they obey the TPR, do not always show clear improvements with respect to flat methods, or do not scale well when applied to the full GO. Results To overcome these limitations, we propose Hierarchical Ensemble Methods for Directed Acyclic Graphs (HEMDAG), a family of highly modular hierarchical ensembles of classifiers, able to build upon any flat method and to provide ‘TPR-safe’ predictions, by leveraging a combination of isotonic regression and TPR learning strategies. Extensive experiments on synthetic and real...
    Membrane transport systems comprise roughly 10% of all proteins in a cell and play a critical role in many biological processes [1]. Improving and expanding their classification is an important goal that can affect studies involving... more
    Membrane transport systems comprise roughly 10% of all proteins in a cell and play a critical role in many biological processes [1]. Improving and expanding their classification is an important goal that can affect studies involving comparative and functional genomics, probing molecular mechanisms of diseases and metabolic processes, and searching new therapeutic targets and pharmacologically relevant transport proteins. In this context, a relevant classification problem is represented by the characterization of transport proteins according to the TC (Transporter Classification) data base (TCDB). Indeed by exploiting this hierarchical taxonomy that includes thousands of families and subfamilies of transporters we implicitly predict the mode of action of the transport activity, the energy coupling mechanism used for the transport, the phylogenetic grouping of the proteins and their substrate specificity [2].
    This is the Supplementary information of the paper:<br>M. Schubach, M. Re, P.N. Robinson and G. Valentini Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants, <br>... more
    This is the Supplementary information of the paper:<br>M. Schubach, M. Re, P.N. Robinson and G. Valentini Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants, <br> <i>Scientific Reports, Nature Publishing</i>, 7:2959, 2017
    Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference
    Known pathogenic variants associated with genetic Mendelian diseases represent a tiny minority of the overall genetic variation that characterizes the human genome. In this context classical imbalance-aware machine learning methods are... more
    Known pathogenic variants associated with genetic Mendelian diseases represent a tiny minority of the overall genetic variation that characterizes the human genome. In this context classical imbalance-aware machine learning methods are unable to distinguish pathogenic from benign variants, since they are severely biased toward the majority (benign) class. Recent works based on ensemble and hyper-ensemble methods showed that by adopting sampling techniques we can significantly improve performance on this challenging task. Inspired by these findings and by recent successful applications of deep learning to Precision Medicine, we propose two learning techniques for neural networks designed to assure a certain balancing between pathogenic and benign variants during the training phase, or to assure that with high probability at least one pathogenic variant is included in the training mini-batch set of examples. The experimental prediction of non-coding mutations associated with Mendelian...
    The Automated Functional Prediction (AFP) of proteins became a challenging problem in bioinformatics and biomedicine aiming at handling and interpreting the extremely large-sized proteomes of several eukaryotic organisms. A central issue... more
    The Automated Functional Prediction (AFP) of proteins became a challenging problem in bioinformatics and biomedicine aiming at handling and interpreting the extremely large-sized proteomes of several eukaryotic organisms. A central issue in AFP is the absence in public repositories for protein functions, e.g. the Gene Ontology (GO), of well defined sets of negative examples to learn accurate classifiers for AFP. In this paper we investigate the Query by Committee paradigm of active learning to select the negatives most informative for the classifier and the protein function to be inferred. We validated our approach in predicting the Gene Ontology function for the S.cerevisiae proteins.
    Methods for phenotype and outcome prediction are largely based on inductive supervised models that use selected biomarkers to make predictions, without explicitly considering the functional relationships between individuals. We introduce... more
    Methods for phenotype and outcome prediction are largely based on inductive supervised models that use selected biomarkers to make predictions, without explicitly considering the functional relationships between individuals. We introduce a novel network-based approach named Patient-Net (P-Net) in which biomolecular profiles of patients are modeled in a graph-structured space that represents gene expression relationships between patients. Then a kernel-based semi-supervised transductive algorithm is applied to the graph to explore the overall topology of the graph and to predict the phenotype/clinical outcome of patients. Experimental tests involving several publicly available datasets of patients afflicted with pancreatic, breast, colon and colorectal cancer show that our proposed method is competitive with state-of-the-art supervised and semi-supervised predictive systems. Importantly, P-Net also provides interpretable models that can be easily visualized to gain clues about the re...
    Wnt/Fzd signaling has been implicated in hematopoietic stem cell maintenance and in acute leukemia establishment. In our previous work, we described a recurrent rearrangement involving the WNT10B locus (WNT10BR), characterized by the... more
    Wnt/Fzd signaling has been implicated in hematopoietic stem cell maintenance and in acute leukemia establishment. In our previous work, we described a recurrent rearrangement involving the WNT10B locus (WNT10BR), characterized by the expression of WNT10BIVS1 transcript variant, in acute myeloid leukemia. To determine the occurrence of WNT10BR in T‐cell acute lymphoblastic leukemia (T‐ALL), we retrospectively analyzed an Italian cohort of patients (n = 20) and detected a high incidence (13/20) of WNT10BIVS1 expression. To address genes involved in WNT10B molecular response, we have designed a Wnt‐targeted RNA sequencing panel. Identifying Wnt agonists and antagonists, it results that the expression of FZD6, LRP5, and PROM1 genes stands out in WNT10BIVS1 positive patients compared to negative ones. Using MOLT4 and MUTZ‐2 as leukemic cell models, which are characterized by the expression of WNT10BIVS1, we have observed that WNT10B drives major Wnt activation to the FZD6 receptor comple...
    Bias-variance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well tuned to the properties of a specific base learner. Indeed the effectiveness of ensemble methods critically... more
    Bias-variance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well tuned to the properties of a specific base learner. Indeed the effectiveness of ensemble methods critically depends on accuracy, diversity and learning characteristics of base learners. We present an extended experimental analysis of bias-variance decomposition of the error in Support Vector Machines (SVMs), considering Gaussian, polynomial and dot product kernels. A characterization of the error decomposition is provided, by means of the analysis of the relationships between bias, variance, kernel type and its parameters, offering insights into the way SVMs learn. The results show that the expected trade-off between bias and variance is sometimes observed, but more complex relationships can be detected, especially in Gaussian and polynomial kernels. We show that the bias-variance decomposition offers a rationale to develop ensemble methods using SVMs as base ...
    Theoretical and experimental analyses of bagging indicate that it is primarily a variance reduction technique. This suggests that bagging should be applied to learning algorithms tuned to minimize bias, even at the cost of some increase... more
    Theoretical and experimental analyses of bagging indicate that it is primarily a variance reduction technique. This suggests that bagging should be applied to learning algorithms tuned to minimize bias, even at the cost of some increase in variance. We test this idea with Support Vector Machines (SVMs) by employing out-of-bag estimates of bias and variance to tune the SVMs. Experiments indicate that bagging of low-bias SVMs (the "lobag" algorithm) never hurts generalization performance and often improves it compared with well-tuned single SVMs and to bags of individually well-tuned SVMs.
    Several works showed that biomolecular data integration is a key issue to improve the prediction of gene functions. Quite surprisingly only little attention has been devoted to data integration for gene function prediction through... more
    Several works showed that biomolecular data integration is a key issue to improve the prediction of gene functions. Quite surprisingly only little attention has been devoted to data integration for gene function prediction through ensemble methods. In this work we show that relatively simple ensemble methods are competitive and in some cases are also able to outperform state-of-the-art data integration techniques for gene function prediction.
    The aim of this retrospective study is to assess any association between abdominal CT findings and the radiological stage of COVID-19 pneumonia, pulmonary embolism and patient outcomes. We included 158 adult hospitalized COVID-19 patients... more
    The aim of this retrospective study is to assess any association between abdominal CT findings and the radiological stage of COVID-19 pneumonia, pulmonary embolism and patient outcomes. We included 158 adult hospitalized COVID-19 patients between 1 March 2020 and 1 March 2021 who underwent 206 abdominal CTs. Two radiologists reviewed all CT images. Pathological findings were classified as acute or not. A subset of patients with inflammatory pathology in ACE2 organs (bowel, biliary tract, pancreas, urinary system) was identified. The radiological stage of COVID pneumonia, pulmonary embolism, overall days of hospitalization, ICU admission and outcome were registered. Univariate statistical analysis coupled with explainable artificial intelligence (AI) techniques were used to discover associations between variables. The most frequent acute findings were bowel abnormalities (n = 58), abdominal fluid (n = 42), hematomas (n = 28) and acute urologic conditions (n = 8). According to univari...
    The multi-label hierarchical prediction of gene functions at genome and ontology-wide level is a central problem in bioinformatics, and raises challenging questions from a machine learning standpoint. In this context, multi-label... more
    The multi-label hierarchical prediction of gene functions at genome and ontology-wide level is a central problem in bioinformatics, and raises challenging questions from a machine learning standpoint. In this context, multi-label hierarchical ensemble methods that take into account the hierarchical relationships between functional classes have been recently proposed. Various studies also showed that the integration of multiple sources of data is one of the key issues to significantly improve gene function prediction. We propose an integrated approach that combines local data fusion strategies with global hierarchical multi-label methods. The label unbalance typically occurring in gene functional classes is taken into account through the use of cost-sensitive techniques. Ontology-wide results with the yeast model organism, using the FunCat taxonomy, show the effectiveness of the proposed methodological approach.
    The annotation and characterization of tissue-specific cis-regulatory elements (CREs) in non-coding DNA represents an open challenge in computational genomics. Several prior works show that machine learning methods, using epigenetic or... more
    The annotation and characterization of tissue-specific cis-regulatory elements (CREs) in non-coding DNA represents an open challenge in computational genomics. Several prior works show that machine learning methods, using epigenetic or spectral features directly extracted from DNA sequences, can predict active promoters and enhancers in specific tissues or cell lines. In particular, very recently deep-learning techniques obtained state-of-the-art results in this challenging computational task. In this study, we provide additional evidence that Feed Forward Neural Networks (FFNN) trained on epigenetic data and one-dimensional convolutional neural networks (CNN) trained on DNA sequence data can successfully predict active regulatory regions in different cell lines. We show that model selection by means of Bayesian optimization applied to both FFNN and CNN models can significantly improve deep neural network performance, by automatically finding models that best fit the data. Further, ...
    Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of... more
    Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach th...
    Background Several prediction problems in computational biology and genomic medicine are characterized by both big data as well as a high imbalance between examples to be learned, whereby positive examples can represent a tiny minority... more
    Background Several prediction problems in computational biology and genomic medicine are characterized by both big data as well as a high imbalance between examples to be learned, whereby positive examples can represent a tiny minority with respect to negative examples. For instance, deleterious or pathogenic variants are overwhelmed by the sea of neutral variants in the non-coding regions of the genome: thus, the prediction of deleterious variants is a challenging, highly imbalanced classification problem, and classical prediction tools fail to detect the rare pathogenic examples among the huge amount of neutral variants or undergo severe restrictions in managing big genomic data. Results To overcome these limitations we propose parSMURF, a method that adopts a hyper-ensemble approach and oversampling and undersampling techniques to deal with imbalanced data, and parallel computational techniques to both manage big genomic data and substantially speed up the computation. The synerg...
    Missing data imputation has been a hot topic in the past decade, and many state-of-the-art works have been presented to propose novel, interesting solutions that have been applied in a variety of fields. In the past decade, the successful... more
    Missing data imputation has been a hot topic in the past decade, and many state-of-the-art works have been presented to propose novel, interesting solutions that have been applied in a variety of fields. In the past decade, the successful results achieved by deep learning techniques have opened the way to their application for solving difficult problems where human skill is not able to provide a reliable solution. Not surprisingly, some deep learners, mainly exploiting encoder-decoder architectures, have also been designed and applied to the task of missing data imputation. However, most of the proposed imputation techniques have not been designed to tackle “complex data”, that is high dimensional data belonging to datasets with huge cardinality and describing complex problems. Precisely, they often need critical parameters to be manually set or exploit complex architecture and/or training phases that make their computational load impracticable. In this paper, after clustering the s...
    The regulatory code that determines whether and how a given genetic variant affects the function of a regulatory element remains poorly understood for most classes of regulatory variation. Indeed the large majority of bioinformatics tools... more
    The regulatory code that determines whether and how a given genetic variant affects the function of a regulatory element remains poorly understood for most classes of regulatory variation. Indeed the large majority of bioinformatics tools have been developed to predict the pathogenicity of genetic variants in coding sequences or conserved splice sites. Computational algorithms for the prediction of non-coding deleterious variants associated with rare genetic diseases are faced with special challenges owing to the rarity of confirmed pathogenic mutations. Indeed in this context classical machine learning methods are biased toward neutral variants that constitute the large majority of genetic variation, and are not able to detect the potential deleterious variants that constitute only a tiny minority of all known genetic variation. We recently proposed hyperSMURF, hyper-ensemble of SMOTE Undersampled Random Forests, an ensemble approach explicitly designed to deal with the huge imbala...
    Several problems in network biology and medicine can be cast into a framework where entities are represented through partially labeled networks, and the aim is inferring the labels (usually binary) of the unlabeled part. Connections... more
    Several problems in network biology and medicine can be cast into a framework where entities are represented through partially labeled networks, and the aim is inferring the labels (usually binary) of the unlabeled part. Connections represent functional or genetic similarity between entities, while the labellings often are highly unbalanced, that is one class is largely under-represented: for instance in the automated protein function prediction (AFP) for most Gene Ontology terms only few proteins are annotated, or in the disease-gene prioritization problem only few genes are actually known to be involved in the etiology of a given disease. Imbalance-aware approaches to accurately predict node labels in biological networks are thereby required. Furthermore, such methods must be scalable, since input data can be large-sized as, for instance, in the context of multi-species protein networks. We propose a novel semi-supervised parallel enhancement of COSNET, an imbalance-aware algorith...
    The prediction of human gene-abnormal phenotype associations is a fundamental step toward the discovery of novel genes associated with human disorders, especially when no genes are known to be associated with a specific disease. In this... more
    The prediction of human gene-abnormal phenotype associations is a fundamental step toward the discovery of novel genes associated with human disorders, especially when no genes are known to be associated with a specific disease. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of the abnormalities associated with human diseases. While the problem of the prediction of gene-disease associations has been widely investigated, the related problem of gene-phenotypic feature (i.e., HPO term) associations has been largely overlooked, even if for most human genes no HPO term associations are known and despite the increasing application of the HPO to relevant medical problems. Moreover most of the methods proposed in literature are not able to capture the hierarchical relationships between HPO terms, thus resulting in inconsistent and relatively inaccurate predictions. We present two hierarchical ensemble methods that we formally prove to provide biologica...
    A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and... more
    A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis ...
    The interpretation of non-coding variants still constitutes a major challenge in the application of whole-genome sequencing in Mendelian disease, especially for single-nucleotide and other small non-coding variants. Here we present... more
    The interpretation of non-coding variants still constitutes a major challenge in the application of whole-genome sequencing in Mendelian disease, especially for single-nucleotide and other small non-coding variants. Here we present Genomiser, an analysis framework that is able not only to score the relevance of variation in the non-coding genome, but also to associate regulatory variants to specific Mendelian diseases. Genomiser scores variants through either existing methods such as CADD or a bespoke machine learning method and combines these with allele frequency, regulatory sequences, chromosomal topological domains, and phenotypic relevance to discover variants associated to specific Mendelian disorders. Overall, Genomiser is able to identify causal regulatory variants as the top candidate in 77% of simulated whole genomes, allowing effective detection and discovery of regulatory variants in Mendelian disease.
    ABSTRACT Objective: The ultimate goal of any genome-scale experiment is to provide a functional interpretation of the results, relating the available genomic information to the hypotheses that originated the experiment. Methods and... more
    ABSTRACT Objective: The ultimate goal of any genome-scale experiment is to provide a functional interpretation of the results, relating the available genomic information to the hypotheses that originated the experiment. Methods and results: ...

    And 155 more