Professor Alexander N. Gorban holds a personal chair in Applied Mathematics at the University of Leicester since 2004. He worked for Russian Academy of Sciences, Siberian Branch (Krasnoyarsk, Russia), and ETH Zürich (Switzerland), was a visiting professor and research scholar at Clay Mathematics Institute (Cambridge, MA), IHES (Bures-sur-Yvette, Île de France), Courant Institute of Mathematical Sciences (New York), and Isaac Newton Institute for Mathematical Sciences (Cambridge, UK). His main research interests are dynamics of systems of physical, chemical and biological kinetics; biomathematics; data mining and model reduction problems.
Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the pro... more Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterat...
One major problem in Natural Language Processing is the automatic analysis and representation of ... more One major problem in Natural Language Processing is the automatic analysis and representation of human language. Human language is ambiguous and deeper understanding of semantics and creating human-to-machine interaction have required an effort in creating the schemes for act of communication and building common-sense knowledge bases for the ‘meaning’ in texts. This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts. Computational methods extracting semantic feature are used to analyse the relations between texts of messages and ‘representations of situations’ for a newly created large collection of scientific texts, Leicester Scientific Corpus. The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties, with the vectors of some attributes: a list of scientific subject categories that the text belongs to. First, this paper introd...
Drug use disorder is characterised by several terms: addiction, dependence, and abuse. We discuss... more Drug use disorder is characterised by several terms: addiction, dependence, and abuse. We discuss the notion of psychoactive substance and relations between the existing definitions. The personality traits which may be important for predisposition to use of drugs are introduced: the Five-Factor Model, impulsivity, and sensation-seeking. A number of studies have illustrated that personality traits are associated with drug consumption. The previous pertinent results are reviewed. A database with information on 1,885 respondents and their usage of 18 drugs is introduced. The results of our study are briefly outlined: the personality traits (Five-Factor Model, impulsivity, and sensation-seeking) together with simple demographic data make possible the prediction of the risk of consumption of individual drugs; personality profiles for users of different drugs. In particular, groups of heroin and ecstasy users are significantly different; there exist three correlation pleiades of drugs. These are clusters of drugs with correlated consumption, centred around heroin, ecstasy, and benzodiazepines.
Large datasets represented by multidimensional data point clouds often possess non-trivial distri... more Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently ...
Can the analysis of the semantics of words used in the text of a scientific paper predict its fut... more Can the analysis of the semantics of words used in the text of a scientific paper predict its future impact measured by citations? This study details examples of automated text classification that achieved 80% success rate in distinguishing between highly-cited and little-cited articles. Automated intelligent systems allow the identification of promising works that could become influential in the scientific community. The problems of quantifying the meaning of texts and representation of human language have been clear since the inception of Natural Language Processing. This paper presents a novel method for vector representation of text meaning based on information theory and show how this informational semantics is used for text classification on the basis of the Leicester Scientific Corpus. We describe the experimental framework used to evaluate the impact of scientific articles through their informational semantics. Our interest is in citation classification to discover how impor...
We present ElPiGraph, a method for approximating data distributions having non-trivial topologica... more We present ElPiGraph, a method for approximating data distributions having non-trivial topological features such as the existence of excluded regions or branching structures. Unlike many existing methods, ElPiGraph is not based on the construction of a k-nearest neighbour graph, a procedure that can perform poorly in the case of multidimensional and noisy data. Instead, ElPiGraph constructs elastic principal graphs in a more robust way by minimizing elastic energy, applying graph grammars and explicitly controlling topological complexity. Using trimmed approximation error function makes ElPiGraph extremely robust to the presence of background noise without decreasing computational performance and allows it to deal with complex cases of manifold learning (for example, ElPiGraph can learn disconnected intersecting manifolds). Thanks to the quasi-quadratic nature of the elastic function, ElPiGraph performs almost as fast as a simple k-means clustering and, therefore, is much more scala...
2018 International Joint Conference on Neural Networks (IJCNN), 2018
Defining an error function (a measure of deviation of a model prediction from the data) is a crit... more Defining an error function (a measure of deviation of a model prediction from the data) is a critical step in any optimization-based data analysis method, including regression, clustering and dimension reduction. Usual quadratic error function in case of real-life high-dimensional and noisy data suffers from non-robustness to presence of outliers. Therefore, using non-quadratic error functions in data analysis and machine learning (such as L1 norm-based) is an active field of modern research but the majority of methods suggested are either slow or imprecise (use arbitrary heuristics). We suggest a flexible and highly performant approach to generalize most of existing data analysis methods to an arbitrary error function of subquadratic growth. For this purpose, we exploit PQSQ functions (piece-wise quadratic of subquadratic growth), which can be minimized by a simple and fast splitting-based iterative algorithm. The theoretical basis of the PQSQ approach is an application of min-plus...
In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Mea... more In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Meaning Space. This space is constructed using principal component analysis (singular decomposition) on the matrix of word category relative information gains, where the categories are those used by the Web of Science, and the words are taken from a reduced word set from texts in the Web of Science. We show that this reduced word set plausibly represents all texts in the corpus, so that the principal component analysis has some objective meaning with respect to the corpus. We argue that 13 dimensions is adequate to describe the meaning of scientific texts, and hypothesise about the qualitative meaning of the principal components.
Cell cycle is the most fundamental biological process underlying the existence and propagation of... more Cell cycle is the most fundamental biological process underlying the existence and propagation of life in time and space. It has been an object for mathematical modeling for long, with several alternative mechanistic modeling principles suggested, describing in more or less details the known molecular mechanisms. Recently, cell cycle has been investigated at single cell level in snapshots of unsynchronized cell populations, exploiting the new methods for transcriptomic and proteomic molecular profiling. This raises a need for simplified semi-phenomenological cell cycle models, in order to formalize the processes underlying the cell cycle, at a higher abstracted level. Here we suggest a modeling framework, recapitulating the most important properties of the cell cycle as a limit trajectory of a dynamical process characterized by several internal states with switches between them. In the simplest form, this leads to a limit cycle trajectory, composed by linear segments in logarithmic ...
Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the pro... more Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterat...
One major problem in Natural Language Processing is the automatic analysis and representation of ... more One major problem in Natural Language Processing is the automatic analysis and representation of human language. Human language is ambiguous and deeper understanding of semantics and creating human-to-machine interaction have required an effort in creating the schemes for act of communication and building common-sense knowledge bases for the ‘meaning’ in texts. This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts. Computational methods extracting semantic feature are used to analyse the relations between texts of messages and ‘representations of situations’ for a newly created large collection of scientific texts, Leicester Scientific Corpus. The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties, with the vectors of some attributes: a list of scientific subject categories that the text belongs to. First, this paper introd...
Drug use disorder is characterised by several terms: addiction, dependence, and abuse. We discuss... more Drug use disorder is characterised by several terms: addiction, dependence, and abuse. We discuss the notion of psychoactive substance and relations between the existing definitions. The personality traits which may be important for predisposition to use of drugs are introduced: the Five-Factor Model, impulsivity, and sensation-seeking. A number of studies have illustrated that personality traits are associated with drug consumption. The previous pertinent results are reviewed. A database with information on 1,885 respondents and their usage of 18 drugs is introduced. The results of our study are briefly outlined: the personality traits (Five-Factor Model, impulsivity, and sensation-seeking) together with simple demographic data make possible the prediction of the risk of consumption of individual drugs; personality profiles for users of different drugs. In particular, groups of heroin and ecstasy users are significantly different; there exist three correlation pleiades of drugs. These are clusters of drugs with correlated consumption, centred around heroin, ecstasy, and benzodiazepines.
Large datasets represented by multidimensional data point clouds often possess non-trivial distri... more Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently ...
Can the analysis of the semantics of words used in the text of a scientific paper predict its fut... more Can the analysis of the semantics of words used in the text of a scientific paper predict its future impact measured by citations? This study details examples of automated text classification that achieved 80% success rate in distinguishing between highly-cited and little-cited articles. Automated intelligent systems allow the identification of promising works that could become influential in the scientific community. The problems of quantifying the meaning of texts and representation of human language have been clear since the inception of Natural Language Processing. This paper presents a novel method for vector representation of text meaning based on information theory and show how this informational semantics is used for text classification on the basis of the Leicester Scientific Corpus. We describe the experimental framework used to evaluate the impact of scientific articles through their informational semantics. Our interest is in citation classification to discover how impor...
We present ElPiGraph, a method for approximating data distributions having non-trivial topologica... more We present ElPiGraph, a method for approximating data distributions having non-trivial topological features such as the existence of excluded regions or branching structures. Unlike many existing methods, ElPiGraph is not based on the construction of a k-nearest neighbour graph, a procedure that can perform poorly in the case of multidimensional and noisy data. Instead, ElPiGraph constructs elastic principal graphs in a more robust way by minimizing elastic energy, applying graph grammars and explicitly controlling topological complexity. Using trimmed approximation error function makes ElPiGraph extremely robust to the presence of background noise without decreasing computational performance and allows it to deal with complex cases of manifold learning (for example, ElPiGraph can learn disconnected intersecting manifolds). Thanks to the quasi-quadratic nature of the elastic function, ElPiGraph performs almost as fast as a simple k-means clustering and, therefore, is much more scala...
2018 International Joint Conference on Neural Networks (IJCNN), 2018
Defining an error function (a measure of deviation of a model prediction from the data) is a crit... more Defining an error function (a measure of deviation of a model prediction from the data) is a critical step in any optimization-based data analysis method, including regression, clustering and dimension reduction. Usual quadratic error function in case of real-life high-dimensional and noisy data suffers from non-robustness to presence of outliers. Therefore, using non-quadratic error functions in data analysis and machine learning (such as L1 norm-based) is an active field of modern research but the majority of methods suggested are either slow or imprecise (use arbitrary heuristics). We suggest a flexible and highly performant approach to generalize most of existing data analysis methods to an arbitrary error function of subquadratic growth. For this purpose, we exploit PQSQ functions (piece-wise quadratic of subquadratic growth), which can be minimized by a simple and fast splitting-based iterative algorithm. The theoretical basis of the PQSQ approach is an application of min-plus...
In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Mea... more In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Meaning Space. This space is constructed using principal component analysis (singular decomposition) on the matrix of word category relative information gains, where the categories are those used by the Web of Science, and the words are taken from a reduced word set from texts in the Web of Science. We show that this reduced word set plausibly represents all texts in the corpus, so that the principal component analysis has some objective meaning with respect to the corpus. We argue that 13 dimensions is adequate to describe the meaning of scientific texts, and hypothesise about the qualitative meaning of the principal components.
Cell cycle is the most fundamental biological process underlying the existence and propagation of... more Cell cycle is the most fundamental biological process underlying the existence and propagation of life in time and space. It has been an object for mathematical modeling for long, with several alternative mechanistic modeling principles suggested, describing in more or less details the known molecular mechanisms. Recently, cell cycle has been investigated at single cell level in snapshots of unsynchronized cell populations, exploiting the new methods for transcriptomic and proteomic molecular profiling. This raises a need for simplified semi-phenomenological cell cycle models, in order to formalize the processes underlying the cell cycle, at a higher abstracted level. Here we suggest a modeling framework, recapitulating the most important properties of the cell cycle as a limit trajectory of a dynamical process characterized by several internal states with switches between them. In the simplest form, this leads to a limit cycle trajectory, composed by linear segments in logarithmic ...
The concept of biological adaptation was closely connected to some mathematical and engineering i... more The concept of biological adaptation was closely connected to some mathematical and engineering ideas from the very beginning. Cannon’s homeostasis is, in its essence, automatic stabilisation of the body. Selye discovered the phases and limits of adaptation to harmful conditions. His model of General Adaptation Syndrome (GAS) states that an event that threatens an organism’s well-being leads to a three-stage bodily response: Alarm – Resistance – Exhaustion. He concluded from his experiments that the regulatory mechanisms need some “adaptation resource” (adaptability) and demonstrated that the adaptability decreases in the course of adaptation. This adaptability is a hypothetical extensive variable, adaptation energy. In GAS, the adaptation resource is spending for continuous neutralisation of a harmful factor which affects the organism. In addition, Selye demonstrated that the adaptability is spending for training: a rat was trained for resistance to one factor but lost the ability to train for resistance to another factor. There is a fundamental difference between the resource (adaptation energy) spending in GAS and the resource pending for training. The second type is the change of the resistivity landscape. The domain of values of the harmful factors, where the organism can survive, is changing. The volume of this domain can be extended but not for free, its extension requires adaptation energy. Logarithm of this volume is an entropy, and we can call it the “adaptation entropy”. Thus, analysis of Selye’s experiments and physiological hypotheses lead us to the notion of adaptation entropy and, in combination with adaptation energy, to the “adaptation free energy”. We present a new family of dynamical models of physiological adaptation based on the notion of the adaptation free energy. This is a new class of the “top-down” thermodynamic models for physiology.
A lecture given at the 2016 summer school on "Interscale interactions in fluid mechanics and beyo... more A lecture given at the 2016 summer school on "Interscale interactions in fluid mechanics and beyond" organised by the EPSRC Centre for Doctoral Training in Fluid Dynamics Across the Scales of Imperial College London.
Hilbert's 6th problem concerns the axiomatization of those parts of physics which are ready for a rigorous mathematical approach. D. Hilbert attracted special attention to the rigorous theory of limiting processes "which lead from the atomistic view to the laws of motion of continua". We formalise this question as a problem of slow invariant manifolds for kinetic equations. We review a few instances where such hydrodynamic manifolds were found by the direct solution of the invariance equation. The dynamic equations on these manifolds give us a clue about the proper asymptotic of the continuum mechanic equations for rarefied non-equilibrium gases.
We consider multiscale networks of transport processes and approximate their dynamics by the syst... more We consider multiscale networks of transport processes and approximate their dynamics by the systems of simple dominant networks. The dominant systems can be used for direct computation of steady states and relaxation dynamics, especially when kinetic information is incomplete. It could serve as a robust first approximation in perturbation theory or for preconditioning. Many of the parameters of the initial model are no longer present in the dominant system: these parameters are non-critical. Parameters of dominant systems indicate putative targets to change the behavior of the large network and answer an important question: given a network model, which are its critical parameters? The dominant system is, by definition, the system that gives us the main asymptotic terms of the stationary state and relaxation in the limit for well separated rate constants. The theory of dominant systems for networks with first-order kinetics and Markov chains is well developed [1, 2]. We found the explicit asymptotics of eigenvectors and eigenvalues. All algorithms are represented topologically by transformation of the graph of networks (labeled by the transport coefficients. In the simplest cases, the dominant system can be represented as dominant path in the network. In the general case, the hierarchy of dominant paths in the hierarchy of lumped networks is needed. Accuracy of estimates is proven. Performance of the algorithms is demonstrated on simple benchmarks and on multiscale biochemical networks. These methods are applied, in particular, to the analysis of microRNA-mediated mechanisms of translation repression. For nonlinear networks, we present a new heuristic algorithm for calculation of hierarchy of dominant paths. The results of the analysis of the dominant systems often support the observation by Kruskal: “And the answer quite generally has the form of a new system (well posed problem) for the solution to satisfy, although this is sometimes obscured because the new system is so easily solved that one is led directly to the solution without noticing the intermediate step.” References: [1] A.N. Gorban, O. Radulescu, A.Y. Zinovyev, Asymptotology of chemical reaction networks, Chemical Engineering Science 65 (2010), 2310–2324, [2] A.N. Gorban and O. Radulescu, Dynamic and Static Limitation in Multiscale Reaction
An invited talk International conference Modelling Biological Evolution 2015: Linking Mathematical Theories with Empirical Realities
In 1938, H. Selye proposed the notion of adaptation energy and published some “Experimental evide... more In 1938, H. Selye proposed the notion of adaptation energy and published some “Experimental evidence supporting the conception of adaptation energy”. This idea was widely criticized and its use nowadays is rather limited. Nevertheless, the response to many harmful factors has often general non-specific form and we can guess that the mechanisms of physiological adaptation admit a very general and nonspecific description. We assume that natural selection plays a key role in the evolution of mechanisms of physiological adaptation and apply the optimality models to description of these mechanisms. In the light of the optimality models, the mechanisms of adaptation are represented as the optimal distribution of resources for neutralization of harmful factors. We study dynamics of resources redistribution and revisit the theory of the general adaptation syndrome. Adaptation energy is considered as an internal coordinate on the `dominant path’ in the model of adaptation. The phenomenon of `oscillating death’ is predicted on the base of the dynamical models of adaptation.
Conference of the International Federation of Classification Societies, University of Bologna, 7th July 2015.
The problem of identification of pair of loci associated with heat tolerance in yeasts is conside... more The problem of identification of pair of loci associated with heat tolerance in yeasts is considered. Interactions of Quantitative Trait Loci (QTL) in heat selected yeast are analysed by comparing them to an unselected pool of random individuals. Data on individual F12 progeny selected for heat tolerance, which have been genotyped at 25 locations identified by sequencing a selected pool, are re-examined. 960 individuals were genotyped at these locations and multi-locus genotype frequencies were compared to 172 sequenced individuals from the original unselected pool. We use Relative Information Gain (RIG) for analysis of associations between loci. Correlation analysis in many pairs of loci requires multi testing methods. Two multi testing approaches are applied for selection of associations: False Discovery Rate (FDR) method in the version suggested by J.D. Storey and R. Tibshirani and specially developed Bootstrap Test of ordered RIG (BToRIG). BToRIG demonstrates slightly higher sensitivity than FDR approach does for FDR=1. The statistical analysis of entropy and RIG in genotypes of a selected population reveals further interactions than previously seen. Importantly this is done in comparison to the unselected population’s genotypes to account for inherent biases in the original population.
A talk given at the Conference of the International Federation of Classification Societies, Unive... more A talk given at the Conference of the International Federation of Classification Societies, University of Bologna, 8th July 2015. The problem of evaluating an individual’s risk of drug consumption and misuse is highly important and novel. An online survey methodology was employed to collect data including personality traits (NEO-FFI-R), impulsivity (BIS-11), sensation seeking (ImpSS), and demographic information. The data set contained information on the consumption of 18 central nervous system psychoactive drugs. Correlation analysis using a relative information gain model demonstrates the existence of a group of drugs (amphetamines, cannabis, cocaine, ecstasy, legal highs, LSD, and magic mushrooms) with strongly correlated consumption. An exhaustive search was performed to select the most effective subset of input features and data mining methods to classify users and non-users for each drug. A number of classification methods were employed (decision tree, random forest, k-nearest...
This book discusses the psychological traits associated with drug consumption through the statist... more This book discusses the psychological traits associated with drug consumption through the statistical analysis of a new database with information on 1885 respondents and use of 18 drugs. After reviewing published works on the psychological profiles of drug users and describing the data mining and machine learning methods used, it demonstrates that the personality traits (five factor model, impulsivity, and sensation seeking) together with simple demographic data make it possible to predict the risk of consumption of individual drugs with a sensitivity and specificity above 70% for most drugs. It also analyzes the correlations of use of different substances and describes the groups of drugs with correlated use, identifying significant differences in personality profiles for users of different drugs.
The book is intended for advanced undergraduates and first-year PhD students, as well as researchers and practitioners. Although no previous knowledge of machine learning, advanced data mining concepts or modern psychology of personality is assumed, familiarity with basic statistics and some experience in the use of probabilities would be helpful.
Uploads
Papers by Alexander Gorban
Hilbert's 6th problem concerns the axiomatization of those parts of physics which are ready for a rigorous mathematical approach. D. Hilbert attracted special attention to the rigorous theory of limiting processes "which lead from the atomistic view to the laws of motion of continua". We formalise this question as a problem of slow invariant manifolds for kinetic equations. We review a few instances where such hydrodynamic manifolds were found by the direct solution of the invariance equation. The dynamic equations on these manifolds give us a clue about the proper asymptotic of the continuum mechanic equations for rarefied non-equilibrium gases.
References:
[1] A.N. Gorban, O. Radulescu, A.Y. Zinovyev, Asymptotology of chemical reaction
networks, Chemical Engineering Science 65 (2010), 2310–2324,
[2] A.N. Gorban and O. Radulescu, Dynamic and Static Limitation in Multiscale Reaction
We assume that natural selection plays a key role in the evolution of mechanisms of physiological adaptation and apply the optimality models to description of these mechanisms. In the light of the optimality models, the mechanisms of adaptation are represented as the optimal distribution of resources for neutralization of harmful factors. We study dynamics of resources redistribution and revisit the theory of the general adaptation syndrome. Adaptation energy is considered as an internal coordinate on the `dominant path’ in the model of adaptation. The phenomenon of `oscillating death’ is predicted on the base of the dynamical models of adaptation.
The book is intended for advanced undergraduates and first-year PhD students, as well as researchers and practitioners. Although no previous knowledge of machine learning, advanced data mining concepts or modern psychology of personality is assumed, familiarity with basic statistics and some experience in the use of probabilities would be helpful.