Professor Alexander N. Gorban holds a personal chair in Applied Mathematics at the University of Leicester since 2004. He worked for Russian Academy of Sciences, Siberian Branch (Krasnoyarsk, Russia), and ETH Zürich (Switzerland), was a visiting professor and research scholar at Clay Mathematics Institute (Cambridge, MA), IHES (Bures-sur-Yvette, Île de France), Courant Institute of Mathematical Sciences (New York), and Isaac Newton Institute for Mathematical Sciences (Cambridge, UK). His main research interests are dynamics of systems of physical, chemical and biological kinetics; biomathematics; data mining and model reduction problems.
One major problem in Natural Language Processing is the automatic analysis and representation of ... more One major problem in Natural Language Processing is the automatic analysis and representation of human language. Human language is ambiguous and deeper understanding of semantics and creating human-to-machine interaction have required an effort in creating the schemes for act of communication and building common-sense knowledge bases for the 'meaning' in texts. This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts. Computational methods extracting semantic feature are used to analyse the relations between texts of messages and 'representations of situations' for a newly created large collection of scientific texts, Leicester Scientific Corpus. The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties, with the vectors of some attributes: a list of scientific subject categories that the text belongs to. First, this paper introduces 'Meaning Space' in which the informational representation of the meaning is extracted from the occurrence of the word in texts across the scientific categories, i.e., the meaning of a word is represented by a vector of Relative Information Gain about the subject categories. Then, the meaning space is statistically analysed for Leicester Scientific Dictionary-Core and we investigate 'Principal Components of the Meaning' to describe the adequate dimensions of the meaning. The research in this paper conducts the base for the geometric representation of the meaning of texts.
In 1938, H. Selye proposed the notion of adaptation energy and published "Experimental evidence s... more In 1938, H. Selye proposed the notion of adaptation energy and published "Experimental evidence supporting the conception of adaptation energy". Adaptation of an animal to different factors appears as the spending of one resource. Adaptation energy is a hypothetical extensive quantity spent for adaptation. This term causes much debate when one takes it literally, as a physical quantity, i.e. a sort of energy. The controversial points of view impede the systematic use of the notion of adaptation energy despite experimental evidence. Nevertheless, the response to many harmful factors often has general non-specific form and we suggest that the mechanisms of physiological adaptation admit a very general and nonspecific description. We aim to demonstrate that Selye's adaptation energy is the cornerstone of the top-down approach to modelling of non-specific adaptation processes. We analyse Selye's axioms of adaptation energy together with Goldstone's modifications and propose a series of models for interpretation of these axioms. Adaptation energy is considered as an internal coordinate on the 'dominant path' in the model of adaptation. The phenomena of 'oscillating death' and 'oscillating remission' are predicted on the base of the dynamical models of adaptation. Natural selection plays a key role in the evolution of mechanisms of physiological adaptation. We use the fitness optimization approach to study of the distribution of resources for neutralization of harmful factors, during adaptation to a multifactor environment, and analyse the optimal strategies for different systems of factors.
Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the pro... more Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterat...
One major problem in Natural Language Processing is the automatic analysis and representation of ... more One major problem in Natural Language Processing is the automatic analysis and representation of human language. Human language is ambiguous and deeper understanding of semantics and creating human-to-machine interaction have required an effort in creating the schemes for act of communication and building common-sense knowledge bases for the ‘meaning’ in texts. This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts. Computational methods extracting semantic feature are used to analyse the relations between texts of messages and ‘representations of situations’ for a newly created large collection of scientific texts, Leicester Scientific Corpus. The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties, with the vectors of some attributes: a list of scientific subject categories that the text belongs to. First, this paper introd...
2019 International Joint Conference on Neural Networks (IJCNN), 2019
The curse of dimensionality causes well-known and widely discussed problems for machine learning ... more The curse of dimensionality causes well-known and widely discussed problems for machine learning methods. There is a hypothesis that usage of Manhattan distance and even fractional quasinorms lp (for p less than 1) can help to overcome the curse of dimensionality in classification problems. In this study, we systematically test this hypothesis for 37 binary classification problems on 25 databases. We confirm that fractional quasinorms have greater relative contrast or coefficient of variation than Euclidean norm l2, but we demonstrate also that the distance concentration shows qualitatively the same behaviour for all tested norms and quasinorms and the difference between them decays while dimension tends to infinity. Estimation of classification quality for kNN based on different norms and quasinorms shows that the greater relative contrast does not mean the better classifier performance and the worst performance for different databases was shown by the different norms (quasinorms). A systematic comparison shows that the difference in performance of kNN based on lp for p=2, 1, and 0.5 is statistically insignificant.
This chapter includes results of data analysis. The relationship between personality profiles and... more This chapter includes results of data analysis. The relationship between personality profiles and drug consumption is described and the individual drug consumption risks for different drugs is evaluated. Significant differences between groups of drug users and non-users are identified. Machine learning algorithms solve the user/non-user classification problem for many drugs with impressive sensitivity and specificity. Analysis of correlations between use of different drugs reveals existence of clusters of substances with highly correlated use, which we term correlation pleiades. It is proven that the mean profiles of users of different drugs are significantly different (for benzodiazepines, ecstasy, and heroin). Visualisation of risk by risk maps is presented. The difference between users of different drugs is analysed and three distinct types of users are identified for benzodiazepines, ecstasy, and heroin. Keywords Risk analysis • Psychological profiles • Discriminant analysis • Correlation pleiades • Drug clustering 4.1 Descriptive Statistics and Psychological Profile of Illicit Drug Users The data set contains seven categories of drug users: 'Never used', 'Used over a decade ago', 'Used in last decade', 'Used in last year', 'Used in last month', 'Used in last week', and 'Used in last day'. A respondent selected their category for every drug from the list. We formed four classification problems based on the following classes (see section 'Drug use'): the decade-, year-, month-, and week-based user/non-user separations. We have identified the relationship between personality profiles (NEO-FFI-R) and drug consumption for the decade-, year-, month-, and week-based classification problems. We have evaluated the risk of drug consumption for each individual according to their personality profile. This evaluation was performed separately for each drug for the decade-based user/non-user separation. We have also analysed the interrelations between the individual drug consumption risks for different drugs. Part of these results has been presented in [1] (and in more detail in the 2015 technical
Drug use disorder is characterised by several terms: addiction, dependence, and abuse. We discuss... more Drug use disorder is characterised by several terms: addiction, dependence, and abuse. We discuss the notion of psychoactive substance and relations between the existing definitions. The personality traits which may be important for predisposition to use of drugs are introduced: the Five-Factor Model, impulsivity, and sensation-seeking. A number of studies have illustrated that personality traits are associated with drug consumption. The previous pertinent results are reviewed. A database with information on 1,885 respondents and their usage of 18 drugs is introduced. The results of our study are briefly outlined: the personality traits (Five-Factor Model, impulsivity, and sensation-seeking) together with simple demographic data make possible the prediction of the risk of consumption of individual drugs; personality profiles for users of different drugs. In particular, groups of heroin and ecstasy users are significantly different; there exist three correlation pleiades of drugs. These are clusters of drugs with correlated consumption, centred around heroin, ecstasy, and benzodiazepines.
Large datasets represented by multidimensional data point clouds often possess non-trivial distri... more Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently ...
Can the analysis of the semantics of words used in the text of a scientific paper predict its fut... more Can the analysis of the semantics of words used in the text of a scientific paper predict its future impact measured by citations? This study details examples of automated text classification that achieved 80% success rate in distinguishing between highly-cited and little-cited articles. Automated intelligent systems allow the identification of promising works that could become influential in the scientific community. The problems of quantifying the meaning of texts and representation of human language have been clear since the inception of Natural Language Processing. This paper presents a novel method for vector representation of text meaning based on information theory and show how this informational semantics is used for text classification on the basis of the Leicester Scientific Corpus. We describe the experimental framework used to evaluate the impact of scientific articles through their informational semantics. Our interest is in citation classification to discover how impor...
We present ElPiGraph, a method for approximating data distributions having non-trivial topologica... more We present ElPiGraph, a method for approximating data distributions having non-trivial topological features such as the existence of excluded regions or branching structures. Unlike many existing methods, ElPiGraph is not based on the construction of a k-nearest neighbour graph, a procedure that can perform poorly in the case of multidimensional and noisy data. Instead, ElPiGraph constructs elastic principal graphs in a more robust way by minimizing elastic energy, applying graph grammars and explicitly controlling topological complexity. Using trimmed approximation error function makes ElPiGraph extremely robust to the presence of background noise without decreasing computational performance and allows it to deal with complex cases of manifold learning (for example, ElPiGraph can learn disconnected intersecting manifolds). Thanks to the quasi-quadratic nature of the elastic function, ElPiGraph performs almost as fast as a simple k-means clustering and, therefore, is much more scala...
2018 International Joint Conference on Neural Networks (IJCNN), 2018
Defining an error function (a measure of deviation of a model prediction from the data) is a crit... more Defining an error function (a measure of deviation of a model prediction from the data) is a critical step in any optimization-based data analysis method, including regression, clustering and dimension reduction. Usual quadratic error function in case of real-life high-dimensional and noisy data suffers from non-robustness to presence of outliers. Therefore, using non-quadratic error functions in data analysis and machine learning (such as L1 norm-based) is an active field of modern research but the majority of methods suggested are either slow or imprecise (use arbitrary heuristics). We suggest a flexible and highly performant approach to generalize most of existing data analysis methods to an arbitrary error function of subquadratic growth. For this purpose, we exploit PQSQ functions (piece-wise quadratic of subquadratic growth), which can be minimized by a simple and fast splitting-based iterative algorithm. The theoretical basis of the PQSQ approach is an application of min-plus...
In this book a story is told about the psychological traits associated with drug consumption. The... more In this book a story is told about the psychological traits associated with drug consumption. The book includes: • A review of published works on the psychological profiles of drug users. • Analysis of a new original database with information on 1885 respondents and usage of 18 drugs. (Database is available online.) • An introductory description of the data mining and machine learning methods used for the analysis of this dataset. • The demonstration that the personality traits (five factor model, impulsivity, and sensation seeking), together with simple demographic data, give the possibility of predicting the risk of consumption of individual drugs with sensitivity and specificity above 70% for most drugs. • The analysis of correlations of use of different substances and the description of the groups of drugs with correlated use (correlation pleiades). • Proof of significant differences of personality profiles for users of different drugs. This is explicitly proved for benzodiazepines, ecstasy, and heroin. • Tables of personality profiles for users and non-users of 18 substances. The book is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of machine learning, advanced data mining concepts or modern psychology of personality is assumed. For more detailed introduction into statistical methods we recommend several undergraduate textbooks. Familiarity with basic statistics and some experience in the use of probabilities would be helpful as well as some basic technical understanding of psychology.
In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Mea... more In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Meaning Space. This space is constructed using principal component analysis (singular decomposition) on the matrix of word category relative information gains, where the categories are those used by the Web of Science, and the words are taken from a reduced word set from texts in the Web of Science. We show that this reduced word set plausibly represents all texts in the corpus, so that the principal component analysis has some objective meaning with respect to the corpus. We argue that 13 dimensions is adequate to describe the meaning of scientific texts, and hypothesise about the qualitative meaning of the principal components.
IEEE Transactions on Geoscience and Remote Sensing, 2021
Artificial light-at-night (ALAN), emitted from the ground and visible from space, marks human pre... more Artificial light-at-night (ALAN), emitted from the ground and visible from space, marks human presence on Earth. Since the launch of the Suomi National Polar Partnership satellite with the Visible Infrared Imaging Radiometer Suite Day/Night Band (VIIRS/DNB) onboard, global nighttime images have significantly improved; however, they remained panchromatic. Although multispectral images are also available, they are either commercial or free of charge, but sporadic. In this paper, we use several machine learning techniques, such as linear, kernel, random forest regressions, and elastic map approach, to transform panchromatic VIIRS/DBN into Red Green Blue (RGB) images. To validate the proposed approach, we analyze RGB images for eight urban areas worldwide. We link RGB values, obtained from ISS photographs, to panchromatic ALAN intensities, their pixel-wise differences, and several land-use type proxies. Each dataset is used for model training, while other datasets are used for the model validation. The analysis shows that model-estimated RGB images demonstrate a high degree of correspondence with the original RGB images from the ISS database. Yet, estimates, based on linear, kernel and random forest regressions, provide better correlations, contrast similarity and lower WMSEs levels, while RGB images, generated using elastic map approach, provide higher consistency of predictions.
Automatic grading is not a new approach but the need to adapt the latest technology to automatic ... more Automatic grading is not a new approach but the need to adapt the latest technology to automatic grading has become very important. As the technology has rapidly became more powerful on scoring exams and essays, especially from the 1990s onwards, partially or wholly automated grading systems using computational methods have evolved and have become a major area of research. In particular, the demand of scoring of natural language responses has created a need for tools that can be applied to automatically grade these responses. In this paper, we focus on the concept of automatic grading of short answer questions such as are typical in the UK GCSE system, and providing useful feedback on their answers to students. We present experimental results on a dataset provided from the introductory computer science class in the University of North Texas. We first apply standard data mining techniques to the corpus of student answers for the purpose of measuring similarity between the student answers and the model answer. This is based on the number of common words. We then evaluate the relation between these similarities and marks awarded by scorers. We consider an approach that groups student answers into clusters. Each cluster would be awarded the same mark, and the same feedback given to each answer in a cluster. In this manner, we demonstrate that clusters indicate the groups of students who are awarded the same or the similar scores. Words in each cluster are compared to show that clusters are constructed based on how many and which words of the model answer have been used. The main novelty in this paper is that we design a model to predict marks based on the similarities between the student answers and the model answer. We argue that computational methods be used to enhance the reliability of human scoring, and not replace it. Humans are required to calibrate the system, and to deal with situations that are challenging. Computational methods can provide insight into which student answers will be found challenging and thus be a place human judgement is required.
Cell cycle is the most fundamental biological process underlying the existence and propagation of... more Cell cycle is the most fundamental biological process underlying the existence and propagation of life in time and space. It has been an object for mathematical modeling for long, with several alternative mechanistic modeling principles suggested, describing in more or less details the known molecular mechanisms. Recently, cell cycle has been investigated at single cell level in snapshots of unsynchronized cell populations, exploiting the new methods for transcriptomic and proteomic molecular profiling. This raises a need for simplified semi-phenomenological cell cycle models, in order to formalize the processes underlying the cell cycle, at a higher abstracted level. Here we suggest a modeling framework, recapitulating the most important properties of the cell cycle as a limit trajectory of a dynamical process characterized by several internal states with switches between them. In the simplest form, this leads to a limit cycle trajectory, composed by linear segments in logarithmic ...
What can the randomness of missing values tell you about clinical practice in large data sets of ... more What can the randomness of missing values tell you about clinical practice in large data sets of children's vital signs?,
Synthesis of proteins is one of the most fundamental biological processes, which consumes a signi... more Synthesis of proteins is one of the most fundamental biological processes, which consumes a significant amount of cellular resources. Despite many efforts to produce detailed mechanistic mathematical models of translation, no basic and simple kinetic model of mRNA lifecycle (transcription, translation and degradation) exists. We build such a model by lumping multiple states of translated mRNA into few dynamical variables and introducing a pool of translating ribosomes. The basic and simple model can be extended, if necessary, to take into account various phenomena such as the interaction between translating ribosomes or regulation of translation by microRNA. The model can be used as a building block (translation module) for more complex models of cellular processes.
Living neuronal networks in dissociated neuronal cultures are widely known for their ability to g... more Living neuronal networks in dissociated neuronal cultures are widely known for their ability to generate highly robust spatiotemporal activity patterns in various experimental conditions. These include neuronal avalanches satisfying the power scaling law and thereby exemplifying self-organized criticality in living systems. A crucial question is how these patterns can be explained and modeled in a way that is biologically meaningful, mathematically tractable and yet broad enough to account for neuronal heterogeneity and complexity. Here we propose a simple model which may offer an answer to this question. Our derivations are based on just few phenomenological observations concerning input-output behavior of an isolated neuron. A distinctive feature of the model is that at the simplest level of description it comprises of only two variables, a network activity variable and an exogenous variable corresponding to energy needed to sustain the activity and modulate the efficacy of signal transmission. Strikingly, this simple model is already capable of explaining emergence of network spikes and bursts in developing neuronal cultures. The model behavior and predictions are supported by empirical observations and published experimental evidence on cultured neurons behavior exposed to oxygen and energy deprivation. At the larger, network scale, introduction of the energy-dependent regulatory mechanism enables the network to balance on the edge of the network percolation transition. Network activity in this state shows population bursts satisfying the scaling avalanche conditions. This network state is self-sustainable and represents a balance between global network-wide processes and spontaneous activity of individual elements.
One major problem in Natural Language Processing is the automatic analysis and representation of ... more One major problem in Natural Language Processing is the automatic analysis and representation of human language. Human language is ambiguous and deeper understanding of semantics and creating human-to-machine interaction have required an effort in creating the schemes for act of communication and building common-sense knowledge bases for the 'meaning' in texts. This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts. Computational methods extracting semantic feature are used to analyse the relations between texts of messages and 'representations of situations' for a newly created large collection of scientific texts, Leicester Scientific Corpus. The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties, with the vectors of some attributes: a list of scientific subject categories that the text belongs to. First, this paper introduces 'Meaning Space' in which the informational representation of the meaning is extracted from the occurrence of the word in texts across the scientific categories, i.e., the meaning of a word is represented by a vector of Relative Information Gain about the subject categories. Then, the meaning space is statistically analysed for Leicester Scientific Dictionary-Core and we investigate 'Principal Components of the Meaning' to describe the adequate dimensions of the meaning. The research in this paper conducts the base for the geometric representation of the meaning of texts.
In 1938, H. Selye proposed the notion of adaptation energy and published "Experimental evidence s... more In 1938, H. Selye proposed the notion of adaptation energy and published "Experimental evidence supporting the conception of adaptation energy". Adaptation of an animal to different factors appears as the spending of one resource. Adaptation energy is a hypothetical extensive quantity spent for adaptation. This term causes much debate when one takes it literally, as a physical quantity, i.e. a sort of energy. The controversial points of view impede the systematic use of the notion of adaptation energy despite experimental evidence. Nevertheless, the response to many harmful factors often has general non-specific form and we suggest that the mechanisms of physiological adaptation admit a very general and nonspecific description. We aim to demonstrate that Selye's adaptation energy is the cornerstone of the top-down approach to modelling of non-specific adaptation processes. We analyse Selye's axioms of adaptation energy together with Goldstone's modifications and propose a series of models for interpretation of these axioms. Adaptation energy is considered as an internal coordinate on the 'dominant path' in the model of adaptation. The phenomena of 'oscillating death' and 'oscillating remission' are predicted on the base of the dynamical models of adaptation. Natural selection plays a key role in the evolution of mechanisms of physiological adaptation. We use the fitness optimization approach to study of the distribution of resources for neutralization of harmful factors, during adaptation to a multifactor environment, and analyse the optimal strategies for different systems of factors.
Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the pro... more Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterat...
One major problem in Natural Language Processing is the automatic analysis and representation of ... more One major problem in Natural Language Processing is the automatic analysis and representation of human language. Human language is ambiguous and deeper understanding of semantics and creating human-to-machine interaction have required an effort in creating the schemes for act of communication and building common-sense knowledge bases for the ‘meaning’ in texts. This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts. Computational methods extracting semantic feature are used to analyse the relations between texts of messages and ‘representations of situations’ for a newly created large collection of scientific texts, Leicester Scientific Corpus. The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties, with the vectors of some attributes: a list of scientific subject categories that the text belongs to. First, this paper introd...
2019 International Joint Conference on Neural Networks (IJCNN), 2019
The curse of dimensionality causes well-known and widely discussed problems for machine learning ... more The curse of dimensionality causes well-known and widely discussed problems for machine learning methods. There is a hypothesis that usage of Manhattan distance and even fractional quasinorms lp (for p less than 1) can help to overcome the curse of dimensionality in classification problems. In this study, we systematically test this hypothesis for 37 binary classification problems on 25 databases. We confirm that fractional quasinorms have greater relative contrast or coefficient of variation than Euclidean norm l2, but we demonstrate also that the distance concentration shows qualitatively the same behaviour for all tested norms and quasinorms and the difference between them decays while dimension tends to infinity. Estimation of classification quality for kNN based on different norms and quasinorms shows that the greater relative contrast does not mean the better classifier performance and the worst performance for different databases was shown by the different norms (quasinorms). A systematic comparison shows that the difference in performance of kNN based on lp for p=2, 1, and 0.5 is statistically insignificant.
This chapter includes results of data analysis. The relationship between personality profiles and... more This chapter includes results of data analysis. The relationship between personality profiles and drug consumption is described and the individual drug consumption risks for different drugs is evaluated. Significant differences between groups of drug users and non-users are identified. Machine learning algorithms solve the user/non-user classification problem for many drugs with impressive sensitivity and specificity. Analysis of correlations between use of different drugs reveals existence of clusters of substances with highly correlated use, which we term correlation pleiades. It is proven that the mean profiles of users of different drugs are significantly different (for benzodiazepines, ecstasy, and heroin). Visualisation of risk by risk maps is presented. The difference between users of different drugs is analysed and three distinct types of users are identified for benzodiazepines, ecstasy, and heroin. Keywords Risk analysis • Psychological profiles • Discriminant analysis • Correlation pleiades • Drug clustering 4.1 Descriptive Statistics and Psychological Profile of Illicit Drug Users The data set contains seven categories of drug users: 'Never used', 'Used over a decade ago', 'Used in last decade', 'Used in last year', 'Used in last month', 'Used in last week', and 'Used in last day'. A respondent selected their category for every drug from the list. We formed four classification problems based on the following classes (see section 'Drug use'): the decade-, year-, month-, and week-based user/non-user separations. We have identified the relationship between personality profiles (NEO-FFI-R) and drug consumption for the decade-, year-, month-, and week-based classification problems. We have evaluated the risk of drug consumption for each individual according to their personality profile. This evaluation was performed separately for each drug for the decade-based user/non-user separation. We have also analysed the interrelations between the individual drug consumption risks for different drugs. Part of these results has been presented in [1] (and in more detail in the 2015 technical
Drug use disorder is characterised by several terms: addiction, dependence, and abuse. We discuss... more Drug use disorder is characterised by several terms: addiction, dependence, and abuse. We discuss the notion of psychoactive substance and relations between the existing definitions. The personality traits which may be important for predisposition to use of drugs are introduced: the Five-Factor Model, impulsivity, and sensation-seeking. A number of studies have illustrated that personality traits are associated with drug consumption. The previous pertinent results are reviewed. A database with information on 1,885 respondents and their usage of 18 drugs is introduced. The results of our study are briefly outlined: the personality traits (Five-Factor Model, impulsivity, and sensation-seeking) together with simple demographic data make possible the prediction of the risk of consumption of individual drugs; personality profiles for users of different drugs. In particular, groups of heroin and ecstasy users are significantly different; there exist three correlation pleiades of drugs. These are clusters of drugs with correlated consumption, centred around heroin, ecstasy, and benzodiazepines.
Large datasets represented by multidimensional data point clouds often possess non-trivial distri... more Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently ...
Can the analysis of the semantics of words used in the text of a scientific paper predict its fut... more Can the analysis of the semantics of words used in the text of a scientific paper predict its future impact measured by citations? This study details examples of automated text classification that achieved 80% success rate in distinguishing between highly-cited and little-cited articles. Automated intelligent systems allow the identification of promising works that could become influential in the scientific community. The problems of quantifying the meaning of texts and representation of human language have been clear since the inception of Natural Language Processing. This paper presents a novel method for vector representation of text meaning based on information theory and show how this informational semantics is used for text classification on the basis of the Leicester Scientific Corpus. We describe the experimental framework used to evaluate the impact of scientific articles through their informational semantics. Our interest is in citation classification to discover how impor...
We present ElPiGraph, a method for approximating data distributions having non-trivial topologica... more We present ElPiGraph, a method for approximating data distributions having non-trivial topological features such as the existence of excluded regions or branching structures. Unlike many existing methods, ElPiGraph is not based on the construction of a k-nearest neighbour graph, a procedure that can perform poorly in the case of multidimensional and noisy data. Instead, ElPiGraph constructs elastic principal graphs in a more robust way by minimizing elastic energy, applying graph grammars and explicitly controlling topological complexity. Using trimmed approximation error function makes ElPiGraph extremely robust to the presence of background noise without decreasing computational performance and allows it to deal with complex cases of manifold learning (for example, ElPiGraph can learn disconnected intersecting manifolds). Thanks to the quasi-quadratic nature of the elastic function, ElPiGraph performs almost as fast as a simple k-means clustering and, therefore, is much more scala...
2018 International Joint Conference on Neural Networks (IJCNN), 2018
Defining an error function (a measure of deviation of a model prediction from the data) is a crit... more Defining an error function (a measure of deviation of a model prediction from the data) is a critical step in any optimization-based data analysis method, including regression, clustering and dimension reduction. Usual quadratic error function in case of real-life high-dimensional and noisy data suffers from non-robustness to presence of outliers. Therefore, using non-quadratic error functions in data analysis and machine learning (such as L1 norm-based) is an active field of modern research but the majority of methods suggested are either slow or imprecise (use arbitrary heuristics). We suggest a flexible and highly performant approach to generalize most of existing data analysis methods to an arbitrary error function of subquadratic growth. For this purpose, we exploit PQSQ functions (piece-wise quadratic of subquadratic growth), which can be minimized by a simple and fast splitting-based iterative algorithm. The theoretical basis of the PQSQ approach is an application of min-plus...
In this book a story is told about the psychological traits associated with drug consumption. The... more In this book a story is told about the psychological traits associated with drug consumption. The book includes: • A review of published works on the psychological profiles of drug users. • Analysis of a new original database with information on 1885 respondents and usage of 18 drugs. (Database is available online.) • An introductory description of the data mining and machine learning methods used for the analysis of this dataset. • The demonstration that the personality traits (five factor model, impulsivity, and sensation seeking), together with simple demographic data, give the possibility of predicting the risk of consumption of individual drugs with sensitivity and specificity above 70% for most drugs. • The analysis of correlations of use of different substances and the description of the groups of drugs with correlated use (correlation pleiades). • Proof of significant differences of personality profiles for users of different drugs. This is explicitly proved for benzodiazepines, ecstasy, and heroin. • Tables of personality profiles for users and non-users of 18 substances. The book is aimed at advanced undergraduates or first-year PhD students, as well as researchers and practitioners. No previous knowledge of machine learning, advanced data mining concepts or modern psychology of personality is assumed. For more detailed introduction into statistical methods we recommend several undergraduate textbooks. Familiarity with basic statistics and some experience in the use of probabilities would be helpful as well as some basic technical understanding of psychology.
In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Mea... more In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Meaning Space. This space is constructed using principal component analysis (singular decomposition) on the matrix of word category relative information gains, where the categories are those used by the Web of Science, and the words are taken from a reduced word set from texts in the Web of Science. We show that this reduced word set plausibly represents all texts in the corpus, so that the principal component analysis has some objective meaning with respect to the corpus. We argue that 13 dimensions is adequate to describe the meaning of scientific texts, and hypothesise about the qualitative meaning of the principal components.
IEEE Transactions on Geoscience and Remote Sensing, 2021
Artificial light-at-night (ALAN), emitted from the ground and visible from space, marks human pre... more Artificial light-at-night (ALAN), emitted from the ground and visible from space, marks human presence on Earth. Since the launch of the Suomi National Polar Partnership satellite with the Visible Infrared Imaging Radiometer Suite Day/Night Band (VIIRS/DNB) onboard, global nighttime images have significantly improved; however, they remained panchromatic. Although multispectral images are also available, they are either commercial or free of charge, but sporadic. In this paper, we use several machine learning techniques, such as linear, kernel, random forest regressions, and elastic map approach, to transform panchromatic VIIRS/DBN into Red Green Blue (RGB) images. To validate the proposed approach, we analyze RGB images for eight urban areas worldwide. We link RGB values, obtained from ISS photographs, to panchromatic ALAN intensities, their pixel-wise differences, and several land-use type proxies. Each dataset is used for model training, while other datasets are used for the model validation. The analysis shows that model-estimated RGB images demonstrate a high degree of correspondence with the original RGB images from the ISS database. Yet, estimates, based on linear, kernel and random forest regressions, provide better correlations, contrast similarity and lower WMSEs levels, while RGB images, generated using elastic map approach, provide higher consistency of predictions.
Automatic grading is not a new approach but the need to adapt the latest technology to automatic ... more Automatic grading is not a new approach but the need to adapt the latest technology to automatic grading has become very important. As the technology has rapidly became more powerful on scoring exams and essays, especially from the 1990s onwards, partially or wholly automated grading systems using computational methods have evolved and have become a major area of research. In particular, the demand of scoring of natural language responses has created a need for tools that can be applied to automatically grade these responses. In this paper, we focus on the concept of automatic grading of short answer questions such as are typical in the UK GCSE system, and providing useful feedback on their answers to students. We present experimental results on a dataset provided from the introductory computer science class in the University of North Texas. We first apply standard data mining techniques to the corpus of student answers for the purpose of measuring similarity between the student answers and the model answer. This is based on the number of common words. We then evaluate the relation between these similarities and marks awarded by scorers. We consider an approach that groups student answers into clusters. Each cluster would be awarded the same mark, and the same feedback given to each answer in a cluster. In this manner, we demonstrate that clusters indicate the groups of students who are awarded the same or the similar scores. Words in each cluster are compared to show that clusters are constructed based on how many and which words of the model answer have been used. The main novelty in this paper is that we design a model to predict marks based on the similarities between the student answers and the model answer. We argue that computational methods be used to enhance the reliability of human scoring, and not replace it. Humans are required to calibrate the system, and to deal with situations that are challenging. Computational methods can provide insight into which student answers will be found challenging and thus be a place human judgement is required.
Cell cycle is the most fundamental biological process underlying the existence and propagation of... more Cell cycle is the most fundamental biological process underlying the existence and propagation of life in time and space. It has been an object for mathematical modeling for long, with several alternative mechanistic modeling principles suggested, describing in more or less details the known molecular mechanisms. Recently, cell cycle has been investigated at single cell level in snapshots of unsynchronized cell populations, exploiting the new methods for transcriptomic and proteomic molecular profiling. This raises a need for simplified semi-phenomenological cell cycle models, in order to formalize the processes underlying the cell cycle, at a higher abstracted level. Here we suggest a modeling framework, recapitulating the most important properties of the cell cycle as a limit trajectory of a dynamical process characterized by several internal states with switches between them. In the simplest form, this leads to a limit cycle trajectory, composed by linear segments in logarithmic ...
What can the randomness of missing values tell you about clinical practice in large data sets of ... more What can the randomness of missing values tell you about clinical practice in large data sets of children's vital signs?,
Synthesis of proteins is one of the most fundamental biological processes, which consumes a signi... more Synthesis of proteins is one of the most fundamental biological processes, which consumes a significant amount of cellular resources. Despite many efforts to produce detailed mechanistic mathematical models of translation, no basic and simple kinetic model of mRNA lifecycle (transcription, translation and degradation) exists. We build such a model by lumping multiple states of translated mRNA into few dynamical variables and introducing a pool of translating ribosomes. The basic and simple model can be extended, if necessary, to take into account various phenomena such as the interaction between translating ribosomes or regulation of translation by microRNA. The model can be used as a building block (translation module) for more complex models of cellular processes.
Living neuronal networks in dissociated neuronal cultures are widely known for their ability to g... more Living neuronal networks in dissociated neuronal cultures are widely known for their ability to generate highly robust spatiotemporal activity patterns in various experimental conditions. These include neuronal avalanches satisfying the power scaling law and thereby exemplifying self-organized criticality in living systems. A crucial question is how these patterns can be explained and modeled in a way that is biologically meaningful, mathematically tractable and yet broad enough to account for neuronal heterogeneity and complexity. Here we propose a simple model which may offer an answer to this question. Our derivations are based on just few phenomenological observations concerning input-output behavior of an isolated neuron. A distinctive feature of the model is that at the simplest level of description it comprises of only two variables, a network activity variable and an exogenous variable corresponding to energy needed to sustain the activity and modulate the efficacy of signal transmission. Strikingly, this simple model is already capable of explaining emergence of network spikes and bursts in developing neuronal cultures. The model behavior and predictions are supported by empirical observations and published experimental evidence on cultured neurons behavior exposed to oxygen and energy deprivation. At the larger, network scale, introduction of the energy-dependent regulatory mechanism enables the network to balance on the edge of the network percolation transition. Network activity in this state shows population bursts satisfying the scaling avalanche conditions. This network state is self-sustainable and represents a balance between global network-wide processes and spontaneous activity of individual elements.
The concept of biological adaptation was closely connected to some mathematical and engineering i... more The concept of biological adaptation was closely connected to some mathematical and engineering ideas from the very beginning. Cannon’s homeostasis is, in its essence, automatic stabilisation of the body. Selye discovered the phases and limits of adaptation to harmful conditions. His model of General Adaptation Syndrome (GAS) states that an event that threatens an organism’s well-being leads to a three-stage bodily response: Alarm – Resistance – Exhaustion. He concluded from his experiments that the regulatory mechanisms need some “adaptation resource” (adaptability) and demonstrated that the adaptability decreases in the course of adaptation. This adaptability is a hypothetical extensive variable, adaptation energy. In GAS, the adaptation resource is spending for continuous neutralisation of a harmful factor which affects the organism. In addition, Selye demonstrated that the adaptability is spending for training: a rat was trained for resistance to one factor but lost the ability to train for resistance to another factor. There is a fundamental difference between the resource (adaptation energy) spending in GAS and the resource pending for training. The second type is the change of the resistivity landscape. The domain of values of the harmful factors, where the organism can survive, is changing. The volume of this domain can be extended but not for free, its extension requires adaptation energy. Logarithm of this volume is an entropy, and we can call it the “adaptation entropy”. Thus, analysis of Selye’s experiments and physiological hypotheses lead us to the notion of adaptation entropy and, in combination with adaptation energy, to the “adaptation free energy”. We present a new family of dynamical models of physiological adaptation based on the notion of the adaptation free energy. This is a new class of the “top-down” thermodynamic models for physiology.
A lecture given at the 2016 summer school on "Interscale interactions in fluid mechanics and beyo... more A lecture given at the 2016 summer school on "Interscale interactions in fluid mechanics and beyond" organised by the EPSRC Centre for Doctoral Training in Fluid Dynamics Across the Scales of Imperial College London.
Hilbert's 6th problem concerns the axiomatization of those parts of physics which are ready for a rigorous mathematical approach. D. Hilbert attracted special attention to the rigorous theory of limiting processes "which lead from the atomistic view to the laws of motion of continua". We formalise this question as a problem of slow invariant manifolds for kinetic equations. We review a few instances where such hydrodynamic manifolds were found by the direct solution of the invariance equation. The dynamic equations on these manifolds give us a clue about the proper asymptotic of the continuum mechanic equations for rarefied non-equilibrium gases.
We consider multiscale networks of transport processes and approximate their dynamics by the syst... more We consider multiscale networks of transport processes and approximate their dynamics by the systems of simple dominant networks. The dominant systems can be used for direct computation of steady states and relaxation dynamics, especially when kinetic information is incomplete. It could serve as a robust first approximation in perturbation theory or for preconditioning. Many of the parameters of the initial model are no longer present in the dominant system: these parameters are non-critical. Parameters of dominant systems indicate putative targets to change the behavior of the large network and answer an important question: given a network model, which are its critical parameters? The dominant system is, by definition, the system that gives us the main asymptotic terms of the stationary state and relaxation in the limit for well separated rate constants. The theory of dominant systems for networks with first-order kinetics and Markov chains is well developed [1, 2]. We found the explicit asymptotics of eigenvectors and eigenvalues. All algorithms are represented topologically by transformation of the graph of networks (labeled by the transport coefficients. In the simplest cases, the dominant system can be represented as dominant path in the network. In the general case, the hierarchy of dominant paths in the hierarchy of lumped networks is needed. Accuracy of estimates is proven. Performance of the algorithms is demonstrated on simple benchmarks and on multiscale biochemical networks. These methods are applied, in particular, to the analysis of microRNA-mediated mechanisms of translation repression. For nonlinear networks, we present a new heuristic algorithm for calculation of hierarchy of dominant paths. The results of the analysis of the dominant systems often support the observation by Kruskal: “And the answer quite generally has the form of a new system (well posed problem) for the solution to satisfy, although this is sometimes obscured because the new system is so easily solved that one is led directly to the solution without noticing the intermediate step.” References: [1] A.N. Gorban, O. Radulescu, A.Y. Zinovyev, Asymptotology of chemical reaction networks, Chemical Engineering Science 65 (2010), 2310–2324, [2] A.N. Gorban and O. Radulescu, Dynamic and Static Limitation in Multiscale Reaction
An invited talk International conference Modelling Biological Evolution 2015: Linking Mathematical Theories with Empirical Realities
In 1938, H. Selye proposed the notion of adaptation energy and published some “Experimental evide... more In 1938, H. Selye proposed the notion of adaptation energy and published some “Experimental evidence supporting the conception of adaptation energy”. This idea was widely criticized and its use nowadays is rather limited. Nevertheless, the response to many harmful factors has often general non-specific form and we can guess that the mechanisms of physiological adaptation admit a very general and nonspecific description. We assume that natural selection plays a key role in the evolution of mechanisms of physiological adaptation and apply the optimality models to description of these mechanisms. In the light of the optimality models, the mechanisms of adaptation are represented as the optimal distribution of resources for neutralization of harmful factors. We study dynamics of resources redistribution and revisit the theory of the general adaptation syndrome. Adaptation energy is considered as an internal coordinate on the `dominant path’ in the model of adaptation. The phenomenon of `oscillating death’ is predicted on the base of the dynamical models of adaptation.
Conference of the International Federation of Classification Societies, University of Bologna, 7th July 2015.
The problem of identification of pair of loci associated with heat tolerance in yeasts is conside... more The problem of identification of pair of loci associated with heat tolerance in yeasts is considered. Interactions of Quantitative Trait Loci (QTL) in heat selected yeast are analysed by comparing them to an unselected pool of random individuals. Data on individual F12 progeny selected for heat tolerance, which have been genotyped at 25 locations identified by sequencing a selected pool, are re-examined. 960 individuals were genotyped at these locations and multi-locus genotype frequencies were compared to 172 sequenced individuals from the original unselected pool. We use Relative Information Gain (RIG) for analysis of associations between loci. Correlation analysis in many pairs of loci requires multi testing methods. Two multi testing approaches are applied for selection of associations: False Discovery Rate (FDR) method in the version suggested by J.D. Storey and R. Tibshirani and specially developed Bootstrap Test of ordered RIG (BToRIG). BToRIG demonstrates slightly higher sensitivity than FDR approach does for FDR=1. The statistical analysis of entropy and RIG in genotypes of a selected population reveals further interactions than previously seen. Importantly this is done in comparison to the unselected population’s genotypes to account for inherent biases in the original population.
A talk given at the Conference of the International Federation of Classification Societies, Unive... more A talk given at the Conference of the International Federation of Classification Societies, University of Bologna, 8th July 2015. The problem of evaluating an individual’s risk of drug consumption and misuse is highly important and novel. An online survey methodology was employed to collect data including personality traits (NEO-FFI-R), impulsivity (BIS-11), sensation seeking (ImpSS), and demographic information. The data set contained information on the consumption of 18 central nervous system psychoactive drugs. Correlation analysis using a relative information gain model demonstrates the existence of a group of drugs (amphetamines, cannabis, cocaine, ecstasy, legal highs, LSD, and magic mushrooms) with strongly correlated consumption. An exhaustive search was performed to select the most effective subset of input features and data mining methods to classify users and non-users for each drug. A number of classification methods were employed (decision tree, random forest, k-nearest...
This book discusses the psychological traits associated with drug consumption through the statist... more This book discusses the psychological traits associated with drug consumption through the statistical analysis of a new database with information on 1885 respondents and use of 18 drugs. After reviewing published works on the psychological profiles of drug users and describing the data mining and machine learning methods used, it demonstrates that the personality traits (five factor model, impulsivity, and sensation seeking) together with simple demographic data make it possible to predict the risk of consumption of individual drugs with a sensitivity and specificity above 70% for most drugs. It also analyzes the correlations of use of different substances and describes the groups of drugs with correlated use, identifying significant differences in personality profiles for users of different drugs.
The book is intended for advanced undergraduates and first-year PhD students, as well as researchers and practitioners. Although no previous knowledge of machine learning, advanced data mining concepts or modern psychology of personality is assumed, familiarity with basic statistics and some experience in the use of probabilities would be helpful.
Uploads
Papers by Alexander Gorban
Hilbert's 6th problem concerns the axiomatization of those parts of physics which are ready for a rigorous mathematical approach. D. Hilbert attracted special attention to the rigorous theory of limiting processes "which lead from the atomistic view to the laws of motion of continua". We formalise this question as a problem of slow invariant manifolds for kinetic equations. We review a few instances where such hydrodynamic manifolds were found by the direct solution of the invariance equation. The dynamic equations on these manifolds give us a clue about the proper asymptotic of the continuum mechanic equations for rarefied non-equilibrium gases.
References:
[1] A.N. Gorban, O. Radulescu, A.Y. Zinovyev, Asymptotology of chemical reaction
networks, Chemical Engineering Science 65 (2010), 2310–2324,
[2] A.N. Gorban and O. Radulescu, Dynamic and Static Limitation in Multiscale Reaction
We assume that natural selection plays a key role in the evolution of mechanisms of physiological adaptation and apply the optimality models to description of these mechanisms. In the light of the optimality models, the mechanisms of adaptation are represented as the optimal distribution of resources for neutralization of harmful factors. We study dynamics of resources redistribution and revisit the theory of the general adaptation syndrome. Adaptation energy is considered as an internal coordinate on the `dominant path’ in the model of adaptation. The phenomenon of `oscillating death’ is predicted on the base of the dynamical models of adaptation.
The book is intended for advanced undergraduates and first-year PhD students, as well as researchers and practitioners. Although no previous knowledge of machine learning, advanced data mining concepts or modern psychology of personality is assumed, familiarity with basic statistics and some experience in the use of probabilities would be helpful.