-
Restoring balance: principled under/oversampling of data for optimal classification
Authors:
Emanuele Loffredo,
Mauro Pastore,
Simona Cocco,
Rémi Monasson
Abstract:
Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this…
▽ More
Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Unlearning regularization for Boltzmann Machines
Authors:
Enrico Ventura,
Simona Cocco,
Rémi Monasson,
Francesco Zamponi
Abstract:
Boltzmann Machines (BMs) are graphical models with interconnected binary units, employed for the unsupervised modeling of data distributions. When trained on real data, BMs show the tendency to behave like critical systems, displaying a high susceptibility of the model under a small rescaling of the inferred parameters. This behaviour is not convenient for the purpose of generating data, because i…
▽ More
Boltzmann Machines (BMs) are graphical models with interconnected binary units, employed for the unsupervised modeling of data distributions. When trained on real data, BMs show the tendency to behave like critical systems, displaying a high susceptibility of the model under a small rescaling of the inferred parameters. This behaviour is not convenient for the purpose of generating data, because it slows down the sampling process, and induces the model to overfit the training-data. In this study, we introduce a regularization method for BMs to improve the robustness of the model under rescaling of the parameters. The new technique shares formal similarities with the unlearning algorithm, an iterative procedure used to improve memory associativity in Hopfield-like neural networks. We test our unlearning regularization on synthetic data generated by two simple models, the Curie-Weiss ferromagnetic model and the Sherrington-Kirkpatrick spin glass model. We show that it outperforms $L_p$-norm schemes and discuss the role of parameter initialization. Eventually, the method is applied to learn the activity of real neuronal cells, confirming its efficacy at shifting the inferred model away from criticality and coming out as a powerful candidate for actual scientific implementations.
△ Less
Submitted 15 May, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Evolutionary Dynamics of a Lattice Dimer: a Toy Model for Stability vs. Affinity Trade-offs in Proteins
Authors:
Emanuele Loffredo,
Elisabetta Vesconi,
Rostam Razban,
Orit Peleg,
Eugene Shakhnovich,
Simona Cocco,
Rémi Monasson
Abstract:
Understanding how a stressor applied on a biological system shapes its evolution is key to achieving targeted evolutionary control. Here we present a toy model of two interacting lattice proteins to quantify the response to the selective pressure defined by the binding energy. We generate sequence data of proteins and study how the sequence and structural properties of dimers are affected by the a…
▽ More
Understanding how a stressor applied on a biological system shapes its evolution is key to achieving targeted evolutionary control. Here we present a toy model of two interacting lattice proteins to quantify the response to the selective pressure defined by the binding energy. We generate sequence data of proteins and study how the sequence and structural properties of dimers are affected by the applied selective pressure, both during the evolutionary process and in the stationary regime. In particular we show that internal contacts of native structures lose strength, while inter-structure contacts are strengthened due to the folding-binding competition. We discuss how dimerization is achieved through enhanced mutability on the interacting faces, and how the designability of each native structure changes upon introduction of the stressor.
△ Less
Submitted 5 December, 2023; v1 submitted 22 March, 2023;
originally announced March 2023.
-
Statistical-physics approaches to RNA molecules, families and networks
Authors:
Simona Cocco,
Andrea De Martino,
Andrea Pagnani,
Martin Weigt
Abstract:
This contribution focuses on the fascinating RNA molecule, its sequence-dependent folding driven by base-pairing interactions, the interplay between these interactions and natural evolution, and its multiple regulatory roles. The four of us have dug into these topics using the tools and the spirit of the statistical physics of disordered systems, and in particular the concept of a disordered (ener…
▽ More
This contribution focuses on the fascinating RNA molecule, its sequence-dependent folding driven by base-pairing interactions, the interplay between these interactions and natural evolution, and its multiple regulatory roles. The four of us have dug into these topics using the tools and the spirit of the statistical physics of disordered systems, and in particular the concept of a disordered (energy/fitness) landscape. After an introduction to RNA molecules and the perspectives they open not only in evolutionary and synthetic biology but also in medicine, we will introduce the important notions of energy and fitness landscapes for these molecules. In Section III we will review some models and algorithms for RNA sequence-to-secondary-structure mapping. Section IV discusses how the secondary-structure energy landscape can be derived from unzipping data. Section V deals with the inference of RNA structure from evolutionary sequence data sampled in different organisms. This will shift the focus from the `sequence-to-structure' mapping described in Section III to a `sequence-to-function' landscape that can be inferred from laboratory evolutionary data on DNA aptamers. Finally, in Section VI, we shall discuss the rich theoretical picture linking networks of interacting RNA molecules to the organization of robust, systemic regulatory programs. Along this path, we will therefore explore phenomena across multiple scales in space, number of molecules and time, showing how the biological complexity of the RNA world can be captured by the unifying concepts of statistical physics.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.
-
Disentangling representations in Restricted Boltzmann Machines without adversaries
Authors:
Jorge Fernandez-de-Cossio-Diaz,
Simona Cocco,
Remi Monasson
Abstract:
A goal of unsupervised machine learning is to build representations of complex high-dimensional data, with simple relations to their properties. Such disentangled representations make easier to interpret the significant latent factors of variation in the data, as well as to generate new data with desirable features. Methods for disentangling representations often rely on an adversarial scheme, in…
▽ More
A goal of unsupervised machine learning is to build representations of complex high-dimensional data, with simple relations to their properties. Such disentangled representations make easier to interpret the significant latent factors of variation in the data, as well as to generate new data with desirable features. Methods for disentangling representations often rely on an adversarial scheme, in which representations are tuned to avoid discriminators from being able to reconstruct information about the data properties (labels). Unfortunately adversarial training is generally difficult to implement in practice. Here we propose a simple, effective way of disentangling representations without any need to train adversarial discriminators, and apply our approach to Restricted Boltzmann Machines (RBM), one of the simplest representation-based generative models. Our approach relies on the introduction of adequate constraints on the weights during training, which allows us to concentrate information about labels on a small subset of latent variables. The effectiveness of the approach is illustrated with four examples: the CelebA dataset of facial images, the two-dimensional Ising model, the MNIST dataset of handwritten digits, and the taxonomy of protein families. In addition, we show how our framework allows for analytically computing the cost, in terms of log-likelihood of the data, associated to the disentanglement of their representations.
△ Less
Submitted 8 March, 2023; v1 submitted 23 June, 2022;
originally announced June 2022.
-
Mutational paths with sequence-based models of proteins: from sampling to mean-field characterisation
Authors:
Eugenio Mauri,
Simona Cocco,
Rémi Monasson
Abstract:
Identifying and characterizing mutational paths is an important issue in evolutionary biology and in bioengineering. We here introduce a generic description of mutational paths in terms of the goodness of sequences and of the mutational dynamics (how sequences change) along the path. We first propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins…
▽ More
Identifying and characterizing mutational paths is an important issue in evolutionary biology and in bioengineering. We here introduce a generic description of mutational paths in terms of the goodness of sequences and of the mutational dynamics (how sequences change) along the path. We first propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins in silico, and apply to data-driven models of natural proteins learned from sequence data with Restricted Boltzmann Machines. We then use mean-field theory to characterize the properties of mutational paths for different mutational dynamics of interest, and show how it can be used to extend Kimura's estimate of evolutionary distances to sequence-based epistatic models of selection.
△ Less
Submitted 27 March, 2023; v1 submitted 22 April, 2022;
originally announced April 2022.
-
Barriers and Dynamical Paths in Alternating Gibbs Sampling of Restricted Boltzmann Machines
Authors:
Clément Roussel,
Simona Cocco,
Rémi Monasson
Abstract:
Restricted Boltzmann Machines (RBM) are bi-layer neural networks used for the unsupervised learning of model distributions from data. The bipartite architecture of RBM naturally defines an elegant sampling procedure, called Alternating Gibbs Sampling (AGS), where the configurations of the latent-variable layer are sampled conditional to the data-variable layer, and vice versa. We study here the pe…
▽ More
Restricted Boltzmann Machines (RBM) are bi-layer neural networks used for the unsupervised learning of model distributions from data. The bipartite architecture of RBM naturally defines an elegant sampling procedure, called Alternating Gibbs Sampling (AGS), where the configurations of the latent-variable layer are sampled conditional to the data-variable layer, and vice versa. We study here the performance of AGS on several analytically tractable models borrowed from statistical mechanics. We show that standard AGS is not more efficient than classical Metropolis-Hastings (MH) sampling of the effective energy landscape defined on the data layer. However, RBM can identify meaningful representations of training data in their latent space. Furthermore, using these representations and combining Gibbs sampling with the MH algorithm in the latent space can enhance the sampling performance of the RBM when the hidden units encode weakly dependent features of the data. We illustrate our findings on three datasets: Bars and Stripes and MNIST, well known in machine learning, and the so-called Lattice Proteins, introduced in theoretical biology to study the sequence-to-structure mapping in proteins.
△ Less
Submitted 21 October, 2021; v1 submitted 13 July, 2021;
originally announced July 2021.
-
Gaussian Closure Scheme in the Quasi-Linkage Equilibrium Regime of Evolving Genome Populations
Authors:
Eugenio Mauri,
Simona Cocco,
Rémi Monasson
Abstract:
Describing the evolution of a population of genomes evolving in a complex fitness landscape is generally very hard. We here introduce an approximate Gaussian closure scheme to characterize analytically the statistics of a genomic population in the so-called Quasi--Linkage Equilibrium (QLE) regime, applicable to generic values of the rates of mutation or recombination and fitness functions. The G…
▽ More
Describing the evolution of a population of genomes evolving in a complex fitness landscape is generally very hard. We here introduce an approximate Gaussian closure scheme to characterize analytically the statistics of a genomic population in the so-called Quasi--Linkage Equilibrium (QLE) regime, applicable to generic values of the rates of mutation or recombination and fitness functions. The Gaussian approximation is illustrated on a short-range fitness landscape with two far away and competing maxima. It unveils the existence of a phase transition from a broad to a polarized distribution of genomes as the strength of epistatic couplings is increased, characterized by slow coarsening dynamics of competing allele domains. Results of the closure scheme are corroborated by numerical simulations.
△ Less
Submitted 13 October, 2020;
originally announced October 2020.
-
Inferring epistasis from genomic data with comparable mutation and outcrossing rate
Authors:
Hong-Li Zeng,
Eugenio Mauri,
Vito Dichio,
Simona Cocco,
Remi Monasson,
Erik Aurell
Abstract:
We consider a population evolving due to mutation, selection and recombination, where selection includes single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). We further consider the problem of inferring fitness in the evolutionary dynamics from one or several snap-shots of the distribution of genotypes in the population. In the recent literature this has been done…
▽ More
We consider a population evolving due to mutation, selection and recombination, where selection includes single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). We further consider the problem of inferring fitness in the evolutionary dynamics from one or several snap-shots of the distribution of genotypes in the population. In the recent literature this has been done by applying the Quasi-Linkage Equilibrium (QLE) regime first obtained by Kimura in the limit of high recombination. Here we show that the approach also works in the interesting regime where the effects of mutations are comparable to or larger than recombination. This leads to a modified main epistatic fitness inference formula where the rates of mutation and recombination occur together. We also derive this formula using by a previously developed Gaussian closure that formally remains valid when recombination is absent. The findings are validated through numerical simulations.
△ Less
Submitted 4 May, 2021; v1 submitted 30 June, 2020;
originally announced June 2020.
-
'Place-cell' emergence and learning of invariant data with restricted Boltzmann machines: breaking and dynamical restoration of continuous symmetries in the weight space
Authors:
Moshir Harsh,
Jérôme Tubiana,
Simona Cocco,
Remi Monasson
Abstract:
Distributions of data or sensory stimuli often enjoy underlying invariances. How and to what extent those symmetries are captured by unsupervised learning methods is a relevant question in machine learning and in computational neuroscience. We study here, through a combination of numerical and analytical tools, the learning dynamics of Restricted Boltzmann Machines (RBM), a neural network paradigm…
▽ More
Distributions of data or sensory stimuli often enjoy underlying invariances. How and to what extent those symmetries are captured by unsupervised learning methods is a relevant question in machine learning and in computational neuroscience. We study here, through a combination of numerical and analytical tools, the learning dynamics of Restricted Boltzmann Machines (RBM), a neural network paradigm for representation learning. As learning proceeds from a random configuration of the network weights, we show the existence of, and characterize a symmetry-breaking phenomenon, in which the latent variables acquire receptive fields focusing on limited parts of the invariant manifold supporting the data. The symmetry is restored at large learning times through the diffusion of the receptive field over the invariant manifold; hence, the RBM effectively spans a continuous attractor in the space of network weights. This symmetry-breaking phenomenon takes place only if the amount of data available for training exceeds some critical value, depending on the network size and the intensity of symmetry-induced correlations in the data; below this 'retarded-learning' threshold, the network weights are essentially noisy and overfit the data.
△ Less
Submitted 30 December, 2019;
originally announced December 2019.
-
Inference of compressed Potts graphical models
Authors:
Francesca Rizzato,
Alice Coucke,
Eleonora de Leonardis,
J. P. Barton,
Jérôme Tubiana,
Remi Monasson,
Simona Cocco
Abstract:
We consider the problem of inferring a graphical Potts model on a population of variables, with a non-uniform number of Potts colors (symbols) across variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization sch…
▽ More
We consider the problem of inferring a graphical Potts model on a population of variables, with a non-uniform number of Potts colors (symbols) across variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization scheme, in which the number of colors available to each variable is reduced, and interaction networks are made sparse. To achieve this color compression scheme, only Potts states with large empirical frequency (exceeding some threshold) are explicitly modeled on each site, while the others are grouped into a single state. We benchmark the performances of this mixed regularization approach, with two inference algorithms, the Adaptive Cluster Expansion (ACE) and the PseudoLikelihood Maximization (PLM) on synthetic data obtained by sampling disordered Potts models on an Erdos-Renyi random graphs. We show in particular that color compression does not affect the quality of reconstruction of the parameters corresponding to high-frequency symbols, while drastically reducing the number of the other parameters and thus the computational time. Our procedure is also applied to multi-sequence alignments of protein families, with similar results.
△ Less
Submitted 3 January, 2020; v1 submitted 30 July, 2019;
originally announced July 2019.
-
Adaptive Cluster Expansion for Ising spin models
Authors:
Simona Cocco,
Giancarlo Croce,
Francesco Zamponi
Abstract:
We propose an algorithm to obtain numerically approximate solutions of the direct Ising problem, that is, to compute the free energy and the equilibrium observables of spin systems with arbitrary two-spin interactions. To this purpose we use the Adaptive Cluster Expansion method, originally developed to solve the inverse Ising problem, that is, to infer the interactions from the equilibrium correl…
▽ More
We propose an algorithm to obtain numerically approximate solutions of the direct Ising problem, that is, to compute the free energy and the equilibrium observables of spin systems with arbitrary two-spin interactions. To this purpose we use the Adaptive Cluster Expansion method, originally developed to solve the inverse Ising problem, that is, to infer the interactions from the equilibrium correlations. The method consists in iteratively constructing and selecting clusters of spins, computing their contributions to the free energy and discarding clusters whose contribution is lower than a fixed threshold. The properties of the cluster expansion and its performance are studied in detail on one dimensional, two dimensional, random and fully connected graphs with homogeneous or heterogeneous fields and couplings. We discuss the differences between different representations (Boolean and Ising) of the spin variables.
△ Less
Submitted 10 October, 2019; v1 submitted 13 June, 2019;
originally announced June 2019.
-
Inverse Statistical Physics of Protein Sequences: A Key Issues Review
Authors:
Simona Cocco,
Christoph Feinauer,
Matteo Figliuzzi,
Remi Monasson,
Martin Weigt
Abstract:
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e.~evolutionarily related protein sequences, to which method…
▽ More
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e.~evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.
△ Less
Submitted 3 March, 2017;
originally announced March 2017.
-
Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models
Authors:
Hugo Jacquin,
Amy Gilson,
Eugene Shakhnovich,
Simona Cocco,
Rémi Monasson
Abstract:
Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those…
▽ More
Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of 'true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons for the success of inverse approaches to the modelling of proteins from sequence data, and their limitations.
△ Less
Submitted 15 November, 2016;
originally announced November 2016.
-
On the entropy of protein families
Authors:
John Barton,
Arup Chakraborty,
Simona Cocco,
Hugo Jacquin,
Rémi Monasson
Abstract:
Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Mod…
▽ More
Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1-and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the fixation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.
△ Less
Submitted 26 December, 2015;
originally announced December 2015.
-
Learning probabilities from random observables in high dimensions: the maximum entropy distribution and others
Authors:
Tomoyuki Obuchi,
Simona Cocco,
Rémi Monasson
Abstract:
We consider the problem of learning a target probability distribution over a set of $N$ binary variables from the knowledge of the expectation values (with this target distribution) of $M$ observables, drawn uniformly at random. The space of all probability distributions compatible with these $M$ expectation values within some fixed accuracy, called version space, is studied. We introduce a biased…
▽ More
We consider the problem of learning a target probability distribution over a set of $N$ binary variables from the knowledge of the expectation values (with this target distribution) of $M$ observables, drawn uniformly at random. The space of all probability distributions compatible with these $M$ expectation values within some fixed accuracy, called version space, is studied. We introduce a biased measure over the version space, which gives a boost increasing exponentially with the entropy of the distributions and with an arbitrary inverse `temperature' $Γ$. The choice of $Γ$ allows us to interpolate smoothly between the unbiased measure over all distributions in the version space ($Γ=0$) and the pointwise measure concentrated at the maximum entropy distribution ($Γ\to \infty$). Using the replica method we compute the volume of the version space and other quantities of interest, such as the distance $R$ between the target distribution and the center-of-mass distribution over the version space, as functions of $α=(\log M)/N$ and $Γ$ for large $N$. Phase transitions at critical values of $α$ are found, corresponding to qualitative improvements in the learning of the target distribution and to the decrease of the distance $R$. However, for fixed $α$, the distance $R$ does not vary with $Γ$, which means that the maximum entropy distribution is not closer to the target distribution than any other distribution compatible with the observable values. Our results are confirmed by Monte Carlo sampling of the version space for small system sizes ($N\le 10$).
△ Less
Submitted 21 July, 2015; v1 submitted 10 March, 2015;
originally announced March 2015.
-
Stochastic Ratchet Mechanisms for Replacement of Proteins Bound to DNA
Authors:
Simona Cocco,
John F. Marko,
Remi Monasson
Abstract:
Experiments indicate that unbinding rates of proteins from DNA can depend on the concentration of proteins in nearby solution. Here we present a theory of multi-step replacement of DNA-bound proteins by solution-phase proteins. For four different kinetic scenarios we calculate the depen- dence of protein unbinding and replacement rates on solution protein concentration. We find (1) strong effects…
▽ More
Experiments indicate that unbinding rates of proteins from DNA can depend on the concentration of proteins in nearby solution. Here we present a theory of multi-step replacement of DNA-bound proteins by solution-phase proteins. For four different kinetic scenarios we calculate the depen- dence of protein unbinding and replacement rates on solution protein concentration. We find (1) strong effects of progressive 'rezipping' of the solution-phase protein onto DNA sites liberated by 'unzipping' of the originally bound protein; (2) that a model in which solution-phase proteins bind non-specifically to DNA can describe experiments on exchanges between the non specific DNA- binding proteins Fis-Fis and Fis-HU; (3) that a binding specific model describes experiments on the exchange of CueR proteins on specific binding sites.
△ Less
Submitted 26 May, 2014;
originally announced May 2014.
-
Large Pseudo-Counts and $L_2$-Norm Penalties Are Necessary for the Mean-Field Inference of Ising and Potts Models
Authors:
J. P. Barton,
S. Cocco,
E. De Leonardis,
R. Monasson
Abstract:
Mean field (MF) approximation offers a simple, fast way to infer direct interactions between elements in a network of correlated variables, a common, computationally challenging problem with practical applications in fields ranging from physics and biology to the social sciences. However, MF methods achieve their best performance with strong regularization, well beyond Bayesian expectations, an em…
▽ More
Mean field (MF) approximation offers a simple, fast way to infer direct interactions between elements in a network of correlated variables, a common, computationally challenging problem with practical applications in fields ranging from physics and biology to the social sciences. However, MF methods achieve their best performance with strong regularization, well beyond Bayesian expectations, an empirical fact that is poorly understood. In this work, we study the influence of pseudo-count and $L_2$-norm regularization schemes on the quality of inferred Ising or Potts interaction networks from correlation data within the MF approximation. We argue, based on the analysis of small systems, that the optimal value of the regularization strength remains finite even if the sampling noise tends to zero, in order to correct for systematic biases introduced by the MF approximation. Our claim is corroborated by extensive numerical studies of diverse model systems and by the analytical study of the $m$-component spin model, for large but finite $m$. Additionally we find that pseudo-count regularization is robust against sampling noise, and often outperforms $L_2$-norm regularization, particularly when the underlying network of interactions is strongly heterogeneous. Much better performances are generally obtained for the Ising model than for the Potts model, for which only couplings incoming onto medium-frequency symbols are reliably inferred.
△ Less
Submitted 1 May, 2014;
originally announced May 2014.
-
From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction
Authors:
Simona Cocco,
Remi Monasson,
Martin Weigt
Abstract:
Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predictin…
▽ More
Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant 'patterns' of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.
△ Less
Submitted 27 August, 2013; v1 submitted 13 December, 2012;
originally announced December 2012.
-
Charge dynamics of a single donor coupled to a few electrons quantum dot in silicon
Authors:
G. Mazzeo,
E. Prati,
M. Belli,
G. Leti,
S. Cocco,
M. Fanciulli,
F. Guagliardo,
G. Ferrari
Abstract:
We study the charge transfer dynamics between a silicon quantum dot and an individual phosphorous donor using the conduction through the quantum dot as a probe for the donor ionization state. We use a silicon n-MOSFET (metal oxide field effect transistor) biased near threshold in the SET regime with two side gates to control both the device conductance and the donor charge. Temperature and magneti…
▽ More
We study the charge transfer dynamics between a silicon quantum dot and an individual phosphorous donor using the conduction through the quantum dot as a probe for the donor ionization state. We use a silicon n-MOSFET (metal oxide field effect transistor) biased near threshold in the SET regime with two side gates to control both the device conductance and the donor charge. Temperature and magnetic field independent tunneling time is measured. We measure the statistics of the transfer of electrons observed when the ground state D0 of the donor is aligned with the SET states.
△ Less
Submitted 23 March, 2012;
originally announced March 2012.
-
Few Electron Limit of n-type Metal Oxide Semiconductor Single Electron Transistors
Authors:
Enrico Prati,
Marco De Michielis,
Matteo Belli,
Simone Cocco,
Marco Fanciulli,
Dharmraj Kotekar-Patil,
Matthias Ruoff,
Dieter P. Kern,
David A. Wharam,
Arjan Verduijn,
Giuseppe Tettamanzi,
Sven Rogge,
Benoit Roche,
Romain Wacquez,
Xavier Jehl,
Maud Vinet,
Marc Sanquer
Abstract:
We report electronic transport on n-type silicon Single Electron Transistors (SETs) fabricated in Complementary Metal Oxide Semiconductor (CMOS) technology. The n-MOSSETs are built within a pre-industrial Fully Depleted Silicon On Insulator (FDSOI) technology with a silicon thickness down to 10 nm on 200 mm wafers. The nominal channel size of 20 $\times$ 20 nm$^{2}$ is obtained by employing electr…
▽ More
We report electronic transport on n-type silicon Single Electron Transistors (SETs) fabricated in Complementary Metal Oxide Semiconductor (CMOS) technology. The n-MOSSETs are built within a pre-industrial Fully Depleted Silicon On Insulator (FDSOI) technology with a silicon thickness down to 10 nm on 200 mm wafers. The nominal channel size of 20 $\times$ 20 nm$^{2}$ is obtained by employing electron beam lithography for active and gate levels patterning. The Coulomb blockade stability diagram is precisely resolved at 4.2 K and it exhibits large addition energies of tens of meV. The confinement of the electrons in the quantum dot has been modeled by using a Current Spin Density Functional Theory (CS-DFT) method. CMOS technology enables massive production of SETs for ultimate nanoelectronics and quantum variables based devices.
△ Less
Submitted 20 March, 2012;
originally announced March 2012.
-
Adaptive cluster expansion for the inverse Ising problem: convergence, algorithm and tests
Authors:
Simona Cocco,
Rémi Monasson
Abstract:
We present a procedure to solve the inverse Ising problem, that is to find the interactions between a set of binary variables from the measure of their equilibrium correlations. The method consists in constructing and selecting specific clusters of variables, based on their contributions to the cross-entropy of the Ising model. Small contributions are discarded to avoid overfitting and to make the…
▽ More
We present a procedure to solve the inverse Ising problem, that is to find the interactions between a set of binary variables from the measure of their equilibrium correlations. The method consists in constructing and selecting specific clusters of variables, based on their contributions to the cross-entropy of the Ising model. Small contributions are discarded to avoid overfitting and to make the computation tractable. The properties of the cluster expansion and its performances on synthetic data are studied. To make the implementation easier we give the pseudo-code of the algorithm.
△ Less
Submitted 25 October, 2011;
originally announced October 2011.
-
High-Dimensional Inference with the generalized Hopfield Model: Principal Component Analysis and Corrections
Authors:
Simona Cocco,
Remi Monasson,
Vitor Sessak
Abstract:
We consider the problem of inferring the interactions between a set of N binary variables from the knowledge of their frequencies and pairwise correlations. The inference framework is based on the Hopfield model, a special case of the Ising model where the interaction matrix is defined through a set of patterns in the variable space, and is of rank much smaller than N. We show that Maximum Lik eli…
▽ More
We consider the problem of inferring the interactions between a set of N binary variables from the knowledge of their frequencies and pairwise correlations. The inference framework is based on the Hopfield model, a special case of the Ising model where the interaction matrix is defined through a set of patterns in the variable space, and is of rank much smaller than N. We show that Maximum Lik elihood inference is deeply related to Principal Component Analysis when the amp litude of the pattern components, xi, is negligible compared to N^1/2. Using techniques from statistical mechanics, we calculate the corrections to the patterns to the first order in xi/N^1/2. We stress that it is important to generalize the Hopfield model and include both attractive and repulsive patterns, to correctly infer networks with sparse and strong interactions. We present a simple geometrical criterion to decide how many attractive and repulsive patterns should be considered as a function of the sampling noise. We moreover discuss how many sampled configurations are required for a good inference, as a function of the system size, N and of the amplitude, xi. The inference approach is illustrated on synthetic and biological data.
△ Less
Submitted 19 April, 2011;
originally announced April 2011.
-
Adaptive Cluster Expansion for Inferring Boltzmann Machines with Noisy Data
Authors:
Simona Cocco,
Rémi Monasson
Abstract:
We introduce a procedure to infer the interactions among a set of binary variables, based on their sampled frequencies and pairwise correlations. The algorithm builds the clusters of variables contributing most to the entropy of the inferred Ising model, and rejects the small contributions due to the sampling noise. Our procedure successfully recovers benchmark Ising models even at criticality and…
▽ More
We introduce a procedure to infer the interactions among a set of binary variables, based on their sampled frequencies and pairwise correlations. The algorithm builds the clusters of variables contributing most to the entropy of the inferred Ising model, and rejects the small contributions due to the sampling noise. Our procedure successfully recovers benchmark Ising models even at criticality and in the low temperature phase, and is applied to neurobiological data.
△ Less
Submitted 16 February, 2011;
originally announced February 2011.
-
On the trajectories and performance of Infotaxis, an information-based greedy search algorithm
Authors:
Carlo Barbieri,
Simona Cocco,
Rémi Monasson
Abstract:
We present a continuous-space version of Infotaxis, a search algorithm where a searcher greedily moves to maximize the gain in information about the position of the target to be found. Using a combination of analytical and numerical tools we study the nature of the trajectories in two and three dimensions. The probability that the search is successful and the running time of the search are estimat…
▽ More
We present a continuous-space version of Infotaxis, a search algorithm where a searcher greedily moves to maximize the gain in information about the position of the target to be found. Using a combination of analytical and numerical tools we study the nature of the trajectories in two and three dimensions. The probability that the search is successful and the running time of the search are estimated. A possible extension to non-greedy search is suggested.
△ Less
Submitted 16 March, 2011; v1 submitted 13 October, 2010;
originally announced October 2010.
-
Adiabatic Charge Control in a Single Donor Atom Transistor
Authors:
Enrico Prati,
Matteo Belli,
Simone Cocco,
Guido Petretto,
Marco Fanciulli
Abstract:
We charge an individual donor with electrons stored in a quantum dot in its proximity. A Silicon quantum device containing a single Arsenic donor and an electrostatic quantum dot in parallel is realized in a nanometric field effect transistor. The different coupling capacitances of the donor and the quantum dot with the control and the back gates are exploited to generate a relative rigid shift of…
▽ More
We charge an individual donor with electrons stored in a quantum dot in its proximity. A Silicon quantum device containing a single Arsenic donor and an electrostatic quantum dot in parallel is realized in a nanometric field effect transistor. The different coupling capacitances of the donor and the quantum dot with the control and the back gates are exploited to generate a relative rigid shift of their energy spectrum as a function of the back gate voltage, causing the crossing of the energy levels. We observe the sequential tunneling through the $D^{2-}$ and the $D^{3-}$ energy levels of the donor hybridized at the oxide interface at 4.2 K. Their respective states form an honeycomb pattern with the quantum dot states. It is therefore possible to control the exchange coupling of an electron of the quantum dot with the electrons bound to the donor, thus realizing a physical qubit for quantum information processing applications.
△ Less
Submitted 10 August, 2010; v1 submitted 18 May, 2010;
originally announced June 2010.
-
Dynamical modelling of molecular constructions and setups for DNA unzipping
Authors:
Carlo Barbieri,
Simona Cocco,
Remi Monasson,
Francesco Zamponi
Abstract:
We present a dynamical model of DNA mechanical unzipping under the action of a force. The model includes the motion of the fork in the sequence-dependent landscape, the trap(s) acting on the bead(s), and the polymeric components of the molecular construction (unzipped single strands of DNA, and linkers). Different setups are considered to test the model, and the outcome of the simulations is com…
▽ More
We present a dynamical model of DNA mechanical unzipping under the action of a force. The model includes the motion of the fork in the sequence-dependent landscape, the trap(s) acting on the bead(s), and the polymeric components of the molecular construction (unzipped single strands of DNA, and linkers). Different setups are considered to test the model, and the outcome of the simulations is compared to simpler dynamical models existing in the literature where polymers are assumed to be at equilibrium.
△ Less
Submitted 5 December, 2008;
originally announced December 2008.
-
Inferring DNA sequences from mechanical unzipping data: the large-bandwidth case
Authors:
Valentina Baldazzi,
Serena Bradde,
Simona Cocco,
Enzo Marinari,
Remi Monasson
Abstract:
The complementary strands of DNA molecules can be separated when stretched apart by a force; the unzipping signal is correlated to the base content of the sequence but is affected by thermal and instrumental noise. We consider here the ideal case where opening events are known to a very good time resolution (very large bandwidth), and study how the sequence can be reconstructed from the unzippin…
▽ More
The complementary strands of DNA molecules can be separated when stretched apart by a force; the unzipping signal is correlated to the base content of the sequence but is affected by thermal and instrumental noise. We consider here the ideal case where opening events are known to a very good time resolution (very large bandwidth), and study how the sequence can be reconstructed from the unzipping data. Our approach relies on the use of statistical Bayesian inference and of Viterbi decoding algorithm. Performances are studied numerically on Monte Carlo generated data, and analytically. We show how multiple unzippings of the same molecule may be exploited to improve the quality of the prediction, and calculate analytically the number of required unzippings as a function of the bandwidth, the sequence content, the elasticity parameters of the unzipped strands.
△ Less
Submitted 19 April, 2007;
originally announced April 2007.
-
Reconstructing a Random Potential from its Random Walks
Authors:
Simona Cocco,
Remi Monasson
Abstract:
The problem of how many trajectories of a random walker in a potential are needed to reconstruct the values of this potential is studied. We show that this problem can be solved by calculating the probability of survival of an abstract random walker in a partially absorbing potential. The approach is illustrated on the discrete Sinai (random force) model with a drift. We determine the parameter…
▽ More
The problem of how many trajectories of a random walker in a potential are needed to reconstruct the values of this potential is studied. We show that this problem can be solved by calculating the probability of survival of an abstract random walker in a partially absorbing potential. The approach is illustrated on the discrete Sinai (random force) model with a drift. We determine the parameter (temperature, duration of each trajectory, ...) values making reconstruction as fast as possible.
△ Less
Submitted 19 April, 2007;
originally announced April 2007.
-
Protein-Mediated DNA Loops: Effects of Protein Bridge Size and Kinks
Authors:
Nicolas Douarche,
Simona Cocco
Abstract:
This paper focuses on the probability that a portion of DNA closes on itself through thermal fluctuations. We investigate the dependence of this probability upon the size r of a protein bridge and/or the presence of a kink at half DNA length. The DNA is modeled by the Worm-Like Chain model, and the probability of loop formation is calculated in two ways: exact numerical evaluation of the constra…
▽ More
This paper focuses on the probability that a portion of DNA closes on itself through thermal fluctuations. We investigate the dependence of this probability upon the size r of a protein bridge and/or the presence of a kink at half DNA length. The DNA is modeled by the Worm-Like Chain model, and the probability of loop formation is calculated in two ways: exact numerical evaluation of the constrained path integral and the extension of the Shimada and Yamakawa saddle point approximation. For example, we find that the looping free energy of a 100 base pairs DNA decreases from 24 kT to 13 kT when the loop is closed by a protein of r = 10 nm length. It further decreases to 5 kT when the loop has a kink of 120 degrees at half-length.
△ Less
Submitted 16 November, 2005; v1 submitted 11 August, 2005;
originally announced August 2005.
-
Inferring DNA sequences from mechanical unzipping: an ideal-case study
Authors:
V. Baldazzi,
S. Cocco,
E. Marinari,
R. Monasson
Abstract:
We introduce and test a method to predict the sequence of DNA molecules from in silico unzipping experiments. The method is based on Bayesian inference and on the Viterbi decoding algorithm. The probability of misprediction decreases exponentially with the number of unzippings, with a decay rate depending on the applied force and the sequence content.
We introduce and test a method to predict the sequence of DNA molecules from in silico unzipping experiments. The method is based on Bayesian inference and on the Viterbi decoding algorithm. The probability of misprediction decreases exponentially with the number of unzippings, with a decay rate depending on the applied force and the sequence content.
△ Less
Submitted 5 July, 2005; v1 submitted 9 June, 2005;
originally announced June 2005.
-
Role of calcium and noise in the persistent activity of an isolated neuron
Authors:
Simona Cocco
Abstract:
The activity of an isolated and auto-connected neuron is studied using Hodgkin--Huxley and Integrate-and-Fire frameworks. Main ingredients of the modeling are the auto-stimulating autaptic current observed in experiments, with a spontaneous synaptic liberation noise and a calcium--dependent negative feedback mechanism. The distributions of inter-spikes intervals and burst durations are analytica…
▽ More
The activity of an isolated and auto-connected neuron is studied using Hodgkin--Huxley and Integrate-and-Fire frameworks. Main ingredients of the modeling are the auto-stimulating autaptic current observed in experiments, with a spontaneous synaptic liberation noise and a calcium--dependent negative feedback mechanism. The distributions of inter-spikes intervals and burst durations are analytically calculated, and show a good agreement with experimental data.
△ Less
Submitted 3 June, 2004;
originally announced June 2004.
-
Heuristic average-case analysis of the backtrack resolution of random 3-Satisfiability instances
Authors:
Simona Cocco,
Remi Monasson
Abstract:
An analysis of the average-case complexity of solving random 3-Satisfiability (SAT) instances with backtrack algorithms is presented. We first interpret previous rigorous works in a unifying framework based on the statistical physics notions of dynamical trajectories, phase diagram and growth process. It is argued that, under the action of the Davis--Putnam--Loveland--Logemann (DPLL) algorithm,…
▽ More
An analysis of the average-case complexity of solving random 3-Satisfiability (SAT) instances with backtrack algorithms is presented. We first interpret previous rigorous works in a unifying framework based on the statistical physics notions of dynamical trajectories, phase diagram and growth process. It is argued that, under the action of the Davis--Putnam--Loveland--Logemann (DPLL) algorithm, 3-SAT instances are turned into 2+p-SAT instances whose characteristic parameters (ratio alpha of clauses per variable, fraction p of 3-clauses) can be followed during the operation, and define resolution trajectories. Depending on the location of trajectories in the phase diagram of the 2+p-SAT model, easy (polynomial) or hard (exponential) resolutions are generated. Three regimes are identified, depending on the ratio alpha of the 3-SAT instance to be solved. Lower sat phase: for small ratios, DPLL almost surely finds a solution in a time growing linearly with the number N of variables. Upper sat phase: for intermediate ratios, instances are almost surely satisfiable but finding a solution requires exponential time (2 ^ (N omega) with omega>0) with high probability. Unsat phase: for large ratios, there is almost always no solution and proofs of refutation are exponential. An analysis of the growth of the search tree in both upper sat and unsat regimes is presented, and allows us to estimate omega as a function of alpha. This analysis is based on an exact relationship between the average size of the search tree and the powers of the evolution operator encoding the elementary steps of the search heuristic.
△ Less
Submitted 14 January, 2004;
originally announced January 2004.
-
Approximate analysis of search algorithms with "physical" methods
Authors:
Simona Cocco,
Remi Monasson,
Andrea Montanari,
Guilhem Semerjian
Abstract:
An overview of some methods of statistical physics applied to the analysis of algorithms for optimization problems (satisfiability of Boolean constraints, vertex cover of graphs, decoding, ...) with distributions of random inputs is proposed. Two types of algorithms are analyzed: complete procedures with backtracking (Davis-Putnam-Loveland-Logeman algorithm) and incomplete, local search procedur…
▽ More
An overview of some methods of statistical physics applied to the analysis of algorithms for optimization problems (satisfiability of Boolean constraints, vertex cover of graphs, decoding, ...) with distributions of random inputs is proposed. Two types of algorithms are analyzed: complete procedures with backtracking (Davis-Putnam-Loveland-Logeman algorithm) and incomplete, local search procedures (gradient descent, random walksat, ...). The study of complete algorithms makes use of physical concepts such as phase transitions, dynamical renormalization flow, growth processes, ... As for local search procedures, the connection between computational complexity and the structure of the cost function landscape is questioned, with emphasis on the notion of metastability.
△ Less
Submitted 3 February, 2003;
originally announced February 2003.
-
Slow nucleic acid unzipping kinetics from sequence-defined barriers
Authors:
S. Cocco,
J. F. Marko,
R. Monasson
Abstract:
Recent experiments on unzipping of RNA helix-loop structures by force have shown that about 40-base molecules can undergo kinetic transitions between two well-defined `open' and `closed' states, on a timescale = 1 sec [Liphardt et al., Science 297, 733-737 (2001)]. Using a simple dynamical model, we show that these phenomena result from the slow kinetics of crossing large free energy barriers wh…
▽ More
Recent experiments on unzipping of RNA helix-loop structures by force have shown that about 40-base molecules can undergo kinetic transitions between two well-defined `open' and `closed' states, on a timescale = 1 sec [Liphardt et al., Science 297, 733-737 (2001)]. Using a simple dynamical model, we show that these phenomena result from the slow kinetics of crossing large free energy barriers which separate the open and closed conformations. The dependence of barriers on sequence along the helix, and on the size of the loop(s) is analyzed. Some DNAs and RNAs sequences that could show dynamics on different time scales, or three(or more)-state unzipping, are proposed.
△ Less
Submitted 25 July, 2002;
originally announced July 2002.
-
Unzipping Dynamics of Long DNAs
Authors:
S. Cocco,
R. Monasson,
J. F. Marko
Abstract:
The two strands of the DNA double helix can be `unzipped' by application of 15 pN force. We analyze the dynamics of unzipping and rezipping, for the case where the molecule ends are separated and re-approached at constant velocity. For unzipping of 50 kilobase DNAs at less than about 1000 bases per second, thermal equilibrium-based theory applies. However, for higher unzipping velocities, rotati…
▽ More
The two strands of the DNA double helix can be `unzipped' by application of 15 pN force. We analyze the dynamics of unzipping and rezipping, for the case where the molecule ends are separated and re-approached at constant velocity. For unzipping of 50 kilobase DNAs at less than about 1000 bases per second, thermal equilibrium-based theory applies. However, for higher unzipping velocities, rotational viscous drag creates a buildup of elastic torque to levels above kBT in the dsDNA region, causing the unzipping force to be well above or well below the equilibrium unzipping force during respectively unzipping and rezipping, in accord with recent experimental results of Thomen et al. [Phys. Rev. Lett. 88, 248102 (2002)]. Our analysis includes the effect of sequence on unzipping and rezipping, and the transient delay in buildup of the unzipping force due to the approach to the steady state.
△ Less
Submitted 20 July, 2002;
originally announced July 2002.
-
Restart method and exponential acceleration of random 3-SAT instances resolutions: a large deviation analysis of the Davis-Putnam-Loveland-Logemann algorithm
Authors:
S. Cocco,
R. Monasson
Abstract:
The analysis of the solving complexity of random 3-SAT instances using the Davis-Putnam-Loveland-Logemann (DPLL) algorithm slightly below threshold is presented. While finding a solution for such instances demands exponential effort with high probability, we show that an exponentially small fraction of resolutions require a computation scaling linearly in the size of the instance only. We comput…
▽ More
The analysis of the solving complexity of random 3-SAT instances using the Davis-Putnam-Loveland-Logemann (DPLL) algorithm slightly below threshold is presented. While finding a solution for such instances demands exponential effort with high probability, we show that an exponentially small fraction of resolutions require a computation scaling linearly in the size of the instance only. We compute analytically this exponentially small probability of easy resolutions from a large deviation analysis of DPLL with the Generalized Unit Clause search heuristic, and show that the corresponding exponent is smaller (in absolute value) than the growth exponent of the typical resolution time. Our study therefore gives some quantitative basis to heuristic restart solving procedures, and suggests a natural cut-off cost (the size of the instance) for the restart.
△ Less
Submitted 13 June, 2002;
originally announced June 2002.
-
Rigorous decimation-based construction of ground pure states for spin glass models on random lattices
Authors:
S. Cocco,
O. Dubois,
J. Mandler,
R. Monasson
Abstract:
A constructive scheme for determining pure states (clusters) at very low temperature in the 3-spins glass model on a random lattice is provided, in full agreement with Parisi's one step replica symmetry breaking (RSB) scheme. Proof is based on the analysis of an exact decimation procedure. When the number c of couplings per spin is smaller than some critical value c_d, all spins are eliminated a…
▽ More
A constructive scheme for determining pure states (clusters) at very low temperature in the 3-spins glass model on a random lattice is provided, in full agreement with Parisi's one step replica symmetry breaking (RSB) scheme. Proof is based on the analysis of an exact decimation procedure. When the number c of couplings per spin is smaller than some critical value c_d, all spins are eliminated at the end of decimation (RS phase). In the range c_d<c<c_s, a reduced Hamiltonian is left; each ground state (GS) of the latter is a "seed" from which a cluster of GS of the original Hamiltonian can be reconstructed. Above c_s, GS are frustrated with an energy per spin larger than -c. The number of GS in each cluster, the number of clusters, the distances between GS are calculated and correspond to RSB predictions.
△ Less
Submitted 18 July, 2002; v1 submitted 13 June, 2002;
originally announced June 2002.
-
Theoretical models for single-molecule DNA and RNA experiments: from elasticity to unzipping
Authors:
S. Cocco,
J. F. Marko,
R. Monasson
Abstract:
We review statistical-mechanical theories of single-molecule micromanipulation experiments on nucleic acids. First, models for describing polymer elasticity are introduced. We then review how these models are used to interpret single-molecule force-extension experiments on single-stranded and double-stranded DNA. Depending on the force and the molecules used, both smooth elastic behaviors and ab…
▽ More
We review statistical-mechanical theories of single-molecule micromanipulation experiments on nucleic acids. First, models for describing polymer elasticity are introduced. We then review how these models are used to interpret single-molecule force-extension experiments on single-stranded and double-stranded DNA. Depending on the force and the molecules used, both smooth elastic behaviors and abrupt structural transitions are observed. Third, we show how combining the elasticity of two single nucleic acid strands with a description of the base-pairing interactions between them explains much of the phenomenology and kinetics of RNA and DNA `unzipping' experiments.
△ Less
Submitted 13 June, 2002;
originally announced June 2002.
-
Exponentially hard problems are sometimes polynomial, a large deviation analysis of search algorithms for the random Satisfiability problem, and its application to stop-and-restart resolutions
Authors:
S. Cocco,
R. Monasson
Abstract:
A large deviation analysis of the solving complexity of random 3-Satisfiability instances slightly below threshold is presented. While finding a solution for such instances demands an exponential effort with high probability, we show that an exponentially small fraction of resolutions require a computation scaling linearly in the size of the instance only. This exponentially small probability of…
▽ More
A large deviation analysis of the solving complexity of random 3-Satisfiability instances slightly below threshold is presented. While finding a solution for such instances demands an exponential effort with high probability, we show that an exponentially small fraction of resolutions require a computation scaling linearly in the size of the instance only. This exponentially small probability of easy resolutions is analytically calculated, and the corresponding exponent shown to be smaller (in absolute value) than the growth exponent of the typical resolution time. Our study therefore gives some theoretical basis to heuristic stop-and-restart solving procedures, and suggests a natural cut-off (the size of the instance) for the restart.
△ Less
Submitted 1 March, 2002;
originally announced March 2002.
-
Force and kinetic barriers in unzipping of DNA
Authors:
S. Cocco,
R. Monasson,
J. Marko
Abstract:
A theory of the unzipping of double-stranded (ds) DNA is presented, and is compared to recent micromanipulation experiments. It is shown that the interactions which stabilize the double helix and the elastic rigidity of single strands (ss) simply determine the sequence dependent =12 pN force threshold for DNA strand separation. Using a semi-microscopic model of the binding between nucleotide str…
▽ More
A theory of the unzipping of double-stranded (ds) DNA is presented, and is compared to recent micromanipulation experiments. It is shown that the interactions which stabilize the double helix and the elastic rigidity of single strands (ss) simply determine the sequence dependent =12 pN force threshold for DNA strand separation. Using a semi-microscopic model of the binding between nucleotide strands, we show that the greater rigidity of the strands when formed into dsDNA, relative to that of isolated strands, gives rise to a potential barrier to unzipping. The effects of this barrier are derived analytically. The force to keep the extremities of the molecule at a fixed distance, the kinetic rates for strand unpairing at fixed applied force, and the rupture force as a function of loading rate are calculated. The dependence of the kinetics and of the rupture force on molecule length is also analyzed.
△ Less
Submitted 26 February, 2002;
originally announced February 2002.
-
Analysis of the computational complexity of solving random satisfiability problems using branch and bound search algorithms
Authors:
Simona Cocco,
Remi Monasson
Abstract:
The computational complexity of solving random 3-Satisfiability (3-SAT) problems is investigated. 3-SAT is a representative example of hard computational tasks; it consists in knowing whether a set of alpha N randomly drawn logical constraints involving N Boolean variables can be satisfied altogether or not. Widely used solving procedures, as the Davis-Putnam-Loveland-Logeman (DPLL) algorithm, p…
▽ More
The computational complexity of solving random 3-Satisfiability (3-SAT) problems is investigated. 3-SAT is a representative example of hard computational tasks; it consists in knowing whether a set of alpha N randomly drawn logical constraints involving N Boolean variables can be satisfied altogether or not. Widely used solving procedures, as the Davis-Putnam-Loveland-Logeman (DPLL) algorithm, perform a systematic search for a solution, through a sequence of trials and errors represented by a search tree. In the present study, we identify, using theory and numerical experiments, easy (size of the search tree scaling polynomially with N) and hard (exponential scaling) regimes as a function of the ratio alpha of constraints per variable. The typical complexity is explicitly calculated in the different regimes, in very good agreement with numerical simulations. Our theoretical approach is based on the analysis of the growth of the branches in the search tree under the operation of DPLL. On each branch, the initial 3-SAT problem is dynamically turned into a more generic 2+p-SAT problem, where p and 1-p are the fractions of constraints involving three and two variables respectively. The growth of each branch is monitored by the dynamical evolution of alpha and p and is represented by a trajectory in the static phase diagram of the random 2+p-SAT problem. Depending on whether or not the trajectories cross the boundary between phases, single branches or full trees are generated by DPLL, resulting in easy or hard resolutions.
△ Less
Submitted 11 December, 2000;
originally announced December 2000.
-
Trajectories in phase diagrams, growth processes and computational complexity: how search algorithms solve the 3-Satisfiability problem
Authors:
Simona Cocco,
Remi Monasson
Abstract:
Most decision and optimization problems encountered in practice fall into one of two categories with respect to any particular solving method or algorithm: either the problem is solved quickly (easy) or else demands an impractically long computational effort (hard). Recent investigations on model classes of problems have shown that some global parameters, such as the ratio between the constraint…
▽ More
Most decision and optimization problems encountered in practice fall into one of two categories with respect to any particular solving method or algorithm: either the problem is solved quickly (easy) or else demands an impractically long computational effort (hard). Recent investigations on model classes of problems have shown that some global parameters, such as the ratio between the constraints to be satisfied and the adjustable variables, are good predictors of problem hardness and, moreover, have an effect analogous to thermodynamical parameters, e.g. temperature, in predicting phases in condensed matter physics [Monasson et al., Nature 400 (1999) 133-137]. Here we show that changes in the values of such parameters can be tracked during a run of the algorithm defining a trajectory through the parameter space. Focusing on 3-Satisfiability, a recognized representative of hard problems, we analyze trajectories generated by search algorithms using growth processes statistical physics. These trajectories can cross well defined phases, corresponding to domains of easy or hard instances, and allow to successfully predict the times of resolution.
△ Less
Submitted 26 September, 2000;
originally announced September 2000.
-
Theoretical study of collective modes in DNA at ambient temperature
Authors:
Simona Cocco,
Remi Monasson
Abstract:
The instantaneous normal modes corresponding to base pair vibrations (radial modes) and twist angle fluctuations (angular modes) of a DNA molecule model at ambient temperature are theoretically investigated. Due to thermal disorder, normal modes are not plane waves with a single wave number q but have a finite and frequency dependent damping width. The density of modes rho(nu), the average dispe…
▽ More
The instantaneous normal modes corresponding to base pair vibrations (radial modes) and twist angle fluctuations (angular modes) of a DNA molecule model at ambient temperature are theoretically investigated. Due to thermal disorder, normal modes are not plane waves with a single wave number q but have a finite and frequency dependent damping width. The density of modes rho(nu), the average dispersion relation nu(q) as well as the coherence length xi(nu) are analytically calculated. The Gibbs averaged resolvent is computed using a replicated transfer matrix formalism and variational wave functions for the ground and first excited state. Our results for the density of modes are compared to Raman spectroscopy measurements of the collective modes for DNA in solution and show a good agreement with experimental data in the low frequency regime nu < 150 cm^{-1}. Radial modes extend over frequencies ranging from 50 cm^{-1} to 110 cm^{-1}. Angular modes, related to helical axis vibrations are limited to nu < 25 cm^{-1}. Normal modes are highly disordered and coherent over a few base pairs only (xi < 2 nm) in good agreement with neutron scattering experiments.
△ Less
Submitted 2 November, 1999;
originally announced November 1999.
-
Statistical Mechanics of Torque Induced Denaturation of DNA
Authors:
Simona Cocco,
Remi Monasson
Abstract:
A unifying theory of the denaturation transition of DNA, driven by temperature T or induced by an external mechanical torque Gamma is presented. Our model couples the hydrogen-bond opening and the untwisting of the helicoidal molecular structure. We show that denaturation corresponds to a first-order phase transition from B-DNA to d-DNA phases and that the coexistence region is naturally paramet…
▽ More
A unifying theory of the denaturation transition of DNA, driven by temperature T or induced by an external mechanical torque Gamma is presented. Our model couples the hydrogen-bond opening and the untwisting of the helicoidal molecular structure. We show that denaturation corresponds to a first-order phase transition from B-DNA to d-DNA phases and that the coexistence region is naturally parametrized by the degree of supercoiling sigma. The denaturation free energy, the temperature dependence of the twist angle, the phase diagram in the T,Gamma plane and isotherms in the sigma, Gamma plane are calculated and show a good agreement with experimental data.
△ Less
Submitted 24 September, 1999; v1 submitted 20 April, 1999;
originally announced April 1999.
-
Analytical and Numerical Study of Internal Representations in Multilayer Neural Networks with Binary Weights
Authors:
Simona Cocco,
Remi Monasson,
Riccardo Zecchina
Abstract:
We study the weight space structure of the parity machine with binary weights by deriving the distribution of volumes associated to the internal representations of the learning examples. The learning behaviour and the symmetry breaking transition are analyzed and the results are found to be in very good agreement with extended numerical simulations.
We study the weight space structure of the parity machine with binary weights by deriving the distribution of volumes associated to the internal representations of the learning examples. The learning behaviour and the symmetry breaking transition are analyzed and the results are found to be in very good agreement with extended numerical simulations.
△ Less
Submitted 16 April, 1996;
originally announced April 1996.