Search | arXiv e-print repository

Restoring balance: principled under/oversampling of data for optimal classification

Authors: Emanuele Loffredo, Mauro Pastore, Simona Cocco, Rémi Monasson

Abstract: Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this… ▽ More Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: 9 pages + appendix, 3 figures

arXiv:2311.09418 [pdf, other]

doi 10.1088/2632-2153/ad5a5f

Unlearning regularization for Boltzmann Machines

Authors: Enrico Ventura, Simona Cocco, Rémi Monasson, Francesco Zamponi

Abstract: Boltzmann Machines (BMs) are graphical models with interconnected binary units, employed for the unsupervised modeling of data distributions. When trained on real data, BMs show the tendency to behave like critical systems, displaying a high susceptibility of the model under a small rescaling of the inferred parameters. This behaviour is not convenient for the purpose of generating data, because i… ▽ More Boltzmann Machines (BMs) are graphical models with interconnected binary units, employed for the unsupervised modeling of data distributions. When trained on real data, BMs show the tendency to behave like critical systems, displaying a high susceptibility of the model under a small rescaling of the inferred parameters. This behaviour is not convenient for the purpose of generating data, because it slows down the sampling process, and induces the model to overfit the training-data. In this study, we introduce a regularization method for BMs to improve the robustness of the model under rescaling of the parameters. The new technique shares formal similarities with the unlearning algorithm, an iterative procedure used to improve memory associativity in Hopfield-like neural networks. We test our unlearning regularization on synthetic data generated by two simple models, the Curie-Weiss ferromagnetic model and the Sherrington-Kirkpatrick spin glass model. We show that it outperforms $L_p$-norm schemes and discuss the role of parameter initialization. Eventually, the method is applied to learn the activity of real neuronal cells, confirming its efficacy at shifting the inferred model away from criticality and coming out as a powerful candidate for actual scientific implementations. △ Less

Submitted 15 May, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: 18 pages, 13 figures

Journal ref: Machine Learning: Science and Technology 5, 025078 (2024)

arXiv:2303.12431 [pdf, other]

doi 10.1088/1751-8121/acfddc

Evolutionary Dynamics of a Lattice Dimer: a Toy Model for Stability vs. Affinity Trade-offs in Proteins

Authors: Emanuele Loffredo, Elisabetta Vesconi, Rostam Razban, Orit Peleg, Eugene Shakhnovich, Simona Cocco, Rémi Monasson

Abstract: Understanding how a stressor applied on a biological system shapes its evolution is key to achieving targeted evolutionary control. Here we present a toy model of two interacting lattice proteins to quantify the response to the selective pressure defined by the binding energy. We generate sequence data of proteins and study how the sequence and structural properties of dimers are affected by the a… ▽ More Understanding how a stressor applied on a biological system shapes its evolution is key to achieving targeted evolutionary control. Here we present a toy model of two interacting lattice proteins to quantify the response to the selective pressure defined by the binding energy. We generate sequence data of proteins and study how the sequence and structural properties of dimers are affected by the applied selective pressure, both during the evolutionary process and in the stationary regime. In particular we show that internal contacts of native structures lose strength, while inter-structure contacts are strengthened due to the folding-binding competition. We discuss how dimerization is achieved through enhanced mutability on the interacting faces, and how the designability of each native structure changes upon introduction of the stressor. △ Less

Submitted 5 December, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

Comments: New mutational protocol based on the background amino acids distribution. Updated Fig. 7-8. Added new Fig. 14 that shows residual frustration on couplings before and after dimerization

Journal ref: Journal of Physics A: Mathematical and Theoretical, Volume 56, Number 45, Published 13 October 2023 Random Landscapes and Dynamics in Evolution, Ecology and Beyond

arXiv:2207.13402 [pdf, other]

Statistical-physics approaches to RNA molecules, families and networks

Authors: Simona Cocco, Andrea De Martino, Andrea Pagnani, Martin Weigt

Abstract: This contribution focuses on the fascinating RNA molecule, its sequence-dependent folding driven by base-pairing interactions, the interplay between these interactions and natural evolution, and its multiple regulatory roles. The four of us have dug into these topics using the tools and the spirit of the statistical physics of disordered systems, and in particular the concept of a disordered (ener… ▽ More This contribution focuses on the fascinating RNA molecule, its sequence-dependent folding driven by base-pairing interactions, the interplay between these interactions and natural evolution, and its multiple regulatory roles. The four of us have dug into these topics using the tools and the spirit of the statistical physics of disordered systems, and in particular the concept of a disordered (energy/fitness) landscape. After an introduction to RNA molecules and the perspectives they open not only in evolutionary and synthetic biology but also in medicine, we will introduce the important notions of energy and fitness landscapes for these molecules. In Section III we will review some models and algorithms for RNA sequence-to-secondary-structure mapping. Section IV discusses how the secondary-structure energy landscape can be derived from unzipping data. Section V deals with the inference of RNA structure from evolutionary sequence data sampled in different organisms. This will shift the focus from the `sequence-to-structure' mapping described in Section III to a `sequence-to-function' landscape that can be inferred from laboratory evolutionary data on DNA aptamers. Finally, in Section VI, we shall discuss the rich theoretical picture linking networks of interacting RNA molecules to the organization of robust, systemic regulatory programs. Along this path, we will therefore explore phenomena across multiple scales in space, number of molecules and time, showing how the biological complexity of the RNA world can be captured by the unifying concepts of statistical physics. △ Less

Submitted 27 July, 2022; originally announced July 2022.

Comments: 19 pages, 6 figures, to appear in "Spin Glass Theory and Far Beyond - Replica Symmetry Breaking after 40 years" (edited by P Charbonneau, E Marinari, G Parisi, F Ricci Tersenghi, G Sicuro and F Zamponi)

arXiv:2206.11600 [pdf, other]

doi 10.1103/PhysRevX.13.021003

Disentangling representations in Restricted Boltzmann Machines without adversaries

Authors: Jorge Fernandez-de-Cossio-Diaz, Simona Cocco, Remi Monasson

Abstract: A goal of unsupervised machine learning is to build representations of complex high-dimensional data, with simple relations to their properties. Such disentangled representations make easier to interpret the significant latent factors of variation in the data, as well as to generate new data with desirable features. Methods for disentangling representations often rely on an adversarial scheme, in… ▽ More A goal of unsupervised machine learning is to build representations of complex high-dimensional data, with simple relations to their properties. Such disentangled representations make easier to interpret the significant latent factors of variation in the data, as well as to generate new data with desirable features. Methods for disentangling representations often rely on an adversarial scheme, in which representations are tuned to avoid discriminators from being able to reconstruct information about the data properties (labels). Unfortunately adversarial training is generally difficult to implement in practice. Here we propose a simple, effective way of disentangling representations without any need to train adversarial discriminators, and apply our approach to Restricted Boltzmann Machines (RBM), one of the simplest representation-based generative models. Our approach relies on the introduction of adequate constraints on the weights during training, which allows us to concentrate information about labels on a small subset of latent variables. The effectiveness of the approach is illustrated with four examples: the CelebA dataset of facial images, the two-dimensional Ising model, the MNIST dataset of handwritten digits, and the taxonomy of protein families. In addition, we show how our framework allows for analytically computing the cost, in terms of log-likelihood of the data, associated to the disentanglement of their representations. △ Less

Submitted 8 March, 2023; v1 submitted 23 June, 2022; originally announced June 2022.

Comments: Minor corrections. Accepted for publication in Physical Review X

arXiv:2204.10553 [pdf, other]

Mutational paths with sequence-based models of proteins: from sampling to mean-field characterisation

Authors: Eugenio Mauri, Simona Cocco, Rémi Monasson

Abstract: Identifying and characterizing mutational paths is an important issue in evolutionary biology and in bioengineering. We here introduce a generic description of mutational paths in terms of the goodness of sequences and of the mutational dynamics (how sequences change) along the path. We first propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins… ▽ More Identifying and characterizing mutational paths is an important issue in evolutionary biology and in bioengineering. We here introduce a generic description of mutational paths in terms of the goodness of sequences and of the mutational dynamics (how sequences change) along the path. We first propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins in silico, and apply to data-driven models of natural proteins learned from sequence data with Restricted Boltzmann Machines. We then use mean-field theory to characterize the properties of mutational paths for different mutational dynamics of interest, and show how it can be used to extend Kimura's estimate of evolutionary distances to sequence-based epistatic models of selection. △ Less

Submitted 27 March, 2023; v1 submitted 22 April, 2022; originally announced April 2022.

arXiv:2107.06013 [pdf, other]

doi 10.1103/PhysRevE.104.034109

Barriers and Dynamical Paths in Alternating Gibbs Sampling of Restricted Boltzmann Machines

Authors: Clément Roussel, Simona Cocco, Rémi Monasson

Abstract: Restricted Boltzmann Machines (RBM) are bi-layer neural networks used for the unsupervised learning of model distributions from data. The bipartite architecture of RBM naturally defines an elegant sampling procedure, called Alternating Gibbs Sampling (AGS), where the configurations of the latent-variable layer are sampled conditional to the data-variable layer, and vice versa. We study here the pe… ▽ More Restricted Boltzmann Machines (RBM) are bi-layer neural networks used for the unsupervised learning of model distributions from data. The bipartite architecture of RBM naturally defines an elegant sampling procedure, called Alternating Gibbs Sampling (AGS), where the configurations of the latent-variable layer are sampled conditional to the data-variable layer, and vice versa. We study here the performance of AGS on several analytically tractable models borrowed from statistical mechanics. We show that standard AGS is not more efficient than classical Metropolis-Hastings (MH) sampling of the effective energy landscape defined on the data layer. However, RBM can identify meaningful representations of training data in their latent space. Furthermore, using these representations and combining Gibbs sampling with the MH algorithm in the latent space can enhance the sampling performance of the RBM when the hidden units encode weakly dependent features of the data. We illustrate our findings on three datasets: Bars and Stripes and MNIST, well known in machine learning, and the so-called Lattice Proteins, introduced in theoretical biology to study the sequence-to-structure mapping in proteins. △ Less

Submitted 21 October, 2021; v1 submitted 13 July, 2021; originally announced July 2021.

Journal ref: Physical Review E : Statistical, Nonlinear, and Soft Matter Physics, American Physical Society, 2021

arXiv:2010.06220 [pdf, other]

doi 10.1209/0295-5075/132/56001

Gaussian Closure Scheme in the Quasi-Linkage Equilibrium Regime of Evolving Genome Populations

Authors: Eugenio Mauri, Simona Cocco, Rémi Monasson

Abstract: Describing the evolution of a population of genomes evolving in a complex fitness landscape is generally very hard. We here introduce an approximate Gaussian closure scheme to characterize analytically the statistics of a genomic population in the so-called Quasi--Linkage Equilibrium (QLE) regime, applicable to generic values of the rates of mutation or recombination and fitness functions. The G… ▽ More Describing the evolution of a population of genomes evolving in a complex fitness landscape is generally very hard. We here introduce an approximate Gaussian closure scheme to characterize analytically the statistics of a genomic population in the so-called Quasi--Linkage Equilibrium (QLE) regime, applicable to generic values of the rates of mutation or recombination and fitness functions. The Gaussian approximation is illustrated on a short-range fitness landscape with two far away and competing maxima. It unveils the existence of a phase transition from a broad to a polarized distribution of genomes as the strength of epistatic couplings is increased, characterized by slow coarsening dynamics of competing allele domains. Results of the closure scheme are corroborated by numerical simulations. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: EPL - Europhysics Letters, European Physical Society/EDP Sciences/Societ{à} Italiana di Fisica/IOP Publishing, In press

arXiv:2006.16735 [pdf, other]

doi 10.1088/1742-5468/ac0f64

Inferring epistasis from genomic data with comparable mutation and outcrossing rate

Authors: Hong-Li Zeng, Eugenio Mauri, Vito Dichio, Simona Cocco, Remi Monasson, Erik Aurell

Abstract: We consider a population evolving due to mutation, selection and recombination, where selection includes single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). We further consider the problem of inferring fitness in the evolutionary dynamics from one or several snap-shots of the distribution of genotypes in the population. In the recent literature this has been done… ▽ More We consider a population evolving due to mutation, selection and recombination, where selection includes single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). We further consider the problem of inferring fitness in the evolutionary dynamics from one or several snap-shots of the distribution of genotypes in the population. In the recent literature this has been done by applying the Quasi-Linkage Equilibrium (QLE) regime first obtained by Kimura in the limit of high recombination. Here we show that the approach also works in the interesting regime where the effects of mutations are comparable to or larger than recombination. This leads to a modified main epistatic fitness inference formula where the rates of mutation and recombination occur together. We also derive this formula using by a previously developed Gaussian closure that formally remains valid when recombination is absent. The findings are validated through numerical simulations. △ Less

Submitted 4 May, 2021; v1 submitted 30 June, 2020; originally announced June 2020.

Comments: 16 pages, 9 figures. Substantial revision from second version, previous suggestions and comments gratefully acknowledged

arXiv:1912.12942 [pdf, other]

doi 10.1088/1751-8121/ab7d00

'Place-cell' emergence and learning of invariant data with restricted Boltzmann machines: breaking and dynamical restoration of continuous symmetries in the weight space

Authors: Moshir Harsh, Jérôme Tubiana, Simona Cocco, Remi Monasson

Abstract: Distributions of data or sensory stimuli often enjoy underlying invariances. How and to what extent those symmetries are captured by unsupervised learning methods is a relevant question in machine learning and in computational neuroscience. We study here, through a combination of numerical and analytical tools, the learning dynamics of Restricted Boltzmann Machines (RBM), a neural network paradigm… ▽ More Distributions of data or sensory stimuli often enjoy underlying invariances. How and to what extent those symmetries are captured by unsupervised learning methods is a relevant question in machine learning and in computational neuroscience. We study here, through a combination of numerical and analytical tools, the learning dynamics of Restricted Boltzmann Machines (RBM), a neural network paradigm for representation learning. As learning proceeds from a random configuration of the network weights, we show the existence of, and characterize a symmetry-breaking phenomenon, in which the latent variables acquire receptive fields focusing on limited parts of the invariant manifold supporting the data. The symmetry is restored at large learning times through the diffusion of the receptive field over the invariant manifold; hence, the RBM effectively spans a continuous attractor in the space of network weights. This symmetry-breaking phenomenon takes place only if the amount of data available for training exceeds some critical value, depending on the network size and the intensity of symmetry-induced correlations in the data; below this 'retarded-learning' threshold, the network weights are essentially noisy and overfit the data. △ Less

Submitted 30 December, 2019; originally announced December 2019.

arXiv:1907.12793 [pdf, other]

doi 10.1103/PhysRevE.101.012309

Inference of compressed Potts graphical models

Authors: Francesca Rizzato, Alice Coucke, Eleonora de Leonardis, J. P. Barton, Jérôme Tubiana, Remi Monasson, Simona Cocco

Abstract: We consider the problem of inferring a graphical Potts model on a population of variables, with a non-uniform number of Potts colors (symbols) across variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization sch… ▽ More We consider the problem of inferring a graphical Potts model on a population of variables, with a non-uniform number of Potts colors (symbols) across variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization scheme, in which the number of colors available to each variable is reduced, and interaction networks are made sparse. To achieve this color compression scheme, only Potts states with large empirical frequency (exceeding some threshold) are explicitly modeled on each site, while the others are grouped into a single state. We benchmark the performances of this mixed regularization approach, with two inference algorithms, the Adaptive Cluster Expansion (ACE) and the PseudoLikelihood Maximization (PLM) on synthetic data obtained by sampling disordered Potts models on an Erdos-Renyi random graphs. We show in particular that color compression does not affect the quality of reconstruction of the parameters corresponding to high-frequency symbols, while drastically reducing the number of the other parameters and thus the computational time. Our procedure is also applied to multi-sequence alignments of protein families, with similar results. △ Less

Submitted 3 January, 2020; v1 submitted 30 July, 2019; originally announced July 2019.

Journal ref: Phys. Rev. E 101, 012309 (2020)

arXiv:1906.05805 [pdf, other]

doi 10.1140/epjb/e2019-100313-9

Adaptive Cluster Expansion for Ising spin models

Authors: Simona Cocco, Giancarlo Croce, Francesco Zamponi

Abstract: We propose an algorithm to obtain numerically approximate solutions of the direct Ising problem, that is, to compute the free energy and the equilibrium observables of spin systems with arbitrary two-spin interactions. To this purpose we use the Adaptive Cluster Expansion method, originally developed to solve the inverse Ising problem, that is, to infer the interactions from the equilibrium correl… ▽ More We propose an algorithm to obtain numerically approximate solutions of the direct Ising problem, that is, to compute the free energy and the equilibrium observables of spin systems with arbitrary two-spin interactions. To this purpose we use the Adaptive Cluster Expansion method, originally developed to solve the inverse Ising problem, that is, to infer the interactions from the equilibrium correlations. The method consists in iteratively constructing and selecting clusters of spins, computing their contributions to the free energy and discarding clusters whose contribution is lower than a fixed threshold. The properties of the cluster expansion and its performance are studied in detail on one dimensional, two dimensional, random and fully connected graphs with homogeneous or heterogeneous fields and couplings. We discuss the differences between different representations (Boolean and Ising) of the spin variables. △ Less

Submitted 10 October, 2019; v1 submitted 13 June, 2019; originally announced June 2019.

Comments: 27 pages, 14 figures

Journal ref: Eur. Phys. J. B (2019) 92: 259

arXiv:1703.01222 [pdf, other]

doi 10.1088/1361-6633/aa9965

Inverse Statistical Physics of Protein Sequences: A Key Issues Review

Authors: Simona Cocco, Christoph Feinauer, Matteo Figliuzzi, Remi Monasson, Martin Weigt

Abstract: In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e.~evolutionarily related protein sequences, to which method… ▽ More In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e.~evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years. △ Less

Submitted 3 March, 2017; originally announced March 2017.

Comments: 18 pages, 7 figures

Journal ref: Rep. Prog. Phys. 81, 032601 (2018)

arXiv:1611.05082 [pdf, other]

doi 10.1371/journal.pcbi.1004889

Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models

Authors: Hugo Jacquin, Amy Gilson, Eugene Shakhnovich, Simona Cocco, Rémi Monasson

Abstract: Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those… ▽ More Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of 'true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons for the success of inverse approaches to the modelling of proteins from sequence data, and their limitations. △ Less

Submitted 15 November, 2016; originally announced November 2016.

Comments: Supplementary Information available at http://journals.plos.org/ploscompbiol/article?id=10.1371%2Fjournal.pcbi.1004889

Journal ref: PLoS Comput. Biol. 12(5): e1004889 (2016)

arXiv:1512.08101 [pdf, other]

doi 10.1007/s10955-015-1441-4

On the entropy of protein families

Authors: John Barton, Arup Chakraborty, Simona Cocco, Hugo Jacquin, Rémi Monasson

Abstract: Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Mod… ▽ More Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1-and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the fixation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed. △ Less

Submitted 26 December, 2015; originally announced December 2015.

Comments: to appear in Journal of Statistical Physics

arXiv:1503.02802 [pdf, ps, other]

doi 10.1007/s10955-015-1341-7

Learning probabilities from random observables in high dimensions: the maximum entropy distribution and others

Authors: Tomoyuki Obuchi, Simona Cocco, Rémi Monasson

Abstract: We consider the problem of learning a target probability distribution over a set of $N$ binary variables from the knowledge of the expectation values (with this target distribution) of $M$ observables, drawn uniformly at random. The space of all probability distributions compatible with these $M$ expectation values within some fixed accuracy, called version space, is studied. We introduce a biased… ▽ More We consider the problem of learning a target probability distribution over a set of $N$ binary variables from the knowledge of the expectation values (with this target distribution) of $M$ observables, drawn uniformly at random. The space of all probability distributions compatible with these $M$ expectation values within some fixed accuracy, called version space, is studied. We introduce a biased measure over the version space, which gives a boost increasing exponentially with the entropy of the distributions and with an arbitrary inverse `temperature' $Γ$. The choice of $Γ$ allows us to interpolate smoothly between the unbiased measure over all distributions in the version space ($Γ=0$) and the pointwise measure concentrated at the maximum entropy distribution ($Γ\to \infty$). Using the replica method we compute the volume of the version space and other quantities of interest, such as the distance $R$ between the target distribution and the center-of-mass distribution over the version space, as functions of $α=(\log M)/N$ and $Γ$ for large $N$. Phase transitions at critical values of $α$ are found, corresponding to qualitative improvements in the learning of the target distribution and to the decrease of the distance $R$. However, for fixed $α$, the distance $R$ does not vary with $Γ$, which means that the maximum entropy distribution is not closer to the target distribution than any other distribution compatible with the observable values. Our results are confirmed by Monte Carlo sampling of the version space for small system sizes ($N\le 10$). △ Less

Submitted 21 July, 2015; v1 submitted 10 March, 2015; originally announced March 2015.

Comments: 30 pages, 13 figures

arXiv:1405.6673 [pdf, ps, other]

doi 10.1103/PhysRevLett.112.238101

Stochastic Ratchet Mechanisms for Replacement of Proteins Bound to DNA

Authors: Simona Cocco, John F. Marko, Remi Monasson

Abstract: Experiments indicate that unbinding rates of proteins from DNA can depend on the concentration of proteins in nearby solution. Here we present a theory of multi-step replacement of DNA-bound proteins by solution-phase proteins. For four different kinetic scenarios we calculate the depen- dence of protein unbinding and replacement rates on solution protein concentration. We find (1) strong effects… ▽ More Experiments indicate that unbinding rates of proteins from DNA can depend on the concentration of proteins in nearby solution. Here we present a theory of multi-step replacement of DNA-bound proteins by solution-phase proteins. For four different kinetic scenarios we calculate the depen- dence of protein unbinding and replacement rates on solution protein concentration. We find (1) strong effects of progressive 'rezipping' of the solution-phase protein onto DNA sites liberated by 'unzipping' of the originally bound protein; (2) that a model in which solution-phase proteins bind non-specifically to DNA can describe experiments on exchanges between the non specific DNA- binding proteins Fis-Fis and Fis-HU; (3) that a binding specific model describes experiments on the exchange of CueR proteins on specific binding sites. △ Less

Submitted 26 May, 2014; originally announced May 2014.

Comments: à paraitre en PHys. Rev. Lett. june 2014

arXiv:1405.0233 [pdf, other]

doi 10.1103/PhysRevE.90.012132

Large Pseudo-Counts and $L_2$-Norm Penalties Are Necessary for the Mean-Field Inference of Ising and Potts Models

Authors: J. P. Barton, S. Cocco, E. De Leonardis, R. Monasson

Abstract: Mean field (MF) approximation offers a simple, fast way to infer direct interactions between elements in a network of correlated variables, a common, computationally challenging problem with practical applications in fields ranging from physics and biology to the social sciences. However, MF methods achieve their best performance with strong regularization, well beyond Bayesian expectations, an em… ▽ More Mean field (MF) approximation offers a simple, fast way to infer direct interactions between elements in a network of correlated variables, a common, computationally challenging problem with practical applications in fields ranging from physics and biology to the social sciences. However, MF methods achieve their best performance with strong regularization, well beyond Bayesian expectations, an empirical fact that is poorly understood. In this work, we study the influence of pseudo-count and $L_2$-norm regularization schemes on the quality of inferred Ising or Potts interaction networks from correlation data within the MF approximation. We argue, based on the analysis of small systems, that the optimal value of the regularization strength remains finite even if the sampling noise tends to zero, in order to correct for systematic biases introduced by the MF approximation. Our claim is corroborated by extensive numerical studies of diverse model systems and by the analytical study of the $m$-component spin model, for large but finite $m$. Additionally we find that pseudo-count regularization is robust against sampling noise, and often outperforms $L_2$-norm regularization, particularly when the underlying network of interactions is strongly heterogeneous. Much better performances are generally obtained for the Ising model than for the Potts model, for which only couplings incoming onto medium-frequency symbols are reliably inferred. △ Less

Submitted 1 May, 2014; originally announced May 2014.

Comments: 25 pages, 17 figures

Journal ref: Phys Rev E 90 (2014) 012132

arXiv:1212.3281 [pdf, ps, other]

doi 10.1371/journal.pcbi.1003176

From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction

Authors: Simona Cocco, Remi Monasson, Martin Weigt

Abstract: Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predictin… ▽ More Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant 'patterns' of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold. △ Less

Submitted 27 August, 2013; v1 submitted 13 December, 2012; originally announced December 2012.

Comments: Supporting information can be downloaded from: http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003176

Journal ref: PLoS Computational Biology 9, 8 (2013) e1003176

arXiv:1203.5264 [pdf, ps, other]

doi 10.1063/1.4721433

Charge dynamics of a single donor coupled to a few electrons quantum dot in silicon

Authors: G. Mazzeo, E. Prati, M. Belli, G. Leti, S. Cocco, M. Fanciulli, F. Guagliardo, G. Ferrari

Abstract: We study the charge transfer dynamics between a silicon quantum dot and an individual phosphorous donor using the conduction through the quantum dot as a probe for the donor ionization state. We use a silicon n-MOSFET (metal oxide field effect transistor) biased near threshold in the SET regime with two side gates to control both the device conductance and the donor charge. Temperature and magneti… ▽ More We study the charge transfer dynamics between a silicon quantum dot and an individual phosphorous donor using the conduction through the quantum dot as a probe for the donor ionization state. We use a silicon n-MOSFET (metal oxide field effect transistor) biased near threshold in the SET regime with two side gates to control both the device conductance and the donor charge. Temperature and magnetic field independent tunneling time is measured. We measure the statistics of the transfer of electrons observed when the ground state D0 of the donor is aligned with the SET states. △ Less

Submitted 23 March, 2012; originally announced March 2012.

arXiv:1203.4811 [pdf, ps, other]

doi 10.1088/0957-4484/23/21/215204

Few Electron Limit of n-type Metal Oxide Semiconductor Single Electron Transistors

Authors: Enrico Prati, Marco De Michielis, Matteo Belli, Simone Cocco, Marco Fanciulli, Dharmraj Kotekar-Patil, Matthias Ruoff, Dieter P. Kern, David A. Wharam, Arjan Verduijn, Giuseppe Tettamanzi, Sven Rogge, Benoit Roche, Romain Wacquez, Xavier Jehl, Maud Vinet, Marc Sanquer

Abstract: We report electronic transport on n-type silicon Single Electron Transistors (SETs) fabricated in Complementary Metal Oxide Semiconductor (CMOS) technology. The n-MOSSETs are built within a pre-industrial Fully Depleted Silicon On Insulator (FDSOI) technology with a silicon thickness down to 10 nm on 200 mm wafers. The nominal channel size of 20 $\times$ 20 nm$^{2}$ is obtained by employing electr… ▽ More We report electronic transport on n-type silicon Single Electron Transistors (SETs) fabricated in Complementary Metal Oxide Semiconductor (CMOS) technology. The n-MOSSETs are built within a pre-industrial Fully Depleted Silicon On Insulator (FDSOI) technology with a silicon thickness down to 10 nm on 200 mm wafers. The nominal channel size of 20 $\times$ 20 nm$^{2}$ is obtained by employing electron beam lithography for active and gate levels patterning. The Coulomb blockade stability diagram is precisely resolved at 4.2 K and it exhibits large addition energies of tens of meV. The confinement of the electrons in the quantum dot has been modeled by using a Current Spin Density Functional Theory (CS-DFT) method. CMOS technology enables massive production of SETs for ultimate nanoelectronics and quantum variables based devices. △ Less

Submitted 20 March, 2012; originally announced March 2012.

Comments: 4 Figures

arXiv:1110.5416 [pdf, ps, other]

doi 10.1007/s10955-012-0463-4

Adaptive cluster expansion for the inverse Ising problem: convergence, algorithm and tests

Authors: Simona Cocco, Rémi Monasson

Abstract: We present a procedure to solve the inverse Ising problem, that is to find the interactions between a set of binary variables from the measure of their equilibrium correlations. The method consists in constructing and selecting specific clusters of variables, based on their contributions to the cross-entropy of the Ising model. Small contributions are discarded to avoid overfitting and to make the… ▽ More We present a procedure to solve the inverse Ising problem, that is to find the interactions between a set of binary variables from the measure of their equilibrium correlations. The method consists in constructing and selecting specific clusters of variables, based on their contributions to the cross-entropy of the Ising model. Small contributions are discarded to avoid overfitting and to make the computation tractable. The properties of the cluster expansion and its performances on synthetic data are studied. To make the implementation easier we give the pseudo-code of the algorithm. △ Less

Submitted 25 October, 2011; originally announced October 2011.

Comments: Paper submitted to Journal of Statistical Physics

arXiv:1104.3665 [pdf, ps, other]

doi 10.1103/PhysRevE.83.051123

High-Dimensional Inference with the generalized Hopfield Model: Principal Component Analysis and Corrections

Authors: Simona Cocco, Remi Monasson, Vitor Sessak

Abstract: We consider the problem of inferring the interactions between a set of N binary variables from the knowledge of their frequencies and pairwise correlations. The inference framework is based on the Hopfield model, a special case of the Ising model where the interaction matrix is defined through a set of patterns in the variable space, and is of rank much smaller than N. We show that Maximum Lik eli… ▽ More We consider the problem of inferring the interactions between a set of N binary variables from the knowledge of their frequencies and pairwise correlations. The inference framework is based on the Hopfield model, a special case of the Ising model where the interaction matrix is defined through a set of patterns in the variable space, and is of rank much smaller than N. We show that Maximum Lik elihood inference is deeply related to Principal Component Analysis when the amp litude of the pattern components, xi, is negligible compared to N^1/2. Using techniques from statistical mechanics, we calculate the corrections to the patterns to the first order in xi/N^1/2. We stress that it is important to generalize the Hopfield model and include both attractive and repulsive patterns, to correctly infer networks with sparse and strong interactions. We present a simple geometrical criterion to decide how many attractive and repulsive patterns should be considered as a function of the sampling noise. We moreover discuss how many sampled configurations are required for a good inference, as a function of the system size, N and of the amplitude, xi. The inference approach is illustrated on synthetic and biological data. △ Less

Submitted 19 April, 2011; originally announced April 2011.

Comments: Physical Review E: Statistical, Nonlinear, and Soft Matter Physics (2011) to appear

arXiv:1102.3260 [pdf, ps, other]

doi 10.1103/PhysRevLett.106.090601

Adaptive Cluster Expansion for Inferring Boltzmann Machines with Noisy Data

Authors: Simona Cocco, Rémi Monasson

Abstract: We introduce a procedure to infer the interactions among a set of binary variables, based on their sampled frequencies and pairwise correlations. The algorithm builds the clusters of variables contributing most to the entropy of the inferred Ising model, and rejects the small contributions due to the sampling noise. Our procedure successfully recovers benchmark Ising models even at criticality and… ▽ More We introduce a procedure to infer the interactions among a set of binary variables, based on their sampled frequencies and pairwise correlations. The algorithm builds the clusters of variables contributing most to the entropy of the inferred Ising model, and rejects the small contributions due to the sampling noise. Our procedure successfully recovers benchmark Ising models even at criticality and in the low temperature phase, and is applied to neurobiological data. △ Less

Submitted 16 February, 2011; originally announced February 2011.

Comments: Accepted for publication in Physical Review Letters (2011)

arXiv:1010.2728 [pdf, other]

doi 10.1209/0295-5075/94/20005

On the trajectories and performance of Infotaxis, an information-based greedy search algorithm

Authors: Carlo Barbieri, Simona Cocco, Rémi Monasson

Abstract: We present a continuous-space version of Infotaxis, a search algorithm where a searcher greedily moves to maximize the gain in information about the position of the target to be found. Using a combination of analytical and numerical tools we study the nature of the trajectories in two and three dimensions. The probability that the search is successful and the running time of the search are estimat… ▽ More We present a continuous-space version of Infotaxis, a search algorithm where a searcher greedily moves to maximize the gain in information about the position of the target to be found. Using a combination of analytical and numerical tools we study the nature of the trajectories in two and three dimensions. The probability that the search is successful and the running time of the search are estimated. A possible extension to non-greedy search is suggested. △ Less

Submitted 16 March, 2011; v1 submitted 13 October, 2010; originally announced October 2010.

Comments: 6 pages, 7 figures, accepted for publication in EPL

Journal ref: EPL, 94 (2011) 20005

arXiv:1006.5406 [pdf, ps, other]

doi 10.1063/1.3551735

Adiabatic Charge Control in a Single Donor Atom Transistor

Authors: Enrico Prati, Matteo Belli, Simone Cocco, Guido Petretto, Marco Fanciulli

Abstract: We charge an individual donor with electrons stored in a quantum dot in its proximity. A Silicon quantum device containing a single Arsenic donor and an electrostatic quantum dot in parallel is realized in a nanometric field effect transistor. The different coupling capacitances of the donor and the quantum dot with the control and the back gates are exploited to generate a relative rigid shift of… ▽ More We charge an individual donor with electrons stored in a quantum dot in its proximity. A Silicon quantum device containing a single Arsenic donor and an electrostatic quantum dot in parallel is realized in a nanometric field effect transistor. The different coupling capacitances of the donor and the quantum dot with the control and the back gates are exploited to generate a relative rigid shift of their energy spectrum as a function of the back gate voltage, causing the crossing of the energy levels. We observe the sequential tunneling through the $D^{2-}$ and the $D^{3-}$ energy levels of the donor hybridized at the oxide interface at 4.2 K. Their respective states form an honeycomb pattern with the quantum dot states. It is therefore possible to control the exchange coupling of an electron of the quantum dot with the electrons bound to the donor, thus realizing a physical qubit for quantum information processing applications. △ Less

Submitted 10 August, 2010; v1 submitted 18 May, 2010; originally announced June 2010.

Comments: 12 pages, 5 figures

arXiv:0812.1180 [pdf, ps, other]

doi 10.1088/1478-3975/6/2/025003

Dynamical modelling of molecular constructions and setups for DNA unzipping

Authors: Carlo Barbieri, Simona Cocco, Remi Monasson, Francesco Zamponi

Abstract: We present a dynamical model of DNA mechanical unzipping under the action of a force. The model includes the motion of the fork in the sequence-dependent landscape, the trap(s) acting on the bead(s), and the polymeric components of the molecular construction (unzipped single strands of DNA, and linkers). Different setups are considered to test the model, and the outcome of the simulations is com… ▽ More We present a dynamical model of DNA mechanical unzipping under the action of a force. The model includes the motion of the fork in the sequence-dependent landscape, the trap(s) acting on the bead(s), and the polymeric components of the molecular construction (unzipped single strands of DNA, and linkers). Different setups are considered to test the model, and the outcome of the simulations is compared to simpler dynamical models existing in the literature where polymers are assumed to be at equilibrium. △ Less

Submitted 5 December, 2008; originally announced December 2008.

Journal ref: Phys. Biol. 6, 025003 (2009)

arXiv:0704.2547 [pdf, ps, other]

doi 10.1103/PhysRevE.75.011904

Inferring DNA sequences from mechanical unzipping data: the large-bandwidth case

Authors: Valentina Baldazzi, Serena Bradde, Simona Cocco, Enzo Marinari, Remi Monasson

Abstract: The complementary strands of DNA molecules can be separated when stretched apart by a force; the unzipping signal is correlated to the base content of the sequence but is affected by thermal and instrumental noise. We consider here the ideal case where opening events are known to a very good time resolution (very large bandwidth), and study how the sequence can be reconstructed from the unzippin… ▽ More The complementary strands of DNA molecules can be separated when stretched apart by a force; the unzipping signal is correlated to the base content of the sequence but is affected by thermal and instrumental noise. We consider here the ideal case where opening events are known to a very good time resolution (very large bandwidth), and study how the sequence can be reconstructed from the unzipping data. Our approach relies on the use of statistical Bayesian inference and of Viterbi decoding algorithm. Performances are studied numerically on Monte Carlo generated data, and analytically. We show how multiple unzippings of the same molecule may be exploited to improve the quality of the prediction, and calculate analytically the number of required unzippings as a function of the bandwidth, the sequence content, the elasticity parameters of the unzipped strands. △ Less

Submitted 19 April, 2007; originally announced April 2007.

Journal ref: Phys. Rev. E 75 (2007) 011904

arXiv:0704.2539 [pdf, ps, other]

doi 10.1209/0295-5075/81/20002

Reconstructing a Random Potential from its Random Walks

Authors: Simona Cocco, Remi Monasson

Abstract: The problem of how many trajectories of a random walker in a potential are needed to reconstruct the values of this potential is studied. We show that this problem can be solved by calculating the probability of survival of an abstract random walker in a partially absorbing potential. The approach is illustrated on the discrete Sinai (random force) model with a drift. We determine the parameter… ▽ More The problem of how many trajectories of a random walker in a potential are needed to reconstruct the values of this potential is studied. We show that this problem can be solved by calculating the probability of survival of an abstract random walker in a partially absorbing potential. The approach is illustrated on the discrete Sinai (random force) model with a drift. We determine the parameter (temperature, duration of each trajectory, ...) values making reconstruction as fast as possible. △ Less

Submitted 19 April, 2007; originally announced April 2007.

Journal ref: Europhysics Letters (EPL) (2008) 81, 20002

arXiv:cond-mat/0508288 [pdf, ps, other]

doi 10.1103/PhysRevE.72.061902

Protein-Mediated DNA Loops: Effects of Protein Bridge Size and Kinks

Authors: Nicolas Douarche, Simona Cocco

Abstract: This paper focuses on the probability that a portion of DNA closes on itself through thermal fluctuations. We investigate the dependence of this probability upon the size r of a protein bridge and/or the presence of a kink at half DNA length. The DNA is modeled by the Worm-Like Chain model, and the probability of loop formation is calculated in two ways: exact numerical evaluation of the constra… ▽ More This paper focuses on the probability that a portion of DNA closes on itself through thermal fluctuations. We investigate the dependence of this probability upon the size r of a protein bridge and/or the presence of a kink at half DNA length. The DNA is modeled by the Worm-Like Chain model, and the probability of loop formation is calculated in two ways: exact numerical evaluation of the constrained path integral and the extension of the Shimada and Yamakawa saddle point approximation. For example, we find that the looping free energy of a 100 base pairs DNA decreases from 24 kT to 13 kT when the loop is closed by a protein of r = 10 nm length. It further decreases to 5 kT when the loop has a kink of 120 degrees at half-length. △ Less

Submitted 16 November, 2005; v1 submitted 11 August, 2005; originally announced August 2005.

Comments: corrected typos and figures, references updated; 13 pages, 7 figures, accepted for publication in Phys. Rev. E

Journal ref: Physical Review E 72, 061902 (2005)

arXiv:cond-mat/0506221 [pdf, ps, other]

doi 10.1103/PhysRevLett.96.128102

Inferring DNA sequences from mechanical unzipping: an ideal-case study

Authors: V. Baldazzi, S. Cocco, E. Marinari, R. Monasson

Abstract: We introduce and test a method to predict the sequence of DNA molecules from in silico unzipping experiments. The method is based on Bayesian inference and on the Viterbi decoding algorithm. The probability of misprediction decreases exponentially with the number of unzippings, with a decay rate depending on the applied force and the sequence content. We introduce and test a method to predict the sequence of DNA molecules from in silico unzipping experiments. The method is based on Bayesian inference and on the Viterbi decoding algorithm. The probability of misprediction decreases exponentially with the number of unzippings, with a decay rate depending on the applied force and the sequence content. △ Less

Submitted 5 July, 2005; v1 submitted 9 June, 2005; originally announced June 2005.

Comments: Source as TeX file with ps figures

arXiv:q-bio/0406010 [pdf, ps, other]

Role of calcium and noise in the persistent activity of an isolated neuron

Authors: Simona Cocco

Abstract: The activity of an isolated and auto-connected neuron is studied using Hodgkin--Huxley and Integrate-and-Fire frameworks. Main ingredients of the modeling are the auto-stimulating autaptic current observed in experiments, with a spontaneous synaptic liberation noise and a calcium--dependent negative feedback mechanism. The distributions of inter-spikes intervals and burst durations are analytica… ▽ More The activity of an isolated and auto-connected neuron is studied using Hodgkin--Huxley and Integrate-and-Fire frameworks. Main ingredients of the modeling are the auto-stimulating autaptic current observed in experiments, with a spontaneous synaptic liberation noise and a calcium--dependent negative feedback mechanism. The distributions of inter-spikes intervals and burst durations are analytically calculated, and show a good agreement with experimental data. △ Less

Submitted 3 June, 2004; originally announced June 2004.

Comments: date de redaction: 30/5/2004

arXiv:cs/0401011 [pdf, ps, other]

Heuristic average-case analysis of the backtrack resolution of random 3-Satisfiability instances

Authors: Simona Cocco, Remi Monasson

Abstract: An analysis of the average-case complexity of solving random 3-Satisfiability (SAT) instances with backtrack algorithms is presented. We first interpret previous rigorous works in a unifying framework based on the statistical physics notions of dynamical trajectories, phase diagram and growth process. It is argued that, under the action of the Davis--Putnam--Loveland--Logemann (DPLL) algorithm,… ▽ More An analysis of the average-case complexity of solving random 3-Satisfiability (SAT) instances with backtrack algorithms is presented. We first interpret previous rigorous works in a unifying framework based on the statistical physics notions of dynamical trajectories, phase diagram and growth process. It is argued that, under the action of the Davis--Putnam--Loveland--Logemann (DPLL) algorithm, 3-SAT instances are turned into 2+p-SAT instances whose characteristic parameters (ratio alpha of clauses per variable, fraction p of 3-clauses) can be followed during the operation, and define resolution trajectories. Depending on the location of trajectories in the phase diagram of the 2+p-SAT model, easy (polynomial) or hard (exponential) resolutions are generated. Three regimes are identified, depending on the ratio alpha of the 3-SAT instance to be solved. Lower sat phase: for small ratios, DPLL almost surely finds a solution in a time growing linearly with the number N of variables. Upper sat phase: for intermediate ratios, instances are almost surely satisfiable but finding a solution requires exponential time (2 ^ (N omega) with omega>0) with high probability. Unsat phase: for large ratios, there is almost always no solution and proofs of refutation are exponential. An analysis of the growth of the search tree in both upper sat and unsat regimes is presented, and allows us to estimate omega as a function of alpha. This analysis is based on an exact relationship between the average size of the search tree and the powers of the evolution operator encoding the elementary steps of the search heuristic. △ Less

Submitted 14 January, 2004; originally announced January 2004.

Comments: to appear in Theoretical Computer Science

ACM Class: A.0

Journal ref: Theoretical Computer Science (2004) A 320, 345

arXiv:cs/0302003 [pdf, ps, other]

Approximate analysis of search algorithms with "physical" methods

Authors: Simona Cocco, Remi Monasson, Andrea Montanari, Guilhem Semerjian

Abstract: An overview of some methods of statistical physics applied to the analysis of algorithms for optimization problems (satisfiability of Boolean constraints, vertex cover of graphs, decoding, ...) with distributions of random inputs is proposed. Two types of algorithms are analyzed: complete procedures with backtracking (Davis-Putnam-Loveland-Logeman algorithm) and incomplete, local search procedur… ▽ More An overview of some methods of statistical physics applied to the analysis of algorithms for optimization problems (satisfiability of Boolean constraints, vertex cover of graphs, decoding, ...) with distributions of random inputs is proposed. Two types of algorithms are analyzed: complete procedures with backtracking (Davis-Putnam-Loveland-Logeman algorithm) and incomplete, local search procedures (gradient descent, random walksat, ...). The study of complete algorithms makes use of physical concepts such as phase transitions, dynamical renormalization flow, growth processes, ... As for local search procedures, the connection between computational complexity and the structure of the cost function landscape is questioned, with emphasis on the notion of metastability. △ Less

Submitted 3 February, 2003; originally announced February 2003.

Comments: 28 pages, 23 figures

ACM Class: F.2.2

arXiv:cond-mat/0207609 [pdf, ps, other]

doi 10.1140/epje/e2003-00019-8

Slow nucleic acid unzipping kinetics from sequence-defined barriers

Authors: S. Cocco, J. F. Marko, R. Monasson

Abstract: Recent experiments on unzipping of RNA helix-loop structures by force have shown that about 40-base molecules can undergo kinetic transitions between two well-defined `open' and `closed' states, on a timescale = 1 sec [Liphardt et al., Science 297, 733-737 (2001)]. Using a simple dynamical model, we show that these phenomena result from the slow kinetics of crossing large free energy barriers wh… ▽ More Recent experiments on unzipping of RNA helix-loop structures by force have shown that about 40-base molecules can undergo kinetic transitions between two well-defined `open' and `closed' states, on a timescale = 1 sec [Liphardt et al., Science 297, 733-737 (2001)]. Using a simple dynamical model, we show that these phenomena result from the slow kinetics of crossing large free energy barriers which separate the open and closed conformations. The dependence of barriers on sequence along the helix, and on the size of the loop(s) is analyzed. Some DNAs and RNAs sequences that could show dynamics on different time scales, or three(or more)-state unzipping, are proposed. △ Less

Submitted 25 July, 2002; originally announced July 2002.

Comments: 8 pages Revtex, including 4 figures

arXiv:cond-mat/0207499 [pdf, ps, other]

doi 10.1103/PhysRevE.66.051914

Unzipping Dynamics of Long DNAs

Authors: S. Cocco, R. Monasson, J. F. Marko

Abstract: The two strands of the DNA double helix can be `unzipped' by application of 15 pN force. We analyze the dynamics of unzipping and rezipping, for the case where the molecule ends are separated and re-approached at constant velocity. For unzipping of 50 kilobase DNAs at less than about 1000 bases per second, thermal equilibrium-based theory applies. However, for higher unzipping velocities, rotati… ▽ More The two strands of the DNA double helix can be `unzipped' by application of 15 pN force. We analyze the dynamics of unzipping and rezipping, for the case where the molecule ends are separated and re-approached at constant velocity. For unzipping of 50 kilobase DNAs at less than about 1000 bases per second, thermal equilibrium-based theory applies. However, for higher unzipping velocities, rotational viscous drag creates a buildup of elastic torque to levels above kBT in the dsDNA region, causing the unzipping force to be well above or well below the equilibrium unzipping force during respectively unzipping and rezipping, in accord with recent experimental results of Thomen et al. [Phys. Rev. Lett. 88, 248102 (2002)]. Our analysis includes the effect of sequence on unzipping and rezipping, and the transient delay in buildup of the unzipping force due to the approach to the steady state. △ Less

Submitted 20 July, 2002; originally announced July 2002.

Comments: 15 pages Revtex file including 9 figures

arXiv:cond-mat/0206242 [pdf, ps, other]

Restart method and exponential acceleration of random 3-SAT instances resolutions: a large deviation analysis of the Davis-Putnam-Loveland-Logemann algorithm

Authors: S. Cocco, R. Monasson

Abstract: The analysis of the solving complexity of random 3-SAT instances using the Davis-Putnam-Loveland-Logemann (DPLL) algorithm slightly below threshold is presented. While finding a solution for such instances demands exponential effort with high probability, we show that an exponentially small fraction of resolutions require a computation scaling linearly in the size of the instance only. We comput… ▽ More The analysis of the solving complexity of random 3-SAT instances using the Davis-Putnam-Loveland-Logemann (DPLL) algorithm slightly below threshold is presented. While finding a solution for such instances demands exponential effort with high probability, we show that an exponentially small fraction of resolutions require a computation scaling linearly in the size of the instance only. We compute analytically this exponentially small probability of easy resolutions from a large deviation analysis of DPLL with the Generalized Unit Clause search heuristic, and show that the corresponding exponent is smaller (in absolute value) than the growth exponent of the typical resolution time. Our study therefore gives some quantitative basis to heuristic restart solving procedures, and suggests a natural cut-off cost (the size of the instance) for the restart. △ Less

Submitted 13 June, 2002; originally announced June 2002.

Comments: submitted to Annals of Math and Artificial Intelligence

arXiv:cond-mat/0206239 [pdf, ps, other]

doi 10.1103/PhysRevLett.90.047205

Rigorous decimation-based construction of ground pure states for spin glass models on random lattices

Authors: S. Cocco, O. Dubois, J. Mandler, R. Monasson

Abstract: A constructive scheme for determining pure states (clusters) at very low temperature in the 3-spins glass model on a random lattice is provided, in full agreement with Parisi's one step replica symmetry breaking (RSB) scheme. Proof is based on the analysis of an exact decimation procedure. When the number c of couplings per spin is smaller than some critical value c_d, all spins are eliminated a… ▽ More A constructive scheme for determining pure states (clusters) at very low temperature in the 3-spins glass model on a random lattice is provided, in full agreement with Parisi's one step replica symmetry breaking (RSB) scheme. Proof is based on the analysis of an exact decimation procedure. When the number c of couplings per spin is smaller than some critical value c_d, all spins are eliminated at the end of decimation (RS phase). In the range c_d<c<c_s, a reduced Hamiltonian is left; each ground state (GS) of the latter is a "seed" from which a cluster of GS of the original Hamiltonian can be reconstructed. Above c_s, GS are frustrated with an energy per spin larger than -c. The number of GS in each cluster, the number of clusters, the distances between GS are calculated and correspond to RSB predictions. △ Less

Submitted 18 July, 2002; v1 submitted 13 June, 2002; originally announced June 2002.

arXiv:cond-mat/0206238 [pdf, ps, other]

Theoretical models for single-molecule DNA and RNA experiments: from elasticity to unzipping

Authors: S. Cocco, J. F. Marko, R. Monasson

Abstract: We review statistical-mechanical theories of single-molecule micromanipulation experiments on nucleic acids. First, models for describing polymer elasticity are introduced. We then review how these models are used to interpret single-molecule force-extension experiments on single-stranded and double-stranded DNA. Depending on the force and the molecules used, both smooth elastic behaviors and ab… ▽ More We review statistical-mechanical theories of single-molecule micromanipulation experiments on nucleic acids. First, models for describing polymer elasticity are introduced. We then review how these models are used to interpret single-molecule force-extension experiments on single-stranded and double-stranded DNA. Depending on the force and the molecules used, both smooth elastic behaviors and abrupt structural transitions are observed. Third, we show how combining the elasticity of two single nucleic acid strands with a description of the base-pairing interactions between them explains much of the phenomenology and kinetics of RNA and DNA `unzipping' experiments. △ Less

Submitted 13 June, 2002; originally announced June 2002.

Comments: to appear in CRAS, special issue dedicated to Single Molecule Experiments

arXiv:cond-mat/0203012 [pdf, ps, other]

doi 10.1103/PhysRevE.66.037101

Exponentially hard problems are sometimes polynomial, a large deviation analysis of search algorithms for the random Satisfiability problem, and its application to stop-and-restart resolutions

Authors: S. Cocco, R. Monasson

Abstract: A large deviation analysis of the solving complexity of random 3-Satisfiability instances slightly below threshold is presented. While finding a solution for such instances demands an exponential effort with high probability, we show that an exponentially small fraction of resolutions require a computation scaling linearly in the size of the instance only. This exponentially small probability of… ▽ More A large deviation analysis of the solving complexity of random 3-Satisfiability instances slightly below threshold is presented. While finding a solution for such instances demands an exponential effort with high probability, we show that an exponentially small fraction of resolutions require a computation scaling linearly in the size of the instance only. This exponentially small probability of easy resolutions is analytically calculated, and the corresponding exponent shown to be smaller (in absolute value) than the growth exponent of the typical resolution time. Our study therefore gives some theoretical basis to heuristic stop-and-restart solving procedures, and suggests a natural cut-off (the size of the instance) for the restart. △ Less

Submitted 1 March, 2002; originally announced March 2002.

Comments: Revtex file, 4 figures

arXiv:cond-mat/0202466 [pdf, ps, other]

doi 10.1073/pnas.151257598

Force and kinetic barriers in unzipping of DNA

Authors: S. Cocco, R. Monasson, J. Marko

Abstract: A theory of the unzipping of double-stranded (ds) DNA is presented, and is compared to recent micromanipulation experiments. It is shown that the interactions which stabilize the double helix and the elastic rigidity of single strands (ss) simply determine the sequence dependent =12 pN force threshold for DNA strand separation. Using a semi-microscopic model of the binding between nucleotide str… ▽ More A theory of the unzipping of double-stranded (ds) DNA is presented, and is compared to recent micromanipulation experiments. It is shown that the interactions which stabilize the double helix and the elastic rigidity of single strands (ss) simply determine the sequence dependent =12 pN force threshold for DNA strand separation. Using a semi-microscopic model of the binding between nucleotide strands, we show that the greater rigidity of the strands when formed into dsDNA, relative to that of isolated strands, gives rise to a potential barrier to unzipping. The effects of this barrier are derived analytically. The force to keep the extremities of the molecule at a fixed distance, the kinetic rates for strand unpairing at fixed applied force, and the rupture force as a function of loading rate are calculated. The dependence of the kinetics and of the rupture force on molecule length is also analyzed. △ Less

Submitted 26 February, 2002; originally announced February 2002.

Comments: Revtex file + 6 eps Figures; published in Proc. Natl. Acad. Sci. USA 98, 8608 (2001)

arXiv:cond-mat/0012191 [pdf, ps, other]

doi 10.1007/s100510170101

Analysis of the computational complexity of solving random satisfiability problems using branch and bound search algorithms

Authors: Simona Cocco, Remi Monasson

Abstract: The computational complexity of solving random 3-Satisfiability (3-SAT) problems is investigated. 3-SAT is a representative example of hard computational tasks; it consists in knowing whether a set of alpha N randomly drawn logical constraints involving N Boolean variables can be satisfied altogether or not. Widely used solving procedures, as the Davis-Putnam-Loveland-Logeman (DPLL) algorithm, p… ▽ More The computational complexity of solving random 3-Satisfiability (3-SAT) problems is investigated. 3-SAT is a representative example of hard computational tasks; it consists in knowing whether a set of alpha N randomly drawn logical constraints involving N Boolean variables can be satisfied altogether or not. Widely used solving procedures, as the Davis-Putnam-Loveland-Logeman (DPLL) algorithm, perform a systematic search for a solution, through a sequence of trials and errors represented by a search tree. In the present study, we identify, using theory and numerical experiments, easy (size of the search tree scaling polynomially with N) and hard (exponential scaling) regimes as a function of the ratio alpha of constraints per variable. The typical complexity is explicitly calculated in the different regimes, in very good agreement with numerical simulations. Our theoretical approach is based on the analysis of the growth of the branches in the search tree under the operation of DPLL. On each branch, the initial 3-SAT problem is dynamically turned into a more generic 2+p-SAT problem, where p and 1-p are the fractions of constraints involving three and two variables respectively. The growth of each branch is monitored by the dynamical evolution of alpha and p and is represented by a trajectory in the static phase diagram of the random 2+p-SAT problem. Depending on whether or not the trajectories cross the boundary between phases, single branches or full trees are generated by DPLL, resulting in easy or hard resolutions. △ Less

Submitted 11 December, 2000; originally announced December 2000.

Comments: 37 RevTeX pages, 15 figures; submitted to Phys.Rev.E

arXiv:cond-mat/0009410 [pdf, ps, other]

doi 10.1103/PhysRevLett.86.1654

Trajectories in phase diagrams, growth processes and computational complexity: how search algorithms solve the 3-Satisfiability problem

Authors: Simona Cocco, Remi Monasson

Abstract: Most decision and optimization problems encountered in practice fall into one of two categories with respect to any particular solving method or algorithm: either the problem is solved quickly (easy) or else demands an impractically long computational effort (hard). Recent investigations on model classes of problems have shown that some global parameters, such as the ratio between the constraint… ▽ More Most decision and optimization problems encountered in practice fall into one of two categories with respect to any particular solving method or algorithm: either the problem is solved quickly (easy) or else demands an impractically long computational effort (hard). Recent investigations on model classes of problems have shown that some global parameters, such as the ratio between the constraints to be satisfied and the adjustable variables, are good predictors of problem hardness and, moreover, have an effect analogous to thermodynamical parameters, e.g. temperature, in predicting phases in condensed matter physics [Monasson et al., Nature 400 (1999) 133-137]. Here we show that changes in the values of such parameters can be tracked during a run of the algorithm defining a trajectory through the parameter space. Focusing on 3-Satisfiability, a recognized representative of hard problems, we analyze trajectories generated by search algorithms using growth processes statistical physics. These trajectories can cross well defined phases, corresponding to domains of easy or hard instances, and allow to successfully predict the times of resolution. △ Less

Submitted 26 September, 2000; originally announced September 2000.

Comments: Revtex file + 4 eps figures

arXiv:cond-mat/9911008 [pdf, ps, other]

doi 10.1063/1.481646

Theoretical study of collective modes in DNA at ambient temperature

Authors: Simona Cocco, Remi Monasson

Abstract: The instantaneous normal modes corresponding to base pair vibrations (radial modes) and twist angle fluctuations (angular modes) of a DNA molecule model at ambient temperature are theoretically investigated. Due to thermal disorder, normal modes are not plane waves with a single wave number q but have a finite and frequency dependent damping width. The density of modes rho(nu), the average dispe… ▽ More The instantaneous normal modes corresponding to base pair vibrations (radial modes) and twist angle fluctuations (angular modes) of a DNA molecule model at ambient temperature are theoretically investigated. Due to thermal disorder, normal modes are not plane waves with a single wave number q but have a finite and frequency dependent damping width. The density of modes rho(nu), the average dispersion relation nu(q) as well as the coherence length xi(nu) are analytically calculated. The Gibbs averaged resolvent is computed using a replicated transfer matrix formalism and variational wave functions for the ground and first excited state. Our results for the density of modes are compared to Raman spectroscopy measurements of the collective modes for DNA in solution and show a good agreement with experimental data in the low frequency regime nu < 150 cm^{-1}. Radial modes extend over frequencies ranging from 50 cm^{-1} to 110 cm^{-1}. Angular modes, related to helical axis vibrations are limited to nu < 25 cm^{-1}. Normal modes are highly disordered and coherent over a few base pairs only (xi < 2 nm) in good agreement with neutron scattering experiments. △ Less

Submitted 2 November, 1999; originally announced November 1999.

Comments: 20 pages + 13 ps figures

arXiv:cond-mat/9904277 [pdf, ps, other]

doi 10.1103/PhysRevLett.83.5178

Statistical Mechanics of Torque Induced Denaturation of DNA

Authors: Simona Cocco, Remi Monasson

Abstract: A unifying theory of the denaturation transition of DNA, driven by temperature T or induced by an external mechanical torque Gamma is presented. Our model couples the hydrogen-bond opening and the untwisting of the helicoidal molecular structure. We show that denaturation corresponds to a first-order phase transition from B-DNA to d-DNA phases and that the coexistence region is naturally paramet… ▽ More A unifying theory of the denaturation transition of DNA, driven by temperature T or induced by an external mechanical torque Gamma is presented. Our model couples the hydrogen-bond opening and the untwisting of the helicoidal molecular structure. We show that denaturation corresponds to a first-order phase transition from B-DNA to d-DNA phases and that the coexistence region is naturally parametrized by the degree of supercoiling sigma. The denaturation free energy, the temperature dependence of the twist angle, the phase diagram in the T,Gamma plane and isotherms in the sigma, Gamma plane are calculated and show a good agreement with experimental data. △ Less

Submitted 24 September, 1999; v1 submitted 20 April, 1999; originally announced April 1999.

Comments: 5 pages, 3 figures, model improved

arXiv:cond-mat/9604102 [pdf, ps, other]

doi 10.1103/PhysRevE.54.717

Analytical and Numerical Study of Internal Representations in Multilayer Neural Networks with Binary Weights

Authors: Simona Cocco, Remi Monasson, Riccardo Zecchina

Abstract: We study the weight space structure of the parity machine with binary weights by deriving the distribution of volumes associated to the internal representations of the learning examples. The learning behaviour and the symmetry breaking transition are analyzed and the results are found to be in very good agreement with extended numerical simulations. We study the weight space structure of the parity machine with binary weights by deriving the distribution of volumes associated to the internal representations of the learning examples. The learning behaviour and the symmetry breaking transition are analyzed and the results are found to be in very good agreement with extended numerical simulations. △ Less

Submitted 16 April, 1996; originally announced April 1996.

Comments: revtex, 20 pages + 9 figures, to appear in Phys. Rev. E

Showing 1–46 of 46 results for author: Cocco, S