Post-transcriptional autoregulation of gene expression is common in bacterial systems but many fe... more Post-transcriptional autoregulation of gene expression is common in bacterial systems but many fewer examples are known in eukaryotes. We used the yeast collection of genes fused to GFP as a rapid screen for examples of feedback regulation in ribosomal proteins by overexpressing a non-regulatable version of a gene and observing the effects on the expression of the GFP-fused version. We tested 95 ribosomal protein genes and found that 21 of them showed at least a threefold repression. Two genes, RPS22B and RPL1B, showed over a 10-fold repression. In both cases the cis-regulatory segment resides in the 5' UTR of the gene as shown by placing that segment of the mRNA upstream of GFP alone and demonstrating it is sufficient to cause repression of GFP when the protein is over-expressed. Further analyses showed that the intron in the 5' UTR of RPS22B is required for regulation, presumably because the protein inhibits splicing that is necessary for translation. The 5' UTR of RPL1B contains a sequence and structure motif that is conserved in the binding sites of Rpl1 orthologs from bacteria to mammals, and mutations within the motif eliminate repression. .
The goal of this chapter is to provide evidence and justification for the hypothesis that autogen... more The goal of this chapter is to provide evidence and justification for the hypothesis that autogenous, posttranscriptional regulation of gene expression is common. Several examples are known, mostly from bacteria, bacteriophage, and yeast species. Each was identified either by accident or by a concerted effort to understand the regulation of specific genes. The paucity of examples from higher eukaryotes may be due to the difficulty of identifying them using common approaches for uncovering regulatory interactions. An alternative approach is proposed that can fill the gap.
The human genome contains about 800 C2H2 zinc finger proteins (ZFPs), and most of them are compos... more The human genome contains about 800 C2H2 zinc finger proteins (ZFPs), and most of them are composed of long arrays of zinc fingers. Standard ZFP recognition model asserts longer finger arrays should recognize longer DNA-binding sites. However, recent experimental efforts to identify in vivo ZFP binding sites contradict this assumption, with many exhibiting short motifs. Here we use ZFY, CTCF, ZIM3, and ZNF343 as examples to address three closely related questions: What are the reasons that impede current motif discovery methods? What are the functions of those seemingly unused fingers and how can we improve the motif discovery algorithms based on long ZFPs’ biophysical properties? Using ZFY, we employed a variety of methods and find evidence for ‘dependent recognition’ where downstream fingers can recognize some previously undiscovered motifs only in the presence of an intact core site. For CTCF, high-throughput measurements revealed its upstream specificity profile depends on the s...
Motivation Motifs play a crucial role in computational biology, as they provide valuable informat... more Motivation Motifs play a crucial role in computational biology, as they provide valuable information about the binding specificity of proteins. However, conventional motif discovery methods typically rely on simple combinatoric or probabilistic approaches, which can be biased by heuristics such as substring-masking for multiple motif discovery. In recent years, deep neural networks have become increasingly popular for motif discovery, as they are capable of capturing complex patterns in data. Nonetheless, inferring motifs from neural networks remains a challenging problem, both from a modeling and computational standpoint, despite the success of these networks in supervised learning tasks. Results We present a principled representation learning approach based on a hierarchical sparse representation for motif discovery. Our method effectively discovers gapped, long, or overlapping motifs that we show to commonly exist in next-generation sequencing datasets, in addition to the short a...
We present a principled representation learning approach based on convolutional dictionary learni... more We present a principled representation learning approach based on convolutional dictionary learning (CDL) for motif discovery. We unroll an iterative algorithm that optimizes CDL as a forward pass in a neural network, resulting in a network that is fully interpretable, fast, and capable of finding motifs in large datasets. Simulated data show that our network is more sensitive and specific for discovering binding sites that exhibit complex binding patterns than popular motif discovery methods such as STREME and HOMER. Our network reveals statistically significant motifs and their diverse binding modes from the JASPAR database that are currently not reported.
Table S2. Effect of Method for Generating Negative Sequences on Training and Testing Scores. (DOC... more Table S2. Effect of Method for Generating Negative Sequences on Training and Testing Scores. (DOCX 13 kb)
)of the promoter The Role of the thi Box. The mRNA for the [A made. Regu- thiCOGE operon in Rhizo... more )of the promoter The Role of the thi Box. The mRNA for the [A made. Regu- thiCOGE operon in Rhizobium etli conn to be posttran- tains a 211-base leader with two features The amount of that are likely to be important. One is a ie promoter ap- hairpin structure that is just 5' of the )resence of thia- thiC-initiating AUG and overlapping the of the mRNA is ribosome-binding site (RBS), where one tor structure in expects it would inhibit ribosome binding. ie of the operon, The other feature is called the "thi box," :he trp operon is which is a site of 38 bases that is highly onuation process conserved in the region 5' to the start of P protein binds many genes, from many species, that are nd to the mRNA involved in thiamin biosynthesis. That the ire termination, thi box is important in the mRNA, rather of all the genes than the DNA, is demonstrated by comsynthesis. E. coli parison of the different occurrences of it. process as a sec- Only about half of the positions are comth the ribosome pletely conserved, but all of the sites can of tryptophan in fold into the same RNA secondary struconcentration of ture because of compensating changes that maintain complementarity. Because the attenuating this structure is conserved in different
Transcription factor-driven cell fate engineering in pluripotency induction, transdifferentiation... more Transcription factor-driven cell fate engineering in pluripotency induction, transdifferentiation, and forward reprogramming requires efficiency, speed, and maturity for widespread adoption and clinical translation. Here, we used Oct4, Sox2, Klf4, and c-Myc driven pluripotency reprogramming to evaluate methods for enhancing and tailoring cell fate transitions, through directed evolution with iterative screening of pooled mutant libraries and phenotypic selection. We identified an artificially evolved and enhanced POU factor (ePOU) that substantially outperforms wild-type Oct4 in terms of reprogramming speed and efficiency. In contrast to Oct4, not only can ePOU induce pluripotency with Sox2 alone, but it can also do so in the absence of Sox2 in a three-factor ePOU/Klf4/c-Myc cocktail. Biochemical assays combined with genome-wide analyses showed that ePOU possesses a new preference to dimerize on palindromic DNA elements. Yet, the moderate capacity of Oct4 to function as a pioneer fa...
Despite enormous progress in understanding the fundamentals of bacterial gene regulation, our kno... more Despite enormous progress in understanding the fundamentals of bacterial gene regulation, our knowledge remains limited when compared with the number of bacterial genomes and regulatory systems to be discovered. Derived from a small number of initial studies, classic definitions for concepts of gene regulation have evolved as the number of characterized promoters has increased. Together with discoveries made using new technologies, this knowledge has led to revised generalizations and principles. In this Expert Recommendation, we suggest precise, updated definitions that support a logical, consistent conceptual framework of bacterial gene regulation, focusing on transcription initiation. The resulting concepts can be formalized by ontologies for computational modelling, laying the foundation for improved bioinformatics tools, knowledge-based resources and scientific communication. Thus, this work will help researchers construct better predictive models, with different formalisms, that will be useful in engineering, synthetic biology, microbiology and genetics.
On behalf of the Health and Environmental Research Advisory Committee (HERAC), I am pleased to su... more On behalf of the Health and Environmental Research Advisory Committee (HERAC), I am pleased to submit to you the enclosed Report on the Human Genome Initiative. This was prepared by a subcommittee under the chairmanship of Dr. Ignacio Tinoco. University of California. Berkeley. and is in response to a charge by you. It has been strongly endorsed by the parent committee. The report urges DOE and the Nation to commit to a large. multi-year. multidisciplinary. technological undertaking to order and sequence the human genome. This effort will first require significant innovation in general capability to manipulate DNA. major new analytical methods for ordering and sequencing. theoretical developments in computer science and mathematical biology, and great expansions in our ability to store and manipulate the information and to interface it with other large and diverse genetic databases. The actual ordering and sequencing involves the coordinated processing of some 3 billion bases from a reference human genome. Science is poised on the rudimentary edge of being able to read and understand human genes. A concerted. broadly based. scientific effort to provide new methods of sufficient power and scale should transform this activity from an inefficient one-gene-at-a-time. single laboratory effort into a coordinated. worldwide. comprehensive reading of "the book of man". The effort will be extraordinary in scope and magnitude. but so will be the benefit to biological understanding. new technology and the diagnosis and treatment of human disease.
Propanediol is degraded by a B12-dependent pathway in Salmonella typhimurium. The enzymes for thi... more Propanediol is degraded by a B12-dependent pathway in Salmonella typhimurium. The enzymes for this pathway are encoded in a small region (minute 41) that includes the pdu operon (controlling B12-dependent degradation of propanediol) and the divergent cob operon (controlling synthesis of cobalamin, B12). Expression of both operons is induced by propanediol and globally controlled by the ArcA and Crp systems. The region between the two operons encodes two proteins, PduF, a transporter of propanediol, and PocR, which mediates the induction of the regulon by propanediol. Insertion mutations between the pdu and cob operons have been characterized, and their exact positions have been correlated with mutant phenotypes. The region includes five promoters, four of which are controlled by the PocR protein and induced by propanediol. The cob and pdu operons each have one regulated promoter; the pduF gene is expressed from two regulated promoters (P1 and P2). The P1 and P2 transcripts extend be...
We have developed a system which uses neural networks and dynamic programming (DP) to identify pr... more We have developed a system which uses neural networks and dynamic programming (DP) to identify protein coding regions in genomic DNA sequences. Nine scores are calculated on all subintervals of the sequence which evaluate the likelihood that the subinterval belongs to one of four classes; first, last or internal exon or intron. These scores are weighted by a neural network and used as input to a DP algorithm. DP is used to find the highest scoring combination of introns and exons subject to a few simple constraints on gene structure. The neural network weights are optimized by training on input vectors which measure the difference between the predicted optimal solution by DP and the biologically correct solution. The system is trained by maximizing the difference between the correct parse and a sample of incorrect parses. On a test set of genomic sequences from GenBank, we obtained correlation coefficients for exon nucleotide prediction as high as 0.94. This is superior to the results obtained by purely rule-based systems.
Covariation analysis of sets of aligned sequences of protein" molecules is successful 'in certain... more Covariation analysis of sets of aligned sequences of protein" molecules is successful 'in certain instances in elucidating certain structural and functional links [Korber(1993)], but in general, pairs of sites displaying highly covarying mutations in protein sequences do not necessarily correspond to sites that are spatially close in the protein structure [Gobel(1994)], [Clarke(1995)], [Shindyalov(1994)], [Thomas(1996)], [Taylor(1994)], [Neher(1994)]. In contrast, covariation analysis of sets of aligned sequences for RNA molecules is relatively successful in elucidating RNA secondary structure, as well as some aspects of tertiary structure [Gutell(1992)]. The goals of this paper are to (1) present the problem, (2) develop the mathematical formalism-for solving the problem, and (3) validate the resulting algorithms on simulated data. Extensive application to biological sequences will be presented elsewhere.
bioRxiv (Cold Spring Harbor Laboratory), Mar 14, 2024
Paired-class homeodomain transcription factors (HD TFs) play essential roles in vertebrate develo... more Paired-class homeodomain transcription factors (HD TFs) play essential roles in vertebrate development, and their mutations are linked to human diseases. One unique feature of paired-class HD is cooperative dimerization on specific palindrome DNA sequences. Yet, the functional significance of HD cooperative dimerization in animal development and its dysregulation in diseases remain elusive. Using the retinal TF Cone-rod Homeobox (CRX) as a model, we have studied how blindness-causing mutations in the paired HD, p.E80A and p.K88N, alter CRX's cooperative dimerization, lead to gene misexpression and photoreceptor developmental deficits in dominant manners. CRX E80A maintains binding at monomeric WT CRX motifs but is deficient in cooperative binding at dimeric motifs. CRX E80A 's cooperativity defect impacts the exponential increase of photoreceptor gene expression in terminal differentiation and produces immature, nonfunctional photoreceptors in the Crx E80A retinas. CRX K88N is highly cooperative and localizes to ectopic genomic sites with strong enrichment of dimeric HD motifs. CRX K88N 's altered biochemical properties disrupt CRX's ability to direct dynamic chromatin remodeling during development to activate photoreceptor differentiation programs and silence progenitor programs. Our study here provides in vitro and in vivo molecular evidence that paired-class HD cooperative dimerization regulates neuronal development and dysregulation of cooperative binding contributes to severe dominant blinding retinopathies. .
We have identified the-binding site on the bacteriophage T4 gene 32 mRNA responsible for autogeno... more We have identified the-binding site on the bacteriophage T4 gene 32 mRNA responsible for autogenous translational regulation. We demonstrate that this site is largely unstructured and overlaps the initiation codon of gene 32 as previously predicted. Cooperative binding of gene 32 protein to this site specifically blocks the formation of 30 S-tRNAy-gene 32 mRNA ternary complexes and initiation of translation. The translational operator is bound cooperatively by gene 32 protein and this binding is facilitated by a nucleation site far upstream from the initiation codon. A similar unstructured mRNA lacking this nucleation site is also bound cooperatively , but only at concentrations of gene 32 protein higher than those needed to repress binding of ribosomes to the gene 32 mRNA. Some sequence-specific interactions may also influence this binding. Comparison of the bacteriophage T2, T4 and T6 gene 32 operator sequences leads us to propose that the nucleation site is a pseudoknot.
Matrices can be used to evaluate sequences for functional activity. Multiple regression can solve... more Matrices can be used to evaluate sequences for functional activity. Multiple regression can solve for the matrix that gives the best fit between sequence evaluations and quantitative activities. This analysis shows that the best model for context effects on suppression by suS involves primarily the two nucleotides 3' to the amber codon, and that their contributions are independent and additive. Context effects on 2AP mutagenesis also involve the two nucleotides 3' to the 2AP insertion, but their effects are not independent. In a construct for producing ^-galactosidase, the effects on translational yields of the tri-nucleotide 5' to the initiation codon are dependent on the entire triplet. Models based on these quantitative results are presented for each of the examples.
INTRODUCTION Regulation of protein synthesis at the level of translation is one of the many ways ... more INTRODUCTION Regulation of protein synthesis at the level of translation is one of the many ways cells control the concentrations of gene products. We define (for this review) translational regulation to be the variation from “constitutive” levels of translation caused by the action of specific proteins on specific mRNAs. The constitutive efficiencies of translation of mRNAs are determined by both the sequence and the structure of their initiation regions (Steitz 1979; Gold et al. 1981; Stormo et al. 1982). mRNAs are in excess of ribosomes; i.e., free ribosomes are less abundant than are open initiation regions on mRNA. Selection of messages for translation occurs according to the relative strengths of their initiation domains. Primary sequence features important for recognition by ribosomes include the polypurine Shine and Dalgarno region, the initiation codon, and other nearby nucleotides (Gold et al. 1981). Initiation regions possess varying degrees of structure, from the unstruc...
Post-transcriptional autoregulation of gene expression is common in bacterial systems but many fe... more Post-transcriptional autoregulation of gene expression is common in bacterial systems but many fewer examples are known in eukaryotes. We used the yeast collection of genes fused to GFP as a rapid screen for examples of feedback regulation in ribosomal proteins by overexpressing a non-regulatable version of a gene and observing the effects on the expression of the GFP-fused version. We tested 95 ribosomal protein genes and found that 21 of them showed at least a threefold repression. Two genes, RPS22B and RPL1B, showed over a 10-fold repression. In both cases the cis-regulatory segment resides in the 5' UTR of the gene as shown by placing that segment of the mRNA upstream of GFP alone and demonstrating it is sufficient to cause repression of GFP when the protein is over-expressed. Further analyses showed that the intron in the 5' UTR of RPS22B is required for regulation, presumably because the protein inhibits splicing that is necessary for translation. The 5' UTR of RPL1B contains a sequence and structure motif that is conserved in the binding sites of Rpl1 orthologs from bacteria to mammals, and mutations within the motif eliminate repression. .
The goal of this chapter is to provide evidence and justification for the hypothesis that autogen... more The goal of this chapter is to provide evidence and justification for the hypothesis that autogenous, posttranscriptional regulation of gene expression is common. Several examples are known, mostly from bacteria, bacteriophage, and yeast species. Each was identified either by accident or by a concerted effort to understand the regulation of specific genes. The paucity of examples from higher eukaryotes may be due to the difficulty of identifying them using common approaches for uncovering regulatory interactions. An alternative approach is proposed that can fill the gap.
The human genome contains about 800 C2H2 zinc finger proteins (ZFPs), and most of them are compos... more The human genome contains about 800 C2H2 zinc finger proteins (ZFPs), and most of them are composed of long arrays of zinc fingers. Standard ZFP recognition model asserts longer finger arrays should recognize longer DNA-binding sites. However, recent experimental efforts to identify in vivo ZFP binding sites contradict this assumption, with many exhibiting short motifs. Here we use ZFY, CTCF, ZIM3, and ZNF343 as examples to address three closely related questions: What are the reasons that impede current motif discovery methods? What are the functions of those seemingly unused fingers and how can we improve the motif discovery algorithms based on long ZFPs’ biophysical properties? Using ZFY, we employed a variety of methods and find evidence for ‘dependent recognition’ where downstream fingers can recognize some previously undiscovered motifs only in the presence of an intact core site. For CTCF, high-throughput measurements revealed its upstream specificity profile depends on the s...
Motivation Motifs play a crucial role in computational biology, as they provide valuable informat... more Motivation Motifs play a crucial role in computational biology, as they provide valuable information about the binding specificity of proteins. However, conventional motif discovery methods typically rely on simple combinatoric or probabilistic approaches, which can be biased by heuristics such as substring-masking for multiple motif discovery. In recent years, deep neural networks have become increasingly popular for motif discovery, as they are capable of capturing complex patterns in data. Nonetheless, inferring motifs from neural networks remains a challenging problem, both from a modeling and computational standpoint, despite the success of these networks in supervised learning tasks. Results We present a principled representation learning approach based on a hierarchical sparse representation for motif discovery. Our method effectively discovers gapped, long, or overlapping motifs that we show to commonly exist in next-generation sequencing datasets, in addition to the short a...
We present a principled representation learning approach based on convolutional dictionary learni... more We present a principled representation learning approach based on convolutional dictionary learning (CDL) for motif discovery. We unroll an iterative algorithm that optimizes CDL as a forward pass in a neural network, resulting in a network that is fully interpretable, fast, and capable of finding motifs in large datasets. Simulated data show that our network is more sensitive and specific for discovering binding sites that exhibit complex binding patterns than popular motif discovery methods such as STREME and HOMER. Our network reveals statistically significant motifs and their diverse binding modes from the JASPAR database that are currently not reported.
Table S2. Effect of Method for Generating Negative Sequences on Training and Testing Scores. (DOC... more Table S2. Effect of Method for Generating Negative Sequences on Training and Testing Scores. (DOCX 13 kb)
)of the promoter The Role of the thi Box. The mRNA for the [A made. Regu- thiCOGE operon in Rhizo... more )of the promoter The Role of the thi Box. The mRNA for the [A made. Regu- thiCOGE operon in Rhizobium etli conn to be posttran- tains a 211-base leader with two features The amount of that are likely to be important. One is a ie promoter ap- hairpin structure that is just 5' of the )resence of thia- thiC-initiating AUG and overlapping the of the mRNA is ribosome-binding site (RBS), where one tor structure in expects it would inhibit ribosome binding. ie of the operon, The other feature is called the "thi box," :he trp operon is which is a site of 38 bases that is highly onuation process conserved in the region 5' to the start of P protein binds many genes, from many species, that are nd to the mRNA involved in thiamin biosynthesis. That the ire termination, thi box is important in the mRNA, rather of all the genes than the DNA, is demonstrated by comsynthesis. E. coli parison of the different occurrences of it. process as a sec- Only about half of the positions are comth the ribosome pletely conserved, but all of the sites can of tryptophan in fold into the same RNA secondary struconcentration of ture because of compensating changes that maintain complementarity. Because the attenuating this structure is conserved in different
Transcription factor-driven cell fate engineering in pluripotency induction, transdifferentiation... more Transcription factor-driven cell fate engineering in pluripotency induction, transdifferentiation, and forward reprogramming requires efficiency, speed, and maturity for widespread adoption and clinical translation. Here, we used Oct4, Sox2, Klf4, and c-Myc driven pluripotency reprogramming to evaluate methods for enhancing and tailoring cell fate transitions, through directed evolution with iterative screening of pooled mutant libraries and phenotypic selection. We identified an artificially evolved and enhanced POU factor (ePOU) that substantially outperforms wild-type Oct4 in terms of reprogramming speed and efficiency. In contrast to Oct4, not only can ePOU induce pluripotency with Sox2 alone, but it can also do so in the absence of Sox2 in a three-factor ePOU/Klf4/c-Myc cocktail. Biochemical assays combined with genome-wide analyses showed that ePOU possesses a new preference to dimerize on palindromic DNA elements. Yet, the moderate capacity of Oct4 to function as a pioneer fa...
Despite enormous progress in understanding the fundamentals of bacterial gene regulation, our kno... more Despite enormous progress in understanding the fundamentals of bacterial gene regulation, our knowledge remains limited when compared with the number of bacterial genomes and regulatory systems to be discovered. Derived from a small number of initial studies, classic definitions for concepts of gene regulation have evolved as the number of characterized promoters has increased. Together with discoveries made using new technologies, this knowledge has led to revised generalizations and principles. In this Expert Recommendation, we suggest precise, updated definitions that support a logical, consistent conceptual framework of bacterial gene regulation, focusing on transcription initiation. The resulting concepts can be formalized by ontologies for computational modelling, laying the foundation for improved bioinformatics tools, knowledge-based resources and scientific communication. Thus, this work will help researchers construct better predictive models, with different formalisms, that will be useful in engineering, synthetic biology, microbiology and genetics.
On behalf of the Health and Environmental Research Advisory Committee (HERAC), I am pleased to su... more On behalf of the Health and Environmental Research Advisory Committee (HERAC), I am pleased to submit to you the enclosed Report on the Human Genome Initiative. This was prepared by a subcommittee under the chairmanship of Dr. Ignacio Tinoco. University of California. Berkeley. and is in response to a charge by you. It has been strongly endorsed by the parent committee. The report urges DOE and the Nation to commit to a large. multi-year. multidisciplinary. technological undertaking to order and sequence the human genome. This effort will first require significant innovation in general capability to manipulate DNA. major new analytical methods for ordering and sequencing. theoretical developments in computer science and mathematical biology, and great expansions in our ability to store and manipulate the information and to interface it with other large and diverse genetic databases. The actual ordering and sequencing involves the coordinated processing of some 3 billion bases from a reference human genome. Science is poised on the rudimentary edge of being able to read and understand human genes. A concerted. broadly based. scientific effort to provide new methods of sufficient power and scale should transform this activity from an inefficient one-gene-at-a-time. single laboratory effort into a coordinated. worldwide. comprehensive reading of "the book of man". The effort will be extraordinary in scope and magnitude. but so will be the benefit to biological understanding. new technology and the diagnosis and treatment of human disease.
Propanediol is degraded by a B12-dependent pathway in Salmonella typhimurium. The enzymes for thi... more Propanediol is degraded by a B12-dependent pathway in Salmonella typhimurium. The enzymes for this pathway are encoded in a small region (minute 41) that includes the pdu operon (controlling B12-dependent degradation of propanediol) and the divergent cob operon (controlling synthesis of cobalamin, B12). Expression of both operons is induced by propanediol and globally controlled by the ArcA and Crp systems. The region between the two operons encodes two proteins, PduF, a transporter of propanediol, and PocR, which mediates the induction of the regulon by propanediol. Insertion mutations between the pdu and cob operons have been characterized, and their exact positions have been correlated with mutant phenotypes. The region includes five promoters, four of which are controlled by the PocR protein and induced by propanediol. The cob and pdu operons each have one regulated promoter; the pduF gene is expressed from two regulated promoters (P1 and P2). The P1 and P2 transcripts extend be...
We have developed a system which uses neural networks and dynamic programming (DP) to identify pr... more We have developed a system which uses neural networks and dynamic programming (DP) to identify protein coding regions in genomic DNA sequences. Nine scores are calculated on all subintervals of the sequence which evaluate the likelihood that the subinterval belongs to one of four classes; first, last or internal exon or intron. These scores are weighted by a neural network and used as input to a DP algorithm. DP is used to find the highest scoring combination of introns and exons subject to a few simple constraints on gene structure. The neural network weights are optimized by training on input vectors which measure the difference between the predicted optimal solution by DP and the biologically correct solution. The system is trained by maximizing the difference between the correct parse and a sample of incorrect parses. On a test set of genomic sequences from GenBank, we obtained correlation coefficients for exon nucleotide prediction as high as 0.94. This is superior to the results obtained by purely rule-based systems.
Covariation analysis of sets of aligned sequences of protein" molecules is successful 'in certain... more Covariation analysis of sets of aligned sequences of protein" molecules is successful 'in certain instances in elucidating certain structural and functional links [Korber(1993)], but in general, pairs of sites displaying highly covarying mutations in protein sequences do not necessarily correspond to sites that are spatially close in the protein structure [Gobel(1994)], [Clarke(1995)], [Shindyalov(1994)], [Thomas(1996)], [Taylor(1994)], [Neher(1994)]. In contrast, covariation analysis of sets of aligned sequences for RNA molecules is relatively successful in elucidating RNA secondary structure, as well as some aspects of tertiary structure [Gutell(1992)]. The goals of this paper are to (1) present the problem, (2) develop the mathematical formalism-for solving the problem, and (3) validate the resulting algorithms on simulated data. Extensive application to biological sequences will be presented elsewhere.
bioRxiv (Cold Spring Harbor Laboratory), Mar 14, 2024
Paired-class homeodomain transcription factors (HD TFs) play essential roles in vertebrate develo... more Paired-class homeodomain transcription factors (HD TFs) play essential roles in vertebrate development, and their mutations are linked to human diseases. One unique feature of paired-class HD is cooperative dimerization on specific palindrome DNA sequences. Yet, the functional significance of HD cooperative dimerization in animal development and its dysregulation in diseases remain elusive. Using the retinal TF Cone-rod Homeobox (CRX) as a model, we have studied how blindness-causing mutations in the paired HD, p.E80A and p.K88N, alter CRX's cooperative dimerization, lead to gene misexpression and photoreceptor developmental deficits in dominant manners. CRX E80A maintains binding at monomeric WT CRX motifs but is deficient in cooperative binding at dimeric motifs. CRX E80A 's cooperativity defect impacts the exponential increase of photoreceptor gene expression in terminal differentiation and produces immature, nonfunctional photoreceptors in the Crx E80A retinas. CRX K88N is highly cooperative and localizes to ectopic genomic sites with strong enrichment of dimeric HD motifs. CRX K88N 's altered biochemical properties disrupt CRX's ability to direct dynamic chromatin remodeling during development to activate photoreceptor differentiation programs and silence progenitor programs. Our study here provides in vitro and in vivo molecular evidence that paired-class HD cooperative dimerization regulates neuronal development and dysregulation of cooperative binding contributes to severe dominant blinding retinopathies. .
We have identified the-binding site on the bacteriophage T4 gene 32 mRNA responsible for autogeno... more We have identified the-binding site on the bacteriophage T4 gene 32 mRNA responsible for autogenous translational regulation. We demonstrate that this site is largely unstructured and overlaps the initiation codon of gene 32 as previously predicted. Cooperative binding of gene 32 protein to this site specifically blocks the formation of 30 S-tRNAy-gene 32 mRNA ternary complexes and initiation of translation. The translational operator is bound cooperatively by gene 32 protein and this binding is facilitated by a nucleation site far upstream from the initiation codon. A similar unstructured mRNA lacking this nucleation site is also bound cooperatively , but only at concentrations of gene 32 protein higher than those needed to repress binding of ribosomes to the gene 32 mRNA. Some sequence-specific interactions may also influence this binding. Comparison of the bacteriophage T2, T4 and T6 gene 32 operator sequences leads us to propose that the nucleation site is a pseudoknot.
Matrices can be used to evaluate sequences for functional activity. Multiple regression can solve... more Matrices can be used to evaluate sequences for functional activity. Multiple regression can solve for the matrix that gives the best fit between sequence evaluations and quantitative activities. This analysis shows that the best model for context effects on suppression by suS involves primarily the two nucleotides 3' to the amber codon, and that their contributions are independent and additive. Context effects on 2AP mutagenesis also involve the two nucleotides 3' to the 2AP insertion, but their effects are not independent. In a construct for producing ^-galactosidase, the effects on translational yields of the tri-nucleotide 5' to the initiation codon are dependent on the entire triplet. Models based on these quantitative results are presented for each of the examples.
INTRODUCTION Regulation of protein synthesis at the level of translation is one of the many ways ... more INTRODUCTION Regulation of protein synthesis at the level of translation is one of the many ways cells control the concentrations of gene products. We define (for this review) translational regulation to be the variation from “constitutive” levels of translation caused by the action of specific proteins on specific mRNAs. The constitutive efficiencies of translation of mRNAs are determined by both the sequence and the structure of their initiation regions (Steitz 1979; Gold et al. 1981; Stormo et al. 1982). mRNAs are in excess of ribosomes; i.e., free ribosomes are less abundant than are open initiation regions on mRNA. Selection of messages for translation occurs according to the relative strengths of their initiation domains. Primary sequence features important for recognition by ribosomes include the polypurine Shine and Dalgarno region, the initiation codon, and other nearby nucleotides (Gold et al. 1981). Initiation regions possess varying degrees of structure, from the unstruc...
Uploads
Papers by Gary Stormo