SummaryAmplicons to Global Gene (A2G2) is a Python wrapper that uses MAFFT and an “Amplicon to Ge... more SummaryAmplicons to Global Gene (A2G2) is a Python wrapper that uses MAFFT and an “Amplicon to Gene” strategy to align very large numbers of sequences while improving alignment accuracy. It is specially developed to deal with conserved genes, where traditional aligners introduce a significant amount of gaps. A2G2 leverages the add sequences option of MAFFT to align the sequences to a global reference gene and a local reference region. Both of these references can be consensus sequences of trusted sources. Efficient parallelization of these tasks allows A2G2 to align a very large number of sequences (> 500K) in a reasonable amount of time. A2G2 can be imported in Python for easier integration with other software, or can be run via command line.AvailabilityA2G2 is implemented in Python 3 (3.6) and depends on MAFFT availability. Other package requirements can be found in the requirements.txt file at https://github.com/jshleap/A2G. A2G2 is also available via PyPi (https://pypi.org/pr...
ABSTRACTThe effective use of metabarcoding in biodiversity science has brought important analytic... more ABSTRACTThe effective use of metabarcoding in biodiversity science has brought important analytical challenges due to the need to generate accurate taxonomic assignments. The assignment of sequences to a generic or species level is critical for biodiversity surveys and biomonitoring, but it is particularly challenging. Researchers must select the approach that best recovers information on species composition. This study evaluates the performance and accuracy of seven methods in recovering the species composition of mock communities which vary in species number and specimen abundance, while holding upstream molecular and bioinformatic variables constant. It also evaluates the impact of parameter optimization on the quality of the predictions. Despite the general belief that BLAST top hit underperforms newer methods, our results indicate that it competes well with more complex approaches if optimized for the mock community under study. For example, the two machine learning methods tha...
Assessing protein modularity is important to understand protein evolution. We propose a graph-the... more Assessing protein modularity is important to understand protein evolution. We propose a graph-theory approach with significance and power testing to identify sub-domain architecture (SDA). 1) clusters are determined by maximizing the modularity score; 2) clusters are tested for significance. We present here a methodology to identify SDA robustly, biologically meaningful, and statistically supported. The robustness is tested using simulated data with known modularity. Modules are correctly identified even with low correlation within a module. We also analyzed two real datasets: amylase catalytic domain and NPC1 protein N-terminal domain. The amylase contains a TIM barrel with polysaccharides cleavage sites and a calcium-binding domain. In this data set we identified four robust evolutionary modules, one of which forms the minimal functional TIM barrel. The NPC1 protein is involved in the intracellular lipid metabolism coordinating sterol trafficking. NPC1 is the first luminal domain ...
There is a relationship between function, stability and the evolutionary process. One barrier to ... more There is a relationship between function, stability and the evolutionary process. One barrier to establish this relationship is to define functional and evolutionary units, or module in 3D data. It is possible to evaluate the correlation of residues across a sample of aligned homologous structures. Here, we define an evolutionary module as a subset of residues that correlate significantly more within the set than to the rest of the configuration across samples. We develop a graph theoretic approach based on community detection, which identifies clusters of landmarks optimizing the community modularity index (Q) on coordinate correlation data. In such approach, the nodes of the graph are landmarks and an edge is draw only if significant correlation is found, and is above a certain threshold. We further present a strategy to test the significance of these clusters, assess the statistical power of the clustering and test the robustness of the final results against sampling error. This ...
We define here modules as substructures which shape variation across samples of homologous protei... more We define here modules as substructures which shape variation across samples of homologous proteins is significantly small. These modules are relevant to the understanding of the interdependence across sites in evolving proteins folds. Our approach is inspired by the analysis of shapes in geometric morphometric, but applied to molecular data in 3 dimensions. We demonstrate that the classic approach in testing modularity is inadequate as a strategy to define modules in protein structures. Here we present a graph theoretic approach based on community detection. This approach identifies positions in tertiary structures which correlate across homologous structures. We further present a strategy to test the significance of these modules. This novel approach is applied to both classic animal shape datasets as well as Homstrad multiple protein alignments. INTRODUCTION The evolution of shapes is a key issue of interest in understanding the structural determinants of evolution 1,2. In biolog...
Defining modules in biological datasets is of increase interest in several branches of biological... more Defining modules in biological datasets is of increase interest in several branches of biological and medical sciences. Here, an evolutionary module is defined as a subset of landmarks that correlate significantly more with each other, than to the rest of the configuration (i.e. shape), in a evolutionary sampling (sampling across a phylogenetic tree). These modules are relevant to understand the paths of the phenotypic evolution in both animal/plant shapes, as well as protein structures. We demonstrate that the classical approach in testing modularity is inadequate as a strategy to define modules in datasets with a high number of variables (i.e. protein structures). Here we present a graph theoretic approach based on community detection. This approach identifies clusters of landmarks, which correlate across shapes. We further present a strategy to test the significance of these clusters and therefore it definition as modules. This novel approach is applied to simulations, classic an...
A central strategy in structural biology is to find the optimal alignment for a set of many homol... more A central strategy in structural biology is to find the optimal alignment for a set of many homologous protein structures. This task is not trivial and is still not particularly scalable to large datasets. Several structural alignment programs have been developed in the last few years. They mainly differ on the definition of homology or the optimization of the fit among structures. We propose here a practical and scalable strategy to align large datasets of protein structures. This strategy is based on aligning n-1 structures against a single reference structure. Here we show that 1) selecting the best reference from a dataset is significant to the overall RMSD of the alignment ; 2) although searching for the best reference in big datasets is O(An 2) problem (A being the complexity of a pairwise alignment), it is possible to define a heuristic to select a suitable reference structure from a large dataset. We test these using a large dataset, the GP120 family, which is rich in disord...
There is a relationship between function, stability and the evolutionary process. One barrier to ... more There is a relationship between function, stability and the evolutionary process. One barrier to establish this relationship is to define functional and evolutionary units, or module in 3D data. It is possible to evaluate the correlation of residues across a sample of aligned homologous structures. Here, we define an evolutionary module as a subset of residues that correlate significantly more within the set than to the rest of the configuration across samples. We develop a graph theoretic approach based on community detection, which identifies clusters of landmarks optimizing the community modularity index (Q) on coordinate correlation data. In such approach, the nodes of the graph are landmarks and an edge is draw only if significant correlation is found, and is above a certain threshold. We further present a strategy to test the significance of these clusters, assess the statistical power of the clustering and test the robustness of the final results against sampling error. This ...
Urotrygon rogersi is a stingray of the order Myliobatiformes, with benthic habits and trophic pre... more Urotrygon rogersi is a stingray of the order Myliobatiformes, with benthic habits and trophic preference for crustaceans, is frequently caught in trawling fisheries. Such frequency could change the morphometric relationships (possible associations between variables) in this species; therefore, the 41 morphometric variables more commonly used in batoids were studied. The collinearity and correlation were evaluated among mothers, among offspring and between offspring and mothers. The variables were subject to descriptive statistics, variable transformation, collinearity test, multivariate statistics and linear and polynomial regressions. In this study, 83% of the variables showed collinearity with high values of variance inflation factor (ranging from 5.69 and 606.72), which can affect the phenotypic variance decomposition in quantitative analyses. Despite this, all variables (except 6 in mothers and 4 in pups) were linear, making easier the estimation of quantitative parameters. In 2...
Huperzia brevifolia is one of the dominant species of the genus Huperzia living in paramos and su... more Huperzia brevifolia is one of the dominant species of the genus Huperzia living in paramos and superparamos from the Colombian Andes. A detailed study of the sporangium's ontogeny and sporogenesis was carried out using specimens collected at 4200m above sea level, in Parque Natural Nacional El Cocuy, Colombia. Small pieces of caulinar axis bearing sporangia were fixed, dehydrated, paraffin embedded, sectioned in a rotatory microtome, and stained using the common Safranin O-Fast Green technique; handmade cross sections were also made, stained with aqueous Toluidine Blue (TBO). The sporangia develops basipetally, a condition that allows observation of all the developmental stages taking place throughout the caulinar axis of adult plants. Each sporangium originates from a group of epidermal cells, axilar to the microphylls. These cells undergo active mitosis, and produce new external and internal cellular groups. The sporangium wall and the tapetum originate from the external group of cells, while the internal cellular group leads to the sporogenous tissue. Meiosis occur in the sporocytes and produce simultaneous types tetrads, each one giving rise four trilete spores, with foveolate ornamentation. During the sporangium ripening, the outermost layer of the wall develops anticlinally, and inner periclinal thickenings and the innermost one perform as a secretory tapetum, which persists until the spores are completely mature. All other cellular layers colapse.
The 5S rDNA gene is a non-coding RNA that can be found in two copies (type I and type II) in bony... more The 5S rDNA gene is a non-coding RNA that can be found in two copies (type I and type II) in bony and cartilaginous fish. Previous studies have pointed out that type II gene is a paralog derived from type I. We analyzed the molecular organization of 5S rDNA type II in elasmobranchs. Although the structure of the 5S rDNA is supposed to be highly conserved, our results show that the secondary structure in this group possesses some variability and is different than the consensus secondary structure. One of these differences in Selachii is an internal loop at nucleotides 7 and 112. These mutations observed in the transcribed region suggest an independent origin of the gene among Batoids and Selachii. All promoters were highly conserved with the exception of BoxA, possibly due to its affinity to polymerase III. This latter enzyme recognizes a dT4 sequence as stop signal, however in Rajiformes this signal was doubled in length to dT8. This could be an adaptation towards a higher efficiency in the termination process. Our results suggest that there is no TATA box in elasmobranchs in the NTS region. We also provide some evidence suggesting that the complexity of the microsatellites present in the NTS region play an important role in the 5S rRNA gene since it is significantly correlated with the length of the NTS.
Assessing protein modularity is important to understand protein evolution. We propose a graph-the... more Assessing protein modularity is important to understand protein evolution. We propose a graph-theory approach with significance and power testing to identify sub-domain architecture (SDA). 1) clusters are determined by maximizing the modularity score; 2) clusters are tested for significance. We present here a methodology to identify SDA robustly, biologically meaningful, and statistically supported. The robustness is tested using simulated data with known modularity. Modules are correctly identified even with low correlation within a module. We also analyzed two real datasets: amylase catalytic domain and NPC1 protein N-terminal domain. The amylase contains a TIM barrel with polysaccharides cleavage sites and a calcium-binding domain. In this data set we identified four robust evolutionary modules, one of which forms the minimal functional TIM barrel. The NPC1 protein is involved in the intracellular lipid metabolism coordinating sterol trafficking. NPC1 is the first luminal domain ...
SummaryAmplicons to Global Gene (A2G2) is a Python wrapper that uses MAFFT and an “Amplicon to Ge... more SummaryAmplicons to Global Gene (A2G2) is a Python wrapper that uses MAFFT and an “Amplicon to Gene” strategy to align very large numbers of sequences while improving alignment accuracy. It is specially developed to deal with conserved genes, where traditional aligners introduce a significant amount of gaps. A2G2 leverages the add sequences option of MAFFT to align the sequences to a global reference gene and a local reference region. Both of these references can be consensus sequences of trusted sources. Efficient parallelization of these tasks allows A2G2 to align a very large number of sequences (> 500K) in a reasonable amount of time. A2G2 can be imported in Python for easier integration with other software, or can be run via command line.AvailabilityA2G2 is implemented in Python 3 (3.6) and depends on MAFFT availability. Other package requirements can be found in the requirements.txt file at https://github.com/jshleap/A2G. A2G2 is also available via PyPi (https://pypi.org/pr...
ABSTRACTThe effective use of metabarcoding in biodiversity science has brought important analytic... more ABSTRACTThe effective use of metabarcoding in biodiversity science has brought important analytical challenges due to the need to generate accurate taxonomic assignments. The assignment of sequences to a generic or species level is critical for biodiversity surveys and biomonitoring, but it is particularly challenging. Researchers must select the approach that best recovers information on species composition. This study evaluates the performance and accuracy of seven methods in recovering the species composition of mock communities which vary in species number and specimen abundance, while holding upstream molecular and bioinformatic variables constant. It also evaluates the impact of parameter optimization on the quality of the predictions. Despite the general belief that BLAST top hit underperforms newer methods, our results indicate that it competes well with more complex approaches if optimized for the mock community under study. For example, the two machine learning methods tha...
Assessing protein modularity is important to understand protein evolution. We propose a graph-the... more Assessing protein modularity is important to understand protein evolution. We propose a graph-theory approach with significance and power testing to identify sub-domain architecture (SDA). 1) clusters are determined by maximizing the modularity score; 2) clusters are tested for significance. We present here a methodology to identify SDA robustly, biologically meaningful, and statistically supported. The robustness is tested using simulated data with known modularity. Modules are correctly identified even with low correlation within a module. We also analyzed two real datasets: amylase catalytic domain and NPC1 protein N-terminal domain. The amylase contains a TIM barrel with polysaccharides cleavage sites and a calcium-binding domain. In this data set we identified four robust evolutionary modules, one of which forms the minimal functional TIM barrel. The NPC1 protein is involved in the intracellular lipid metabolism coordinating sterol trafficking. NPC1 is the first luminal domain ...
There is a relationship between function, stability and the evolutionary process. One barrier to ... more There is a relationship between function, stability and the evolutionary process. One barrier to establish this relationship is to define functional and evolutionary units, or module in 3D data. It is possible to evaluate the correlation of residues across a sample of aligned homologous structures. Here, we define an evolutionary module as a subset of residues that correlate significantly more within the set than to the rest of the configuration across samples. We develop a graph theoretic approach based on community detection, which identifies clusters of landmarks optimizing the community modularity index (Q) on coordinate correlation data. In such approach, the nodes of the graph are landmarks and an edge is draw only if significant correlation is found, and is above a certain threshold. We further present a strategy to test the significance of these clusters, assess the statistical power of the clustering and test the robustness of the final results against sampling error. This ...
We define here modules as substructures which shape variation across samples of homologous protei... more We define here modules as substructures which shape variation across samples of homologous proteins is significantly small. These modules are relevant to the understanding of the interdependence across sites in evolving proteins folds. Our approach is inspired by the analysis of shapes in geometric morphometric, but applied to molecular data in 3 dimensions. We demonstrate that the classic approach in testing modularity is inadequate as a strategy to define modules in protein structures. Here we present a graph theoretic approach based on community detection. This approach identifies positions in tertiary structures which correlate across homologous structures. We further present a strategy to test the significance of these modules. This novel approach is applied to both classic animal shape datasets as well as Homstrad multiple protein alignments. INTRODUCTION The evolution of shapes is a key issue of interest in understanding the structural determinants of evolution 1,2. In biolog...
Defining modules in biological datasets is of increase interest in several branches of biological... more Defining modules in biological datasets is of increase interest in several branches of biological and medical sciences. Here, an evolutionary module is defined as a subset of landmarks that correlate significantly more with each other, than to the rest of the configuration (i.e. shape), in a evolutionary sampling (sampling across a phylogenetic tree). These modules are relevant to understand the paths of the phenotypic evolution in both animal/plant shapes, as well as protein structures. We demonstrate that the classical approach in testing modularity is inadequate as a strategy to define modules in datasets with a high number of variables (i.e. protein structures). Here we present a graph theoretic approach based on community detection. This approach identifies clusters of landmarks, which correlate across shapes. We further present a strategy to test the significance of these clusters and therefore it definition as modules. This novel approach is applied to simulations, classic an...
A central strategy in structural biology is to find the optimal alignment for a set of many homol... more A central strategy in structural biology is to find the optimal alignment for a set of many homologous protein structures. This task is not trivial and is still not particularly scalable to large datasets. Several structural alignment programs have been developed in the last few years. They mainly differ on the definition of homology or the optimization of the fit among structures. We propose here a practical and scalable strategy to align large datasets of protein structures. This strategy is based on aligning n-1 structures against a single reference structure. Here we show that 1) selecting the best reference from a dataset is significant to the overall RMSD of the alignment ; 2) although searching for the best reference in big datasets is O(An 2) problem (A being the complexity of a pairwise alignment), it is possible to define a heuristic to select a suitable reference structure from a large dataset. We test these using a large dataset, the GP120 family, which is rich in disord...
There is a relationship between function, stability and the evolutionary process. One barrier to ... more There is a relationship between function, stability and the evolutionary process. One barrier to establish this relationship is to define functional and evolutionary units, or module in 3D data. It is possible to evaluate the correlation of residues across a sample of aligned homologous structures. Here, we define an evolutionary module as a subset of residues that correlate significantly more within the set than to the rest of the configuration across samples. We develop a graph theoretic approach based on community detection, which identifies clusters of landmarks optimizing the community modularity index (Q) on coordinate correlation data. In such approach, the nodes of the graph are landmarks and an edge is draw only if significant correlation is found, and is above a certain threshold. We further present a strategy to test the significance of these clusters, assess the statistical power of the clustering and test the robustness of the final results against sampling error. This ...
Urotrygon rogersi is a stingray of the order Myliobatiformes, with benthic habits and trophic pre... more Urotrygon rogersi is a stingray of the order Myliobatiformes, with benthic habits and trophic preference for crustaceans, is frequently caught in trawling fisheries. Such frequency could change the morphometric relationships (possible associations between variables) in this species; therefore, the 41 morphometric variables more commonly used in batoids were studied. The collinearity and correlation were evaluated among mothers, among offspring and between offspring and mothers. The variables were subject to descriptive statistics, variable transformation, collinearity test, multivariate statistics and linear and polynomial regressions. In this study, 83% of the variables showed collinearity with high values of variance inflation factor (ranging from 5.69 and 606.72), which can affect the phenotypic variance decomposition in quantitative analyses. Despite this, all variables (except 6 in mothers and 4 in pups) were linear, making easier the estimation of quantitative parameters. In 2...
Huperzia brevifolia is one of the dominant species of the genus Huperzia living in paramos and su... more Huperzia brevifolia is one of the dominant species of the genus Huperzia living in paramos and superparamos from the Colombian Andes. A detailed study of the sporangium's ontogeny and sporogenesis was carried out using specimens collected at 4200m above sea level, in Parque Natural Nacional El Cocuy, Colombia. Small pieces of caulinar axis bearing sporangia were fixed, dehydrated, paraffin embedded, sectioned in a rotatory microtome, and stained using the common Safranin O-Fast Green technique; handmade cross sections were also made, stained with aqueous Toluidine Blue (TBO). The sporangia develops basipetally, a condition that allows observation of all the developmental stages taking place throughout the caulinar axis of adult plants. Each sporangium originates from a group of epidermal cells, axilar to the microphylls. These cells undergo active mitosis, and produce new external and internal cellular groups. The sporangium wall and the tapetum originate from the external group of cells, while the internal cellular group leads to the sporogenous tissue. Meiosis occur in the sporocytes and produce simultaneous types tetrads, each one giving rise four trilete spores, with foveolate ornamentation. During the sporangium ripening, the outermost layer of the wall develops anticlinally, and inner periclinal thickenings and the innermost one perform as a secretory tapetum, which persists until the spores are completely mature. All other cellular layers colapse.
The 5S rDNA gene is a non-coding RNA that can be found in two copies (type I and type II) in bony... more The 5S rDNA gene is a non-coding RNA that can be found in two copies (type I and type II) in bony and cartilaginous fish. Previous studies have pointed out that type II gene is a paralog derived from type I. We analyzed the molecular organization of 5S rDNA type II in elasmobranchs. Although the structure of the 5S rDNA is supposed to be highly conserved, our results show that the secondary structure in this group possesses some variability and is different than the consensus secondary structure. One of these differences in Selachii is an internal loop at nucleotides 7 and 112. These mutations observed in the transcribed region suggest an independent origin of the gene among Batoids and Selachii. All promoters were highly conserved with the exception of BoxA, possibly due to its affinity to polymerase III. This latter enzyme recognizes a dT4 sequence as stop signal, however in Rajiformes this signal was doubled in length to dT8. This could be an adaptation towards a higher efficiency in the termination process. Our results suggest that there is no TATA box in elasmobranchs in the NTS region. We also provide some evidence suggesting that the complexity of the microsatellites present in the NTS region play an important role in the 5S rRNA gene since it is significantly correlated with the length of the NTS.
Assessing protein modularity is important to understand protein evolution. We propose a graph-the... more Assessing protein modularity is important to understand protein evolution. We propose a graph-theory approach with significance and power testing to identify sub-domain architecture (SDA). 1) clusters are determined by maximizing the modularity score; 2) clusters are tested for significance. We present here a methodology to identify SDA robustly, biologically meaningful, and statistically supported. The robustness is tested using simulated data with known modularity. Modules are correctly identified even with low correlation within a module. We also analyzed two real datasets: amylase catalytic domain and NPC1 protein N-terminal domain. The amylase contains a TIM barrel with polysaccharides cleavage sites and a calcium-binding domain. In this data set we identified four robust evolutionary modules, one of which forms the minimal functional TIM barrel. The NPC1 protein is involved in the intracellular lipid metabolism coordinating sterol trafficking. NPC1 is the first luminal domain ...
Uploads
Papers by Jose S Hleap