In this paper an approach devised to perform multiple alignment is described, able to exploit any available secondary structure information. In particular, given the sequences to be aligned, their secondary structure (either available or... more
In this paper an approach devised to perform multiple alignment is described, able to exploit any available secondary structure information. In particular, given the sequences to be aligned, their secondary structure (either available or predicted) is used to perform an ...
Multiple sequence alignment is one of the important research topics of bioinformatics. The objective is to maximize the similarities between them by adding and shuffling gaps. We propose a hybrid algorithm based on genetic (GAs) and... more
Multiple sequence alignment is one of the important research topics of bioinformatics. The objective is to maximize the similarities between them by adding and shuffling gaps. We propose a hybrid algorithm based on genetic (GAs) and 2-optimal algorithms. We are using permutation coding corresponding to represent the solution, and we are studying scoring function for multiple alignments, that is used
ABSTRACTThe metallo-β-lactamases fall into two groups: Ambler class B subgroups B1 and B2 and Ambler class B subgroup B3. The two groups are so distantly related that there is no detectable sequence homology between members of the two... more
ABSTRACTThe metallo-β-lactamases fall into two groups: Ambler class B subgroups B1 and B2 and Ambler class B subgroup B3. The two groups are so distantly related that there is no detectable sequence homology between members of the two different groups, but homology is clearly detectable at the protein structure level. The multiple structure alignment program MAPS has been used to align the structures of eight metallo-β-lactamases and five structurally homologous proteins from the metallo-β-lactamase superfamily, and that alignment has been used to construct a phylogenetic tree of the metallo-β-lactamases. The presence of genes fromEubacteria,Archaebacteria, andEukaryotaon that tree is consistent with a very ancient origin of the metallo-β-lactamase family.
Gene prospection is one of the current challenges for science. Molecular biologists and Bioinformatics need to analyze a large amount of information. Gene prospection still constitutes an area to be explored, mainly in eukaryotes. It... more
Gene prospection is one of the current challenges for science. Molecular biologists and Bioinformatics need to analyze a large amount of information. Gene prospection still constitutes an area to be explored, mainly in eukaryotes. It involves the combination of bioinformatics, with in silico analyses, and molecular biology with the construction and screening of cDNA libraries. The strategy of gene prospection begins with the construction of cDNA libraries, followed by a search for homology in public databases. Soon afterwards, conserved regions are identified on the selected sequences, throughout multiple alignments. With this information, degenerate primers may be constructed for cDNA screening, which are analyzed to screen one or more genes, using labeled probes or DNA amplification by PCR. The spectacular progress of bioinformatic in the last few years, and the growing computational capacity and speed access on the Internet, helped to accomplish new analyses. The development of m...
We present a new algorithm, based on the multidimensional QR factorization, to remove redundancy from a multiple structural alignment by choosing representative protein structures that best preserve the phylogenetic tree topology of the... more
We present a new algorithm, based on the multidimensional QR factorization, to remove redundancy from a multiple structural alignment by choosing representative protein structures that best preserve the phylogenetic tree topology of the homologous group. The classical QR factorization with pivoting, developed as a fast numerical solution to eigenvalue and linear least-squares problems of the form AxZb, was designed to
Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are... more
Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are available in public databases. Finding correlations in and between these data to answer critical research questions is extremely challenging. This problem needs to be approached from several directions: information science to organize and search the data; information visualization to assist in recognizing correlations; mathematics to formulate statistical inferences; and biology to analyze chemical and physical properties in terms of sequence and structure changes. Here we present MultiSeq, a unified bioinformatics analysis environment that allows one to organize, display, align and analyze both sequence and structure data for proteins and nucleic acids. While special emphasis is placed on analyzing the data within the framework of evolutionary bio...
We use a quantitative definition of specificity to develop a neural network for the identification of common protein binding sites in a collection of unaligned DNA fragments. We demonstrate the equivalence of the method to maximizing... more
We use a quantitative definition of specificity to develop a neural network for the identification of common protein binding sites in a collection of unaligned DNA fragments. We demonstrate the equivalence of the method to maximizing Information Content of the aligned sites when simple models of the binding energy and the genome are employed. The network method subsumes those simple models and is capable of working with more complicated ones. This is demonstrated using a Markov model of the E. coli genome and a sampling method to approximate the partition function. A variation of Gibbs' sampling aids in avoiding local minima.
Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are... more
Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are available in public databases. Finding correlations in and between these data to answer critical research questions is extremely challenging. This problem needs to be approached from several directions: information science to organize and search the data; information visualization to assist in recognizing correlations; mathematics to formulate statistical inferences; and biology to analyze chemical and physical properties in terms of sequence and structure changes. Here we present MultiSeq, a unified bioinformatics analysis environment that allows one to organize, display, align and analyze both sequence and structure data for proteins and nucleic acids. While special emphasis is placed on analyzing the data within the framework of evolutionary bio...
We propose several preprocessing steps to be used before biomarker clustering or classifying for high-throughput Mass Spectrometry (MS) data. These preprocessing steps for the mass spectra are multiple alignment of technical replicates,... more
We propose several preprocessing steps to be used before biomarker clustering or classifying for high-throughput Mass Spectrometry (MS) data. These preprocessing steps for the mass spectra are multiple alignment of technical replicates, baseline correction and normalization along the mass/charge axis. While the benefits from baseline correction and alignment seem obvious we studied more carefully the benefit from normalizing using some human prostate cancer SELDI TOF MS data (obtained from the Virginia Prostate Center Tissue and body Fluid Bank and approved by the Eastern Virginia Medical School). We show on these data that our global normalization by scaling helps in distinguishing between different cancer groups as well as between cancer and non-cancer groups. We used the Between to Within sum of squares ratio introduced by Fisher as well as visual inspection to illustrate the improvement brought by the normalization.
We determined the nucleotide sequences of blaCARB-4 encoding CARB-4 and deduced a polypeptide of 288 amino acids. The gene was characterized as a variant of group 2c carbenicillin-hydrolyzing beta-lactamases such as PSE-4, PSE-1, and... more
We determined the nucleotide sequences of blaCARB-4 encoding CARB-4 and deduced a polypeptide of 288 amino acids. The gene was characterized as a variant of group 2c carbenicillin-hydrolyzing beta-lactamases such as PSE-4, PSE-1, and CARB-3. The level of DNA homology between the bla genes for these beta-lactamases varied from 98.7 to 99.9%, while that between these genes and blaCARB-4 encoding CARB-4 was 86.3%. The blaCARB-4 gene was acquired from some other source because it has a G+C content of 39.1%, compared to a G+C content of 67% for typical Pseudomonas aeruginosa genes. DNA sequencing revealed that blaAER-1 shared 60.8% DNA identity with blaPSE-3 encoding PSE-3. The deduced AER-1 beta-lactamase peptide was compared to class A, B, C, and D enzymes and had 57.6% identity with PSE-3, including an STHK tetrad at the active site. For CARB-4 and AER-1, conserved canonical amino acid boxes typical of class A beta-lactamases were identified in a multiple alignment. Analysis of the DN...
Multiple sequence alignments are the usual starting point for analyses of protein structure and evolution. For proteins with repeated, shuffled and missing domains, however, traditional multiple sequence alignment algorithms fail to... more
Multiple sequence alignments are the usual starting point for analyses of protein structure and evolution. For proteins with repeated, shuffled and missing domains, however, traditional multiple sequence alignment algorithms fail to provide an accurate view of homology between related proteins, because they either assume that the input sequences are globally alignable or require locally alignable regions to appear in the same order in all sequences. In this paper, we present ProDA, a novel system for automated detection and ...
The DNA to Protein Translation is performed by detecting open reading frame (ORF) while taking a DNA coding sequence(CDS) as an input. This sequence is then converted into aminoacids taking 3 nucleotides(codons) at a time. Each codon... more
The DNA to Protein Translation is performed by detecting open reading frame (ORF) while taking a DNA coding sequence(CDS) as an input. This sequence is then converted into aminoacids taking 3 nucleotides(codons) at a time. Each codon specifies an amino acid. 3 frames for ...
We introduce a vector-space embedding of protein sequences which will allow us to find the motifs in a set of proteins. Our method can also be used for the multiple alignment of more than two proteins. It is superior to the existing... more
We introduce a vector-space embedding of protein sequences which will allow us to find the motifs in a set of proteins. Our method can also be used for the multiple alignment of more than two proteins. It is superior to the existing methods that depend on the order of proteins ...
Structure-based RNA multiple alignment is particularly challenging because covarying mutations make sequence information alone insufficient. Existing tools for RNA multiple alignment first generate pairwise RNA structure alignments and... more
Structure-based RNA multiple alignment is particularly challenging because covarying mutations make sequence information alone insufficient. Existing tools for RNA multiple alignment first generate pairwise RNA structure alignments and then build the multiple alignment using only sequence information. Here we present PMFastR, an algorithm which iteratively uses a sequence-structure alignment procedure to build a structure-based RNA multiple alignment from one sequence with known structure and a database of sequences from the same family. PMFastR also has low memory consumption allowing for the alignment of large sequences such as 16S and 23S rRNA. The algorithm also provides a method to utilize a multi-core environment. We present results on benchmark data sets from BRAliBase, which shows PMFastR performs comparably to other state-of-the-art programs. Finally, we regenerate 607 Rfam seed alignments and show that our automated process creates multiple alignments similar to the manual...
Motivation: With the increasing availability of large proteinprotein interaction networks, the question of protein network alignment is becoming central to systems biology. Network alignment is further delineated into two sub-problems:... more
Motivation: With the increasing availability of large proteinprotein interaction networks, the question of protein network alignment is becoming central to systems biology. Network alignment is further delineated into two sub-problems: local alignment, to find small conserved motifs across ...
We define and prove properties of the consensus shape for a family of proteins, a protein-like structure that provides a compact summary of the significant structural information for a protein family. If all members of a protein family... more
We define and prove properties of the consensus shape for a family of proteins, a protein-like structure that provides a compact summary of the significant structural information for a protein family. If all members of a protein family exhibit a geometric relationship between corresponding alpha carbons then that relationship is preserved in the consensus shape. In particular, distances and angles
Background Aminopeptidase B (Ap-B; EC 3.4.11.6) catalyzes the cleavage of basic residues at the N-terminus of peptides and processes glucagon into miniglucagon. The enzyme exhibits, in vitro, a residual ability to hydrolyze leukotriene A4... more
Background Aminopeptidase B (Ap-B; EC 3.4.11.6) catalyzes the cleavage of basic residues at the N-terminus of peptides and processes glucagon into miniglucagon. The enzyme exhibits, in vitro, a residual ability to hydrolyze leukotriene A4 into the pro-inflammatory lipid mediator leukotriene B4. The potential bi-functional nature of Ap-B is supported by close structural relationships with LTA4 hydrolase (LTA4H ; EC 3.3.2.6). A structure-function analysis is necessary for the detailed understanding of the enzymatic mechanisms of Ap-B and to design inhibitors, which could be used to determine the complete in vivo functions of the enzyme. Results The rat Ap-B cDNA was expressed in E. coli and the purified recombinant enzyme was characterized. 18 mutants of the H 325 E XXH X18 E 348 Zn2+-binding motif were constructed and expressed. All mutations were found to abolish the aminopeptidase activity. A multiple alignment of 500 sequences of the M1 family of aminopeptidases was performed to i...
We have built a specialized relational database and a search tool for natural mutants of protein C. It contains 195 entries that include 182 missense and 13 stop mutations. A menu driven search engine allows the user to retrieve stored... more
We have built a specialized relational database and a search tool for natural mutants of protein C. It contains 195 entries that include 182 missense and 13 stop mutations. A menu driven search engine allows the user to retrieve stored information for each variant, that include genetic ...
We propose a new alignment procedure that is capable of aligning protein sequences and structures in a unified manner. Recursive dynamic programming (RDP) is a hierarchical method which, on each level of the hierarchy, identifies locally... more
We propose a new alignment procedure that is capable of aligning protein sequences and structures in a unified manner. Recursive dynamic programming (RDP) is a hierarchical method which, on each level of the hierarchy, identifies locally optimal solutions and assembles them into partial alignments of sequences and/or structures. In contrast to classical dynamic programming, RDP can also handle alignment problems that use objective functions not obeying the principle of prefix optimality, e.g. scoring schemes derived from energy potentials of mean force. For such alignment problems, RDP aims at computing solutions that are near-optimal with respect to the involved cost function and biologically meaningful at the same time. Towards this goal, RDP maintains a dynamic balance between different factors governing alignment fitness such as evolutionary relationships and structural preferences. As in the RDP method gaps are not scored explicitly, the problematic assignment of gap cost param...