Mass spectrometry is a very popular method for protein and peptide identification nowadays. Abundance of data generated in this way grows exponentially every year. Although there exist algorithms for interpreting mass spectra, demand for... more
Mass spectrometry is a very popular method for protein and peptide identification nowadays. Abundance of data generated in this way grows exponentially every year. Although there exist algorithms for interpreting mass spectra, demand for faster and more accurate approaches remains.
We propose an approach for preprocessing the protein sequence database based on metric access methods. This approach allows to select only a small set of suitable peptide sequence candidates, which can be then compared with experimental spectra using more sophisticated algorithms. We define logarithmic distance for selecting peptide sequence candidates and also outline possibilities of using the interval query for searching posttranslational modifications.
The experimental results show that our approach is comparable in precision with nowadays most widely used public tools and outline possible directions for further research.
Tandem mass spectrometry is a widely used method for protein and peptide sequences identification. Since the mass spectra contain up to 80% of noise and many other inaccuracies, there still exists a need for more accurate algorithms for... more
Tandem mass spectrometry is a widely used method for protein and peptide sequences identification. Since the mass spectra contain up to 80% of noise and many other inaccuracies, there still exists a need for more accurate algorithms for mass spectra interpretation.
The sizes of protein databases grow rapidly and the methods for indexing these databases in order to interpret mass spectra become very popular. The parametrised Hausdorff distance, suitable for non-metric search, is presented in this paper. It models the similarity among tandem mass spectra very well and it is able to match the spectrum to correct peptide sequence in many cases without any post-processing scoring system.
In biological applications, the tandem mass spectrometry is a widely used method for determining protein and peptide sequences from an "in vitro" sample. The sequences are not determined directly, but they must be interpreted from the... more
In biological applications, the tandem mass spectrometry is a widely used method for determining protein and peptide sequences from an "in vitro" sample. The sequences are not determined directly, but they must be interpreted from the mass spectra, which is the output of the mass spectrometer. This work is focused on a similarity-search approach to mass spectra interpretation, where the parametrized Hausdorff distance (dHP) is used as the similarity. In order to provide an efficient similarity search under dHP, the metric access methods and the TriGen algorithm (controlling the metricity of dHP) are employed. We show that similarity search using dHP exhibits better correctness of peptide mass spectra interpretation than the cosine similarity commonly mentioned in mass spectrometry literature.
Moreover, the search model using the dHP distance could be extended to support chemical modifications in the query mass spectra, which is typically a problem when the cosine similarity is used. Our approach can be utilized as a coarse filter by any other database approach for mass spectra interpretation.
Tandem mass spectrometry is a fast and modern method for determining protein and peptide sequences from an "in vitro" sample. A mass spectrometer outputs mass spectra, which are utilized for sequences identification. The successful... more
Tandem mass spectrometry is a fast and modern method for determining protein and peptide sequences from an "in vitro" sample. A mass spectrometer outputs mass spectra, which are utilized for sequences identification. The successful methods for mass spectra interpretation are based on search in databases of already known or predicted protein sequences. Since the amount of protein sequences in the databases grows exponentially, couple of indexing approaches have been proposed to speed up the identification of the sequences corresponding to the mass spectra. However, many of these approaches do not (or poorly) support interpretation of the spectra contaminated with posttranslational modifications (PTMs), which in real-world conditions occur very often.
We propose a promising method for dealing with PTMs in mass spectra, including efficient similarity search employing metric indexing. In this paper, we generalize a previously proposed method based on the parametrized Hausdorff distance, which can be used as a coarse filter for any other database-based mass spectra interpretation method.
We have generalized a method for tandem mass spectra interpretation, based on the parameterized Hausdorff distance. Instead of just peptides (short pieces of proteins), in this paper we describe the interpretation of whole protein... more
We have generalized a method for tandem mass spectra interpretation, based on the parameterized Hausdorff distance. Instead of just peptides (short pieces of proteins), in this paper we describe the interpretation of whole protein sequences. For this purpose, we employ the recently introduced NM-tree to index the database of hypothetical mass spectra for exact or fast approximate search. The NM-tree combines the M-tree with the TriGen algorithm in a way that allows to dynamically control the retrieval precision at query time. A scheme for protein sequences identification using the NM-tree is proposed.
In biological applications, the tandem mass spectrometry is a widely used method for determining protein and peptide sequences from an “in vitro” sample. The sequences are not determined directly, but they must be interpreted from the... more
In biological applications, the tandem mass spectrometry is a widely used method for determining protein and peptide sequences from an “in vitro” sample. The sequences are not determined directly, but they must be interpreted from the mass spectra, which is the output of the mass spectrometer. This work is focused on a similarity-search approach to mass spectra interpretation, where the parameterized Hausdorff distance (dHP) is used as the similarity. In order to provide an efficient similarity search under dHP, the metric access methods and the TriGen algorithm (controlling the metricity of dHP) are employed. Moreover, the search model based on the dHP supports posttranslational modifications (PTMs) in the query mass spectra, what is typically a problem when an indexing approach is used. Our approach can be utilized as a coarse filter by any other database approach for mass spectra interpretation.
SimTandem is a tool for fast identification of protein and peptide sequences from tandem mass spectra. The identification is based on similarity search of spectra captured by a tandem mass spectrometer in databases of theoretical mass... more
SimTandem is a tool for fast identification of protein and peptide sequences from tandem mass spectra. The identification is based on similarity search of spectra captured by a tandem mass spectrometer in databases of theoretical mass spectra generated from databases of known protein sequences. Since the number of protein sequences in the databases grows rapidly and a sequential scan over the entire database of spectra is time-consuming, the non-metric access methods are employed as the database indexing techniques. SimTandem is based on a previously proposed method and is freely available at http://www.simtandem.org or http://www.siret.cz/simtandem.
Tandem mass spectrometry is a well-known technique for identification of protein sequences from an "in vitro" sample. To identify the sequences from spectra captured by a spectrometer, the similarity search in a database of hypothetical... more
Tandem mass spectrometry is a well-known technique for identification of protein sequences from an "in vitro" sample. To identify the sequences from spectra captured by a spectrometer, the similarity search in a database of hypothetical mass spectra is often used. For this purpose, a database of known protein sequences is utilized to generate the hypothetical spectra. Since the number of sequences in the databases grows rapidly over the time, several approaches have been proposed to index the databases of mass spectra. In this paper, we improve an approach based on the non-metric similarity search where the M-tree and the TriGen algorithm are employed for fast and approximative search. We show that preprocessing of mass spectra by clustering speeds up the identification of sequences more than 100x with respect to the sequential scan of the entire database. Moreover, when the protein candidates are refined by sequential scan in the postprocessing step, the whole approach exhibits precision similar to that of sequential scan over the entire database (over 90%).
With the emerging applications dealing with complex multimedia retrieval, such as the multimedia exploration, appropriate indexing structures need to be designed. A formalism for compact metric region description can significantly... more
With the emerging applications dealing with complex multimedia retrieval, such as the multimedia exploration, appropriate indexing structures need to be designed. A formalism for compact metric region description can significantly simplify the design of algorithms for such indexes, thus more complex and efficient metric indexes can be developed. In this paper, we introduce the cut-regions that are suitable for compact metric region description and we discuss their basic operations. To demonstrate the power of cut-regions, we redefine the PM-Tree using the cut-region formalism and, moreover, we use the formalism to describe our new improvements of the PM-Tree construction techniques. We have experimentally evaluated that the improved construction techniques lead to query performance originally obtained just using expensive construction techniques. Also in comparison with other metric and spatial access methods, the revisited PM-Tree proved its benefits.
The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra generated by shotgun proteomics. Since query spectra contain... more
The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra generated by shotgun proteomics. Since query spectra contain many inaccuracies and the sizes of databases grow rapidly in recent years, demands on more accurate mass spectra similarities and on the utilization of database indexing techniques are still desirable. We propose a statistical comparison of parameterized Hausdorff distance with freely available tools OMSSA, X!Tandem and with the cosine similarity. We show that a precursor mass filter in combination with a modification of previously proposed parameterized Hausdorff distance outperforms state-of-the-art tools in both - the speed of search and the number of identified peptide sequences (even though the q-value is only 0.001). Our method is implemented in the freely available application SimTandem which can be used in the framework TOPP based on OpenMS.