Skip to main content

Jiri Novak

Academy of Sciences of the Czech Republic, Institute of Microbiology, Post-Doc

Czech Technical University in Prague, Faculty of Information Technology, Department of Software Engineering, Faculty Member

Followers

26

Following

9

Co-author

1

Public Views

Address: Institute of Microbiology
Academy of Sciences of the Czech Republic
Videnska 1083
142 20 Prague
Czech Republic

less

InterestsView All (26)

Uploads

Papers by Jiri Novak

D. Luptakova, T. Pluhacek, M. Petrik, J. Novak, A. Palyzova, L. Sokolova, A. Skriba, B. Sediva, K. Lemr, V. Havlicek. Non-invasive and invasive diagnoses of aspergillosis in a rat model by mass spectrometry

Scientific Reports, 2017

Invasive pulmonary aspergillosis results in 450,000 deaths per year and complicates cancer chemot... more Invasive pulmonary aspergillosis results in 450,000 deaths per year and complicates cancer chemotherapy, transplantations and the treatment of other immunosuppressed patients. Using a rat model of experimental aspergillosis, the fungal siderophores ferricrocin and triacetylfusarinine C were identified as markers of aspergillosis and quantified in urine, serum and lung tissues. Biomarkers were analyzed by matrix-assisted laser desorption ionization (MALDI) and electrospray ionization mass spectrometry using a 12T SolariX Fourier transform ion cyclotron resonance (FTICR) mass spectrometer. The limits of detection of the ferri-forms of triacetylfusarinine C and ferricrocin in the rat serum were 0.28 and 0.36 ng/mL, respectively. In the rat urine the respective limits of detection achieved 0.02 and 0.03 ng/mL. In the sera of infected animals, triacetylfusarinine C was not detected but ferricrocin concentration fluctuated in the 3–32 ng/mL range. Notably, the mean concentrations of triacetylfusarinine C and ferricrocin in the rat urine were 0.37 and 0.63 μg/mL, respectively. The MALDI FTICR mass spectrometry imaging illustrated the actual microbial ferricrocin distribution in the lung tissues and resolved the false-positive results obtained by the light microscopy and histological staining. Ferricrocin and triacetylfusarinine C detection in urine represents an innovative non-invasive indication of Aspergillus infection in a host.

J. Novak, V. Havlicek. Dereplication and Visualization of Fungal Siderophores by CycloBranch

Mycoses, 2017

J. Novak, L. Sokolova, K. Lemr, T. Pluhacek, A. Palyzova, V. Havlicek. Batch-processing of Imaging or Liquid-Chromatography Mass Spectrometry Datasets and De Novo Sequencing of Polyketide Siderophores

Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, 2017

The open-source and cross-platform software CycloBranch was utilized for dereplication of organic... more The open-source and cross-platform software CycloBranch was utilized for dereplication of organic compounds from mass spectrometry imaging imzML datasets and its functions were illustrated on microbial siderophores. The pixel-to-pixel batch-processing was analogous to liquid chromatography mass spectrometry data. Each data point represented here by accurate m/z values and the corresponding ion intensities was matched against integrated compound libraries. The fine isotopic structure matching was also embedded into CycloBranch dereplication process. The siderophores' characterization from single-pixel mass spectra was further supported by their de novo sequencing. New ketide building block library was utilized by CycloBranch to characterize the siderophores in images and mixtures and nomenclature of fragment ion series of linear and cyclic polyketide siderophores was proposed. The software is freely available at http://ms.biomed.cas.cz/cyclobranch.

J. Prichystal, K. A. Schug, K. Lemr, J. Novak, V. Havlicek. Structural analysis of natural products

Analytical Chemistry, 2016

Current mass spectrometry, nuclear magnetic resonance spectroscopy and X-ray diffraction are pres... more Current mass spectrometry, nuclear magnetic resonance spectroscopy and X-ray diffraction are presented as structure elucidation tools for analytical chemistry of natural products. Discovering new molecular entities combined with dereplication of known organic compounds represent prerequisites for biological assays and for respective applications as pharmaceuticals or molecular markers. Liquid chromatography is briefly addressed with respect to its use in mass spectrometry- and nuclear magnetic resonance-based metabolomics studies.

J. Novak, K. Lemr, K. A. Schug, V. Havlicek. CycloBranch: De Novo Sequencing of Nonribosomal Peptides from Accurate Product Ion Mass Spectra

Journal of The American Society for Mass Spectrometry, 2015

Nonribosomal peptides have a wide range of biological and medical applications. Their identificat... more Nonribosomal peptides have a wide range of biological and medical applications. Their identification by tandem mass spectrometry remains a challenging task. A new open-source de novo peptide identification engine CycloBranch was developed and successfully applied in identification or detailed characterization of 11 linear, cyclic, branched, and branch-cyclic peptides. CycloBranch is based on annotated building block databases the size of which is defined by the user according to ribosomal or nonribosomal peptide origin. The current number of involved nonisobaric and isobaric building blocks is 287 and 521, respectively. Contrary to all other peptide sequencing tools utilizing either peptide libraries or peptide fragment libraries, CycloBranch represents a true de novo sequencing engine developed for accurate mass spectrometric data. It is a stand-alone and cross-platform application with a graphical and user-friendly interface; it supports mzML, mzXML, mgf, txt, and baf file formats and can be run in parallel on multiple threads. It can be downloaded for free from http://ms.biomed.cas.cz/cyclobranch/, where the User’s manual and video tutorials can be found.

T. Pluhacek, K. Lemr, D. Ghosh, D. Milde, J. Novak, V. Havlicek. Characterization of microbial siderophores by mass spectrometry

Mass Spectrometry Reviews, 2016

Siderophores play important roles in microbial iron piracy, and are applied as infectious disease... more Siderophores play important roles in microbial iron piracy, and are applied as infectious disease biomarkers and novel pharmaceutical drugs. Inductively coupled plasma and molecular mass spectrometry (ICP-MS) combined with high resolution separations allow characterization of siderophores in complex samples taking advantages of mass defect data filtering, tandem mass spectrometry, and iron-containing compound quantitation. The enrichment approaches used in siderophore analysis and current ICP-MS technologies are reviewed. The recent tools for fast dereplication of secondary metabolites and their databases are reported. This review on siderophores is concluded with their recent medical, biochemical, geochemical, and agricultural applications in mass spectrometry context.

J. Novak, T. Sachsenberg, D. Hoksza, T. Skopal, O. Kohlbacher. On Comparison of SimTandem with State-of-the-Art Peptide Identification Tools, Efficiency of Precursor Mass Filter and Dealing with Variable Modifications

Journal of Integrative Bioinformatics, 2013

The similarity search in theoretical mass spectra generated from protein sequence databases is a ... more The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra produced by shotgun proteomics. Growing protein sequence databases and noisy query spectra demand database indexing techniques and better similarity measures for the comparison of theoretical spectra against query spectra. We employ a modification of previously proposed parameterized Hausdorff distance for comparisons of mass spectra. The new distance outperforms the original distance, the angle distance and state-of-the-art peptide identification tools OMSSA and X!Tandem in the number of identified peptides even though the q-value is only 0.001. When a precursor mass filter is used as a database indexing technique, our method outperforms OMSSA in the speed of search. When variable modifications are not searched, the search time is similar to X!Tandem. We show that the precursor mass filter is an efficient database indexing technique for high-accuracy data even though many variable modifications are being searched. We demonstrate that the number of identified peptides is bigger when variable modifications are searched separately by more search runs of a peptide identification engine. Otherwise, the false discovery rates are affected by mixing unmodified and modified spectra together resulting in a lower number of identified peptides. Our method is implemented in the freely available application SimTandem which can be used in the framework TOPP based on OpenMS.

J. Novak, T. Sachsenberg, D. Hoksza, T. Skopal, O. Kohlbacher. A Statistical Comparison of SimTandem with State-of-the-Art Peptide Identification Tools

7th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB), 2013

The similarity search in theoretical mass spectra generated from protein sequence databases is a ... more The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra generated by shotgun proteomics. Since query spectra contain many inaccuracies and the sizes of databases grow rapidly in recent years, demands on more accurate mass spectra similarities and on the utilization of database indexing techniques are still desirable. We propose a statistical comparison of parameterized Hausdorff distance with freely available tools OMSSA, X!Tandem and with the cosine similarity. We show that a precursor mass filter in combination with a modification of previously proposed parameterized Hausdorff distance outperforms state-of-the-art tools in both - the speed of search and the number of identified peptide sequences (even though the q-value is only 0.001). Our method is implemented in the freely available application SimTandem which can be used in the framework TOPP based on OpenMS.

J. Novak, J. Galgonek, D. Hoksza, T. Skopal. SimTandem: Similarity Search in Tandem Mass Spectra

5th International Conference on Similarity Search and Applications (SISAP), 2012

SimTandem is a tool for fast identification of protein and peptide sequences from tandem mass spe... more SimTandem is a tool for fast identification of protein and peptide sequences from tandem mass spectra. The identification is based on similarity search of spectra captured by a tandem mass spectrometer in databases of theoretical mass spectra generated from databases of known protein sequences. Since the number of protein sequences in the databases grows rapidly and a sequential scan over the entire database of spectra is time-consuming, the non-metric access methods are employed as the database indexing techniques. SimTandem is based on a previously proposed method and is freely available at http://www.simtandem.org or http://www.siret.cz/simtandem.

J. Lokoc, P. Cech, J. Novak, T. Skopal. Cut-Region: A Compact Building Block for Hierarchical Metric Indexing

5th International Conference on Similarity Search and Applications (SISAP), 2012

With the emerging applications dealing with complex multimedia retrieval, such as the multimedia ... more With the emerging applications dealing with complex multimedia retrieval, such as the multimedia exploration, appropriate indexing structures need to be designed. A formalism for compact metric region description can significantly simplify the design of algorithms for such indexes, thus more complex and efficient metric indexes can be developed. In this paper, we introduce the cut-regions that are suitable for compact metric region description and we discuss their basic operations. To demonstrate the power of cut-regions, we redefine the PM-Tree using the cut-region formalism and, moreover, we use the formalism to describe our new improvements of the PM-Tree construction techniques. We have experimentally evaluated that the improved construction techniques lead to query performance originally obtained just using expensive construction techniques. Also in comparison with other metric and spatial access methods, the revisited PM-Tree proved its benefits.

J. Novak, D. Hoksza, J. Lokoc, T. Skopal. On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

by Jiri Novak and David Hoksza

8th International Symposium on Bioinformatics Research and Applications (ISBRA), 2012

Tandem mass spectrometry is a well-known technique for identification of protein sequences from a... more Tandem mass spectrometry is a well-known technique for identification of protein sequences from an "in vitro" sample. To identify the sequences from spectra captured by a spectrometer, the similarity search in a database of hypothetical mass spectra is often used. For this purpose, a database of known protein sequences is utilized to generate the hypothetical spectra. Since the number of sequences in the databases grows rapidly over the time, several approaches have been proposed to index the databases of mass spectra. In this paper, we improve an approach based on the non-metric similarity search where the M-tree and the TriGen algorithm are employed for fast and approximative search. We show that preprocessing of mass spectra by clustering speeds up the identification of sequences more than 100x with respect to the sequential scan of the entire database. Moreover, when the protein candidates are refined by sequential scan in the postprocessing step, the whole approach exhibits precision similar to that of sequential scan over the entire database (over 90%).

J. Novak, T. Skopal, D. Hoksza, J. Lokoc. Non-metric Similarity Search of Tandem Mass Spectra Including Posttranslational Modifications

Journal of Discrete Algorithms, 2012

In biological applications, the tandem mass spectrometry is a widely used method for determining ... more In biological applications, the tandem mass spectrometry is a widely used method for determining protein and peptide sequences from an “in vitro” sample. The sequences are not determined directly, but they must be interpreted from the mass spectra, which is the output of the mass spectrometer. This work is focused on a similarity-search approach to mass spectra interpretation, where the parameterized Hausdorff distance (dHP) is used as the similarity. In order to provide an efficient similarity search under dHP, the metric access methods and the TriGen algorithm (controlling the metricity of dHP) are employed. Moreover, the search model based on the dHP supports posttranslational modifications (PTMs) in the query mass spectra, what is typically a problem when an indexing approach is used. Our approach can be utilized as a coarse filter by any other database approach for mass spectra interpretation.

J. Novak, T. Skopal, D. Hoksza, J. Lokoc, J. Galgonek. Protein Sequences Identification using NM-tree

Proceedings of the 4th International Conference on Similarity Search and Applications (SISAP), 2011

We have generalized a method for tandem mass spectra interpretation, based on the parameterized H... more We have generalized a method for tandem mass spectra interpretation, based on the parameterized Hausdorff distance. Instead of just peptides (short pieces of proteins), in this paper we describe the interpretation of whole protein sequences. For this purpose, we employ the recently introduced NM-tree to index the database of hypothetical mass spectra for exact or fast approximate search. The NM-tree combines the M-tree with the TriGen algorithm in a way that allows to dynamically control the retrieval precision at query time. A scheme for protein sequences identification using the NM-tree is proposed.

J. Novak, D. Hoksza. Similarity Search and Posttranslational Modifications in Tandem Mass Spectra

Proceedings of Bioinformatics and Biomedicine Workshops (BIBMW), 2010

Tandem mass spectrometry is a fast and modern method for determining protein and peptide sequence... more Tandem mass spectrometry is a fast and modern method for determining protein and peptide sequences from an "in vitro" sample. A mass spectrometer outputs mass spectra, which are utilized for sequences identification. The successful methods for mass spectra interpretation are based on search in databases of already known or predicted protein sequences. Since the amount of protein sequences in the databases grows exponentially, couple of indexing approaches have been proposed to speed up the identification of the sequences corresponding to the mass spectra. However, many of these approaches do not (or poorly) support interpretation of the spectra contaminated with posttranslational modifications (PTMs), which in real-world conditions occur very often.

We propose a promising method for dealing with PTMs in mass spectra, including efficient similarity search employing metric indexing. In this paper, we generalize a previously proposed method based on the parametrized Hausdorff distance, which can be used as a coarse filter for any other database-based mass spectra interpretation method.

J. Novak, T. Skopal, D. Hoksza, J. Lokoc. Improving the similarity search of tandem mass spectra using metric access methods

Proceedings of the 3rd International Conference on Similarity Search and Applications (SISAP), 2010

In biological applications, the tandem mass spectrometry is a widely used method for determining ... more In biological applications, the tandem mass spectrometry is a widely used method for determining protein and peptide sequences from an "in vitro" sample. The sequences are not determined directly, but they must be interpreted from the mass spectra, which is the output of the mass spectrometer. This work is focused on a similarity-search approach to mass spectra interpretation, where the parametrized Hausdorff distance (dHP) is used as the similarity. In order to provide an efficient similarity search under dHP, the metric access methods and the TriGen algorithm (controlling the metricity of dHP) are employed. We show that similarity search using dHP exhibits better correctness of peptide mass spectra interpretation than the cosine similarity commonly mentioned in mass spectrometry literature.

Moreover, the search model using the dHP distance could be extended to support chemical modifications in the query mass spectra, which is typically a problem when the cosine similarity is used. Our approach can be utilized as a coarse filter by any other database approach for mass spectra interpretation.

J. Novak, D. Hoksza. Parametrised Hausdorff Distance as a Non-Metric Similarity Model for Tandem Mass Spectrometry

Proceedings of the 10th Annual International Workshop on Databases, Texts, Specifications, and Objects (DATESO), 2010

Tandem mass spectrometry is a widely used method for protein and peptide sequences identification... more Tandem mass spectrometry is a widely used method for protein and peptide sequences identification. Since the mass spectra contain up to 80% of noise and many other inaccuracies, there still exists a need for more accurate algorithms for mass spectra interpretation.

The sizes of protein databases grow rapidly and the methods for indexing these databases in order to interpret mass spectra become very popular. The parametrised Hausdorff distance, suitable for non-metric search, is presented in this paper. It models the similarity among tandem mass spectra very well and it is able to match the spectrum to correct peptide sequence in many cases without any post-processing scoring system.

J. Novak, D. Hoksza. An application of the metric access methods to the mass spectrometry data

Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2009

Mass spectrometry is a very popular method for protein and peptide identification nowadays. Abund... more Mass spectrometry is a very popular method for protein and peptide identification nowadays. Abundance of data generated in this way grows exponentially every year. Although there exist algorithms for interpreting mass spectra, demand for faster and more accurate approaches remains.

We propose an approach for preprocessing the protein sequence database based on metric access methods. This approach allows to select only a small set of suitable peptide sequence candidates, which can be then compared with experimental spectra using more sophisticated algorithms. We define logarithmic distance for selecting peptide sequence candidates and also outline possibilities of using the interval query for searching posttranslational modifications.

The experimental results show that our approach is comparable in precision with nowadays most widely used public tools and outline possible directions for further research.

Conference Presentations by Jiri Novak

33rd Informal Meeting on Mass Spectrometry - De Novo Identification of Nonribosomal Peptides from Accurate Product Ion Mass Spectra

4th Conference of the Czech Society for Mass Spectrometry - Software-aided identification of complex nonribosomal peptides from accurate tandem mass spectra

3rd Conference of the Czech Society for Mass Spectrometry - Identification of Peptide Sequences by SimTandem

D. Luptakova, T. Pluhacek, M. Petrik, J. Novak, A. Palyzova, L. Sokolova, A. Skriba, B. Sediva, K. Lemr, V. Havlicek. Non-invasive and invasive diagnoses of aspergillosis in a rat model by mass spectrometry

Scientific Reports, 2017

Invasive pulmonary aspergillosis results in 450,000 deaths per year and complicates cancer chemot... more Invasive pulmonary aspergillosis results in 450,000 deaths per year and complicates cancer chemotherapy, transplantations and the treatment of other immunosuppressed patients. Using a rat model of experimental aspergillosis, the fungal siderophores ferricrocin and triacetylfusarinine C were identified as markers of aspergillosis and quantified in urine, serum and lung tissues. Biomarkers were analyzed by matrix-assisted laser desorption ionization (MALDI) and electrospray ionization mass spectrometry using a 12T SolariX Fourier transform ion cyclotron resonance (FTICR) mass spectrometer. The limits of detection of the ferri-forms of triacetylfusarinine C and ferricrocin in the rat serum were 0.28 and 0.36 ng/mL, respectively. In the rat urine the respective limits of detection achieved 0.02 and 0.03 ng/mL. In the sera of infected animals, triacetylfusarinine C was not detected but ferricrocin concentration fluctuated in the 3–32 ng/mL range. Notably, the mean concentrations of triacetylfusarinine C and ferricrocin in the rat urine were 0.37 and 0.63 μg/mL, respectively. The MALDI FTICR mass spectrometry imaging illustrated the actual microbial ferricrocin distribution in the lung tissues and resolved the false-positive results obtained by the light microscopy and histological staining. Ferricrocin and triacetylfusarinine C detection in urine represents an innovative non-invasive indication of Aspergillus infection in a host.

J. Novak, V. Havlicek. Dereplication and Visualization of Fungal Siderophores by CycloBranch

Mycoses, 2017

J. Novak, L. Sokolova, K. Lemr, T. Pluhacek, A. Palyzova, V. Havlicek. Batch-processing of Imaging or Liquid-Chromatography Mass Spectrometry Datasets and De Novo Sequencing of Polyketide Siderophores

Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, 2017

The open-source and cross-platform software CycloBranch was utilized for dereplication of organic... more The open-source and cross-platform software CycloBranch was utilized for dereplication of organic compounds from mass spectrometry imaging imzML datasets and its functions were illustrated on microbial siderophores. The pixel-to-pixel batch-processing was analogous to liquid chromatography mass spectrometry data. Each data point represented here by accurate m/z values and the corresponding ion intensities was matched against integrated compound libraries. The fine isotopic structure matching was also embedded into CycloBranch dereplication process. The siderophores' characterization from single-pixel mass spectra was further supported by their de novo sequencing. New ketide building block library was utilized by CycloBranch to characterize the siderophores in images and mixtures and nomenclature of fragment ion series of linear and cyclic polyketide siderophores was proposed. The software is freely available at http://ms.biomed.cas.cz/cyclobranch.

J. Prichystal, K. A. Schug, K. Lemr, J. Novak, V. Havlicek. Structural analysis of natural products

Analytical Chemistry, 2016

Current mass spectrometry, nuclear magnetic resonance spectroscopy and X-ray diffraction are pres... more Current mass spectrometry, nuclear magnetic resonance spectroscopy and X-ray diffraction are presented as structure elucidation tools for analytical chemistry of natural products. Discovering new molecular entities combined with dereplication of known organic compounds represent prerequisites for biological assays and for respective applications as pharmaceuticals or molecular markers. Liquid chromatography is briefly addressed with respect to its use in mass spectrometry- and nuclear magnetic resonance-based metabolomics studies.

J. Novak, K. Lemr, K. A. Schug, V. Havlicek. CycloBranch: De Novo Sequencing of Nonribosomal Peptides from Accurate Product Ion Mass Spectra

Journal of The American Society for Mass Spectrometry, 2015

Nonribosomal peptides have a wide range of biological and medical applications. Their identificat... more Nonribosomal peptides have a wide range of biological and medical applications. Their identification by tandem mass spectrometry remains a challenging task. A new open-source de novo peptide identification engine CycloBranch was developed and successfully applied in identification or detailed characterization of 11 linear, cyclic, branched, and branch-cyclic peptides. CycloBranch is based on annotated building block databases the size of which is defined by the user according to ribosomal or nonribosomal peptide origin. The current number of involved nonisobaric and isobaric building blocks is 287 and 521, respectively. Contrary to all other peptide sequencing tools utilizing either peptide libraries or peptide fragment libraries, CycloBranch represents a true de novo sequencing engine developed for accurate mass spectrometric data. It is a stand-alone and cross-platform application with a graphical and user-friendly interface; it supports mzML, mzXML, mgf, txt, and baf file formats and can be run in parallel on multiple threads. It can be downloaded for free from http://ms.biomed.cas.cz/cyclobranch/, where the User’s manual and video tutorials can be found.

T. Pluhacek, K. Lemr, D. Ghosh, D. Milde, J. Novak, V. Havlicek. Characterization of microbial siderophores by mass spectrometry

Mass Spectrometry Reviews, 2016

Siderophores play important roles in microbial iron piracy, and are applied as infectious disease... more Siderophores play important roles in microbial iron piracy, and are applied as infectious disease biomarkers and novel pharmaceutical drugs. Inductively coupled plasma and molecular mass spectrometry (ICP-MS) combined with high resolution separations allow characterization of siderophores in complex samples taking advantages of mass defect data filtering, tandem mass spectrometry, and iron-containing compound quantitation. The enrichment approaches used in siderophore analysis and current ICP-MS technologies are reviewed. The recent tools for fast dereplication of secondary metabolites and their databases are reported. This review on siderophores is concluded with their recent medical, biochemical, geochemical, and agricultural applications in mass spectrometry context.

J. Novak, T. Sachsenberg, D. Hoksza, T. Skopal, O. Kohlbacher. On Comparison of SimTandem with State-of-the-Art Peptide Identification Tools, Efficiency of Precursor Mass Filter and Dealing with Variable Modifications

Journal of Integrative Bioinformatics, 2013

The similarity search in theoretical mass spectra generated from protein sequence databases is a ... more The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra produced by shotgun proteomics. Growing protein sequence databases and noisy query spectra demand database indexing techniques and better similarity measures for the comparison of theoretical spectra against query spectra. We employ a modification of previously proposed parameterized Hausdorff distance for comparisons of mass spectra. The new distance outperforms the original distance, the angle distance and state-of-the-art peptide identification tools OMSSA and X!Tandem in the number of identified peptides even though the q-value is only 0.001. When a precursor mass filter is used as a database indexing technique, our method outperforms OMSSA in the speed of search. When variable modifications are not searched, the search time is similar to X!Tandem. We show that the precursor mass filter is an efficient database indexing technique for high-accuracy data even though many variable modifications are being searched. We demonstrate that the number of identified peptides is bigger when variable modifications are searched separately by more search runs of a peptide identification engine. Otherwise, the false discovery rates are affected by mixing unmodified and modified spectra together resulting in a lower number of identified peptides. Our method is implemented in the freely available application SimTandem which can be used in the framework TOPP based on OpenMS.

J. Novak, T. Sachsenberg, D. Hoksza, T. Skopal, O. Kohlbacher. A Statistical Comparison of SimTandem with State-of-the-Art Peptide Identification Tools

7th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB), 2013

The similarity search in theoretical mass spectra generated from protein sequence databases is a ... more The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra generated by shotgun proteomics. Since query spectra contain many inaccuracies and the sizes of databases grow rapidly in recent years, demands on more accurate mass spectra similarities and on the utilization of database indexing techniques are still desirable. We propose a statistical comparison of parameterized Hausdorff distance with freely available tools OMSSA, X!Tandem and with the cosine similarity. We show that a precursor mass filter in combination with a modification of previously proposed parameterized Hausdorff distance outperforms state-of-the-art tools in both - the speed of search and the number of identified peptide sequences (even though the q-value is only 0.001). Our method is implemented in the freely available application SimTandem which can be used in the framework TOPP based on OpenMS.

J. Novak, J. Galgonek, D. Hoksza, T. Skopal. SimTandem: Similarity Search in Tandem Mass Spectra

5th International Conference on Similarity Search and Applications (SISAP), 2012

SimTandem is a tool for fast identification of protein and peptide sequences from tandem mass spe... more SimTandem is a tool for fast identification of protein and peptide sequences from tandem mass spectra. The identification is based on similarity search of spectra captured by a tandem mass spectrometer in databases of theoretical mass spectra generated from databases of known protein sequences. Since the number of protein sequences in the databases grows rapidly and a sequential scan over the entire database of spectra is time-consuming, the non-metric access methods are employed as the database indexing techniques. SimTandem is based on a previously proposed method and is freely available at http://www.simtandem.org or http://www.siret.cz/simtandem.

J. Lokoc, P. Cech, J. Novak, T. Skopal. Cut-Region: A Compact Building Block for Hierarchical Metric Indexing

5th International Conference on Similarity Search and Applications (SISAP), 2012

With the emerging applications dealing with complex multimedia retrieval, such as the multimedia ... more With the emerging applications dealing with complex multimedia retrieval, such as the multimedia exploration, appropriate indexing structures need to be designed. A formalism for compact metric region description can significantly simplify the design of algorithms for such indexes, thus more complex and efficient metric indexes can be developed. In this paper, we introduce the cut-regions that are suitable for compact metric region description and we discuss their basic operations. To demonstrate the power of cut-regions, we redefine the PM-Tree using the cut-region formalism and, moreover, we use the formalism to describe our new improvements of the PM-Tree construction techniques. We have experimentally evaluated that the improved construction techniques lead to query performance originally obtained just using expensive construction techniques. Also in comparison with other metric and spatial access methods, the revisited PM-Tree proved its benefits.

J. Novak, D. Hoksza, J. Lokoc, T. Skopal. On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

by Jiri Novak and David Hoksza

8th International Symposium on Bioinformatics Research and Applications (ISBRA), 2012

Tandem mass spectrometry is a well-known technique for identification of protein sequences from a... more Tandem mass spectrometry is a well-known technique for identification of protein sequences from an "in vitro" sample. To identify the sequences from spectra captured by a spectrometer, the similarity search in a database of hypothetical mass spectra is often used. For this purpose, a database of known protein sequences is utilized to generate the hypothetical spectra. Since the number of sequences in the databases grows rapidly over the time, several approaches have been proposed to index the databases of mass spectra. In this paper, we improve an approach based on the non-metric similarity search where the M-tree and the TriGen algorithm are employed for fast and approximative search. We show that preprocessing of mass spectra by clustering speeds up the identification of sequences more than 100x with respect to the sequential scan of the entire database. Moreover, when the protein candidates are refined by sequential scan in the postprocessing step, the whole approach exhibits precision similar to that of sequential scan over the entire database (over 90%).

J. Novak, T. Skopal, D. Hoksza, J. Lokoc. Non-metric Similarity Search of Tandem Mass Spectra Including Posttranslational Modifications

Journal of Discrete Algorithms, 2012

In biological applications, the tandem mass spectrometry is a widely used method for determining ... more In biological applications, the tandem mass spectrometry is a widely used method for determining protein and peptide sequences from an “in vitro” sample. The sequences are not determined directly, but they must be interpreted from the mass spectra, which is the output of the mass spectrometer. This work is focused on a similarity-search approach to mass spectra interpretation, where the parameterized Hausdorff distance (dHP) is used as the similarity. In order to provide an efficient similarity search under dHP, the metric access methods and the TriGen algorithm (controlling the metricity of dHP) are employed. Moreover, the search model based on the dHP supports posttranslational modifications (PTMs) in the query mass spectra, what is typically a problem when an indexing approach is used. Our approach can be utilized as a coarse filter by any other database approach for mass spectra interpretation.

J. Novak, T. Skopal, D. Hoksza, J. Lokoc, J. Galgonek. Protein Sequences Identification using NM-tree

Proceedings of the 4th International Conference on Similarity Search and Applications (SISAP), 2011

We have generalized a method for tandem mass spectra interpretation, based on the parameterized H... more We have generalized a method for tandem mass spectra interpretation, based on the parameterized Hausdorff distance. Instead of just peptides (short pieces of proteins), in this paper we describe the interpretation of whole protein sequences. For this purpose, we employ the recently introduced NM-tree to index the database of hypothetical mass spectra for exact or fast approximate search. The NM-tree combines the M-tree with the TriGen algorithm in a way that allows to dynamically control the retrieval precision at query time. A scheme for protein sequences identification using the NM-tree is proposed.

J. Novak, D. Hoksza. Similarity Search and Posttranslational Modifications in Tandem Mass Spectra

Proceedings of Bioinformatics and Biomedicine Workshops (BIBMW), 2010

Tandem mass spectrometry is a fast and modern method for determining protein and peptide sequence... more Tandem mass spectrometry is a fast and modern method for determining protein and peptide sequences from an "in vitro" sample. A mass spectrometer outputs mass spectra, which are utilized for sequences identification. The successful methods for mass spectra interpretation are based on search in databases of already known or predicted protein sequences. Since the amount of protein sequences in the databases grows exponentially, couple of indexing approaches have been proposed to speed up the identification of the sequences corresponding to the mass spectra. However, many of these approaches do not (or poorly) support interpretation of the spectra contaminated with posttranslational modifications (PTMs), which in real-world conditions occur very often.

We propose a promising method for dealing with PTMs in mass spectra, including efficient similarity search employing metric indexing. In this paper, we generalize a previously proposed method based on the parametrized Hausdorff distance, which can be used as a coarse filter for any other database-based mass spectra interpretation method.

J. Novak, T. Skopal, D. Hoksza, J. Lokoc. Improving the similarity search of tandem mass spectra using metric access methods

Proceedings of the 3rd International Conference on Similarity Search and Applications (SISAP), 2010

In biological applications, the tandem mass spectrometry is a widely used method for determining ... more In biological applications, the tandem mass spectrometry is a widely used method for determining protein and peptide sequences from an "in vitro" sample. The sequences are not determined directly, but they must be interpreted from the mass spectra, which is the output of the mass spectrometer. This work is focused on a similarity-search approach to mass spectra interpretation, where the parametrized Hausdorff distance (dHP) is used as the similarity. In order to provide an efficient similarity search under dHP, the metric access methods and the TriGen algorithm (controlling the metricity of dHP) are employed. We show that similarity search using dHP exhibits better correctness of peptide mass spectra interpretation than the cosine similarity commonly mentioned in mass spectrometry literature.

Moreover, the search model using the dHP distance could be extended to support chemical modifications in the query mass spectra, which is typically a problem when the cosine similarity is used. Our approach can be utilized as a coarse filter by any other database approach for mass spectra interpretation.

J. Novak, D. Hoksza. Parametrised Hausdorff Distance as a Non-Metric Similarity Model for Tandem Mass Spectrometry

Proceedings of the 10th Annual International Workshop on Databases, Texts, Specifications, and Objects (DATESO), 2010

Tandem mass spectrometry is a widely used method for protein and peptide sequences identification... more Tandem mass spectrometry is a widely used method for protein and peptide sequences identification. Since the mass spectra contain up to 80% of noise and many other inaccuracies, there still exists a need for more accurate algorithms for mass spectra interpretation.

The sizes of protein databases grow rapidly and the methods for indexing these databases in order to interpret mass spectra become very popular. The parametrised Hausdorff distance, suitable for non-metric search, is presented in this paper. It models the similarity among tandem mass spectra very well and it is able to match the spectrum to correct peptide sequence in many cases without any post-processing scoring system.

J. Novak, D. Hoksza. An application of the metric access methods to the mass spectrometry data

Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2009

Mass spectrometry is a very popular method for protein and peptide identification nowadays. Abund... more Mass spectrometry is a very popular method for protein and peptide identification nowadays. Abundance of data generated in this way grows exponentially every year. Although there exist algorithms for interpreting mass spectra, demand for faster and more accurate approaches remains.

We propose an approach for preprocessing the protein sequence database based on metric access methods. This approach allows to select only a small set of suitable peptide sequence candidates, which can be then compared with experimental spectra using more sophisticated algorithms. We define logarithmic distance for selecting peptide sequence candidates and also outline possibilities of using the interval query for searching posttranslational modifications.

The experimental results show that our approach is comparable in precision with nowadays most widely used public tools and outline possible directions for further research.

33rd Informal Meeting on Mass Spectrometry - De Novo Identification of Nonribosomal Peptides from Accurate Product Ion Mass Spectra

4th Conference of the Czech Society for Mass Spectrometry - Software-aided identification of complex nonribosomal peptides from accurate tandem mass spectra

3rd Conference of the Czech Society for Mass Spectrometry - Identification of Peptide Sequences by SimTandem

6th OpenMS User Meeting - Identification of Peptide Sequences by SimTandem

PACBB 2013 - A Statistical Comparison of SimTandem with State-of-the-Art Peptide Identification Tools

ISBRA 2012 - On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

4th MaxQuant Summer School 2012 - Identification of Protein Sequences from Tandem Mass Spectra using Non-metric Access Methods

ENBIK 2012 (CZ) - Identifikace proteinových sekvencí s využitím nemetrického indexování databází hmotnostních spekter

SISAP 2011 - Protein Sequences Identiﬁcation using NM-tree

BIBM 2010 - Similarity Search and PosttranslationalModiﬁcations in TandemMass Spectra

SISAP 2010 - Improving the similarity search of tandem mass spectra using metric access methods

DATESO 2010 - Parametrised Hausdorff Distance as a Non-Metric Similarity Model for Tandem Mass Spectrometry

CIBCB 2009 - An application of the metric access methods to the mass spectrometry data

Similarity Search in Mass Spectra Databases

Aplikace metrických indexovacích metod na data získaná hmotnostní spektrometrií