-
AIRI: Predicting Retention Indices and their Uncertainties using Artificial Intelligence
Authors:
Lewis Y. Geer,
Stephen E. Stein,
William Gary Mallard,
Douglas J. Slotta
Abstract:
The Kováts Retention index (RI) is a quantity measured using gas chromatography and commonly used in the identification of chemical structures. Creating libraries of observed RI values is a laborious task, so we explore the use of a deep neural network for predicting RI values from structure for standard semipolar columns. This network generated predictions with a mean absolute error of 15.1 and,…
▽ More
The Kováts Retention index (RI) is a quantity measured using gas chromatography and commonly used in the identification of chemical structures. Creating libraries of observed RI values is a laborious task, so we explore the use of a deep neural network for predicting RI values from structure for standard semipolar columns. This network generated predictions with a mean absolute error of 15.1 and, in a quantification of the tail of the error distribution, a 95th percentile absolute error of 46.5. Because of the Artificial Intelligence Retention Indices (AIRI) network's accuracy, it was used to predict RI values for the NIST EI-MS spectral libraries. These RI values are used to improve chemical identification methods and the quality of the library. Estimating uncertainty is an important practical need when using prediction models. To quantify the uncertainty of our network for each individual prediction, we used the outputs of an ensemble of 8 networks to calculate a predicted standard deviation for each RI value prediction. This predicted standard deviation was corrected to follow the error between observed and predicted RI values. The Z scores using these predicted standard deviations had a standard deviation of 1.52 and a 95th percentile absolute Z score corresponding to a mean RI value of 42.6.
△ Less
Submitted 17 January, 2024; v1 submitted 2 January, 2024;
originally announced January 2024.
-
AIomics: exploring more of the proteome using mass spectral libraries extended by AI
Authors:
Lewis Y. Geer,
Joel Lapin,
Douglas J. Slotta,
Tytus D. Mak,
Stephen E. Stein
Abstract:
The unbounded permutations of biological molecules, including proteins and their constituent peptides, presents a dilemma in identifying the components of complex biosamples. Sequence search algorithms used to identify peptide spectra can be expanded to cover larger classes of molecules, including more modifications, isoforms, and atypical cleavage, but at the cost of false positives or false nega…
▽ More
The unbounded permutations of biological molecules, including proteins and their constituent peptides, presents a dilemma in identifying the components of complex biosamples. Sequence search algorithms used to identify peptide spectra can be expanded to cover larger classes of molecules, including more modifications, isoforms, and atypical cleavage, but at the cost of false positives or false negatives due to the simplified spectra they compute from sequence records. Spectral library searching can help solve this issue by precisely matching experimental spectra to library spectra with excellent sensitivity and specificity. However, compiling spectral libraries that span entire proteomes is pragmatically difficult. Neural networks that predict complete spectra containing a full range of annotated and unannotated ions can be used to replace these simplified spectra with libraries of fully predicted spectra, including modified peptides. Using such a network, we created predicted spectral libraries that were used to rescore matches from a sequence search done over a large search space, including a large number of modifications. Rescoring improved the separation of true and false hits by 82%, yielding an 8% increase in peptide identifications, including a 21% increase in nonspecifically cleaved peptides and a 17% increase in phosphopeptides.
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
Searching by index for similar sequences: the SEQR algorithm
Authors:
David I. Hurwitz,
Lianyi Han,
Lewis Y. Geer
Abstract:
This paper describes a method to efficiently retrieve protein database sequences similar to a query sequence, while allowing for significant numbers of mutations. We call this method SEQR for SEQuence Retrieval. This approach increases the speed of sequence similarity searches by an order of magnitude compared to conventional algorithms at the expense of sensitivity. Furthermore, retrieval time in…
▽ More
This paper describes a method to efficiently retrieve protein database sequences similar to a query sequence, while allowing for significant numbers of mutations. We call this method SEQR for SEQuence Retrieval. This approach increases the speed of sequence similarity searches by an order of magnitude compared to conventional algorithms at the expense of sensitivity. Furthermore, retrieval time increases less than linearly with the number of sequences, a desirable property during an era when next generation sequencing technologies have yielded greater than exponential increases in sequence records. The lower sensitivity of the algorithm for distantly related sequences compared to benchmarks is not intrinsic to the method itself, but rather due to the procedure used to construct the indexing terms, and may be improved. The indexing terms themselves can be added to standard information retrieval engines, enabling complex queries that include sequence similarity and other descriptors such as taxonomy and text descriptions.
△ Less
Submitted 2 November, 2018;
originally announced November 2018.
-
Open Mass Spectrometry Search Algorithm
Authors:
Lewis Y. Geer,
Sanford P. Markey,
Jeffrey A. Kowalak,
Lukas Wagner,
Ming Xu,
Dawn M. Maynard,
Xiaoyu Yang,
Wenyao Shi,
Stephen H. Bryant
Abstract:
Large numbers of MS/MS peptide spectra generated in proteomics experiments require efficient, sensitive and specific algorithms for peptide identification. In the Open Mass Spectrometry Search Algorithm [OMSSA], specificity is calculated by a classic probability score using an explicit model for matching experimental spectra to sequences. At default thresholds, OMSSA matches more spectra from a…
▽ More
Large numbers of MS/MS peptide spectra generated in proteomics experiments require efficient, sensitive and specific algorithms for peptide identification. In the Open Mass Spectrometry Search Algorithm [OMSSA], specificity is calculated by a classic probability score using an explicit model for matching experimental spectra to sequences. At default thresholds, OMSSA matches more spectra from a standard protein cocktail than a comparable algorithm. OMSSA is designed to be faster than published algorithms in searching large MS/MS datasets.
△ Less
Submitted 1 June, 2004;
originally announced June 2004.