data
Article
Aggregation of Multimodal ICE-MS Data into Joint Classifier
Increases Quality of Brain Cancer Tissue Classification
Anatoly A. Sorokin 1 , Denis S. Bormotov 1 , Denis S. Zavorotnyuk 1 , Vasily A. Eliferov 1 , Konstantin V. Bocharov 2 ,
Stanislav I. Pekov 3,4, * , Evgeny N. Nikolaev 3, * and Igor A. Popov 1, *
1
2
3
4
*
Citation: Sorokin, A.A.; Bormotov,
D.S.; Zavorotnyuk, D.S.; Eliferov,
V.A.; Bocharov, K.V.; Pekov, S.I.;
Nikolaev, E.N.; Popov, I.A.
Aggregation of Multimodal ICE-MS
Data into Joint Classifier Increases
The Moscow Institute of Physics and Technology, National Research University, 141701 Dolgoprudny, Russia
V. L. Talrose Institute for Energy Problems of Chemical Physics, N. N. Semenov Federal Research Center for
Chemical Physics Russian Academy of Science, 119334 Moscow, Russia
Skolkovo Institute of Science and Technology, 121205 Moscow, Russia
Siberian State Medical University, 634050 Tomsk, Russia
Correspondence: stanislav.pekov@forwe.ru (S.I.P.); e.nikolaev@skoltech.ru (E.N.N.);
hexapole@gmail.com (I.A.P.)
Abstract: Mass spectrometry fingerprinting combined with multidimensional data analysis has been
proposed in surgery to determine if a biopsy sample is a tumor. In the specific case of brain tumors,
it is complicated to obtain control samples, leading to model overfitting due to unbalanced sample
cohorts. Usually, classifiers are trained using a single measurement regime, most notably single ion
polarity, but mass range and spectral resolution could also be varied. It is known that lipid groups
differ significantly in their ability to produce positive or negative ions; hence, using only one polarity
significantly restricts the chemical space available for sample discrimination purposes. In this work,
we have developed an approach employing mass spectrometry data obtained by eight different
regimes of measurement simultaneously. Regime-specific classifiers are trained, then a mixture of
experts techniques based on voting or mean probability is used to aggregate predictions of all trained
classifiers and assign a class to the whole sample. The aggregated classifiers have shown a much
better performance than any of the single-regime classifiers and help significantly reduce the effect of
an unbalanced dataset without any augmentation.
Keywords: mass spectrometry; ensemble learners; mixture of experts; brain tumor; ambient ionization;
inline cartridge extraction
Quality of Brain Cancer Tissue
Classification. Data 2023, 8, 8.
https://doi.org/10.3390/data8010008
Academic Editors: Thompson
Sarkodie-Gyan, Wahyu Caesarendra
and Muhammad Irfan
Received: 26 October 2022
Revised: 23 December 2022
Accepted: 24 December 2022
Published: 27 December 2022
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1. Introduction
Mass spectrometry is widely used to analyze biological samples, especially since the
introduction of soft ionization methods: matrix-assisted laser desorption and ionization
(MALDI) and electrospray ionization (ESI). Aside from highly sensitive targeted analysis
of particular molecular species [1,2], mass spectrometry is capable of the rapid analysis of
complex mixtures. This capability has a variety of applications, such as proteomics [3,4]
and lipidomics [5,6]. Combined with advances in ambient ionization mass spectrometry [7–12],
as well as methods of multidimensional data processing and machine learning [13–17],
mass spectrometry enables “fingerprinting” approaches. A sample is analyzed under
reproducible conditions, then the measured mass spectra are regarded as multidimensional
data (intensity values corresponding to the m/z values) irrespective of the molecular species
responsible for the signal. Then, depending on the aims of the research, a classifier or
regression model is built using these spectra. Further samples can then be assigned a specific
class (e.g., clinical diagnosis [11,12,18,19]) by a classification model or characteristic value
(e.g., tumor cell percentage [20,21]) by a regression model using only their mass spectra,
which can be obtained relatively quickly and do not require any subjective interpretation.
Most of the time, classifiers are built using mass spectrometric data from a single
polarity, resolution, and m/z range. These specific parameters depend on the variety of
Data 2023, 8, 8. https://doi.org/10.3390/data8010008
https://www.mdpi.com/journal/data
Data 2023, 8, 8
2 of 8
molecules corresponding to the underlying biological concept. However, such a limitation
is not necessary as multiple classes or subclasses of biomolecules could be affected by the
investigated pathological condition. In the case of cancer lipidomics, it is crucial that lipid
groups differ significantly in their ability to produce positive or negative ions, so using
one polarity only significantly restricts the chemical space used for sample discrimination.
The investigation of tumor tissue samples using inline cartridge extraction mass spectrometry (ICE-MS) demonstrates [22] that negative-ion mode spectra are more variable than
their positive-ion mode counterparts. If the spectra are measured in the narrow m/z range
(e.g., m/z 500–1000), the peaks can be attributed to lipids [23], while in the wider range
(m/z 100–2000) other molecules contribute to the signal, such as a neural tissue marker
NAA (m/z 174) [24]. The motivation for this work was to develop a classification approach
that uses MS data from different regimes to provide a more robust and reliable classifier.
In the theory of machine learning, weak learners are expected to provide classifiers
that are only slightly better than a random guess, in contrast to strong learners, which
are expected to provide highly accurate predictions. Ensemble learning methods merge
several weak learner algorithm implementations to create a strong learner [25]. This is the
idea behind bootstrap aggregating, random forests, and boosting algorithms such as the
XGboost [26] algorithm, which was used in this work to build individual (single regime)
classifiers. These classifiers, trained to discriminate between mass spectra of glioblastoma
and non-tumorous tissue samples, achieve more than 95% accuracy on the same data
that was used to teach the classifier; however, when classifying an “unfamiliar” data set,
the accuracy drops to 80%. This situation is called overfitting and is typical for highly
multidimensional data such as mass spectra.
In this work, we apply an additional aggregation step in that we combine the results
of ensemble learners built on the data for individual regimes and show that this approach
is able to overcome overfitting in a way similar to how ensemble learning overcomes the
low accuracy of weak classifiers. We use the XGboost algorithm to model the relationship
between mass-spectrometry data and the diagnosis assigned to the spectra. We implement
two ways to aggregate the XGBoost scan-level predictions into a sample-level prediction
and to aggregate predictions made from different mass-spectrometry regimes into a final
sample prediction.
2. Materials and Methods
The samples of glioblastoma and non-tumorous pathologies were obtained from
the N.N. Burdenko National Scientific and Practical Center for Neurosurgery (NSPCN)
and analyzed under an approved N.N. Burdenko NSPCN Institutional Review Board protocol in accordance with the Helsinki Declaration as revised in 2013. A signed informed
consent explicitly noting that all removed tissues can be used for further research was
obtained from all patients. Brain tumor tissues were resected during elective surgery and
non-tumor pathological tissues were resected in the course of surgery for drug-resistant
epilepsy. In total, 55 biopsy samples from 41 glioblastoma patients and 8 samples from
8 non-tumor patients were obtained for training and testing subsets. An additional
26 biopsy samples with known tumor cell percentage (ranging from 0% to 100%) from 11
glioblastoma patients were taken as a validation subset.
The molecular profiles of tissue sample fragments (up to 3 fragments per sample) are
acquired using ICE-MS [12] with an LTQ XL Orbitrap mass spectrometer (Thermo Fisher
Scientific, San Jose, CA, USA) in eight different regimes of measurement (all combinations of positive (+) or negative (−) ion mode, high (H) or low (L) resolution, m/z range
100–2000 (w) or 500–1000 (n)) to cover as many molecular features as possible [22]. Briefly,
the sequence of regimes (30 s per regime) is Lw−, Hw−, Hn−, Ln−, Lw+, Hw+, Hn+,
Ln+, then repeat from the start. This sequence is defined by an instrumental method
made in XCalibur software and the instrumental method is kept consistent throughout
the experiments. To produce a stable spray and ionization, the following parameters
were used: solvent flow 3.0 µL/min, spray voltage 2.9 kV for positive ions and 3.5 kV
Data 2023, 8, 8
3 of 8
for negative ions, and capillary temperature 220 ◦ C. The solvent consisted of 3:3:3:1 (vol.)
Methanol:Isopropanol:Acetonitrile:Water, with 0.1% (vol.) Acetic Acid. HPLC-grade solvent and acid were obtained from Merck (Merck KGaA, Darmstadt, Germany).
The data are transformed from the proprietary RAW format into the CDF file format, where a spectrum is represented as a set of scans, where each scan is a table with
two columns consisting of m/z and the corresponding intensity values. Each scan also has
a total ion current attribute and a time attribute, that is the time elapsed since the start of
the measurement. In total, 165 fragments of tumor samples and 23 fragments of non-tumor
samples were measured.
All simulations were performed in a standard machine-learning cross-validation
setup. An algorithm for producing classifier models was applied for each of the eight MS
measurement regimes as follows. The spectra were transformed into a feature matrix with
individual scans as rows in accordance with the guideline described in [21]. Firstly, the
intensities of the spectra were calibrated with regard to the total ion current. Next, m/z
aligning and peak detection was performed, followed by the filtering out of peaks presented
in fewer than 10 scans per diagnosis. The resulting list of mass spectrometry peaks was
transformed into the matrix of peak intensities. Then, all the rows corresponding to the
samples with known tumor cell percentage data were set aside and used as the validation
set. Columns with little variability (sd < 0.2) and columns with high mutual correlation
(rho > 0.8) were excluded from the matrix. The resulting matrix rows were divided into
training and testing sets in a 60/40 ratio, ensuring that scans from the same tissue fragment
are always assigned to the same set. From the training set matrix, which contained approx.
5000–7000 rows (depending on measurement regime), 600 rows were chosen at random and
were used for the optimization of the XGboost algorithm metaparameters. An optimal set of
metaparameters were then used to build a classifier using the training set, and the classifier
performance was evaluated using a testing subset. To assign a class to the spectrum,
predictions for individual scans were aggregated either using voting or averaging the class
probability (these are referred to as aggregation methods below).
The 16 tables obtained (8 measurement regimes and 2 aggregation methods) were
aggregated into one, using the same method that was used for the individual scans: either
voting or a mean probability (meanP) calculation. These aggregated results characterize
clinical samples, with all spectra from a sample contributing to the final prediction.
The mass spectra data were preprocessed using the R environment version 4.0.4
with the R packages MALDIquant [27], and model analysis was performed with the
KNIME Analytics Platform ver. 4.5.2 [28]. The classification algorithm was realized
as the KNIME workflow and is available at the KNIME Community Hub via the link
https://kni.me/w/dWtqs1_6S2XVP6EG (accessed on 26 December 2022).
3. Results
The data were split into three sets based on the tissue sample data: the validation set
contains all samples, for which tumor cell percentage was evaluated manually by a pathologist [21], and the rest of the data were split into training and testing subsets in a 60/40 ratio.
The results obtained for the testing subset are shown in Table 1. The voting-based classifiers
achieve higher accuracy for all regimes, with the sole exception of positive ion mode,
narrow m/z range, low-resolution regime. The probability-based classifiers show a similar
sensitivity to the voting-based classifiers; however, the specificity is significantly lower,
showing a lesser susceptibility to overfitting for voting-based classifiers. A more balanced
dataset could resolve the overfitting issue (as the number of tumor tissue samples was
five times that of non-tumor control samples). Ethics and epidemiology limit the number
of available tissue samples, especially control samples, which leads to a disproportion
between tumor and non-tumor tissue numbers in all studies dedicated to the analysis of ex
vivo human brain pathological tissues.
Data 2023, 8, 8
4 of 8
Table 1. The classifiers validated on the testing subset. Accuracy—the ratio of correctly identified
samples to all samples; sensitivity—the ratio of correctly identified tumor samples to all tumor
samples; specificity—the ratio of correctly identified nontumor pathology samples to all non-tumor
pathology samples.
Resolution
Mode
m/z Range
Aggregation
Sensitivity
Specificity
F-Measure
Accuracy
High
High
High
High
High
High
High
High
Low
Low
Low
Low
Low
Low
Low
Low
Neg
Neg
Neg
Neg
Pos
Pos
Pos
Pos
Neg
Neg
Neg
Neg
Pos
Pos
Pos
Pos
Wide
Wide
Narrow
Narrow
Wide
Wide
Narrow
Narrow
Wide
Wide
Narrow
Narrow
Wide
Wide
Narrow
Narrow
vote
meanP
vote
meanP
vote
meanP
vote
meanP
meanP
vote
vote
meanP
vote
meanP
meanP
vote
1.0
1.0
1.0
1.0
0.978
1.0
0.978
0.993
0.943
0.968
0.973
0.997
0.996
0.992
1.0
0.979
0.833
0.556
0.833
0.611
0.889
0.333
0.833
0.278
0.913
0.797
0.754
0.377
0.879
0.862
0.673
0.769
0.989
0.971
0.989
0.975
0.981
0.957
0.978
0.95
0.960
0.959
0.959
0.931
0.984
0.980
0.966
0.965
0.980
0.947
0.980
0.954
0.967
0.921
0.961
0.908
0.937
0.934
0.932
0.881
0.974
0.967
0.942
0.942
The results obtained using the same algorithm for the validation set are shown in
Table 2. The accuracy and overall quality are significantly lower for the validation set.
The overfitting problem is significantly more profound in the case of the validation set.
For more than half of all regimes, the probability-based classifiers show specificity as low
as zero, meaning no non-tumor sample was assigned the correct class. The vote-based
classifiers show better but still poor specificity.
Table 2. The classifiers validated on the validation subset. Accuracy—the ratio of correctly identified
samples to all samples; sensitivity—the ratio of correctly identified tumor samples to all tumor
samples; specificity—the ratio of correctly identified non-tumor pathology samples to all nontumor
pathology samples.
Resolution
Mode
m/z Range
Aggregation
Sensitivity
Specificity
F-Measure
Accuracy
High
High
High
High
High
High
High
High
Low
Low
Low
Low
Low
Low
Low
Low
Neg
Neg
Neg
Neg
Pos
Pos
Pos
Pos
Neg
Neg
Neg
Neg
Pos
Pos
Pos
Pos
Wide
Wide
Narrow
Narrow
Wide
Wide
Narrow
Narrow
Wide
Wide
Narrow
Narrow
Wide
Wide
Narrow
Narrow
vote
meanP
meanP
vote
meanP
vote
vote
meanP
vote
meanP
vote
meanP
meanP
vote
meanP
vote
1.0
1.0
1.0
1.0
1.0
0.813
1.0
1.0
0.992
0.865
0.981
1.0
1.0
1.0
1.0
0.981
0.25
0.0
0.0
0.0
0.0
0.75
0.5
0.0
0.263
0.579
0.316
0.111
0.438
0.353
0.0
0.063
0.914
0.889
0.889
0.889
0.889
0.867
0.941
0.889
0.943
0.897
0.934
0.931
0.958
0.949
0.929
0.923
0.85
0.8
0.8
0.8
0.8
0.8
0.9
0.8
0.897
0.828
0.882
0.873
0.924
0.908
0.867
0.858
The aggregation of prediction for eight regimes resulted in better performance than
classifiers built using any single regime: out of 26 clinical samples from the validation
set, only one sample was classified incorrectly (the same one for the voting-based and the
mean probability-based algorithms). Notably, this sample contained 44% tumor cells and
56% non-tumor cells and therefore present the most challenging case for the classifiers
Data 2023, 8, 8
5 of 8
in general, as the molecular signatures of both tumor and non-tumor cells are present in
such samples in equal proportions. We have chosen the subset of samples with known
tumor cell percentage (TCP) as a validation set to be able to understand the influence of
the sample heterogeneity on the performance of the classifier. The TCP quantitatively
evaluated by histology was not accessed by the classifier during training in any way. That
is why it would be incorrect to say that the arbitrary threshold was set in such a way that
the samples with TCP higher than 50% were considered a tumor, but instead, we had
developed a classifier that is able to separate samples with TCP above 50% from samples
with TCP below 50% Strictly speaking, the threshold TCP value should be chosen to be
consistent with clinical requirements and considerations that will arise when the proposed
method is applied to more specific problems, but to achieve this ability to train classifiers
for a predetermined threshold value, we need a much larger set of histologically evaluated
samples with a TCP estimate. For the complete set of samples, 100% sensitivity and 97.5%
specificity were obtained, corresponding to 1 wrong classification in 214 samples for both
voting- and probability-based aggregated classifiers.
4. Discussion
The most common data format in statistical analysis and machine learning is a feature
matrix. In the case of mass spectra, the columns of such matrices correspond to m/z values
and the rows correspond to the scans. Separate columns can be used to add metadata,
such as spectra identifiers, information about the tissue and the patient, etc. A column that
contains clinical diagnosis can be used to build a classifier [12], and a column containing an
estimate of TCP can be used to train a regression model that predicts this percentage [21].
Notably, the diagnosis should be associated with a clinical biopsy sample rather than a
patient because the particular sample can sometimes be taken from tumor margins and
contain non-tumor as well as tumor cells. From the molecular point of view, such samples
may be more similar to non-tumorous brain tissues.
When building a feature matrix, the intensities are often averaged over all scans of a
spectrum or even a tissue fragment (clinical sample). This lessens the noise and smoothens
the signal. However, after such averaging, the feature matrix becomes too “wide”: the
number of columns is greater than the number of rows, sometimes by order(s) of magnitude. Some methods of statistical analysis are not applicable to such cases. For example,
in discriminant analysis, this is called Small Sample Size (SSS) problem.
Sampling techniques, such as bootstrapping, are used to overcome the SSS problem.
Bootstrapping is a procedure that produces a multitude of data sets from a single input data
set using sampling with replacement. Each resulting data set is used to build a separate
classifier. The outputs from these classifiers are then aggregated through averaging or
voting, producing the final result.
In this work, the matrix still consists of individual scans, so it is kept “long,” which
lessens the impact of a small number of biopsy samples. Variations in peak intensity in
the scans have two causes: changes in the extraction process or random differences in
the ionization and ion cloud manipulation inside the mass spectrometer. The former can
provide information about the cell composition of the tissue and the sample heterogeneity;
the latter will still exist in scans of a perfectly homogenous sample, which can be made
useful: having a large number of scans with random noise, the resulting algorithm would
be less prone to overfitting, which is the main problem of machine learning. This means that
analyzing new data that were not present at the training stage will not affect the accuracy
of prediction significantly. In this work, this proved to be the case. Due to the random
component in the peak intensity, retaining the individual scans in the feature matrix can be
considered an “instrumental bootstrapping.”
Usually, machine learning is performed using a single dataset. However, the direct
aggregation of the multimodal data into one dataset is not helpful regarding the overfitting
problem, as a weak learner will remain a still overfitted weak one. In this work, eight
datasets obtained using different MS measurement regimes (polarity, resolution, and m/z
Data 2023, 8, 8
6 of 8
range) were used. Since no specific method for the aggregation of such multimodal data
has been proposed yet, a joint meta-classifier was produced from all eight separate datasets,
using the mixture of experts approach. Two options for obtaining the final result, namely
voting and probability averaging, were investigated.
In voting, each classifier made a prediction, and the class that was predicted by the
majority of classifiers for each scan (for single-regime classifiers) or regime became the
final prediction. In probability averaging, the probability of belonging to each class was
calculated using XGboost for each scan or regime separately, then these probabilities were
averaged, and the data were assigned a class using probability thresholds.
For a validation set, we used data that were previously annotated by a professional
pathologist to build a regression model for tumor cell percentage prediction [21]. This way,
additional information on tumor cell percentages are available for tumor samples. This
can be important because there were tumor samples with as little as 5% of actual tumor
cells. This quantitative value enables the investigation of algorithm robustness concerning
the heterogeneity of clinical samples. This approach has been useful because it provided a
likely explanation for the only sample for which the class was incorrectly assigned by both
meta-classifiers (voting-based and probability-based), as it was highly heterogenous, as we
have shown above.
It is worth noting that since eight regimes are measured instead of one, the measurement itself takes significantly more time. Under the current ICE-MS experimental protocol,
the cartridge preparation takes no more than 2 min, and one regime is measured in 30 s.
Some additional (device- and regime-specific) time needs to be added to account for switching between regimes, so for this work, eight regimes are measured in 4.8 min. Taking all
the steps together, measuring eight regimes takes slightly less than 7 min, while one regime
can be measured in 2.5 min. Performing a technical repeat immediately after measuring
the eight regimes shows that the sample is not significantly exhausted during this time [22].
Despite this increase in time, improved classification performance is important for the
development of the method, especially if it is aimed at clinical practice.
The obvious next step should be estimating the contribution of each measurement
regime to the prediction of the aggregated classifier so that an informed decision could be
made about which regimes should be measured. Preliminary analysis shows that positive
polarity alone gives slightly better results than negative because of a lower number of false
positives. This task, however, is not as simple as it seems. Simple sequential discard of one
regime from classification would only work in the case of a linear contribution of regimes.
Since various classes of lipids and metabolites could be detected in different polarities,
it is obvious that the aggregation of positive and negative modes data will improve the
classification results; however, the effect of spectra resolution, as well as the mass range,
is not clear. Analysis of Table 1 shows that the wider mass range provides better specificity
in all polarity and resolution values, and high-resolution data gives better accuracy than
low-resolution data. But in some combinations the contribution of polarities could be
the opposite—positive mode provides better results than negative mode with voting in
low-resolution data but worse in high-resolution. The most reliable way to estimate such
contributions is a calculation of each regime’s Shapley value, but this procedure is quite
time-consuming, and taking the overfitting problem into account, it should be repeated
with several train/test data splits. Such calculations are an ongoing project of the group,
but they are still far from complete.
5. Conclusions
Tumor versus non-tumorous pathology classifiers using a single mass spectrometric
measurement regime provide suboptimal performance. We have shown that the performance can be significantly improved when aggregating predictions from several such
classifiers that use different measurement regimes with different polarity, m/z range,
and resolution. Additional information on the tumor cell percentage of the samples is beneficial for interpreting predictions obtained from such classifiers but is extremely cumber-
Data 2023, 8, 8
7 of 8
some. It is shown that in contrast to vote-based classification, probability-based classifiers
result in lower accuracy due to overfitting that originated from the lower availability of
non-tumor pathological human brain tissue samples. However, the mixture of experts
approach provides the same classification efficiency for both aggregation methods. Thus,
the collection and analysis of multimode data simplify the implementation of machine
learning methods to unbalanced sample cohorts typical of clinical applications. We are
estimating the contribution of each regime to the final classifier accuracy for the future
development of a reliable tumor tissue classifier.
Author Contributions: Conceptualization, A.A.S.; methodology, A.A.S.; software, A.A.S. and D.S.Z.;
validation, A.A.S., D.S.Z. and S.I.P.; formal analysis, A.A.S., D.S.B. and D.S.Z.; investigation, D.S.B.,
V.A.E. and S.I.P.; resources, E.N.N. and I.A.P.; data curation, D.S.Z.; writing—original draft preparation, A.A.S..; writing—review and editing, A.A.S., D.S.B., D.S.Z., S.I.P. and I.A.P.; supervision, A.A.S.
and I.A.P.; project administration, K.V.B.; funding acquisition, I.A.P. All authors have read and agreed
to the published version of the manuscript.
Funding: The research was supported by the Ministry of Science and Higher Education of the Russian
Federation, project no. 0714-2020-0006. The research used the equipment of the Shared Research
Facilities of the Semenov Federal Research Center for Chemical Physics RAS.
Institutional Review Board Statement: The clinical samples were analyzed under an approved N.N.
Burdenko NSPCN Institutional Review Board protocol in accordance with the Helsinki Declaration
as revised in 2013. The study was conducted in accordance with the recommendations of the ethical
committee of the N.N. Burdenko NSPCN order Nr.40 from 12.04.2016 revised by order Nr. 131 from
17.07.2018. A signed informed consent explicitly noting that all removed tissues can be used for
further research was obtained from all patients.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: The data described in the manuscript will not be made available in
accordance with the indication of the Ethics Committee. The code related to this study is available at
https://kni.me/w/dWtqs1_6S2XVP6EG.
Conflicts of Interest: The authors declare no competing interests.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Allen, D.; McWhinney, B. Quadrupole Time-of-Flight Mass Spectrometry: A Paradigm Shift in Toxicology Screening Applications.
Clin. Biochem. Rev. 2019, 40, 135–146. [CrossRef]
Lange, V.; Picotti, P.; Domon, B.; Aebersold, R. Selected reaction monitoring for quantitative proteomics: A tutorial. Mol. Syst.
Biol. 2008, 4, 222. [CrossRef]
Messner, C.; Demichev, V.; Bloomfield, N.; Yu, J.; White, M.; Kreidl, M.; Egger, A.-S.; Freiwald, A.; Ivosev, G.; Wasim, F.; et al.
Ultra-fast proteomics with Scanning SWATH. Nat. Biotechnol. 2021, 39, 846–854. [CrossRef]
Comai, L.; Katz, J.; Mallick, P. Proteomics; Humana: New York, NY, USA, 2017; ISBN 978-1-4939-6747-6. [CrossRef]
Yang, K.; Han, X. Lipidomics: Techniques, Applications, and Outcomes Related to Biomedical Sciences. Trends Biochem. Sci. 2016,
41, 954–969. [CrossRef]
Pradas, I.; Huynh, K.; Cabré, R.; Ayala, V.; Meikle, P.; Jové, M.; Pamplona, R. Lipidomics Reveals a Tissue-Specific Fingerprint.
Front. Physiol. 2018, 9, 1165. [CrossRef]
Alberici, R.; Simas, R.; Sanvido, G.; Romão, W.; Lalli, P.; Benassi, M.; Cunha, I.; Eberlin, M. Ambient mass spectrometry: Bringing
MS into the “real world”. Anal. Bioanal. Chem. 2010, 398, 265–294. [CrossRef]
Eberlin, L.; Norton, I.; Orringer, D.; Dunn, I.; Liu, X.; Ide, J.; Jarmusch, A.; Ligon, K.L.; Jolesz, F.; Golby, A.; et al. Ambient mass
spectrometry for the intraoperative molecular diagnosis of human brain tumors. Proc. Natl. Acad. Sci. USA 2013, 110, 1611–1616.
[CrossRef]
Schäfer, K.-C.; Dénes, J.; Albrecht, K.; Szaniszló, T.; Balog, J.; Skoumal, R.; Katona, M.; Tóth, M.; Balogh, L.; Takáts, Z. In vivo,
in situ tissue analysis using rapid evaporative ionization mass spectrometry. Angew. Chem. Int. Ed. 2009, 48, 8240–8242. [CrossRef]
Ogrinc, N.; Saudemont, P.; Balog, J.; Robin, Y.-M.; Gimeno, J.-P.; Pascal, Q.; Tierny, D.; Takats, Z.; Salzet, M.; Fournier, I. Waterassisted laser desorption/ionization mass spectrometry for minimally invasive in vivo and real-time surface analysis using
SpiderMass. Nat. Protoc. 2019, 14, 3162–3182. [CrossRef]
King, M.; Zhang, J.; Lin, J.; Garza, K.; DeHoog, R.; Feider, C.; Bensussan, A.; Sans, M.; Krieger, A.; Badal, S.; et al. Rapid diagnosis
and tumor margin assessment during pancreatic cancer surgery with the MasSpec Pen technology. Proc. Natl. Acad. Sci. USA
2021, 118, e2104411118. [CrossRef]
Data 2023, 8, 8
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
8 of 8
Pekov, S.; Eliferov, V.; Sorokin, A.; Shurkhay, V.; Zhvansky, E.; Vorobyev, A.; Potapov, A.; Nikolaev, E.; Popov, I. Inline cartridge
extraction for rapid brain tumor tissue identification by molecular profiling. Sci. Rep. 2019, 9, 18960. [CrossRef] [PubMed]
Gredell, D.; Schroeder, A.; Belk, K.; Broeckling, C.; Heuberger, A.; Kim, S.-Y.; King, D.; Shackelford, S.; Sharp, J.; Wheeler, T.; et al.
Comparison of Machine Learning Algorithms for Predictive Modeling of Beef Attributes Using Rapid Evaporative Ionization
Mass Spectrometry (REIMS) Data. Sci. Rep. 2019, 9, 5721. [CrossRef] [PubMed]
De Bruyne, K.; Slabbinck, B.; Waegeman, W.; Vauterin, P.; De Baets, B.; Vandamme, P. Bacterial species identification from
MALDI-TOF mass spectra through data analysis and machine learning. Syst. Appl. Microbiol. 2011, 34, 20–29. [CrossRef]
[PubMed]
Ji, H.; Deng, H.; Lu, H.; Zhang, Z. Predicting a molecular fingerprint from an electron ionization mass spectrum with deep neural
networks. Anal. Chem. 2020, 92, 8649–8653. [CrossRef]
Li, T.; Chen, L.; Gan, M. Quality control of imbalanced mass spectra from isotopic labeling experiments. BMC Bioinform. 2019,
20, 549. [CrossRef] [PubMed]
Zhvansky, E.; Sorokin, A.; Shurkhay, V.; Zavorotnyuk, D.; Bormotov, D.; Pekov, S.; Potapov, A.; Nikolaev, E.; Popov, I. Comparison
of Dimensionality Reduction Methods in Mass Spectra of Astrocytoma and Glioblastoma Tissues. Mass Spectrom. (Tokyo) 2021,
10, A0094. [CrossRef]
Eberlin, L.; Norton, I.; Dill, A.; Golby, A.; Ligon, K.; Santagata, S.; Cooks, R.; Agar, N. Classifying Human Brain Tumors by Lipid
Imaging with Mass Spectrometry. Cancer Res. 2012, 72, 645–654. [CrossRef]
Clark, A.; Calligaris, D.; Regan, M.; Krummel, D.; Agar, J.; Kallay, L.; MacDonald, T.; Schniederjan, M.; Santagata, S.;
Pomeroy, S.; et al. Rapid discrimination of pediatric brain tumors by mass spectrometry imaging. J. Neurooncol. 2018, 140,
269–279. [CrossRef]
Pirro, V.; Jarmusch, A.; Alfaro, C.; Hattab, E.; Cohen-Gadol, A.; Cooks, R. Utility of neurological smears for intrasurgical brain
cancer diagnostics and tumour cell percentage by DESI-MS. Analyst 2017, 42, 449–454. [CrossRef]
Pekov, S.; Bormotov, D.; Nikitin, P.; Sorokin, A.; Shurkhay, V.; Eliferov, V.; Zavorotnyuk, D.; Potapov, A.; Nikolaev, E.; Popov, I.
Rapid estimation of tumor cell percentage in brain tissue biopsy samples using inline cartridge extraction mass spectrometry.
Anal. Bioanal. Chem. 2021, 413, 2913–2922. [CrossRef]
Zhvansky, E.; Eliferov, V.; Sorokin, A.; Shurkhay, V.; Pekov, S.; Bormotov, D.; Ivanov, D.; Zavorotnyuk, D.; Bocharov, K.;
Khalliullin, I.; et al. Assessment of variation of inline cartridge extraction mass spectra. J. Mass Spectrom. 2020, 56, e4640.
[CrossRef] [PubMed]
Pekov, S.; Sorokin, A.; Kuzin, A.; Bocharov, K.; Bormotov, D.; Shivalin, A.; Shurkhay, V.; Potapov, A.; Nikolaev, E.; Popov, I.
Analysis of Phosphatidylcholines Alterations in Human Glioblastomas Ex Vivo. Biochem. Mosc. Suppl. Ser. B 2021, 15, 241–247.
[CrossRef]
Yannell, K.; Smith, K.; Alfaro, C.; Jarmusch, A.; Pirro, V.; Cooks, R. N-Acetylaspartate and 2-Hydroxyglutarate Assessed in
Human Brain Tissue by Mass Spectrometry as Neuronal Markers of Oncogenesis. Clin. Chem. 2017, 63, 1766–1767. [CrossRef]
Schapire, R. The strength of weak learnability. Mach. Learn. 1990, 5, 197–227. [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD ‘16), San Francisco, CA, USA, 13–17 August 2016; Association for
Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [CrossRef]
Gibb, S.; Strimmer, K. MALDIquant: A versatile R package for the analysis of mass spectrometry data. Bioinformatics 2012, 28,
2270–2271. [CrossRef] [PubMed]
Berthold, M.; Cebron, N.; Dill, F.; Gabriel, T.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. KNIME: The Konstanz
Information Miner. In Studies in Classification, Data Analysis, and Knowledge Organization; Data Analysis, Machine Learning and
Applications; Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2008.
[CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.