International Journal of
Environmental Research
and Public Health
Article
Detectionof Major Depressive Disorder Based on a
Combination of Voice Features: An Exploratory Approach
Masakazu Higuchi 1, * , Mitsuteru Nakamura 1 , Shuji Shinohara 2 , Yasuhiro Omiya 3 , Takeshi Takano 3 ,
Daisuke Mizuguchi 3 , Noriaki Sonota 1 , Hiroyuki Toda 4 , Taku Saito 4 , Mirai So 5 , Eiji Takayama 6 ,
Hiroo Terashi 7 , Shunji Mitsuyoshi 1 and Shinichi Tokuno 1,8
1
2
3
4
5
6
7
8
*
Citation: Higuchi, M.; Nakamura, M.;
Shinohara, S.; Omiya, Y.; Takano, T.;
Mizuguchi, D.; Sonota, N.; Toda, H.;
Saito, T.; So, M.; et al. Detection of
Major Depressive Disorder Based on
a Combination of Voice Features: An
Exploratory Approach. Int. J. Environ.
Res. Public Health 2022, 19, 11397.
Department of Bioengineering, Graduate School of Engineering, The University of Tokyo,
Tokyo 113-8656, Japan
School of Science and Engineering, Tokyo Denki University, Saitama 350-0394, Japan
PST Inc., Yokohama 231-0023, Japan
Department of Psychiatry, School of Medicine, National Defense Medical College, Saitama 359-8513, Japan
Department of Neuropsychiatry, Tokyo Dental College, Tokyo 101-0061, Japan
Department of Oral Biochemistry, Asahi University School of Dentistry, Gifu 501-0296, Japan
Department of Neurology, Tokyo Medical University, Tokyo 160-8402, Japan
Graduate School of Health Innovation, Kanagawa University of Human Services, Yokosuka 210-0821, Japan
Correspondence: higuchi@bioeng.t.u-tokyo.ac.jp
Abstract: In general, it is common knowledge that people’s feelings are reflected in their voice and
facial expressions. This research work focuses on developing techniques for diagnosing depression
based on acoustic properties of the voice. In this study, we developed a composite index of vocal
acoustic properties that can be used for depression detection. Voice recordings were collected
from patients undergoing outpatient treatment for major depressive disorder at a hospital or clinic
following a physician’s diagnosis. Numerous features were extracted from the collected audio
data using openSMILE software. Furthermore, qualitatively similar features were combined using
principal component analysis. The resulting components were incorporated as parameters in a
logistic regression based classifier, which achieved a diagnostic accuracy of ~90% on the training set
and ~80% on the test set. Lastly, the proposed metric could serve as a new measure for evaluation of
major depressive disorder.
https://doi.org/10.3390/
ijerph191811397
Keywords: voice analysis; major depressive disorder; logistic regression
Academic Editors: Paul B.
Tchounwou and Karlijn Massar
Received: 23 June 2022
Accepted: 7 September 2022
Published: 10 September 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affiliations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1. Introduction
The importance of mental health care in managing different types of stress in modern
society is increasingly recognized around the world. Stress has negative effects on people’s
health and mood in daily life, and its accumulation can cause mental and behavioral
dysfunction in the long term [1]. Besides their impact on individuals, such disorders
result in serious economic costs to society because of their association with reduction of
lifetime earnings and labor productivity [2,3]. This state of affairs requires technologies
that are capable of easily checking for mental illnesses such as depression, for which early
intervention is associated with higher remission rates [4–6].
Previously, some researchers focused on identifying biomarkers in saliva and blood for
use in depression screening [7–9]. For example, Maes et al. proposed interleukin-1 receptor
antagonist (IL-1ra) as a diagnostic biomarker for major depressive disorder (MDD), finding
its serum concentration to be increased in affected individuals [9]. However, besides being
invasive, diagnostic body fluid testing incurs additional costs because of the need for
special measurement equipment and reagents. Self-report psychological questionnaires
such as the Patient Health Questionnaire 9 (PHQ9), General Health Questionnaire (GHQ),
and Beck Depression Inventory (BDI) are non-invasive alternatives commonly used by
Int. J. Environ. Res. Public Health 2022, 19, 11397. https://doi.org/10.3390/ijerph191811397
https://www.mdpi.com/journal/ijerph
Int. J. Environ. Res. Public Health 2022, 19, 11397
2 of 13
clinicians [10–12]. They are relatively simple to administer, but suffer from the inherent
drawback of reporting bias, i.e., certain symptoms/behaviors being over- and/or underendorsed depending on respondents’ awareness of them (or lack thereof) [13]. This bias can
be mitigated by assessments conducted by physicians, such as the Hamilton Depression
Rating Scale (HDRS); however, the extra time involved limits the number of interviews that
can be administered [14].
Feelings are reflected in people’s voice and facial expressions, and this common
knowledge has also been scientifically substantiated [15,16]. Such evidence has driven a
recent surge of research interest in acoustic biomarkers for predicting depression and stress
levels [17–19]. Simplicity is a major advantage of such approaches, i.e., voice recordings
can be collected non-invasively and remotely without any specialized equipment besides a
microphone. Furthermore, they reduce subjectivity in diagnosis since recording data are
processed algorithmically; thus, they avoid reporting bias that is inherent to self-report
assessments, holding promise for detecting a variety of mental illnesses. For example,
Mundt et al. recorded MDD patients reading a standard script via a telephone interface,
and calculated a selection of vocal acoustic properties such as the duration and ratio of
vocalizations and silences, fundamental frequency (F0), and first and second formants (F1,
F2). Several of these measures were markedly different between patients who responded to
treatment and those who did not [18]. Our research group has also focused on depression’s
association with emotional expression. In a previous study, we developed composite
metrics for quantifying mental health—“vitality” and “mental activity”—that combine
different emotional components of the voice [19]. In subsequent work, we showed evidence
for this measure’s effectiveness in detecting depression and monitoring stress due to
life events [20,21]. A weak correlation between vitality and BDI score was confirmed,
suggesting that some voice features correlated with the BDI score. Still, some limitations of
this measure have become apparent. First, since diseases besides depression affect how
emotions are expressed, it is challenging to resolve whether abnormal vitality and mental
activity truly indicates depression or instead reflects a different condition. Furthermore,
since its classification accuracy showed large variation across facilities in some cases, vitality
and mental activity might be dependent on (recording) environment.
openSMILE is a platform for deriving extensive sets of acoustic features from audio
data [22], which has been recently applied by several studies in the field of speech diagnostics. Jiang et al. developed a novel computational methodology for detecting depression
based on vocal acoustic features extracted using openSMILE from recorded speech in three
categories of emotion (positive, neutral, negative). Despite obtaining high accuracy for
depression detection, the development of separate models for men and women slightly
complicated their application in practice [23]. Faurholt-Jepsen et al. extracted openSMILE
features from voice recordings of patients with bipolar disorder, and attempted to use
them to classify their depressive and manic symptoms. Their feature-based classification
accurately matched manic and depressive symptoms as measured by the Young Mania
Rating Scale (YMRS [24]) and HRDS, respectively. Nevertheless, their algorithm utilizes an
immense number of features, posing a risk of overfitting [25]. Focusing on mel-frequency
cepstrum coefficients (MFCCs), Taguchi et al. reported a significant difference in the second
coefficient (MFCC2), which represents spectral energy in the 2000–3000 Hz band, between
the voices of MDD patients and healthy controls. However, their analysis included only
one type of feature, and did not combine multiple features [26]. In a previous study, we
proposed a voice index based on openSMILE features, which could accurately differentiate between three subject groups: patients with major depressive disorder, patients with
bipolar disorder, and healthy individuals. Still, the proposed measure requires further
validation because our training data were drawn from a small sample [27].
The aim of this paper is to develop a composite index based on vocal acoustic features
that can accurately differentiate patients with major depressive disorder from healthy adults.
Depressed patients and non-depressed controls were recorded reading a set of fixed phrases;
the recorded data were split into training and test datasets. Features were extracted from
Int. J. Environ. Res. Public Health 2022, 19, 11397
3 of 13
the training data using openSMILE, and qualitatively similar features were mathematically
aggregated by means of principal component analysis. These components were used as
coefficients in logistic regression to classify subjects. The classification performance of our
proposed indicator was tested on the recordings of the test dataset. The result achieved a
diagnostic accuracy of approximately 80%.
2. Materials and Methods
2.1. Ethical Considerations
This study was approved by the institutional review board of The University of Tokyo
(no. 11572).
2.2. Subjects
Our study enrolled 306 subjects from five institutions in total. Depressed subjects were
recruited from individuals receiving outpatient treatment for major depressive disorder at
Ginza Taimei Clinic (“C”: 87) or National Defense Medical College Hospital (“H1”: 90). For
the control group, self-reported healthy adults were recruited from the National Defense
Medical College Hospital (14), Tokyo Medical University Hospital (“H2”: 23), Asahi
University (“U1”: 38), and The University of Tokyo (“U2”: 54). Patients gave informed
consent after receiving the study information at their first assessment. Controls gave
informed consent after receiving the study information at a health workshop held by the
authors, either individually or in small groups. Patients aged 20 years or older were enrolled
if they met the diagnostic and statistical manual of mental disorders 4th edition Text
Revision (DSM-IV-TR) diagnostic criteria for major depressive disorder [28]. Candidates
with severe physical disability or organic brain disease were excluded. The subjects
were diagnosed by a psychiatrist using the Mini-International Neuropsychiatric Interview
(M.I.N.I.) [29]. The details for the subjects recruited from each facility are summarized in
Table 1.
Table 1. Patient information by facility.
Facility
Gender
Number of
Depressed Subjects
Number of
Healthy Subjects
Age
(Mean ± SD)
C
Male
Female
Total
32
55
87
0
0
0
32.7 ± 6.6
31.6 ± 8.6
32.0 ± 7.9
H1
Male
Female
Total
46
44
90
10
4
14
48.1 ± 12.9
61.2 ± 13.9
54.1 ± 14.8
H2
Male
Female
Total
0
0
0
12
11
23
47.6 ± 10.6
59.8 ± 13.3
53.4 ± 13.2
U1
Male
Female
Total
0
0
0
14
24
38
37.1 ± 18.0
37.3 ± 17.8
37.2 ± 17.6
U2
Male
Female
Total
0
0
0
25
29
54
37.9 ± 8.1
47.0 ± 11.7
42.8 ± 11.1
The severity of depression was assessed by physicians using the HDRS. Several
standards for HDRS severity rating have been reported [30]; following the precedent of
Riedel et al. [31], we interpreted that a HDRS total score of less than 8 indicates depression
in remission and excluded patients accordingly from analysis. Our final study population
consisted of 102 patients (“depressed group (HDRS ≥ 8)”) and 129 healthy adults (“normal
group”). The screening outcomes of patients recruited for the “depressed group” are
summarized in Table 2.
Int. J. Environ. Res. Public Health 2022, 19, 11397
4 of 13
Table 2. Clinical information of MDD patients. Remission group is patients with HDRS < 8. Depression group is patients with HDRS ≥ 8. The M and F symbols in number of subjects mean male and
female, respectively.
Facility
Group
Number of
Subjects
HDRS
(Mean ± SD)
Analysis
C
Remission
Depression
Total
10 (M: 7, F: 3)
77 (M: 25, F: 52)
87 (M: 32, F: 55)
4.8 ± 1.4
24.3 ± 8.6
22.1 ± 10.2
Not used
Used
-
H1
Remission
Depression
Total
65 (M: 27, F: 38)
25 (M: 11, F: 14)
90 (M: 38, F: 52)
2.2 ± 2.2
15.3 ± 7.3
5.8 ± 7.3
Not used
Used
-
2.3. Voice Data
Voice recordings were made in an examination room (C, H1, H2) or conference room
(U1, U2). Each subject read aloud 10 set phrases in Japanese, given along with their English
translations in Table 3. Voice recordings were acquired at 24 bit and 96 kHz resolution
using a Roland R-26 portable digital audio recorder (Hamamatsu, Japan) and Olympus
ME52W lavalier microphone (Tokyo, Japan).
Table 3. Set phrases read aloud by subjects.
Phrase Number
Japanese Phrase
English Translation
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
Totemo genki desu
Kinō wa yoku nemuremashita
Shokuyoku ga arimasu
Kokoro ga odayaka desu
Tsukarete guttari shiteimasu
Okorippoi desu
I–ro–ha–ni–ho–he–to
Honjitsu wa seiten nari
Mukashi mukashi aru tokoro ni
Garapagosu shotō
I am very cheerful
I slept well yesterday
I have an appetite
My heart is calm
I’m dead tired
I am irritable
(No meaning; like “a–b–c”)
“It is fine today” (standard radio test)
“Once upon a time, there lived...”
Galapagos Islands
2.4. Voice Analysis
Each audio file was first normalized to minimize differences in volume due to recording environment, and then segmented by phrase. Each phrase was processed independently to extract vocal features. Various scripts were available for automatically calculating
different sets of features from audio data using openSMILE. Our study used “the large
openSMILE emotion feature set”, developed for use in emotion recognition. We used 6552
(56 × 3 × 39) audio features computed as follows:
I.
II.
III.
56 types of acoustic/physical quantities were calculated at frame level as low-level
descriptors: fast Fourier transform (FFT), MFCCs, voiced speech probability, zerocrossing rate, signal energy, F0, and so on.
3 types of temporal statistics were derived from these descriptors at frame level:
moving average, first-order change over time (“delta”), and second-order change over
time (“delta-delta”).
39 types of statistical functionals were calculated at file (phrase) level from frame-level
values: mean, maximum, minimum, centroid, quartiles, variance, kurtosis, skewness,
and so on.
Each feature was averaged for every subject across the 10 set phrases to obtain mean
values for analysis. The processed data were split by facility of origin into a training set (C,
H2, U1, U2) and test set (H1). Next, the classification algorithm was trained on the training
data using the derived features.
Int. J. Environ. Res. Public Health 2022, 19, 11397
5 of 13
Since models tend to overfit if trained on too many variables, we reduced dimensionality through a combination of receiver operating characteristic (ROC) analysis and principal
component analysis (PCA). First, each feature’s ability to independently distinguish depressed from normal adults was quantified as the area under the corresponding ROC curve
(AUC); only features exceeding a certain threshold were selected. Next, highly correlated
features were transformed into principal components. We carefully selected fewer features
than our sample size given that PCA cannot be applied when more features than subjects
are present because the resulting correlation matrix is rank-deficient. These components’
ability to predict depression was tested by logistic regression with L2 regularization (ridge
regression) [32]. The regularization parameters were optimized by cross-validation within
the training set. Since data were split randomly during cross-validation, different values
were computed for the regularization parameters every time the algorithm was run, causing
downstream variation in regression coefficients. To stabilize the results of training, the
model was trained several times, and each coefficient was averaged across runs to obtain
the final weights.
The classification model was a logistic function describing the linear sum of three
PCs (parameters) times their respective coefficients (weights). This function’s output
was adopted as the classification index for major depressive disorder. The diagnostic
performance of this metric was tested on the voice recordings in the test dataset (H1).
To determine the training and test data sets, splitting the data independent of the
institution is also possible. However, to facilitate future applications, we employed a
method where only a part of the recording environment data, as opposed to the whole,
is incorporated in the index. This allows us to assess the extent to which voices in other
recording environments were not incorporated in the index. Since H1 comprised of both
healthy and depressed subjects, we used the speech data of H1 subjects as the test data.
Statistical processing was conducted using the free software R (version 4.0.2) [33].
3. Results
3.1. Feature Selection
openSMILE features whose AUC exceeded our threshold (≥0.869) were selected. Our
model incorporated 187 features, fewer than the number of recordings in the training
dataset (n = 192).
3.2. Principal Component Analysis
Through PCA, three components were extracted from the 187 features of the training
data, cumulatively accounting for 80% of the observed variance.
3.3. Logistic Regression with Regularization
Logistic regression with L2 regularization using the three components as predictors
was performed 20 times. Table 4 presents the obtained mean regression coefficients.
Table 4. Regression coefficients (mean).
Regression Coefficient
Intercept
Principal component 1
Principal component 2
Principal component 3
−0.421
0.0251
−0.0178
−0.0105
Our novel indicator—Major Depression Discrimination Index (MDDI)—is given by
the following formula:
MDDI =
1
1 + exp (0.421 − 0.0251 × PC1 + 0.0178 × PC2 + 0.0105 × PC3)
(1)
Int. J. Environ. Res. Public Health 2022, 19, 11397
6 of 13
where PC1, PC2, and PC3 correspond to the first, second, and third PCA components,
respectively, as described above. Our classifier was trained on the training data 20 times in
total; each time, the MDDI’s ability to distinguish depressed from non-depressed subjects
was quantified by ROC analysis. The best cut-off value indicated by each curve was
recorded, and eventually, averaged across the 20 trials. The confusion matrix in Table 5
summarizes the diagnostic performance of this aggregated cut-off value on the training set;
it achieved excellent diagnostic performance (sensitivity: 0.935, specificity: 0.896, accuracy:
0.911). For comparison purposes, the ROC curve in Figure 1 displays the classifier’s
performance on the training set over a range of MDDI values (AUC: 0.97). When the
classifier was run using the best cut-off value indicated by this curve, its performance was
excellent and comparable to that achieved by the mean cut-off value (sensitivity: 0.948,
specificity: 0.896, accuracy: 0.917).
Table 5. Classification performance of MDDI on training data (confusion matrix).
Predicted Group
Healthy Group
72
12
5
103
Depressed group
Healthy group
0.6
0.4
0.0
0.2
Sensitivity
0.8
1.0
Actual group
Depressed Group
0.0
0.2
0.4
0.6
0.8
1.0
1 - Specificity
Figure 1. Classification performance of MDDI on training data (ROC curve).
3.4. Model Testing
The classification performance of the MDDI cut-off value derived above in distinguishing depressed from normal subjects was tested on the test dataset. Table 6 shows the
resulting confusion matrix (sensitivity: 0.800, specificity: 0.786, accuracy: 0.795).
Int. J. Environ. Res. Public Health 2022, 19, 11397
7 of 13
Table 6. Classification performance of MDDI cut-off on test data (confusion matrix).
Predicted Group
Actual group
Depressed Group
Healthy Group
20
3
5
11
Depressed group
Healthy group
3.5. Effects of Recording Environment (Facility) and HDRS Score
Figure 2 shows the distributions of MDDI among subjects recruited from each facility. “H1dep” and “H1nor” denote depressed and normal subjects recruited from H1,
respectively. Normal subjects’ mean MDDI was compared between facilities using the
Steel–Dwass method for distribution-free multiple comparisons [34]. The differences between the following pairs were statistically significant: H1nor vs. U2 (p = 0.00123**), H2 vs.
U1 (p = 0.0186*), and H2 vs. U2 (p = 0.0000133**). The differences between the following
pairs were non-significant: H1nor vs. H2 (p = 0.999), H1nor vs. U1 (p = 0.0866), and U1
vs. U2 (p = 0.727).
Healthy group
0.45
0.40
0.30
0.35
MDDI value
0.50
0.55
Depressed group
C
H1dep
H1nor
H2
U1
U2
Figure 2. MDDI distributions by facility.
Depressed subjects’ mean MDDI was compared using Welch’s t-test (assuming their
distributions had unequal variance). The difference between C and Hdep was statistically
significant (t(40.06) = 2.49, p = 0.0170*). Since these subgroups also had dissimilar
distributions of HDRS score, we conjectured that the observed difference in MDDI was
attributable to an underlying difference in depression severity between patients at the
two facilities, and tested our hypothesis using the analysis of covariance (ANCOVA) with
HDRS score as the covariate. Figure 3 presents scatterplots of HDRS score versus MDDI
for depressed patients at each of the two facilities. Regression lines (predicting MDDI) are
indicated in red. First, the interaction between facility and HDRS score-reflected by the
degree of parallelism between the red lines in Figure 3a,b was not significant (p = 0.577),
thereby confirming homogeneous correlation between the covariate and MDDI at each
facility (a necessary assumption of ANCOVA). Next, the residuals between observed and
predicted MDDI from the covariate were compared between facilities using ANOVA, and
the difference was not significant (p = 0.159).
Int. J. Environ. Res. Public Health 2022, 19, 11397
8 of 13
The correlation strength between MDDI and HDRS score was 0.186 at C (p = 0.105),
0.278 at H1dep (p = 0.178), and 0.285 overall (p = 0.00374**).
(b)
0.50
0.35
0.35
0.40
0.45
MDDI value
0.45
0.40
MDDI value
0.50
0.55
0.55
(a)
10
15
20
25
30
35
40
10
15
HDRS score
20
25
30
35
HDRS score
Figure 3. Scatterplots of MDDI versus HDRS score of depressed subjects by facility: (a) C; (b) H1
(H1dep).
3.6. Age Differences
The age distribution in the depressed group and healthy groups differ. As seen
in Table 1, almost no elderly subjects are present in the depressed group. Since voice
quality differs between the young and elderly age groups, it may affect classification. We
therefore conducted a covariance analysis with age as a covariate, similar to Section 3.5.
Figure 4a shows the age distribution for the depressed and healthy groups, and Figure 4b
shows the MDDI and age distribution. The blue and green lines in Figure 4b represent
the regression lines for the MDDI of the depressed and healthy group, respectively. No
significant interaction was observed between age of the depressed group and age of the
healthy group (p = 0.334). Accordingly, the correlation between the covariate and MDDI is
consistent for all groups. A significant difference (p < 0.01∗∗ ) could be confirmed when
comparing the residuals between the observed and predicted MDDI of the groups using
variance analysis.
(a)
(b)
0.55
20
0.30
30
0.35
0.40
MDDI value
0.45
0.50
70
60
50
40
Age
Depressed group
Healthy group
Depressed group
Healthy group
20
30
40
50
60
70
Age
Figure 4. (a) Age distribution and (b) scatterplot of MDDI versus age for the depressed and
healthy groups.
Int. J. Environ. Res. Public Health 2022, 19, 11397
9 of 13
3.7. Gender Differences
Female
0.60
Male
0.60
MDDI value
0.45
0.50
0.55
0.60
0.50
0.45
0.25
0.30
0.35
0.40
MDDI value
0.35
0.30
0.25
Female
H2
Female
H1nor
0.55
0.60
0.55
0.50
0.45
0.40
0.30
0.35
MDDI value
Female
H1dep
0.25
Male
0.40
MDDI value
0.35
0.30
0.25
Male
C
0.40
Male
0.45
0.50
0.55
0.60
0.55
0.50
0.45
0.25
0.30
0.35
0.40
MDDI value
0.45
0.40
0.25
0.30
0.35
MDDI value
0.50
0.55
0.60
Figure 5 shows the gender-specific distributions of MDDI among subjects recruited
from each facility. Within each facility, the mean HDDI was compared between men
and women using Welch’s t-test. Significant gender-related differences were not observed (C: t(47.03) = −0.37, p = 0.716, H1dep: t(6.91) = 1.24, p = 0.255, H1nor:
t(8.46) = −1.04, p = 0.327, H2: t(18.69) = 0.20, p = 0.843, U1: t(20.21) = 0.28 , p = 0.779,
U2: t(47.29) = 0.73, p = 0.469).
Male
Female
U1
Male
Female
U2
Figure 5. MDDI distributions by facility (gender-specific).
4. Discussion
Early in the model development, the fact that nearly 200 features computed by openSMILE exceeded our high cut-off value for feature selection (AUC > 0.85) led us to expect an
abundance of major differences in vocal acoustic qualities of depressed patients compared
with those of normal adults. However, the fact that over 80% of their variance could be
explained by just three principal components suggested that these differences could be captured by a limited set of qualities. Nevertheless, the sheer number of features with strong
loadings on each component made it difficult to interpret what specific vocal properties
each represented. The nature of vocal properties influenced by major depressive disorder
still remains a mystery. Audio features need to be mapped to physiological attributes of
voice in order to decipher the meaning of these components. Hence, it remains an option
for future work. The marginally less than 200 openSMILE features selected did not include
any features relating to F0 and MFCC2, demonstrated by Mundt et al. [18] and Taguchi et
al. [26] to be effective in distinguishing patients with major depressive disorders. One of
the reasons behind this is believed to be the difference in the voice format, since Mundt
et al. analyzed telephone speech and Taguchi et al. analyzed speech in a 16-bit, 22.05kHz
PCM format. In addition, MFCC is a feature that highly depends on the content of speech,
and differences in the content of speech may also have an effect.
The MDDI is calculated using logistic regression with regularization of three components derived by PCA. This criterion demonstrated very good classification performance
(AUC > 0.95), distinguishing between depressed and normal subjects with sensitivity,
specificity, and accuracy close to 0.9. This finding supports our expectation of major differ-
Int. J. Environ. Res. Public Health 2022, 19, 11397
10 of 13
ences in vocal qualities between depressed and non-depressed adults, and suggests that
such differences were properly captured by our classification criterion. Good performance
was also observed when the MDDI was used to classify subjects in the test dataset—with
sensitivity, specificity, and accuracy values close to 0.8—thus confirming our algorithm did
not overfit to the training data. However, further validation seems necessary because we
excluded patients that were considered to be in remission (HDRS score < 8), meaning that
the sample size of the test set was not fully preserved.
Despite normalizing audio files before analysis to minimize the effects of recording environment, some indications suggested that we might have been unable to eliminate sources
of variability besides volume. For normal adults, significant differences in mean MDDI
were observed between recordings made in hospital examination rooms and university
lecture halls; however, facility-related differences were not observed between comparable
environments (i.e., H1nor/H2 and U1/U2). Since none of the depressed patients were
recorded in a lecture hall, it is unclear how environmental differences could affect the
proposed model’s ability to distinguish them from controls. In addition, it is noteworthy
that samples taken from conference rooms tended to have lower MDDI than those from
examination room recordings; if the same tendency occurred among patients, it could
compromise our model’s detection performance. On the other hand, the fact that depressed
patients recorded at C had significantly higher mean MDDI than those at H1dep was not
attributable to environmental differences per se; instead, it could be explained by the fact
that depression was originally more severe (i.e., HDRS score was higher) at patients at
C than H1dep. Indeed, the difference in mean MDDI disappeared after adjustment for
HDRS score; this supports the conclusion that facility-related differences in MDDI actually
originate from facility-related differences in HDRS score. Furthermore, MDDI appears
to reflect depression severity; this hypothesis is supported by the fact that it did not significantly correlate with HDRS score at any individual facility, but did correlate—albeit
weakly—across the entire sample.
Almost no elderly patients were included in the depressed group, while some elderly
patients were included in the healthy group. Since voice quality generally changes with
age, it may have contributed to the classification. We therefore adjusted the MDDI for the
two groups by age. Subsequently, a comparison of the mean values showed significant
difference. Hence, we can conclude that differences in age distribution have no statistical
impact on the MDDI. The reason for this is believed to be that subjects of all ages were
included in the healthy group, and any features correlating to age were eliminated in the
training process.
Hormonal changes affect the voice characteristics [35] and impact the MDDI discrimination. Therefore, this phenomenon needs to be considered. Women are more susceptible
to hormonal changes than men, owing to the menstruation. Therefore, we compared the
gender differences in the MDDI values. We could not confirm statistical differences in
MDDI among depressed or control subjects within any participating facility. The extraction
of features with a significant male-female difference from the selected openSMILE features
resulted in 43 and 59 features for the depressed and healthy group, respectively. After
excluding similar feature pairs from each group, we had 8 and 10 features for the depressed
and healthy group, respectively. Accordingly, the absence of any male–female difference in
the MDDI may be due to the absence of varying features between males and females in the
MDDI. The hormonal changes during the menstruation affect shimmering and jittering
speech features [36]. We did not detect statistically gender-based hormonal differences, as
the MDDI excluded the shimmering and jittering speech features. A slight dissimilarity is
visible between MDDI distributions of men and women within H1dep and H1nor. While
the shimmering and jittering speech features remain unaltered, several other features were
affected by the hormonal changes. Although gender-based hormonal changes need to be
considered, minimal gender-based differences were observed in the MDDI. Since our classifier is gender-neutral, it is unnecessary to switch models according to patient gender, and
thus seems easier to implement compared to the methodology proposed by Jiang et al. [23].
Int. J. Environ. Res. Public Health 2022, 19, 11397
11 of 13
The sound of the voice impacted hormonal changes [37]. However, such correlation was
not relevant to our MDDI discrimination.
Differences in experimental conditions have a vital impact on the accuracy of research
in this field. Therefore, subject conditions (age, gender, recording environment, etc.) must
be as consistent as possible. However, since the health status of healthy subjects is based on
self-reporting, verifying its reliability is difficult. Moreover, defective speech recording may
occur due to errors committed by the recording equipment operator, or the comorbidities
of depression may be missed due to failure of diagnosis by the physician. Accordingly, a
limitation of this study is the difficulty in completely eliminating these factors.
5. Conclusions
This study aimed to develop a composite index based on vocal acoustic features that
can accurately distinguish patients with major depressive disorder from healthy adults.
The data were split into a training set and test set in advance. The voice data in the training
set were processed using openSMILE to derive 6552 vocal acoustic features. To prevent
overfitting, the full set of features was screened to select those that seemed most useful
for classification. Next, dimensionality was further reduced by combining and transforming qualitatively similar features using PCA. Logistic regression with regularization was
then applied using the three resulting components as model parameters. The proposed
criterion—MDDI—distinguished between depressed and normal subjects with ~90% sensitivity, specificity, and accuracy in the training set, and ~80% sensitivity, specificity, and
accuracy in the test set. The near absence of gender-related differences in MDDI provides
further support of its potential efficacy in practice.
Still, several topics require further study. Clarification is needed on the nature of vocal
properties affected by depression. Differences in recording environments, such as between
examination rooms in hospitals and conference rooms in general use buildings, should be
eliminated as much as possible. Finally, the MDDI’s diagnostic performance should be
tested on larger samples of test data.
Author Contributions: Conceptualization, S.T.; methodology, M.H. and S.T.; validation, M.H.; formal
analysis, M.H. and N.S.; investigation, M.H. and D.M.; resources, S.T.; data curation, M.H., M.N., S.S.,
Y.O., T.T., H.T. (Hiroyuki Toda), T.S., M.S., E.T., H.T. (Hiroo Terashi), S.M. and S.T.; writing—original
draft preparation, M.H.; writing—review and editing, S.T.; visualization, M.H.; supervision, S.T.;
project administration, S.T.; funding acquisition, S.T. and M.H. All authors have read and agreed to
the published version of the manuscript.
Funding: This research was partially supported by the Center of Innovation Program from Japan
Science and Technology Agency. This research was partially supported by JSPS KAKENHI Grant
No. 20K12688.
Institutional Review Board Statement: The study was conducted according to the guidelines of the
Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of
the University of Tokyo (protocol code 11572).
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author. The data are not publicly available due to privacy and ethical restrictions.
Acknowledgments: We thank the director of Ginza Taimei Clinic, Bun Chino for assistance with data
collection and all participants for participating.
Conflicts of Interest: M.H., M.N. and S.T. had received financial support from PST Inc until 2019
and currently report no financial support from the company. All other authors declare no conflict
of interest.
Int. J. Environ. Res. Public Health 2022, 19, 11397
12 of 13
Abbreviations
The following abbreviations are used in this manuscript:
IL-1ra
MDD
PHQ9
GHQ
BDI
HDRS
F0
F1
F2
YMRS
MFCC
DSM-IV-TR
MINI
FFT
ROC
PCA
AUC
MDDI
Interleukin-1 Receptor Antagonist
Major Depressive Disorder
Patient Health Questionnaire 9
General Health Questionnaire
Beck Depression Inventory
Hamilton Depression Rating Scale
Fundamental Frequency
First Formant
Second Formant
Young Mania Rating Scale
Mel-frequency Cepstrum Coefficient
Diagnostic and Statistical Manual of mental disorders 4th edition Text Revision
Mini-International Neuropsychiatric Interview
Fast Fourier Transform
Receiver Operating Characteristic
Principal Component Analysis
Area Under the Curve
Major Depression Discrimination Index
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
Cohen, S.; Kessler, R.C.; Gordon, L.U. Measuring Stress: A Guide for Health and Social Scientists; Oxford University Press: Oxford,
UK, 1997.
Perkins, A. Saving money by reducing stress. Harv. Bus. Rev. 1994, 72, 12.
Okumura, Y.; Higuchi, T. Cost of depression among adults in Japan. Prim. Care Companion CNS Disord. 2011, 13, e1–e9. [CrossRef]
[PubMed]
Okuda, A.; Suzuki, T.; Kishi, T.; Yamanouchi, Y.; Umeda, K.; Haitoh, H.; Hashimoto, S.; Ozaki, N.; Iwata, N. Duration of untreated
illness and antidepressant fluvoxamine response in major depressive disorder. Psychiatry Clin. Neurosci. 2010, 64, 268–273.
[CrossRef] [PubMed]
Kayser, J.; Tenke, C.E. In Search of the Rosetta Stone for Scalp EEG: Converging on Reference-free Techniques. Clin. Neurophysiol.
2010, 121, 1973–1975. [CrossRef] [PubMed]
Koo, P.C.; Thome, J.; Berger, C.; Foley, P.; Hoeppner, J. Current source density analysis of resting state EEG in depression: A review.
J. Neural. Transm. 2017, 124 (Suppl. 1), 109–118. [CrossRef]
Izawa, S.; Sugaya, N.; Shirotsuki, K.; Yamada, K.C.; Ogawa, N.; Ouchi, Y.; Nagano, Y.; Suzuki, K.; Nomura, S. Salivary
dehydroepiandrosterone secretion in response to acute psychosocial stress and its correlations with biological and psychological
changes. Biol. Psychol. 2008, 79, 294–298. [CrossRef]
Suzuki, G.; Tokuno, S.; Nibuya, M.; Ishida, T.; Yamamoto, T.; Mukai, Y.; Mitani, K.; Tsumatori, G.; Scott, D.; Shimizu, K. Decreased
plasma brain-derived neurotrophic factor and vascular endothelial growth factor concentrations during military training. PLoS
ONE 2014, 9, e89455. [CrossRef]
Maes, M.; Vandoolaeghe, E.; Ranjan, R.; Bosmans, E.; Bergmans, R.; Desnyder, R. Increased serum interleukin-1-receptorantagonist concentrations in major depression. J. Affect. Disord. 1995, 36, 29–36. [CrossRef]
Kroenke, K.; Spitzer, R.L.; Williams, J.B. The PHQ-9: Validity of a brief depression severity measure. J. Gen. Intern. Med. 2001,
16, 606–613. [CrossRef]
Goldberg, D.P. Manual of the General Health Questionnaire; NFER Publishing: Windsor, ON, Canada, 1978.
Beck, A.T.; Ward, C.H.; Mendelson, M.; Mock, J.; Erbaugh, J. An inventory for measuring depression. Arch. Gen. Psychiatry 1961,
4, 561–571. [CrossRef]
Delgado-Rodríguez, M.; Llorca, J. Bias. J. Epidemiol. Community Health 2004, 58, 635–641. [CrossRef] [PubMed]
Hamilton, M. A rating scale for depression. J. Neurol. Neurosurg. Psychiatry 1960, 23, 56–62. [CrossRef] [PubMed]
Ekman, P. Facial expressions of emotion: New findings, new questions. Psychol. Sci. 1992, 3, 34–38. [CrossRef]
Kitahara, Y.; Tohkura, Y. Prosodic control to express emotions for man-machine speech interaction. IEICE Trans. Fundam. Electron.
Commun. Comput. Sci. 1992, 75, 155–163.
Jan, A.; Meng, H.; Gaus, Y.F.A.; Zhang, F.; Turabzadeh, S. Automatic depression scale prediction using facial expression dynamics
and regression. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, Orlando, FL, USA,
7 November 2014; pp. 73–80.
Mundt, J.C.; Vogel, A.P.; Feltner, D.E.; Lenderking, W.R. Vocal acoustic biomarkers of depression severity and treatment response.
Biol. Psychiatry 2012, 72, 580–587. [CrossRef] [PubMed]
Int. J. Environ. Res. Public Health 2022, 19, 11397
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
13 of 13
Shinohara, S.; Nakamura, M.; Omiya, Y.; Higuchi, M.; Hagiwara, N.; Mitsuyoshi, S.; Toda, H.; Saito, T.; Tanichi, M.;
Yoshino, A.; et al. Depressive mood assessment method based on emotion level derived from voice: Comparison of voice features
of individuals with major depressive disorders and healthy controls. Int. J. Environ. Res. Public Health 2021, 18, 5435. [CrossRef]
Hagiwara, N.; Omiya, Y.; Shinohara, S.; Nakamura, M.; Higuchi, M.; Mitsuyoshi, S.; Yasunaga, H.; Tokuno, S. Validity of Mind
Monitoring System as a Mental Health Indicator using Voice. Adv. Sci. Technol. Eng. Syst. J. 2017, 2, 338–344. [CrossRef]
Higuchi, M.; Nakamura, M.; Shinohara, S.; Omiya, Y.; Takano, T.; Mitsuyoshi, S.; Tokuno, S. Effectiveness of a voice-based mental
health evaluation system for mobile devices: Prospective study. JMIR Form. Res. 2020, 4, e16455. [CrossRef]
Eyben, F.; Wöllmer, M.; Schuller, B. openSMILE: The Munich Versatile and Fast Open-Source Audio Feature Extractor.
In Proceedings of the 18th ACM international conference on Multimedia, Firenze, Italy, 25–29 October 2010; pp. 1459–1462.
Jiang, H.; Hu, B.; Liu, Z.; Yan, L.; Wang, T.; Liu, F.; Kang, H.; Li, X. Investigation of different speech types and emotions for
detecting depression using different classifiers. Speech Commun. 2017, 90, 39–46. [CrossRef]
Young, R.C.; Biggs, J. T.; Ziegler, V.E.; Meyer, D.A. A Rating Scale for Mania: Reliability, Validity and Sensitivity. Br. J. Psychiatry
1978, 133, 429–435. [CrossRef]
Faurholt-Jepsen, M.; Busk, J.; Frost, M.; Vinberg, M.; Christensen, E.M.; Winther, O.; Bardram, J.E.; Kessing, L.V. Voice analysis as
an objective state marker in bipolar disorder. Transl. Psychiatry 2016, 6, e856. [PubMed]
Taguchi, T.; Tachikawa, H.; Nemoto, K.; Suzuki, M.; Nagano, T.; Tachibana, R.; Nishimura, M.; Arai, T. Major depressive disorder
discrimination using vocal acoustic features. J. Affect. Disord. 2018, 225, 214–220. [CrossRef] [PubMed]
Higuchi, M.; Tokuno, S.; Nakamura, M.; Shinohara, S.; Mitsuyoshi, S.; Omiya, Y.; Hagiwara, N.; Takano, T.; Toda, H.; Saito, T.; et al.
Classification of bipolar disorder, major depressive disorder, and healthy state using voice. Asian J. Pharm. Clin. Res. 2018,
11, 89–93. [CrossRef]
American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders, 4th ed.; Text Revision; Amer Psychiatric
Pub Inc.: Washington, DC, USA, 2000.
Sheehan, D.V.; Lecrubier, Y.; Sheehan, K.H.; Amorim, P.; Janavs, J.; Weiller, E.; Hergueta, T.; Baker, R.; Dunbar, G.C. The
Mini-International Neuropsychiatric Interview (M.I.N.I): The development and validation of a structured diagnostic psychiatric
interview for DSM-IV and ICD-10. J. Clin. Psychiatry 1998, 59 (Suppl. 20), 22–33.
Carrozzino, D.; Patierno, C.; Fava, G.A.; Guidi, J. The Hamilton Rating Scales for Depression: A Critical Review of Clinimetric
Properties of Different Versions. Psychother. Psychosom. 2020, 89, 133–150. [CrossRef]
Riedel, M.; Möller, H.J.; Obermeier, M.; Schennach-Wolff, R.; Bauer, M.; Adli, M.; Kronmüller, K.; Nickel, T.; Brieger, P.;
Laux, G.; et al. Response and remission criteria in major depression—A validation of current practice. J. Psychiatr. Res. 2010,
44, 1063–1068. [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw.
2010, 33, 1–22. [CrossRef]
R: A Language and Environment for Statistical Computing. Available online: https://www.R-project.org/ (accessed on
25 April 2022).
Steel, R.G.D. A rank sum test for comparing all pairs of treatments. Technometrics 1960, 2, 197–207. [CrossRef]
Abitbol, J.; Abitbol, P.; Abitbol, B. Sex hormones and the female voice. J. Voice 1999, 13, 424–446. [CrossRef]
Chae, S.W.; Choi, G.; Kang, H.J.; Choi, J.O.; Jin, S.M. Clinical analysis of voice change as a parameter of premenstrual syndrome.
J. Voice 2001, 15, 278–283.
Seltzer, L.J.; Prososki, A.R.; Ziegler, T.E.; Pollak, S.D. Instant messages vs. speech: Hormones and why we still need to hear each
other. Evol. Hum. Behav. 2012, 33, 42–45. [CrossRef] [PubMed]