Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Received 27 October 2022; revised 17 January 2023; accepted 25 January 2023. Date of publication 6 February 2023; date of current version 23 February 2023. The review of this article was arranged by Associate Editor Hung-yi Lee. Digital Object Identifier 10.1109/OJSP.2023.3242862 Hierarchical Multi-Class Classification of Voice Disorders Using Self-Supervised Models and Glottal Features SASKA TIRRONEN , SUDARSANA REDDY KADIRI (Member, IEEE), AND PAAVO ALKU (Fellow, IEEE) Department of Information and Communications Engineering, Aalto University, 02150 Espoo, Finland CORRESPONDING AUTHOR: SASKA TIRRONEN. (e-mail: saska.tirronen@aalto.fi) This work was supported in part by the Academy of Finland under Grant 330139 and in part by the Aalto University (the MEC Program for India). ABSTRACT Previous studies on the automatic classification of voice disorders have mostly investigated the binary classification task, which aims to distinguish pathological voice from healthy voice. Using multiclass classifiers, however, more fine-grained identification of voice disorders can be achieved, which is more helpful for clinical practitioners. Unfortunately, there is little publicly available training data for many voice disorders, which lowers the classification performance on data from unseen speakers. Earlier studies have shown that the usage of glottal source features can reduce data redundancy in detection of laryngeal voice disorders. Another approach to tackle the problems caused by scarcity of training data is to utilize deep learning models, such as wav2vec 2.0 and HuBERT, that have been pre-trained on larger databases. Since the aforementioned approaches have not been thoroughly studied in the multi-class classification of voice disorders, they will be jointly studied in the present work. In addition, we study a hierarchical classifier, which enables task-wise feature optimization and more efficient utilization of data. In this work, the aforementioned three approaches are compared with traditional mel frequency cepstral coefficient (MFCC) features and onevs-rest and one-vs-one SVM classifiers. The results in a 3-class classification problem between healthy voice and two laryngeal disorders (hyperfunctional dysphonia and vocal fold paresis) indicate that all the studied methods outperform the baselines. The best performance was achieved by using features from wav2vec 2.0 LARGE together with hierarchical classification. The balanced classification accuracy of the system was 62.77% for male speakers, and 55.36% for female speakers, which outperformed the baseline systems by an absolute improvement of 15.76% and 6.95% for male and female speakers, respectively. INDEX TERMS Pathological voices, voice disorders, hierarchical classification, glottal source extraction, multi-class classification, Wav2vec, HuBERT. I. INTRODUCTION Automatic classification of voice disorders has been studied widely in the past two decades. The focus has been on detection of voice disorders (i.e., the binary classification problem) [1], [2], [3], [4], [5], [6], [7], [8], while classification of multiple voice disorders (i.e., the multi-class problem) [9], [10], [11], [12], [13] has remained less studied. In the detection problem, the automatic system distinguishes disordered voice from healthy voice. As there are many voice disorders, including both organic and functional, a multi-class classifier, which enables the classification between healthy voice and several different disorders, would be more useful for clinical practitioners. In the current study, a 3-class voice pathology classification problem is studied by investigating the classification between two laryngeal voice disorders (hyperfunctional dysphonia and vocal fold paresis) and healthy voice. Traditionally, automatic detection systems have been constructed as pipelines that consist of separate feature extraction and classification steps [2], [3], [4], [5], [6], [7], [8]. In the feature extraction, the voice signal is mapped into a vector in a suitably designed feature space. The mapped vector representations are then used by a machine learning algorithm to separate healthy voices from disordered voices. In contrast, some recent studies have studied deep learning -based systems This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 80 VOLUME 4, 2023 that combine the feature extraction and classification steps into a single neural network that inputs a voice signal (or its spectrogram) and gives the classification label as output [6], [14], [15]. Such systems are often referred to as end-to-end classifiers. In general, pipeline systems require smaller amounts of training data than end-to-end systems. This is because the classification problem of end-to-end systems is more complex, requiring the model to learn the optimal feature mapping from the data. As the amount of available training data in voice disorder databases is typically small, the focus of this paper is on pipeline systems. However, a small amount of training data is a problem for pipeline systems too, and it may cause low classification performance on unseen data. The problems caused by the scarcity of training data are particularly severe for multi-class classification tasks that call for training data representing several voice pathologies. In the current study, we investigate three approaches to improve multi-class classification of voice disorders based on the pipeline system architecture. The f irst approach corresponds to using the glottal source signal in feature extraction. The second approach corresponds to using self-supervised models as pre-trained feature extractors. The third approach corresponds to using a hierarchical multi-class classifier architecture. The first and second approaches are related to feature extraction, and the third one is related to classification. Hypothetically, the best benefit may be gained by using the feature-based and classification-based approaches together, and combining them with data augmentation methods, as proposed by [11] and [16]. However, data augmentation is outside of the scope of the current study. The first approach aims to take advantage of the source of voiced speech, the glottal excitation, in the feature extraction. The glottal excitation is first estimated using a glottal inverse filtering algorithm, and the estimated source signal is expressed using the mel frequency cepstral coefficient (MFCC) features. This approach helps the classifier learn more generalizable functions from small data sets, because vocal tract information, which is removed by glottal inverse filtering, may be mostly redundant for the classification of the selected disorders. Glottal source features have been studied in a few earlier studies in automatic detection of voice pathologies [7], [8], [17]. However, the glottal source has not been used previously in multi-class classification tasks, where the problem of small data is most severe. In the second approach, we take advantage of wav2vec 2.0 [18] and HuBERT [19], which are frameworks for self-supervised learning of representations from raw speech signals. The self-supervised models were pre-trained on databases in automatic speech recognition (ASR). In the pretraining phase, the models have learned to extract features that generalize well to a variety of speech-related tasks and unseen data. Therefore, the pre-trained models are used as feature extractors by utilizing their hidden layer outputs. Pre-trained self-supervised models have been used before to improve performance of ASR systems in recognition of VOLUME 4, 2023 disordered speech [20], [21]. The wav2vec 2.0 models have also been used, for example, for detection of aphasia [22], for detection of stuttering [23], and for speech rating of disordered children’s speech [24]. Various pre-training approaches have been used to detect Alzheimer’s disease [25], [26], and heart failure [27]. However, only a few studies have applied these techniques on multi-class classification of voice disorders. In [28], the pre-trained VGGish model was used for feature extraction in several multi-class problems. In [29], transfer learning methods between three disorders were studied. However, as per our knowledge, the utilization of the state-of-the-art self-supervised models as feature extractors in multi-class classification of voice disorders has not been studied before. The third approach aims to improve the multi-class classification of voice disorders by using a hierarchical classifier architecture that combines two binary classifiers into a 3-class classifier. In the first step, a binary classification is done between healthy and disordered voices. In the second step, the samples that were classified as disordered are classified into the two selected laryngeal disorders. This approach is an efficient way to use training data, as each voice sample can be utilized twice in learning the two sub-problems. In this work, SVMs and fine-tuned self-supervised models are used as the two binary classifiers. Hierarchical architectures have been used in earlier works for the classification of laryngeal voice disorders [28], [30], [31], and to classify between dysarthria, apraxia of speech, and neurotypical speech [12]. The work in [17] evaluated hierarchical sub-problems individually instead of the full multi-class problem. However, the classification of the two voice pathologies studied in the current paper (hyperfunctional dysphonia and vocal fold paresis) has not been investigated before using hierarchical classifier architectures. One important criterion for the selection of these disorders was the relatively small amount of training data of the two disorders, which makes them difficult to classify. In summary, this work studies the effectiveness of three different approaches to alleviate the data scarcity problem in multi-class classification of voice disorders. These three approaches are highlighted with green color in Fig. 1, and they are: 1) The usage of glottal source signals in feature extraction. 2) The usage of a pre-trained self-supervised model (wav2vec 2.0 and HuBERT) as a feature extractor. 3) The usage of a hierarchical classifier. The three approaches are compared to commonly used baselines. The glottal MFCCs and self-supervised feature extractors are compared to traditional MFCCs, which are the most popular features in voice disorder detection [2], [3], [4], [5], [7], [9], [10], [13], [17], [32]. Hierarchical classifiers are compared to SVM in the popular one-vs-one (OvO) and one-vs-rest (OvR) architectures [9]. The remaining part of the paper is structured as follows. Section II describes the proposed methods and their technical details. Section III describes the details of the experimental 81 TIRRONEN ET AL.: HIERARCHICAL MULTI-CLASS CLASSIFICATION OF VOICE DISORDERS USING SELF-SUPERVISED MODELS AND GLOTTAL FEATURES FIGURE 2. The hierarchical SVM classifier that is used in this work. FIGURE 1. Block diagram of the pipeline system. The proposed methods for feature extraction and classification are indicated by green color. setup, including the database, training and evaluation process, and the experiment runs. The experimental results are presented in Section IV. Finally, Section V summarizes the paper and presents final conclusions. II. PROPOSED SYSTEM In this work, voice disorder classification is performed by using a pipeline system that consists of separate feature extraction and classification steps. The classification is performed between three voice classes (healthy voice, hyperfunctional dysphonia, and vocal fold paresis) as illustrated in Fig. 1. The following sub-sections describe the technical details of the feature extraction and classification steps. A. FEATURES The voice signal is first pre-processed by re-sampling it to 16 kHz and by removing silent segments. All samples shorter than 750 ms are left out. Each utterance is normalized by dividing it with the signal’s maximum absolute value. The baseline MFCC features are extracted by computing 13 coefficients with their delta and delta-delta coefficients. A frame-length of 25 ms is used with a shift of 5 ms. As described in Section I, the first proposed approach to improve multi-class classification of voice pathologies corresponds to using MFCCs computed from glottal source waveforms (denoted as MFCC-glottal) in the feature extraction. First, the glottal source waveform is estimated using the quasiclosed phase (QCP) glottal inverse filtering method proposed in [33]. MFCCs are then extracted from the glottal waveform using the same procedure as in the baseline MFCCs. The second approach introduced in Section I is to extract features by utilizing a self-supervised model that has been pre-trained using ASR databases. Three different selfsupervised models are included in this work. Firstly, we use pre-trained wav2vec 2.0 BASE [18] that was pre-trained and fine-tuned using 960 hours of Librispeech [34]. Secondly, we use pre-trained wav2vec 2.0 LARGE [18], [35], which was pre-trained using a combination of three ASR databases (CommonVoice [36], BABEL [37], and Multilingual Librispeech [34]). Thirdly, we use HuBERT LARGE [19], which 82 was pre-trained on the Libri-Light database [38] and further fine-tuned using 960 hours of Librispeech [34]. Both wav2vec 2.0 LARGE and HuBERT LARGE include 24 transformer blocks in their context networks and the model dimension is 1024, whereas wav2vec BASE only includes 12 transformer blocks and the model dimension is 768. For each of these models, the feature vectors are derived by computing the temporal averages of the relative positional embeddings from the output of each transformer layer of the context network. Similar computation is also done for the input of the first transformer layer. Therefore, the number of feature vectors is 25 for LARGE variations and 13 for BASE variations, and the dimension of the feature space equals the model dimension. In order to denote the feature vectors from each layer, the feature vectors are indexed in an increasing order. The input to the context network has index 0, and the output of the final embedding layer has indexes 24 and 12 for the LARGE and the BASE variations, respectively. B. CLASSIFIERS The third approach to improve multi-class classification of voice disorders is the use of a hierarchical classifier, which consists of two binary classifiers. Fig. 2 shows an illustration of a hierarchical classifier with two binary SVMs (SVM-hier) that is used in this work. The first classifier (SVM-1) distinguishes disordered voices from healthy voices. For the voice samples detected as disordered, the second classifier (SVM-2) classifies the pathology either as hyperfunctional dysphonia or vocal fold paresis. In addition to SVM-hier, another hierarchical system is examined that uses fine-tuned wav2vec 2.0 LARGE models as binary classifiers. This model is referred to as wav2vec-LARGE-hier. The hierarchical classifiers are compared with SVMs based on OvO (SVM-OvO) and OvR (SVM-OvR), as they have been widely used in multi-class classification [9], [10]. Hierarchical classifier, as well as the baseline OvO and OvR systems, divides the full multi-class problem into less complex sub-problems. These sub-problems are solved individually by dedicated classifiers, which effectively shares the total complexity of the task between them. This ensures that each training sample can be utilized in several parts of the architecture to learn different parts of the full problem, which can increase the utility of each training sample. This can help to train classifiers with small databases. In addition, the modularity of hierarchical classifiers can also be taken advantage of by medical practitioners. The hierarchical structure namely makes it possible to perform diagnosis as a sequence of VOLUME 4, 2023 increasingly detailed evaluations, starting from the detection of disordered markers, and ending at a detailed diagnosis of the disorder type. Furthermore, each classifier in the hierarchy can be replaced without a need to modify or change any other classifiers. This is naturally a desirable feature of a system, as it allows for easy maintenance and continuous development. III. EXPERIMENTAL SETUP This section describes the experimental setup that is used in the current study. First, the voice database used in the study is described. Second, the training and testing processes of the classifiers are discussed. Finally, the details of each individual experiment are provided. A. DATABASE The current study uses voice data of the Saarbrücken Voice Disorders (SVD) database [39], [40]. We selected this database, because it is publicly available and covers voice samples from both genders for a variety of laryngeal voice disorders. The database contains 71 different disorders. The recordings were conducted in sessions, where each speaker conducted four speaking tasks: a pronunciation of the German sentence ‘Guten Morgen, wie geht es Ihnen?’ (’Good morning, how are you?’), and sustained pronunciations of three vowels (/a/, /i/, /u/). The vowels were pronounced by varying pitch in four types (low, normal, high, and low-high-low). The database contains samples from 1853 speakers in 2225 sessions. This work includes samples of healthy voices, as well as pathological voice samples of hyperfunctional dysphonia and vocal fold paresis. These two voice disorders were selected because they are among the most prevalent voice disorders1 [41], [42], [43]. Another reason for the selection is the small amount of data for both of the voice disorders in SVD (213 recording sessions for hyperfunctional dysphonia and 213 recording sessions for vocal fold paresis). This enables studying the 3-class classification task using pipeline classifiers in a scenario with a small amount of training data as discussed in Section I. Furthermore, in order to simulate a voice data scenario which is not only of a small size but which also could be generalized to other databases than SVD, we only selected voices that represent one popular speaking task, namely the sustained phonation of the vowel /a/ in normal pitch. We used samples from the speakers who had not had any surgeries or voice therapy prior to recordings, and who were 19-60 years old at the time of the recordings. In addition, we left out samples that were shorter than 750 ms. This resulted in data subsets that are visualized in Fig. 3. B. TRAINING AND TESTING All classifiers were trained by using 5-fold cross-validation (CV). All samples from each speaker were always contained within a single fold, to ensure that a model does not learn FIGURE 3. Number of recording sessions in the selected subset of the database. Included are healthy voices, and voices with hyperfunctional dysphonia and vocal fold paresis. to classify voice samples based on speaker identity. In each iteration, one of the folds was reserved for evaluation, and the other folds were used for training. Performance metrics were computed based on the predictions that were made on the evaluation fold. The metrics include balanced classification accuracy, class-wise precision, class-wise recall, and classwise F1 score. The 5-fold CV was performed 4 times with different random states, to get a total of 20 evaluations. For all hierarchical classifiers, the training process was performed separately for the two binary sub-problems. As part of the cross-validation process, the number of samples in the different classes was balanced by duplicating the samples of the smallest classes. Even though this approach is most likely not optimal, it performed better in our initial tests compared to balancing the classes by leaving out samples from the majority classes. The balancing was done for each fold, which resulted in balanced data for both training and evaluation. Some aspects of the training process were different between the experiments where the SVM-based classifiers were used and the experiments where the self-supervised models were fine-tuned and used as classifiers. When SVMs were used, the training and test features were both z-score normalized with the mean and standard deviation of the training data. Also, for each of the SVM classifiers, hyperparameters were optimized by grid-search. The searched parameter values were identical to the ones used and described in [44, p. 27–28]. For each of the fold iterations, all parameter combinations were evaluated, and the one that achieved the best mean balanced accuracy was selected. In contrast, when the self-supervised models were finetuned, the input signals were pre-processed as in the preprocessing stage of feature extraction (see Section II-A). Gridsearch was not applied to any hyper-parameters. Fine-tuning was conducted once for each CV iteration by minimizing cross-entropy loss by using the AdamW optimizer [45] with β1 = 0.9 and β2 = 0.999. The initial learning rate was 0.0005 and it was reduced linearly. Batch size was 32 and the maximum number of epochs was 50. C. EXPERIMENTS 1 https://www.tgh.org/institutes-and-services/conditions/hyperfunctional- dysphonia VOLUME 4, 2023 The experiments consisted of two parts. The first part evaluates the two feature-based approaches glottal MFCCs and 83 TIRRONEN ET AL.: HIERARCHICAL MULTI-CLASS CLASSIFICATION OF VOICE DISORDERS USING SELF-SUPERVISED MODELS AND GLOTTAL FEATURES FIGURE 4. Classification accuracies obtained in binary tasks of healthy vs. disordered (SVM-1) and hyperfunctional dysphonia vs. vocal fold paresis (SVM-2). The green dashed line represents the baseline MFCC features. The orange dashed line represents the MFCC-glottal features. The solid lines represent the self-supervised features, with the tick labels indicating the index of the corresponding layer. Index 0 refers to the input to the first embedding layer, other indexes refer to the output of the corresponding layer. For each self-supervised feature, the best values are highlighted with larger circles. self-supervised features, and the second part evaluates the classifier-based approach (hierarchical classification). In the first part, the comparison of the baseline MFCC, MFCC-glottal and self-supervised features was conducted using two binary classification tasks: healthy vs. disordered, and hyperfunctional dysphonia vs. vocal fold paresis. These binary problems were selected for the feature-related experiments, because they are the two sub-problems of hierarchical classification. In the second part, hierarchical multi-class classification was examined. The comparison was first made between SVMhier and the baseline classifiers (SVM-OvO and SVM-OvR) by using MFCCs. In addition, we examined the effect of using the self-supervised models together with hierarchical classification. For both hierarchical steps, we selected the self-supervised model that achieved the best performances in the corresponding sub-problems (SVM-1 and SVM-2) in the first part of the experiments. These models were then used in the hierarchical framework in two alternative ways. Firstly, by extracting the self-supervised features from the models and using them together with SVM-hier. In this case, the best features were selected separately for both sub-problems, based on the results of SVM-1 and SVM-2 in the first part of the experiments. Secondly, by fine-tuning the models in the two binary sub-problems and combining them into a hierarchical multi-class classifier. In the latter case, the fine-tuning effectively replaces the manual selection of the best features, as the utility of the final embedding layer was automatically maximized. It is worth pointing out that the number of trainable classifier parameters differs between the multi-class classification systems. This is because 3 SVMs were included in SVM-OvO and SVM-OvR, but only 2 SVMs were included in SVMhier. Also, the self-supervised models were trained on SVD data only in the final experiment where all parameters of 84 the self-supervised models were fine-tuned in the two binary sub-problems. IV. RESULTS This section describes the results of our experiments. First, the results regarding the self-supervised and MFCC-glottal features are discussed. It is followed by a discussion of the results of hierarchical classification. A. MFCC-GLOTTAL AND SELF-SUPERVISED FEATURES The obtained classification accuracies for all the features are shown in Fig. 4. They include evaluations in the two binary sub-problems of SVM-hier: healthy vs. disordered (SVM1), and hyperfunctional dysphonia vs. vocal fold paresis (SVM-2). The other performance metrics than classification accuracies are shown in Table 1. As can be seen, the self-supervised features outperformed the baseline MFCCs consistently. Moreover, the MFCC-glottal features outperformed the baseline MFCCs in almost all scenarios. The best accuracies for male speakers were 75.65% for SVM-1 and 71.95% for SVM-2, and they were obtained using the wav2vec-LARGE-6 and wav2vec-BASE-0 features, respectively. The MFCC-glottal accuracies were 74.48% for SVM-1 and 69.05% for SVM-2, and the baseline MFCC accuracies were 72.01% for SVM-1 and 61.60% for SVM-2. The best accuracies for female speakers were 74.50% for SVM-1 and 63.06% for SVM-2, and they were obtained using the HuBERT-0 and wav2vec-LARGE-13 features, respectively. The MFCC-glottal accuracies were 66.13% for SVM-1 and 59.96% for SVM-2, and the baseline accuracies were 68.15% for SVM-1 and 57.09% for SVM-2. As the wav2vec-LARGE features were generally the best self-supervised features, they were used in the next set of experiments with the SVM-hier classifier. All performance VOLUME 4, 2023 TABLE 1 Performance Metrics Obtained in Binary Tasks of Healthy Vs. Disordered (SVM-1) and Hyperfunctional Dysphonia Vs. Vocal Fold Paresis (SVM-2). PREC Represents Precision, REC Represents Recall, and F1 Represents F1 Score. The Numbers 0, and 1 in the Metric Names Represent the Classes, Which are Healthy (0) and Disordered (1) for SVM-1, and Hyperfunctional Dysphonia (0), and Vocal Fold Paresis (1) for SVM-2. The Mean Values Over the Folds are Reported for All Metrics. In Addition, the Standard Deviations are Reported for Accuracy, and the Best Mean Accuracy Values are Highlighted for Each Classifier and Gender. Results of All the Self-Supervised Features are Not Included, Only the Layers With the Highest Performance are Included (See Fig. 4). In the Feature Column, the Feature Names Include Their Corresponding Layer Numbers for Self-Supervised Features TABLE 2 Performance Metrics for the Multi-Class Classifiers. PREC Represents Precision, REC Represents Recall, and F1 Represents F1 Score. Numbers 0, 1, and 2 in the Metric Names Represent Healthy Voice, Hyperfunctional Dysphonia, and Vocal Fold Paresis, Respectively. The Mean Values Over the Folds are Reported for All Metrics, and the Best Mean Accuracy Values are Highlighted for Each Classifier and Gender. In Addition, the Standard Deviations are Reported for Accuracy metrics of the best layers and their respective baselines are shown in Table 1. B. HIERARCHICAL CLASSIFICATION The results of the experiments with hierarchical classifiers are shown in Fig. 5 and Table 2. First, the baseline classifiers, SVM-OvO and SVM-OvR, were trained and evaluated with MFCCs. For male speakers, the baseline classification accuracies were 47.01% for SVM-OvR and 46.38% for SVM-OvO. VOLUME 4, 2023 For female speakers, the baseline classification accuracies were 47.70% for SVM-OvR and 48.41% for SVM-OvO. Then, SVM-hier was trained and evaluated with MFCCs, and the results were better than those of the baselines (i.e., SVMOvR and SVM-OvO). For male speakers, the accuracy was 53.76%, and for female speakers, the accuracy was 51.11%. Then, the hierarchical classification was examined together with the self-supervised models. Wav2vec 2.0 LARGE was used as the self-supervised model, because it was generally 85 TIRRONEN ET AL.: HIERARCHICAL MULTI-CLASS CLASSIFICATION OF VOICE DISORDERS USING SELF-SUPERVISED MODELS AND GLOTTAL FEATURES FIGURE 5. Classification accuracies obtained in multi-class classification. Heights of the bars represent the mean accuracies over the folds and the tails represent the standard deviations. Wav2vec-LARGE-hier refers to the scenario where fine-tuned wav2vec 2.0 LARGE model was used in hierarchical classification. For other models, the used features are indicated within parentheses. the best self-supervised model in Section IV-A. First, SVM-hier was used and the best wav2vec-LARGE features were selected for both hierarchical steps (SVM-1 and SVM-2) separately, based on their performance in Section IV-A. For female speakers, wav2vec-LARGE-3 was used for SVM-1 and wav2vec-LARGE-13 was used for SVM-2. For male speakers, wav2vec-LARGE-6 was used for SVM-1 and wav2vec-LARGE-14 was used for SVM-2. The resulting multi-class accuracies were 61.29% and 55.36% for male and female speakers, respectively. This was the best obtained performance for female speakers. Finally, wav2vec 2.0 LARGE was fine-tuned for the two binary sub-problems separately, after which the fine-tuned models were combined into a hierarchical classifier, wav2vecLARGE-hier. The obtained classification accuracies were 62.77% and 54.12% for male and female speakers, respectively. This was the best obtained performance for male speakers. Therefore, the highest absolute improvements to the baseline SVM systems were 15.76% and 6.95% for male and female speakers, respectively. The confusion matrices for all the hierarchical systems, as well as for the SVM-OvR baseline are visualized in Fig. 6. The values in the confusion matrices are normalized over the true values (rows). It can be seen that hierarchical classification mainly increases the performance of the two smallest classes. For instance, in comparison to the baseline SVM-OvR with MFCCs, SVM-hier with the wav2vec-LARGE features increased the recall of the smallest class (vocal fold paresis) from 0.28 to 0.43 for male speakers, and from 0.28 to 0.39 for female speakers. Moreover, the recall of hyperfunctional dysphonia increased from 0.30 to 0.58 for male speakers, and from 0.45 to 0.50 for female speakers. In addition, fine-tuning further improved the performance of the smallest class without largely affecting the classification accuracy. In comparison to SVM-hier with wav2vec-LARGE 86 FIGURE 6. Confusion matrices of the multi-class classification systems. The horizontal axis represents the predicted classes, and the vertical axis represents the true classes. Class labels 0, 1, and 2 represent healthy voice, hyperfunctional dysphonia, and vocal fold paresis, respectively. The values are normalized over true values (rows). Wav2vec-LARGE-hier refers to the scenario where fine-tuned wav2vec 2.0 model was used in hierarchical classification. For other models, the used features are indicated within parentheses. features, wav2vec-LARGE-hier increased the recall of vocal fold paresis by 0.10. V. CONCLUSION In this paper, a 3-class voice pathology classification task was studied to automatically classify two laryngeal voice disorders (hyperfunctional dysphonia and vocal fold paresis) and healthy voice. Samples from the Saarbrücken Voice Disorders (SVD) database were used. The study examined three approaches that may alleviate the problem of data scarcity in the multi-class classification of voice disorders and, therefore, improve the classification performance of a pipeline classifier. For the feature extraction phase, the proposed approaches corresponded to the extraction of the MFCC-glottal features, and the usage of the pre-trained self-supervised models as feature extractors. For the classification phase, a hierarchical classification approach was used. Comparisons were made to commonly used baseline approaches: MFCCs for feature extraction, and SVM-OvO and SVM-OvR for classification. The two feature-based approaches were first evaluated in the two binary sub-problems of the hierarchical classification framework. The results indicate that both the MFCC-glottal and self-supervised features increase the classification performance in most scenarios, when comparing to the baseline MFCCs. For male speakers, there is no large difference between MFCC-glottal and the best self-supervised features (1.17% for SVM-1 and 2.90% for SVM-2), which may imply that they are equally effective methods to capture the glottal information that is discriminative between the classes. However, the glottal MFCCs performed consistently well for male speakers but not for female speakers. In fact, the glottal MFCCs were outperformed by the baseline MFCCs for SVM1 with female speakers, with an absolute difference of 2.02%. This difference between the genders might be caused by the fact that the glottal source extraction by inverse filtering is more difficult for high-pitch speech, because less samples are VOLUME 4, 2023 available for the estimation of vocal tract filter for each glottal cycle. In general, the positive effect of the feature-based approaches is largest in the task of classification between the two pathologies, which is also the task which typically suffers most severely from the data scarcity problem. This finding supports our hypothesis that these approaches effectively alleviate the problems caused by scarcity of training data. The implications are similar, when evaluating the hierarchical systems in the multi-class problem. In comparison to the baseline OvO and OvR approaches, all hierarchical systems increase the classification accuracy. The confusion matrices show that the performance improvements are almost completely caused by an improvement in the two smallest classes (hyperfunctional dysphonia and vocal fold paresis). For both genders, the best performance was achieved by using self-supervised models together with hierarchical classification. For female speakers, the best classification accuracy (55.36%) was achieved by using the non-fine-tuned wav2vec-LARGE features, whereas for male speakers, the best accuracy (62.77%) resulted from using the fine-tuned wav2vec-LARGE models. Therefore, the total improvements to the baseline multi-class classifiers were 15.76% and 6.95% (absolute) for males and females, respectively. Overall, the performance difference between the fine-tuned and non-fine-tuned self-supervised models was not large (1.24% for female speakers and 1.48% for male speakers). However, fine-tuning effectively balanced the performance differences between the unbalanced classes. In particular, finetuning resulted in an absolute improvement of 0.10 to the recall of the smallest class (vocal fold paresis), while keeping the balanced classification accuracy almost unchanged. This effect was similar for both genders, and it may indicate that fine-tuning further improves the system performance in small-data scenarios, by balancing the performance differences between the classes of different sizes. The obtained performance metrics in the binary detection between healthy and disordered speech are generally comparable with existing studies that have used the SVD database. For example, the best classification accuracies with recordings of the vowel /a/ in normal pitch were 75.42%, 74.32 %, 67.0% in [5], [8], and [4], respectively. Some works exist that report very high accuracies. For example, the reported classification accuracies were 96.96 % in [3] and 96.5 % in [2]. However, those studies did not use class balancing or balanced classification accuracy metric, which can result overly optimistic performance values due to over-emphasizing the largest healthy class. In this study, the classes were balanced in both training and evaluation data. There is evidence showing that the performance with the SVD data can be largely dependent on the selected experimental setup [1]. The multi-class classification results of this study are not directly comparable to any existing studies, because of the differences in the used databases and included disorders. The work in [11] utilized the SVD database in multi-class classification between healthy, reflux laryngitis, hyperfunctional VOLUME 4, 2023 dysphonia and hypofunctional dysphonia, and achieved true negative and true positive rates of 92.2% and 88.9%, respectively. In [28], SVD was used in a multi-modal classification in several different multi-class classification problems, and the obtained classification accuracy in the 3-class classification was 94.3%. In [12], a 3-class classification was carried out between neurotypical speech, dysarthria and apraxia of speech, and the best balanced classification accuracy of 79.7% is reported. Similar to our work, the best performance was obtained by using a hierarchical SVM classifier. Finally, it should be noted that this study uses the nonnested cross-validation (CV) instead of nested CV, which might result in overfitting, as discussed in [46]. The usage of nested CV has not usually been reported in studies related to detection and classification of voice disorders. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] M. Huckvale and C. Buciuleac, “Automated detection of voice disorder in the saarbrüken voice database: Effects of pathology subset and audio materials,” in Proc. Interspeech, 2021, pp. 1399–1403. F. Amara, M. Fezari, and H. Bourouba, “An improved GMM-SVM system based on distance metric for voice pathology detection,” Appl. Math, vol. 10, no. 3, pp. 1061–1070, 2016. J. Y. Lee, “A two-stage approach using Gaussian mixture models and higher-order statistics for a classification of normal and pathological voices,” EURASIP J. Adv. Signal Process., vol. 1, pp. 1–8, 2012. D. Martínez, E. Lleida, A. Ortega, A. Miguel, and J. Villalba, “Voice pathology detection on the Saarbrüken voice database with calibration and fusion of scores using multifocal toolkit,” in Proc. Adv. Speech Lang. Technol. Iberian Lang., 2012, pp. 99–109. J. A. Gómez-García, L. Moro-Velázquez, and J. I. Godino-Llorente, “On the design of automatic voice condition analysis systems. Part II: Review of speaker recognition techniques and study on the effects of different variability factors,” Biomed. Signal Process. Control, vol. 48, pp. 128–143, 2019. P. Harar, J. B. Alonso-Hernandezy, J. Mekyska, Z. Galaz, R. Burget, and Z. Smekal, “Voice pathology detection using deep learning: A preliminary study,” in Proc. IEEE Int. Conf. Workshop Bioinspired Intell., 2017, pp. 1–4. M. K. Reddy and P. Alku, “A comparison of cepstral features in the detection of pathological voices by varying the input and filterbank of the cepstrum computation,” IEEE Access, vol. 9, pp. 135953–135963, 2021. S. R. Kadiri and P. Alku, “Analysis and detection of pathological voice using glottal source features,” IEEE J. Sel. Topics Signal Process., vol. 14, no. 2, pp. 367–379, Feb. 2019. E. Vaiciukynas, A. Verikas, A. Gelzinis, M. Bacauskiene, and V. Uloza, “Exploring similarity-based classification of larynx disorders from human voice,” Speech Commun., vol. 54, no. 5, pp. 601–610, 2012. R. Behroozmand and F. Almasganj, “Comparison of neural networks and support vector machines applied to optimized features extracted from patients’ speech signal for classification of vocal fold inflammation,” in Proc. IEEE Int. Symp. Signal Process. Inf. Technol., 2005, pp. 844–849. K. T. Chui, M. D. Lytras, and P. Vasant, “Combined generative adversarial network and fuzzy C-means clustering for multi-class voice disorder detection with an imbalanced dataset,” Appl. Sci., vol. 10, no. 13, 2020, Art. no. 4571. I. Kodrasi, M. Pernon, M. Laganaro, and H. Bourlard, “Automatic and perceptual discrimination between dysarthria, apraxia of speech, and neurotypical speech,” in Proc. IEEE Int. Conf. Acoust. Speech and Signal Process., 2021, pp. 7308–7312. A. A. Dibazar, T. W. Berger, and S. S. Narayanan, “Pathological voice assessment,” in Proc. IEEE Int. Conf. Eng. Med. Biol. Soc., 2006, pp. 1669–1673. H. Wu, J. Soraghan, A. Lowit, and G. Di-Caterina, “A deep learning method for pathological voice detection using convolutional deep belief networks,” in Proc. Interspeech, 2018, pp. 446–450. 87 TIRRONEN ET AL.: HIERARCHICAL MULTI-CLASS CLASSIFICATION OF VOICE DISORDERS USING SELF-SUPERVISED MODELS AND GLOTTAL FEATURES [15] J. C. Vásquez-Correa, J. Fritsch, J. R. Orozco-Arroyave, E. Nöth, and M. Magimai-Doss, “On modeling glottal source information for phonation assessment in Parkinson’s disease,” in Proc. Interspeech, 2021, pp. 26–30. [16] Z. Jin et al., “Adversarial data augmentation for disordered speech recognition,” in Proc. Interspeech, 2021, pp. 4803–4807. [17] P. Barche, K. Gurugubelli, and A. K. Vuppala, “Towards automatic assessment of voice disorders: A clinical approach,” in Proc. Interspeech, 2020, pp. 2537–2541. [18] A. Baevski and H. Zhou, “Abdelrahman mohamed, and michael auli, wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 12449– 12460. [19] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451–3460, 2021. [20] A. Hernandez, P. A. Pérez-Toro, E. Noeth, J. R. Orozco-Arroyave, A. Maier, and S. H. Yang, “Cross-lingual self-supervised speech representations for improved dysarthric speech recognition,” in Proc. Interspeech, 2022, pp. 51–55. [21] P. Lester, W. C. Violeta Huang, and T. Toda, “Investigating selfsupervised pretraining frameworks for pathological speech recognition,” in Proc. Interspeech, 2022, pp. 41–45. [22] G. Chatzoudis, M. Plitsis, S. Stamouli, A.-L. Dimou, N. Katsamanis, and V. Katsouros, “Zero-shot cross-lingual aphasia detection using automatic speech recognition,” in Proc. Interspeech, 2022, pp. 2178–2182. [23] S. P. Bayerl, D. Wagner, E. Noeth, and K. Riedhammer, “Detecting dysfluencies in stuttering therapy using wav2vec 2.0,” in Proc. Interspeech, 2022, pp. 2868–2872. [24] Y. Getman et al., “Wav2vec2-based speech rating system for children with speech sound disorder,” in Proc. Interspeech, 2022, pp. 3618–3622. [25] Y. Zhu, X. Liang, J. A. Batsis, and R. M. Roth, “Domain-aware intermediate pretraining for dementia detection with limited data,” in Proc. Interspeech, 2022, pp. 2183–2187. [26] T. Wang et al., “Conformer based elderly speech recognition system for Alzheimer’s disease detection,” in Proc. Interspeech, 2022, pp. 4825–4829. [27] D. Priyasad et al., “Detecting heart failure through voice analysis using self-supervised mode-based memory fusion,” in Proc.23rd Interspeech Conf., 2022, pp. 2848–2852. [28] S. Bhattacharjee and W. Xu, “VoiceLens: A multi-view multi-class disease classification model through daily-life speech data,” Smart Health, vol. 23, 2022, Art. no. 100233. [29] J. Mallela et al., “Voice based classification of patients with Amyotrophic Lateral Sclerosis, Parkinson’s Disease and healthy controls with CNN-LSTM using transfer learning,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020, pp. 6784–6788. [30] H. Cordeiro, J. Fonseca, I. Guimarães, and C. Meneses, “Hierarchical classification and system combination for automatically identifying physiological and neuromuscular laryngeal pathologies,” J. Voice, vol. 31, no. 3, pp. 384.e9–384.e14, 2017. 88 [31] M. Nikkhah-Bahrami, H. Ahmadi-Noubari, B. S. Aghazadeh, and H. K. Heris, “Hierarchical diagnosis of vocal fold disorders,” in Advances Computer Science and Engineering, H. Sarbazi-Azad, B. Parhami, S.G. Miremadi, and S. Hessabi, Eds., Berlin, Germany: Springer, 2009, pp. 897–900. [32] J. Laguarta, F. Hueto, and B. Subirana, “Covid-19 artificial intelligence diagnosis using only cough recordings,” IEEE Open J. Eng. Med. Biol., vol. 1, pp. 275–281, 2020. [33] M. Airaksinen, L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, “A comparison between STRAIGHT, glottal, and sinusoidal vocoding in statistical parametric speech synthesis,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 9, pp. 1658–1670, Sep. 2018. [34] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” in Proc. Interspeech, 2020, pp. 2757–2761. [35] Q. Xu, A. Baevski, and M. Auli, “Simple and effective zero-shot crosslingual phoneme recognition,” in Proc. Interspeech, 2022, pp. 2113– 2117. [36] R. Ardila et al., “Common voice: A massively-multilingual speech corpus,” in Proc. Int. Conf. Lang. Resour. Eval., 2019. [37] J. F. Mark, K. M. Gales, A. K. Ragni, and S. P. Rath, “Speech recognition and keyword spotting for low-resource languages: Babel project research at cued,” in Proc. 4th Int. Workshop Spoken Lang. Technol. Under-Resourced Lang., 2014, pp. 16–23. [38] J. Kahn et al., “Libri-light: A. benchmark for ASR with limited or no supervision,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020, pp. 7669–7673. [39] M. Pützer and W. J. Barry, “Saarbrüken voice database, institute of phonetics, univ. of saarland,” 2007. [Online]. Available: http://www. stimmdatenbank.coli.uni-saarland.de/ (Last viewed Feb. 18, 2023). [40] M. Pützer and W. J. Barry, “Instrumental dimensioning of normal and pathological phonation using acoustic measurements,” Clin. Linguistics Phonetics, vol. 22, no. 6, pp. 407–420, 2008. [41] R. E. Hillman, C. E. Stepp, J. H. V. Stan, M. Zañartu, and D. D. Mehta, “An updated theoretical framework for vocal hyperfunction,” Amer. J. Speech Lang. Pathol., vol. 29, no. 4, pp. 2254–2260, 2020. [42] R. Behroozmand and F. Almasganj, “Optimal selection of waveletpacket-based features using genetic algorithm in pathological assessment of patients’ speech signal with unilateral vocal fold paralysis,” Comput. Biol. Med., vol. 37, no. 4, pp. 474–485, 2007. [43] C. Walton, E. Conway, H. Blackshaw, and P. Carding, “Unilateral vocal fold paralysis: A systematic review of speech-language pathology management,” J. Voice, vol. 31, no. 4, pp. 509–e7, 2017. [44] S. Tirronen, “Detection and multi-class classification of voice disorders from speech recordings,” Master’s thesis, School of Science, Aalto University, Espoo, Finland, 2022. [45] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations, 2019. [46] C. G. Cawley and N. L. C. Talbot, “On over-fitting in model selection and subsequent selection bias in performance evaluation,” J. Mach. Learn. Res., vol. 11, pp. 2079–2107, 2010. VOLUME 4, 2023