Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HNAFormer: A Hierarchical Non-Attention Framework for Alzheimer’s Detection from Spontaneous Speech

HAFFormer: A Hierarchical Attention-Free Framework
for Alzheimer’s Disease Detection from Spontaneous Speech

Abstract

Automatically detecting Alzheimer’s Disease (AD) from spontaneous speech plays an important role in its early diagnosis. Recent approaches highly rely on the Transformer architectures due to its efficiency in modelling long-range context dependencies. However, the quadratic increase in computational complexity associated with self-attention and the length of audio poses a challenge when deploying such models on edge devices. In this context, we construct a novel framework, namely Hierarchical Attention-Free Transformer (HAFFormer), to better deal with long speech for AD detection. Specifically, we employ an attention-free module of Multi-Scale Depthwise Convolution to replace the self-attention and thus avoid the expensive computation, and a GELU-based Gated Linear Unit to replace the feedforward layer, aiming to automatically filter out the redundant information. Moreover, we design a hierarchical structure to force it to learn a variety of information grains, from the frame level to the dialogue level. By conducting extensive experiments on the ADReSS-M dataset, the introduced HAFFormer can achieve competitive results (82.6% accuracy) with other recent work, but with significant computational complexity and model size reduction compared to the standard Transformer. This shows the efficiency of HAFFormer in dealing with long audio for AD detection.

Index Terms—  Alzheimer’s Disease, Hierarchical Modelling, Attention-Free Transformer

1 Introduction

Alzheimer’s Disease (AD) is a common neurodegenerative disorder characterised by clinical features, such as memory impairment, language difficulties, executive dysfunction, and cognitive decline. According to the 2020 report of the Lancet Commission [1], there are over 50 million people worldwide affected by dementia in 2020, and this number will be projected to 152 million by 2050. This will substantially increase individual, family, and society’s financial burden. Although AD cannot be completely cured, early screening or diagnosis not only provides more treatment options but also helps slow the progression of the disease and improve the quality of life for patients. However, accurately detecting AD in its early stage is challenging due to the lack of related health knowledge of AD patients. As spontaneous speech is a potential biomarker that relates to the development of AD, automatic speech analysis as method has recently received significant attention, because it offers a non-invasive, cost-effective, and repeatable means to monitor patients’ speech characteristics [2, 3].

Some researchers have focused on AD detection based on spontaneous speech [4, 5, 6]. Rohanian et al. [5] integrated rule-based features such as word probabilities, disfluency features, and pause information, with various acoustic features for AD detection. Balagopalan et al. [4] conducted an in-depth investigation on traditional acoustic features and pre-trained deep features. They found that combining the classic acoustic features with pre-trained acoustic embeddings in a classification approach can yield higher and more robust performance in an unbalanced data distribution. Additionally, Li et al. [6] explored the impact of acoustic & linguistic embeddings on AD detection tasks. These methods have primarily concentrated on exploring various speech features, such as paralinguistic features and pre-trained acoustic embeddings. Due to the Transformer’s capability to model long-range context dependencies, some recent advancements started to employ the Transformer and its variants as classifiers for AD detection [7, 8, 9]. Specifically, Ilias et al. [7] explored the Vision Transformer (ViT) for AD detection. Jin et al. [8] integrated advanced acoustic embeddings and disfluency features, and combined them with the Swin Transformer and a Random Forest classifier, which achieved the best results in the ADReSS-M competition [3]. Mei et al. [9] also obtained promising results by fine-tuning the wav2vec 2.0 model.

However, handling long-duration spontaneous speech sequences still remains a challenge. Most of the aforementioned methods rely on standard Transformer architectures, in which the computational complexity of its self-attention module has a quadratic relationship with the input length. This makes the model computationally expensive for long sequences, and thus largely hinders the application of spontaneous speech analysis modelling to the automated AD detection in practice. Despite that some efforts have been made, such as HTS-AT [10] and HATN [11], to sequentially downsample the audio signals, the original self-attention module retains and thus cannot help relieve the computation issue significantly.

Refer to caption
Fig. 1: An overview of the proposed Hierarchical Attention-Free Transformer (HAFFormer) framework for Alzheimer’s disease detection. MSDW: multi-scale depth-wise convolution; GEGLU: GELU-based Gated Linear Units; AD/HC: Alzheimer’s Disease or Healthy Control.

To address this challenge, we designed a more efficient architecture, namely the Hierarchical Attention-Free Transformer (HAFFormer). This design is partially inspired by the Metaformer [12] framework, where the Transformer architecture can be mainly structured with a Token Mixer (equal to the original self-attention module) and a Channel Mixer (equal to the original feedforward module). The proposed HAFFormer abandons the quadratic computational complexity of self-attention and instead introduces a convolutional module as the Token Mixer. This convolutional module utilises effective depthwise convolution operations, resulting in a remarkable reduction of the model size and more importantly the computational complexity. Furthermore, we have implemented the GELU-based Gated Linear Unit (GEGLU) module as the Channel Mixer to better capture crucial information within the speech signal. Additionally, encouraged by the SpeechFormer++ [13], HAFFormer is constructed in a hierarchical manner, which further reduces the processing cost for long speech data and makes it more efficient for handling spontaneous speech data. Our contributions are as follows:

  • We propose an Attention-Free Transformer (AFFormer) modelling without an attention mechanism, which is specifically designed for dealing with long-duration speech audio signals in our AD detection scenario.

  • We introduce a hierarchical structure for the AFFormer, empowering the model with the capability to capture different grains of context information, from fine-grain to coarse-grain.

  • We conduct extensive experiments on a publicly available dataset of ADReSS-M for AD detection. The experimental results show that HAFFormer is competitive with other SOTA approaches, but with a considerable reduction of model size and computational complexity.

2 Hierarchical Attention-Free Transformer

Figure 1 illustrates the overall framework of the proposed hierarchical network framework – HAFFormer, which consists of four main components: speech preprocessing and embedding, a projection, a hierarchy of merge and AFFormer blocks, and a classification head. Specifically, as shown in Fig. 1, we introduce three hierarchies of merge and AFFormer blocks. Each of these hierarchies plays a different role in capturing context information, ranging from local to global. The classification head is designed for binary AD detection and is composed of two fully connected layers. In the following sections, we elaborate on the first three components, respectively.

2.1 Speech Preprocessing and Embedding

To obtain rich AD detection features from speech, we employ advanced pretrained models for acoustic feature extraction. Currently, popular pretrained speech models via Self-Supervised Learning (SSL) include Wav2Vec 2.0 [14], HuBERT [15], WavLM [16], and others. Given that the considered AD dataset is cross-lingual (ref. Section 3.1) and early versions of Wav2Vec 2.0, HuBERT, and WavLM were trained on English data only, we opt for Wav2Vec2 XLS-R [17]. This model shares the same architecture with Wav2Vec 2.0 but was trained on much larger datasets over 128 languages to better capture cross-lingual speech representation.

It is worth noting that the AD data that are spontaneous speech (dialogue) often tend to be pretty long, such as several minutes. However, conventional pre-trained models (e. g., Wav2Vec2 XLS-R) typically accept the maximal speech length of 30 seconds, or even less, as inputs. To address this issue, we segment the entire long speech into sequential short utterances using WhisperX [18], by transcribing the signals into linguistic sentences as well as their temporal boundaries. We then feed the sequential short utterances into the Wav2Vec2 XLS-R model to extract frame-wise speech embeddings, which are then concatenated subsequently. The dimensionality of each speech embedding is denoted as D=L×N𝐷𝐿𝑁D=L\times Nitalic_D = italic_L × italic_N, where N𝑁Nitalic_N is 1024 and L𝐿Litalic_L depends on the length of the entire speech signals. In this paper, we set L𝐿Litalic_L to be 3200 (64 seconds) which can cover most of the speech samples (ref. Section 3.1). Any speech longer than 3200 is truncated, while the ones shorter than 3200 are padded with zeros.

2.2 Projection

Considering that AD patients’ spontaneous speech is often lengthy and the annotations are scarce, a high representation dimension would often lead to an overfitting problem. Moreover, it often inevitably brings a heavy computational load. To mitigate the risk of overfitting and reduce the computational load for subsequent AFFormer blocks, we introduce a projection layer to map the high-dimensional representations into low-dimensional ones. Assuming a speech signal corresponds to an embedding of X𝑋Xitalic_X with dimension D=3200×1024𝐷32001024D=3200\times 1024italic_D = 3200 × 1024, the projection layer is introduced to reduce it to a lower dimensionality of D=3200×8𝐷32008D=3200\times 8italic_D = 3200 × 8.

2.3 Merge

As shown in Fig. 1, each hierarchy incorporates a merge layer and one/two AFFormer block(s), where the merge layer serves a dual purpose: reducing the computational complexity associated with long data in Transformer-based variant models and Considerably eliminating the inherent redundancy within speech data for feature aggregation. Specifically, the merge blocks are also implemented by Conv1D. In our experiments, the first, second, and third merge modules downsample the data by factors of 4, 2, and 2, yielding the output dimension of D=800×8𝐷8008D=800\times 8italic_D = 800 × 8, 400×84008400\times 8400 × 8, and 200×82008200\times 8200 × 8, respectively.

2.4 AFFormer block

Inspired by Metaformer [12], we seek to design the most suitable Token Mixer and Channel Mixer for our AD detection task, aiming to replace the high-complexity self-attention module in the standard Transformer.

Token Mixer: We design the Token Mixer with the MSDW structure as shown in Fig. 2 (a). It consists of a two-branch convolution topology: One branch is a 1×1111\times 11 × 1 depth-wise convolution; the other one is a 7×1717\times 17 × 1 depth-wise convolution. Mathematically, the process of Token Mixer can be calculated by:

Y=Conv1D7×1(LN(X))+Conv1D1×1(LN(X))+X,𝑌subscriptConv1D71LN𝑋subscriptConv1D11LN𝑋𝑋Y=\text{Conv1D}_{7\times 1}(\text{LN}(X))+\text{Conv1D}_{1\times 1}(\text{LN}(% X))+X,\vspace{-.2cm}italic_Y = Conv1D start_POSTSUBSCRIPT 7 × 1 end_POSTSUBSCRIPT ( LN ( italic_X ) ) + Conv1D start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( LN ( italic_X ) ) + italic_X , (1)

where X𝑋Xitalic_X and Y𝑌Yitalic_Y represent the inputs and outputs of the Token Mixer, GELU and LN denote the Gaussian Error Linear Unit activation function and layer normalisation. The module is highly motivated by the Inverted Separable Convolution (ISC) module, as proposed in MobileNet V2 [19], where the depth-wise convolution (DW) and inverted residual structure are efficient in capturing the local context information while minimising model size. This design refrains from self-attention layers and thus avoids expensive computations.

Channel Mixer: We construct the Channel Mixer with GEGLU as shown in Fig. 2 (b). The GLU module [20] comprises two branches, where one branch contains one linear layer followed by a nonlinear activation, i. e., GELU in our case, as a gating unit, and the other branch is one linear layer only without any activation function. The outputs from the two branches are then combined by using element-wise multiplication. The gating unit can automatically learn to filter the output information, suppressing unimportant information and retaining relevant one. This design is beneficial for the model to learn long-range dependencies. Recently, the GLU module is increasingly being utilised in large language models [21, 22] to replace one fully connected (FC) layer in the feedforward (FFN) module as well. The process of Channel Mixer can be expressed by:

Y=(GELU(LN(X)W1)LN(X)W2,Y=(\text{GELU}(\text{LN}(X)W_{1})\odot\text{LN}(X)W_{2},italic_Y = ( GELU ( LN ( italic_X ) italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊙ LN ( italic_X ) italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (2)

where W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote the weights of two linear layers, and X𝑋Xitalic_X and Y𝑌Yitalic_Y represent the inputs and outputs of the Channel Mixer.

Refer to caption
Fig. 2: Detailed architecture of the MSDW module (a) and the GEGLU module (b).

3 Experiments and Results

In this section, we will first introduce the used dataset, followed by detailed information about the experimental setup. Finally, we will present the results of the HAFFormer and engage in a discussion.

3.1 Dataset

In this study, we utilised the ADReSS-M Challenge dataset [3], which is designed for multilingual AD detection through spontaneous speech. The ADReSS-M dataset consists of 291 spontaneous speech samples, where 245 samples are for training and 46 samples for testing. In the training set, there are 237 samples in English, and the remaining 8 samples are in Greek. The duration of the English training samples ranges from 22.3 seconds to 268.5 seconds, with an average duration of 75.9 seconds. In the testing set, all samples are in Greek, with a duration ranging from 11.8 seconds to 119.4 seconds and an average duration of 38.1 seconds. The training dataset was balanced with respect to age and gender to mitigate potential confounds and biases in the data.

3.2 Experimental Setup

In our implementation using the PyTorch framework, we employed the cross-entropy loss function to minimise the loss. We used the AdamW optimiser with an initial learning rate of 2e-3 and a weight decay rate of 1e-5. The batch size was set to 8, and the model was trained for 80 epochs. To ensure a fair comparison with other works and assess the overall capability of the model, we use accuracy (ACC) and F1-score (F1) as evaluation metrics.

3.3 Results and Discussions

3.3.1 Comparison with SOTA Approaches

To demonstrate the effectiveness of our proposed method, we conducted a comparison with other SOTA approaches. As shown in Table 1, the introduced HAFFormer yields better performance than most previous work [23, 9], where various combinations of acoustic features (ComParE 2016, eGeMAPS, wav2vec2-base) and fine-tuning methods (wav2vec2-large-xlsr-53) for pre-trained models have been attempted [23, 9].

It also achieves competitive performance with the best work in [24], where pre-training combined with fixed-batch transfer learning has been employed. Pre-training is conducted on a subset of English data from the training set, followed by a fine-tuning on the remaining English and Greek data. Their model consists of two fully-connected layers and an attention-pooling layer. Despite the simplicity, this architecture may lack the scaling capability when increasing the amount of training data.

Table 1: Performance comparison between the proposed HAFFormer and other state-of-the-art methods.
Methods ACC [%] F1 [%]
Pre-train + Mixed-batch transfer learning [24] 82.60 -
Fine-tuned wav2vec 2 [23] 73.91 -
IS10-paralinguistics-compat+SVM [9] 69.60 -
HAFFormer 82.60 82.60

3.3.2 Selection of the Token Mixers

In our exploration of the MetaFormer architecture for AD detection and the search for a more efficient and effective Token Mixer, we tested six different Token Mixers: Self-Attention, Pool, Identity, ISC, DW, and MSDW, with Channel Mixer fixed with FFN. The relevant parameters were set as follows: d_model for Self-Attention was set to 8, and for ISC, DW, and MSDW, the number of convolutional kernels was set to 8. The number of neurons in FFN was set as 4 times d_model, following the standard Transformer setting. Additionally, the 1×1111\times 11 × 1 convolution in ISC was replaced with an FC layer, using the same parameters as the FFN.

The results are shown in Fig. 3. It is illustrated that, with the same Channel Mixer, both Identity and Pool outperformed Self-Attention, demonstrating the effectiveness of the MetaFormer architecture in AD detection. Furthermore, MSDW achieved the best results with considerably fewer parameters than Self-Attention. Therefore, we selected MSDW as the token mixer due to its superior performance while requiring far fewer parameters compared to self-attention.

Refer to caption
Fig. 3: Model performance (accuracy) vs model size (number of parameters) when taking different types of token mixers under the same channel mixer (i. e., FFN). MACs : Multiply–accumulate operations.

3.3.3 Selection of the Channel Mixers

In this section, we aim to select the best Channel Mixer. We evaluated four different Channel Mixers: FFN, Pool, Identity, and GEGLU. We set the number of neurons in GEGLU as 2 times d_model, with other parameters set the same as in Section 3.3.2. Table 2 presents the results for different Channel Mixers. The experimental results indicate that GEGLU consistently yields good results, with MSDW + GELU achieving the best performance. Notably, the MACs for MSDW + GELU are approximately 120120\frac{1}{20}divide start_ARG 1 end_ARG start_ARG 20 end_ARG of the standard Transformer, making it highly efficient in terms of the number of model parameters.

It is worth mentioning that the parameters listed in Section 3.3.2 and Section 3.3.3 exclude the projection layer’s parameters. This is because we map the embeddings to a very low dimension, resulting in the parameters of other modules being much smaller than those of the projection layer. For example, the MACs for Self-Attention + FFN are 107.15M, while the MACs for the projection layer are 78.64M, accounting for 73.4% of the total MACs. On the other hand, the MACs for MSDW + GELU are 80.08M, with the projection layer’s MACs being 78.64M, accounting for over 98% of the total MACs. Therefore, the parameters listed are without the projection layer, for a fair comparison.

Table 2: Comparison of the model performance (accuracy [ACC] and F1), the model size (number of parameters), and the model operation complexity (number of MACs) by taking different token mixers and channel mixers. Pool: average pooling; Identity: no any module; ISC: inverted separable convolution; DW: depth-wise convolution; MSDW: multi-scale depth-wise convolution. FFN: feedforward layer; GEGLU: GELU-based gated linear unit.
Token Channel ACC [%] F1 [%] Params [K] MACs [M]
Mixers Mixers
Self-Attention FFN 71.73 71.69 5.09 28.51
Pool 80.43 80.29 2.33 27.18
Identity 73.91 73.91 2.33 27.18
GEGLU 78.26 78.01 4.45 28.18
Pool FFN 73.91 73.91 3.65 1.6
Pool 76.08 76.08 0.89 0.27
Identity 73.91 73.91 0.89 0.27
GEGLU 76.08 76.05 3.01 1.27
Identity FFN 71.73 71.69 3.65 1.6
Pool 80.43 80.29 0.89 0.27
Identity 73.91 73.91 0.89 0.27
GEGLU 78.26 78.01 3.01 1.27
ISC FFN 76.08 76.08 5.49 2.56
Pool 76.08 75.68 2.73 1.23
Identity 80.43 80.29 2.73 1.23
GEGLU 80.43 80.43 4.85 2.23
DW FFN 76.08 76.09 3.93 1.75
Pool 71.73 71.69 1.17 0.42
Identity 73.91 73.91 1.17 0.42
GEGLU 76.08 76.08 3.29 1.42
MSDW FFN 78.26 78.26 4.13 1.77
Pool 76.08 76.89 1.37 0.44
Identity 76.08 76.05 1.37 0.44
GEGLU 82.60 82.60 3.49 1.44

3.3.4 Selection of the Number of Hierarchy

To reduce the computational complexity of the model for long sequence data, a layered model can be an effective approach. Currently, in both the speech [13, 10] and Computer Vision [12, 25] domains, many works employ a four-hierarchy (stage) model. In this section, we conduct an ablation study on the number of layers in the hierarchical model.

From Fig. 4, it can be observed that both Hierarchy3-1 and Hierarchy3-2 achieved the best results. The difference between them lies in the number of layers in the HAFFormer block: Hierarchy3-1 utilises only one layer, while Hierarchy3-2 uses two layers. Consequently, Hierarchy3-1 has fewer parameters. On the other hand, Hierarchy4 experienced a drop in performance due to overfitting, likely caused by an increase in model size relative to the small dataset. This suggests that having a moderate number of layers, as seen in Hierarchy3-1 and Hierarchy3-2, is more effective for the given AD dataset, while increasing the model complexity beyond a certain point may lead to overfitting on this dataset.

Refer to caption
Fig. 4: Performance of the proposed HAFFormer when taking different numbers of hierarchy. 3-1/2 indicates the three hierarchies but the last one contains one or two blocks.

4 CONCLUSION

To deal with long spontaneous speech in the context of Alzheimer’s disease detection, we proposed a lightweight Transformer variant of Hierarchical Attention-Free Transformer (HAFFormer). The attention-free alternative of the self-attention module remarkably reduces the quadratic computational complexity, and the GELU-based gated linear unit – an alternative of the feedforward module – can better learn to select the most salient representation. Moreover, the introduced hierarchical architecture further lowers the processing cost for handling long speech data. The empirical results present the efficiency of the proposed model in terms of model size and complexity for AD detection. The HAFFormer will be further investigated on other related tasks in mobile mental health [26].

References

  • [1] G. Livingston, J. Huntley, A. Sommerlad, D. Ames, C. Ballard, S. Banerjee, C. Brayne, A. Burns, J. Cohen-Mansfield, C. Cooper et al., “Dementia prevention, intervention, and care: 2020 report of the lancet commission,” The Lancet, vol. 396, no. 10248, pp. 413–446, 2020.
  • [2] G. Gainotti, D. Quaranta, M. G. Vita, and C. Marra, “Neuropsychological predictors of conversion from mild cognitive impairment to alzheimer’s disease,” Journal of Alzheimer’s disease, vol. 38, no. 3, pp. 481–495, 2014.
  • [3] S. Luz, F. Haider, D. Fromm, I. Lazarou, I. Kompatsiaris, and B. MacWhinney, “Multilingual alzheimer’s dementia recognition through spontaneous speech: a signal processing grand challenge,” arXiv preprint arXiv:2301.05562, 2023.
  • [4] A. Balagopalan and J. Novikova, “Comparing acoustic-based approaches for alzheimer’s disease detection,” in Proc. 22st Annual Conference of the International Speech Communication Association (INTERSPEECH), 2021, pp. 3800–3804.
  • [5] M. Rohanian, J. Hough, and M. Purver, “Alzheimer’s dementia recognition using acoustic, lexical, disfluency and speech pause features robust to noisy inputs,” in Proc. 22st Annual Conference of the International Speech Communication Association (INTERSPEECH), 2021, pp. 3820–3824.
  • [6] J. Li, K. Song, J. Li, B. Zheng, D. Li, X. Wu, X. Liu, and H. Meng, “Leveraging pretrained representations with task-related keywords for alzheimer’s disease detection,” in Proc. 48th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–5.
  • [7] L. Ilias, D. Askounis, and J. Psarras, “Detecting dementia from speech and transcripts using transformers,” Computer Speech & Language, vol. 79, p. 101485, Apr 2023.
  • [8] L. Jin, Y. Oh, H. Kim, H. Jung, H. J. Jon, J. E. Shin, and E. Y. Kim, “Consen: Complementary and simultaneous ensemble for alzheimer’s disease detection and mmse score prediction,” in Proc. 48th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–2.
  • [9] K. Mei, X. Ding, Y. Liu, Z. Guo, F. Xu, X. Li, T. Naren, J. Yuan, and Z. Ling, “The ustc system for adress-m challenge,” in Proc. 48th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–2.
  • [10] K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in Proc. 47th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022, pp. 646–650.
  • [11] Z. Zhao, Z. Bao, Z. Zhang, N. Cummins, H. Wang, and B. W. Schuller, “Hierarchical attention transfer networks for depression assessment from speech,” in Proc. 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020, pp. 7159–7163.
  • [12] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” in Proc. 38th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 819–10 829.
  • [13] W. Chen, X. Xing, X. Xu, J. Pang, and L. Du, “Speechformer++: A hierarchical efficient framework for paralinguistic speech processing,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 775–788, 2023.
  • [14] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. 34th Annual Conference on Neural Information Processing Systems (NeurIPS), 2020.
  • [15] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  • [16] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  • [17] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino et al., “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” in Proc. 23st Annual Conference of the International Speech Communication Association (INTERSPEECH), 2022, pp. 2278–2282.
  • [18] M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time-accurate speech transcription of long-form audio,” arXiv preprint arXiv:2303.00747, 2023.
  • [19] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. 31th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
  • [20] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proc. 34th International Conference on Machine Learning (ICML), 2017, pp. 933–941.
  • [21] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  • [22] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
  • [23] X. Chen, Y. Pu, J. Li, and W.-Q. Zhang, “Cross-lingual alzheimer’s disease detection based on paralinguistic and pre-trained features,” in Proc. 48th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–2.
  • [24] B. Tamm, R. Vandenberghe, and H. Van Hamme, “Cross-lingual transfer learning for alzheimer’s detection from spontaneous speech,” in Proc. 48th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023, pp. 1–2.
  • [25] A. Wang, H. Chen, Z. Lin, H. Pu, and G. Ding, “Repvit: Revisiting mobile cnn from vit perspective,” arXiv preprint arXiv:2307.09283, 2023.
  • [26] J. Han, Z. Zhang, C. Mascolo, E. André, J. Tao, Z. Zhao, and B. W. Schuller, “Deep learning for mobile mental health: Challenges and recent advances,” IEEE Signal Processing Magazine, vol. 38, no. 6, pp. 96–105, 2021.