Abstract
Automatic depression level estimation from speech is currently an active research topic in the field of computational emotion recognition. One symptom commonly exhibited by patients with depression is erratic speech volume; thus, patients’ voices can be used as a bio-signature to identify their level of depression. However, speech signals have time-frequency properties; different frequencies and different timestamps contribute to depression detection in different ways. Accordingly, we design a Coordinate Channel Attention (CCA) block for differentiating tensor information with different contributions. We use a dense block to extract profound speech features with the above-mentioned blocks to form our proposed Dense Coordinate Channel Attention Network (DCCANet). Subsequently, a vectorization block is utilized to fuse the high-dimensional information. We split the original long speech into short audio segments of equal length, then feed these short segments into the network after feature extraction to determine BDI-II scores. Ultimately, the mean of the scores is used as the individual’s depression level. Experiments on both the AVEC2013 and AVEC2014 datasets prove the effectiveness of DCCANet, which outperforms several existing methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cai, C., Niu, M., Liu, B., Tao, J., Liu, X.: TDCA-Net: time-domain channel attention network for depression detection. In: Proceedings of the INTERSPEECH, pp. 2511–2515. Brno, Czechia (2021)
Cummins, N., Sethu, V., Epps, J., Williamson, J.R., Quatieri, T.F., Krajewski, J.: Generalized two-stage rank regression framework for depression score prediction from speech. IEEE Trans. Affect. Comput. 11(2), 272–283 (2020)
Dietrich, M., Abbott, K.V., Gartner-Schmidt, J., Rosen, C.A.: The frequency of perceived stress, anxiety, and depression in patients with common pathologies affecting voice. J. Voice 22(4), 472–488 (2008)
Dong, Y., Yang, X.: A hierarchical depression detection model based on vocal and emotional cues. Neurocomputing 441, 279–290 (2021)
Fan, C., Lv, Z., Pei, S., Niu, M.: CSENet: complex squeeze-and-excitation network for speech depression level prediction. In: Proceedings of the 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 546–550. Singapore (2022)
Fu, X., Li, J., Liu, H., Zhang, M., Xin, G.: Audio signal-based depression level prediction combining temporal and spectral features. In: Proceedings of the 26th International Conference on Pattern Recognition (ICPR), pp. 359–365. MontrÃal, Canada (2022)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141. Salt Lake City, Utah, USA (2018)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Honolulu, Hawaii, USA (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2017)
Li, Y., Niu, M., Zhao, Z., Tao, J.: Automatic depression level assessment from speech by long-term global information embedding. In: Proceedings of the 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8507–8511. Singapore (2022)
Lin, S., Zeng, Y., Gong, Y.: Learning of time-frequency attention mechanism for automatic modulation recognition. IEEE Wireless Commun. Lett. 11(4), 707–711 (2022)
Liu, Q., He, H., Yang, J., Feng, X., Zhao, F., Lyu, J.: Changes in the global burden of depression from 1990 to 2017: findings from the global burden of disease study. J. Psychiatr. Res. 126, 134–140 (2020)
Mundt, J.C., Snyder, P.J., Cannizzaro, M.S., Chappie, K., Geralts, D.S.: Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology. J. Neurolinguistics 20(1), 50–64 (2007)
Mundt, J.C., Vogel, A.P., Feltner, D.E., Lenderking, W.R.: Vocal acoustic biomarkers of depression severity and treatment response. Biol. Psychiat. 72(7), 580–587 (2012)
Niu, M., Liu, B., Tao, J., Li, Q.: A time-frequency channel attention and vectorization network for automatic depression level prediction. Neurocomputing 450, 208–218 (2021)
Niu, M., Tao, J., Liu, B., Fan, C.: Automatic depression level detection via LP-norm pooling. In: Proceedings of the INTERSPEECH, pp. 4559–4563. Graz, Austria (2019)
Niu, M., Tao, J., Liu, B., Huang, J., Lian, Z.: Multimodal spatiotemporal representation for automatic depression level detection. IEEE Trans. Affect. Comput. 14(1), 294–307 (2020)
Organization, W.H., et al.: Depression and other common mental disorders: global health estimates. Tech. rep., World Health Organization (2017)
Paykel, E.S.: Basic concepts of depression. Dialogues Clin. Neurosci. 10(3), 279–289 (2022)
Skaik, R.S., Inkpen, D.: Predicting depression in Canada by automatic filling of beck’s depression inventory questionnaire. IEEE Access 10, 102033–102047 (2022)
Stasak, B., Epps, J., Goecke, R.: An investigation of linguistic stress and articulatory vowel characteristics for automatic depression classification. Comput. Speech Lang. 53, 140–155 (2019)
Takaya, T., et al.: Major depressive disorder discrimination using vocal acoustic features. J. Affect. Disord. 225, 214–220 (2018)
Valstar, M., et al.: AVEC 2014: 3D dimensional affect and depression recognition challenge. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge (AVEC), pp. 3–10. Orlando, Florida, USA (2014)
Valstar, M., et al.: AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge (AVEC), pp. 3–10. Barcelona, Spain (2013)
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: ECA-Net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the 33rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11534–11542. Seattle, USA (2020)
Wang, Y., Gorenstein, C.: Psychometric properties of the beck depression inventory-ii: a comprehensive review. Braz. J. Psychiatry 35(4), 416–431 (2013)
Wang, Y., et al.: Transformer-based acoustic modeling for hybrid speech recognition. In: Proceedings of the 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 6874–6878. Virtual, Barcelona (2020)
Zhao, Z., et al.: Hybrid network feature extraction for depression assessment from speech. In: Proceedings of the INTERSPEECH, pp. 4956–4960. Shanghai, China (2020)
Acknowledgements
The work is supported by the National Natural Science Foundation of China (No. 62071330) and the Open Project Program of the State Key Laboratory of Multimodal Artificial Intelligence System (No. 202200012).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, Z., Liu, S., Niu, M., Wang, H., Schuller, B.W. (2025). Dense Coordinate Channel Attention Network for Depression Level Estimation from Speech. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15313. Springer, Cham. https://doi.org/10.1007/978-3-031-78201-5_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-78201-5_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78200-8
Online ISBN: 978-3-031-78201-5
eBook Packages: Computer ScienceComputer Science (R0)