Abstract
Speech rate conversion in screen readers is a crucial human-computer interface that allows persons with vision disabilities to access audiovisual media. However, its potential utility in audio descriptions (ADs) has not been researched extensively. This study investigated the effect of AD speech rates on information access and film aesthetics. The appropriate speech rate for human-narrated and synthesized ADs of short scenes from Japanese movies was evaluated by involving blind, partially blind, and sighted participants. An arranged staircase procedure and software developed to enable AD speech rate adjustments through keyboard operations were employed. The results obtained from blind and partially blind participants indicated lower mean appropriate speech rates of the single AD with human and synthesized voices for information access than those of screen readers in previous studies but higher than the standard spontaneous speech rate. The presence of the film sound did not significantly impact the average speech rate of either the human or synthesized AD. However, the variability of the speech rate decreased specifically for the synthesized AD. Additionally, a statistically significant positive correlation emerged between the appropriate speech rates determined for human and synthesized ADs, strengthened experimentally with short movie scenes with multiple ADs. The comments by blind and partially blind participants and sighted participants’ ratings demonstrated the positive effect of appropriate sequential changes in speech rate on the sense of presence and immersion in movies. The speech rate design of ADs presents an opportunity to improve the comfort of listening to ADs and augment the movie experience.
Similar content being viewed by others
References
Asakawa, C., Takagi, H., Ino, S., et al.: The optimal and maximum listening rates in presenting speech information to the blind. J. Human Interface Soc. 7(1):105–111 (2005) https://ci.nii.ac.jp/naid/10014479377/
Bodi, A., Fazli, P., Ihorn, S., et al.: Automated video description for blind and low vision users. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, pp 1–7, (2021) https://doi.org/10.1145/3411763.3451810
Bragg, D., Bennett, C., Reinecke, K., et al.: A large inclusive study of human listening rates. CHI ’18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems pp 1–12 (2018) https://doi.org/10.1145/3173574.3174018
Cambre, J., Colnago, J., Maddock, J., et al.: Choice of voices: A large-scale evaluation of text-to-speech voice quality for long-form content. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, CHI ’20, pp 1–13, (2020) https://doi.org/10.1145/3313831.3376789
Campos, V.P., de Araújo, T.M.U., de Souza Filho, G.L., et al.: CineAD: a system for automated audio description script generation for the visually impaired. Univ. Access Inf. Soc. 19, 99–111 (2020). https://doi.org/10.1007/s10209-018-0634-4
Campos, V.P., Gonçalves, L.M.G., Ribeiro, W.L., et al.: Machine generation of audio description for blind and visually impaired people. ACM Trans. Access Comput. 16(2), 1–28 (2023). https://doi.org/10.1145/3590955
Choi, D., Kwak, D., Cho, M., et al.: Nobody speaks that fast!. An empirical study of speech rate in conversational agents for people with vision impairments. CHI ’20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems pp 1–13 (2020) https://doi.org/10.1145/3313831.3376569
Clark, R., Silen, H., Kenter, T., et al.: Evaluating long-form text-to-speech: Comparing the ratings of sentences and paragraphs. In: Proceeding 10th ISCA Workshop on Speech Synthesis (SSW 10), pp 99–104, (2019) https://doi.org/10.21437/SSW.2019-18
Cornsweet, T.N.: The staircase-method in psychophysics. Am. J. Psychol. 75(3), 485–491 (1962). https://doi.org/10.2307/1419876
Described, Captioned Media Program, Description key - How to describe. https://dcmp.org/learn/617-description-key---how-to-describe, Accessed 20 September 2023 (2023)
Fernández-Torné, A.: Audio description and technologies: study on the semi-automatisation of the translation and voicing of audio descriptions. PhD thesis, Universitat Autnoma de Barcelona, Barcelona, Spain (2016)
Fernández-Torné, A., Matamala, A.: Text-to-speech vs. human voiced audio descriptions: a reception study in films dubbed into catalan. J. Spec. Trans. 24, 61–88 (2015)
Fisher, R.A.: Statistical methods for research workers. Oliver and Boyd, Edinburgh, Scotland (1925)
Han, T., Bain, M., Nagrani, A., et al.: Autoad: Movie description in context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 18930–18940, (2023) https://doi.org/10.48550/arXiv.2303.16899
Jankowska, A., Ziółko, B., Irgas-Cybulska, M., et al.: Reading rate in filmic audio description. Int. J. Trans. 19, 75–97 (2017). https://doi.org/10.13137/2421-6763/17352
Kobayashi, M., O’Connell, T., Gould, B., et al.: Are synthesized video descriptions acceptable? In: Proceedings of the 12th International ACM SIGACCESS Conference on Computers and Accessibility. Association for Computing Machinery, New York, NY, USA, ASSETS ’10, pp 163–170, (2010) https://doi.org/10.1145/1878803.1878833
Kurihara, K., Imai, A., Seiyama, N., et al.: Automatic generation of audio descriptions for sports programs. SMPTE Motion Imag. J. 128(1), 41–47 (2019). https://doi.org/10.5594/JMI.2018.2879261
Maekawa, K.: Corpus of spontaneous Japanese: Its design and evaluation. Proceedings of The ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR 2003) pp 7–12 (2003)
Natalie, R., Tseng, J., Kacorri, H., et al.: Supporting novices author audio descriptions via automatic feedback. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, CHI ’23, (2023) https://doi.org/10.1145/3544548.3581023
Netflix, Inc. Audio description style guide v2.5. https://partnerhelp.netflixstudios.com/hc/en-us/articles/215510667-Audio-Description-Style-Guide-v2-5, Accessed 20 September 2023 (2023)
Omori, K., Nakagawa, R., Yasumura, M., et al.: Comparative evaluation of the movie with audio description narrated with text-to-speech. IEICE Tech. Rep. 114(512), 17–22 (2015)
Oncescu, A., Henriques, J.F., Liu, Y., et al.: QuerYD: A video dataset with high-quality textual and audio narrations. (2020) CoRR arXiv preprint. arXiv preprint arXiv:2011.11071
Plaza, M.: Cost-effectiveness of audio description production process: comparative analysis of outsourcing and ‘in-house’ methods. Int. J. Prod. Res. 55(12), 3480–3496 (2017). https://doi.org/10.1080/00207543.2017.1282182
Sade, J., Naz, K., Plaza, M.: Enhancing audio description: A value added approach. In: Miesenberger, K., Karshmer, A., Penaz, P., et al. (eds.) Computers Helping People with Special Needs. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 270–277, (2012) https://doi.org/10.1007/978-3-642-31522-0_40
Salway, A.: A corpus-based analysis of audio description. Media for All. pp 151–174 (2007) https://doi.org/10.1163/9789401209564-s012
Shen, X., Li, D., Zhou, J., et al.: Fine-grained audible video description. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10585–10596, (2023) https://doi.org/10.1109/CVPR52688.2022.00497, arXiv:2303.15616
Snyder, J.: Audio desciption guidelines and best practices. American Council of The Blind’s Audio Description Project (2010)
Soldan, M., Pardo, A., Alcázar, J.L., et al.: Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 5016–5025, (2022) https://doi.org/10.48550/arXiv.2112.00431,
Szarkowska, A.: Text-to-speech audio description: towards wider availability of ad. J. Sp. Trans. 15, 142–163 (2011)
Taylor, C., Perego, E. (eds.) The routledge handbook of audio description. Routledge (2022). https://doi.org/10.4324/9781003003052
Independent Television Commission, London: ITC Guidance On Standards for Audio Description (2000. https://msradio.huji.ac.il/narration.doc
Walczak, A., Fryer, L.: Vocal delivery of audio description by genre: measuring users’ presence. Perspectives 26(1), 69–83 (2018). https://doi.org/10.1080/0907676X.2017.1298634
Wang, Y., Liang, W., Huang, H., et al.: Toward automatic audio description generation for accessible videos. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, CHI ’21, pp 1–12, (2021) https://doi.org/10.1145/3411764.3445347
Watanane, T.: Acoustical research on speech properties and explanatory expressions used in screen readers for visually impaired persons. Tech. Rep. The National Institute of Special Education, Japan, (2004) http://www.nise.go.jp/cms/resources/content/389/F-117.pdf
Acknowledgements
We thank Takeya Naono for his assistance with software development and data collection.
Funding
This study was supported by a Grant-in-Aid for Scientific Research (C) 22K12330.
Author information
Authors and Affiliations
Contributions
S.N. wrote the main manuscript text. K.M. and N.O. reviewed the writing of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to declare that are relevant to the content of this article.
Ethical approval
This study was reviewed and approved by Akita University based on Article 12 (1) of the Code of Ethics for Human Subject Research. Informed consent of the participants was received before the commencement of this study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nakajima, S., Okochi, N. & Mitobe, K. Enhancing movie experience by speech rate design of audio description. Univ Access Inf Soc (2024). https://doi.org/10.1007/s10209-024-01178-z
Accepted:
Published:
DOI: https://doi.org/10.1007/s10209-024-01178-z