Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3411763.3451810acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
poster

Automated Video Description for Blind and Low Vision Users

Published: 08 May 2021 Publication History

Abstract

Video accessibility is crucial for blind and low vision users for equitable engagements in education, employment, and entertainment. Despite the availability of professional description services and tools for amateur description, most human-generated descriptions are expensive and time consuming, and the rate of human-generated descriptions simply cannot match the speed of video production. To overcome the increasing gaps in video accessibility, we developed a system to automatically generate descriptions for videos and answer blind and low vision users’ queries on the videos. Results from a pilot study with eight blind video aficionados indicate the promise of this system for meeting needs for immediate access to videos and validate our efforts in developing tools in partnership with the individuals we aim to benefit. Though the results must be interpreted with caution due to the small sample size, participants overall reported high levels of satisfaction with the system, and all preferred use of the system over no support at all.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6077–6086.
[2]
[2] Google Cloud Vision API.Accessed Date 2021-01-10. https://cloud.google.com/vision/docs/ocr
[3]
David Bar-El, Thomas Large, Lydia Davison, and Marcelo Worsley. 2018. Tangicraft: A Multimodal Interface for Minecraft. In International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS). pp. 456–458.
[4]
Carmen J Branje and Deborah I Fels. 2012. Livedescribe: Can Amateur Describers create High-Quality Audio Description?Journal of Visual Impairment & Blindness 106, 3 (2012), pp. 154–165.
[5]
[5] Listen by Code Speech-to Text API.Accessed Date 2021-01-10. https://www.listenbycode.com/
[6]
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1724–1734.
[7]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 326–335.
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 248–255.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018).
[10]
Michael Friedewald and Oliver Raabe. 2011. Ubiquitous Computing: An Overview of Technology Impacts. Telematics and Informatics 28, 2 (2011), pp. 55–65.
[11]
Langis Gagnon, Claude Chapdelaine, David Byrns, Samuel Foucher, Maguelonne Heritier, and Vishwa Gupta. 2010. A Computer-Vision-Assisted System for Videodescription Scripting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR Workshops). pp. 41–48.
[12]
Langis Gagnon, Samuel Foucher, Maguelonne Heritier, Marc Lalonde, David Byrns, Claude Chapdelaine, James Turner, Suzanne Mathieu, Denis Laurendeau, Nath Tan Nguyen, 2009. Towards Computer-Vision Software Tools to Increase Production and Accessibility of Video Description for People with Vision Loss. Universal Access in the Information Society 8, 3 (2009), pp. 199–218.
[13]
Eitan Glinert and Lonce Wyse. 2007. AudiOdyssey: An Accessible Video Game for both Sighted and Non-sighted Gamers. In International Academic Conference on the Future of Game Design and Technology (Future Play). pp. 251–252.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778.
[15]
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), pp. 1735–1780.
[16]
[16] Microsoft Azure Video Indexer.Accessed Date 2021-01-10. https://azure.microsoft.com/en-us/services/media-services/video-indexer/
[17]
Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0.1: The Winning Entry to the VQA Challenge 2018. arXiv:1807.09956 (2018).
[18]
Richard E Ladner, Melody Y Ivory, Rajesh Rao, Sheryl Burgstahler, Dan Comden, Sangyun Hahn, Matthew Renzelmann, Satria Krisnandi, Mahalakshmi Ramasamy, Beverly Slabosky, 2005. Automating Tactile Graphics Translation. In Proceedings of the International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS). pp. 150–157.
[19]
Vladimir Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10 (1966), pp. 707–710.
[20]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV). pp. 740–755.
[21]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language tasks. In Neural Information Processing Systems (NeurIPS). pp. 13–23.
[22]
Joshua A Miele, Steven Landau, and Deborah Gilden. 2006. Talking TMAP: Automated Generation of Audio-Tactile Maps using Smith-Kettlewell’s TMAP Software. British Journal of Visual Impairment 24, 2 (2006), pp. 93–100.
[23]
Devi Archana Paladugu, Zheshen Wang, and Baoxin Li. 2010. On Presenting Audio-tactile Maps to Visually Impaired Users for Getting Directions. In ACM SIGCHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI). pp. 3955–3960.
[24]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Annual Meetings of the Association for Computational Linguistics (ACL). pp. 311–318.
[25]
Amy Pavel, Gabriel Reyes, and Jeffrey P. Bigham. 2020. Rescribe: Authoring and Automatically Editing Audio Descriptions. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (UIST). pp. 747–759.
[26]
Venkatesh Potluri, Priyan Vaithilingam, Suresh Iyengar, Y Vidya, Manohar Swaminathan, and Gopal Srinivasa. 2018. CodeTalk: Improving Programming Environment Accessibility for Visually Impaired Developers. In Proceedings of the Conference on Human Factors in Computing Systems (CHI). pp. 1–11.
[27]
Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. arxiv:1804.02767 [cs.CV]
[28]
Jaime Sánchez and Matías Espinoza. 2011. Audio Haptic Videogaming for Navigation Skills in Learners Who are Blind. In International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS). pp. 227–228.
[29]
Yue-Ting Siu and Ike Presley. 2020. Access Technology for Blind and Low Vision Accessibility. APH Press, Louisville, KY.
[30]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-BERT: Pre-Training of Generic Visual-Linguistic Representations. arXiv:1908.08530 (2019).
[31]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1–9.
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems (NeurIPS). pp. 5998–6008.
[33]
Steven Wall and Stephen Brewster. 2006. Feeling what you hear: tactile feedback for navigation of audio graphs. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). pp. 1123–1132.
[34]
Mark Weiser. 1991. The Computer for the 21st Century. Scientific American 265, 3 (1991), pp. 94–105.
[35]
[35] YouDescribe.Accessed Date 2021-01-10. https://www.youdescribe.org/
[36]
Beste Yuksel, Pooyan Fazli, Umang Mathur, Vaishali Bisht, Soo Jung Kim, Joshua Junhee Lee, Seung Jung Jin, Yue-Ting Siu, Joshua A Miele, and Ilmi Yoon. 2020. Human-in-the-Loop Machine Learning to Increase Video Accessibility for Visually Impaired and Blind Users. In ACM Designing Interactive Systems (DIS). pp. 47–60.
[37]
Beste Yuksel, Pooyan Fazli, Umang Mathur, Vaishali Bisht, Soo Jung Kim, Joshua Junhee Lee, Seung Jung Jin, Yue-Ting Siu, Joshua A Miele, and Ilmi Yoon. 2020. Increasing Video Accessibility for Visually Impaired Users with Human-in-the-Loop Machine Learning. In ACM SIGCHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI). pp. 1–9.

Cited By

View all
  • (2024)A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded PlatformDiyala Journal of Engineering Sciences10.24237/djes.2024.17310(146-162)Online publication date: 1-Sep-2024
  • (2024)Artificial Intelligence in Virtual Reality for Blind and Low Vision Individuals: Literature ReviewProceedings of the Human Factors and Ergonomics Society Annual Meeting10.1177/1071181324126683268:1(1333-1338)Online publication date: 9-Sep-2024
  • (2024)“It’s Kind of Context Dependent”: Understanding Blind and Low Vision People’s Video Accessibility Preferences Across Viewing ScenariosProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642238(1-20)Online publication date: 11-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CHI EA '21: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems
May 2021
2965 pages
ISBN:9781450380959
DOI:10.1145/3411763
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Artificial Intelligence
  2. Blind and Low Vision Users
  3. Video Accessibility
  4. Video Description

Qualifiers

  • Poster
  • Research
  • Refereed limited

Conference

CHI '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

Upcoming Conference

CHI 2025
ACM CHI Conference on Human Factors in Computing Systems
April 26 - May 1, 2025
Yokohama , Japan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)111
  • Downloads (Last 6 weeks)3
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded PlatformDiyala Journal of Engineering Sciences10.24237/djes.2024.17310(146-162)Online publication date: 1-Sep-2024
  • (2024)Artificial Intelligence in Virtual Reality for Blind and Low Vision Individuals: Literature ReviewProceedings of the Human Factors and Ergonomics Society Annual Meeting10.1177/1071181324126683268:1(1333-1338)Online publication date: 9-Sep-2024
  • (2024)“It’s Kind of Context Dependent”: Understanding Blind and Low Vision People’s Video Accessibility Preferences Across Viewing ScenariosProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642238(1-20)Online publication date: 11-May-2024
  • (2024)Review of AI Technologies for Enhancing the Lives of Visually Impaired Individuals: Applications, Outcomes, and Future Directions2024 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)10.1109/ICT4DA62874.2024.10777268(241-246)Online publication date: 18-Nov-2024
  • (2024)Enhancing movie experience by speech rate design of audio descriptionUniversal Access in the Information Society10.1007/s10209-024-01178-zOnline publication date: 4-Dec-2024
  • (2023)Detecting Deceptive Dark-Pattern Web Advertisements for Blind Screen-Reader UsersJournal of Imaging10.3390/jimaging91102399:11(239)Online publication date: 6-Nov-2023
  • (2023)Understanding Challenges and Opportunities in Body Movement Education of People who are Blind or have Low VisionProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3608409(1-19)Online publication date: 22-Oct-2023
  • (2023)The Potential of a Visual Dialogue Agent In a Tandem Automated Audio Description System for VideosProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3608402(1-17)Online publication date: 22-Oct-2023
  • (2023)Beyond Audio Description: Exploring 360° Video Accessibility with Blind and Low Vision Users Through Collaborative CreationProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3608381(1-17)Online publication date: 22-Oct-2023
  • (2023)“Dump it, Destroy it, Send it to Data Heaven”: Blind People’s Expectations for Visual Privacy in Visual Assistance TechnologiesProceedings of the 20th International Web for All Conference10.1145/3587281.3587296(134-147)Online publication date: 30-Apr-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media