Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Temporal Reasoning via Audio Question Answering

Published: 17 August 2020 Publication History

Abstract

Multimodal question answering tasks can be used as proxy tasks to study systems that can perceive and reason about the world. Answering questions about different types of input modalities stresses different aspects of reasoning such as visual reasoning, reading comprehension, story understanding, or navigation. In this article, we use the task of Audio Question Answering (AQA) to study the temporal reasoning abilities of machine learning models. To this end, we introduce the Diagnostic Audio Question Answering (DAQA) dataset comprising audio sequences of natural sound events and programmatically generated questions and answers that probe various aspects of temporal reasoning. We adapt several recent state-of-the-art methods for visual question answering to the AQA task, and use DAQA to demonstrate that they perform poorly on questions that require in-depth temporal reasoning. Finally, we propose a new model, Multiple Auxiliary Controllers for Linear Modulation (MALiMo) that extends the recent Feature-wise Linear Modulation (FiLM) model and significantly improves its temporal reasoning capabilities. We envisage DAQA to foster research on AQA and temporal reasoning and MALiMo a step towards models for AQA.

References

[1]
S. Antol et al., “VQA: Visual question answering,” in Proc. Int. Conf. Comput. Vis., 2015.
[2]
J. Johnson, B. Hariharan, L. van der Maaten, L. Fei Fei, C. Lawrence Zitnick, and R. Girshick, “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning,” in Proc. Comput. Vision Pattern Recognit., 2017.
[3]
Y. Zhu, O. Groth, M. Bernstein, and L. Fei Fei, “Visual7W: Grounded question answering in images,” in Proc. Comput. Vision Pattern Recognit., 2016.
[4]
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” in Proc. Empir. Methods. Nat. Lang. Process., 2016.
[5]
J. Weston et al., “Towards AI-complete question answering: A set of prerequisite toy tasks,” in Proc. Int. Conf. Learn. Representations, 2016.
[6]
A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” in Proc. Comput. Vision Pattern Recognit., 2018.
[7]
M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “MovieQA: Understanding stories in movies through question-answering,” in Proc. Comput. Vision Pattern Recognit., 2016.
[8]
J. F. Allen, “Maintaining knowledge about temporal intervals,” Commun. ACM, vol. 26, no., pp. 832–843, Nov. 1983. [Online]. Available: https://doi.org/10.1145/182.358434
[9]
R. Kowalski and M. Sergot, “A logic-based calculus of events,” New Generation Comput., vol. 4, no. 1, pp. 67–95, Mar. 1986.
[10]
P. Van Beek and D. W. Manchak, “The design and experimental analysis of algorithms for temporal reasoning,” J. Artif. Intell. Res., vol. 4, pp. 1–18, 1996.
[11]
J. P. Bigham et al., “VizWiz: Nearly real-time answers to visual questions,” in Proc. User Interface Softw. Technol., 2010.
[12]
D. Gurari et al., “VizWiz grand challenge: Answering visual questions from blind people,” in Proc. Comput. Vision Pattern Recognit., 2018.
[13]
W. S. Lasecki, P. Thiha, Y. Zhong, E. Brady, and J. P. Bigham, “Answering visual questions with conversational crowd assistants,” in SIGACCESS, 2013.
[14]
D. G. Barrett, F. Hill, A. Santoro, A. S. Morcos, and T. Lillicrap, “Measuring abstract reasoning in neural networks,” in Proc. Int. Conf. Mach. Learn., 2018.
[15]
G. R. Yang, I. Ganichev, X.-J. Wang, J. Shlens, and D. Sussillo, “A dataset and architecture for visual reasoning with a working memory,” in Proc. Europ. Conf. Comput. Vis., 2018.
[16]
Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proc. Comput. Vision Pattern Recognit., 2016.
[17]
E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in Proc. Assoc. Adv. Artif. Intell., 2018.
[18]
P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” in Proc. Anterior Cruciate Ligament, 2018.
[19]
Y. Goyal, T. Khot, D. Summers Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” in Proc. Comput. Vision Pattern Recognit., 2017.
[20]
R. Krishna et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” in Proc. Int. J. Comput. Vis., 2017.
[21]
M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on uncertain input,” in Proc. Neural Inf. Process. Syst., 2014.
[22]
J. Lei, L. Yu, M. Bansal, and T. L. Berg, “TVQA: Localized, compositional video question answering,” in Proc. Empir. Methods in Nat. Lang. Process., 2018.
[23]
K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles, and M. Sun, “Leveraging video descriptions to learn video question answering,” in Proc. Assoc. Adv. Artif. Intell., 2017.
[24]
L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering the temporal context for video question answering,” in Proc. Int. J. Comput. Vision, 2017.
[25]
D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” in Proc. Comput. Vision Pattern Recognit., 2019.
[26]
T. Zhang, D. Dai, T. Tuytelaars, M.-F. Moens, and L. Van Gool, “Speech-based visual question answering,” 2017, arXiv:1705.00464.
[27]
J. Abdelnour, G. Salvi, and J. Rouat, “CLEAR: A dataset for compositional language and elementary acoustic reasoning,” 2018, arXiv:1811.10561.
[28]
P. Anderson et al., “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. Comput. Vision Pattern Recognit., 2018.
[29]
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” in Proc. Empirical Methods in Nat. Lang. Process., 2016.
[30]
M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A neural-based approach to answering questions about images,” in Proc. Int. Conf. Comput. Vis., 2015.
[31]
A. Singh et al., “Pythia-a platform for vision & language research,” in NeurIPS SysML Workshop, 2019.
[32]
D. A. Hudson and C. D. Manning, “Compositional attention networks for machine reasoning,” in Proc. Int. Conf. Learn. Representations, 2018.
[33]
J. Johnson et al., “Inferring and executing programs for visual reasoning,” in Proc. Int. Conf. Comput. Vision, 2017.
[34]
A. Santoro et al., “A simple neural network module for relational reasoning,” in Neural Inf. Process. Syst., 2017.
[35]
H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. C. Courville, “Modulating early visual processing by language,” in Neural Inf. Process. Syst., 2017.
[36]
A. Vaswani et al., “Attention is all you need,” in Proc. Neural Inf. Process. Syst., 2017.
[37]
X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. Comput. Vision Pattern Recognit., 2018.
[38]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. Comput. Vision Pattern Recognit., 2018.
[39]
J. F. Gemmeke et al., “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. Int. Conf. Acoust. Speech, and Signal Process., 2017.
[40]
B. R. Glasberg and B. C. Moore, “A model of loudness applicable to time-varying sounds,” J. Audio Eng. Soc., vol. 50, no. 5, pp. 331–342, 2002.
[41]
J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, 2012.
[42]
D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015.
[43]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015.
[44]
V. Mnih, N. Heess, and A. Graves, “Recurrent models of visual attention,” in Proc. Neural Inf. Process. Syst., 2014.
[45]
S. Yeung, O. Russakovsky, G. Mori, and L. Fei Fei, “End-to-end learning of action detection from frame glimpses in videos,” in Proc. Comput. Vision Pattern Recognit., 2016.
[46]
J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Adv. Neural Inform. Process. Syst., 2019, pp. 13–23.
[47]
W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-BERT: Pre-training of generic visual-linguistic representations,” in Proc. Int. Conf. Learn. Representations, 2020.

Cited By

View all
  • (2023)SQuAD-SRCProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/578(5206-5214)Online publication date: 19-Aug-2023
  • (2023)COCAProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i11.26527(12995-13003)Online publication date: 7-Feb-2023
  • (2023)Progressive Spatio-temporal Perception for Audio-Visual Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612293(7808-7816)Online publication date: 26-Oct-2023
  • Show More Cited By

Index Terms

  1. Temporal Reasoning via Audio Question Answering
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
        IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 28, Issue
        2020
        3123 pages
        ISSN:2329-9290
        EISSN:2329-9304
        Issue’s Table of Contents

        Publisher

        IEEE Press

        Publication History

        Published: 17 August 2020
        Published in TASLP Volume 28

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)2
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 10 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)SQuAD-SRCProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/578(5206-5214)Online publication date: 19-Aug-2023
        • (2023)COCAProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i11.26527(12995-13003)Online publication date: 7-Feb-2023
        • (2023)Progressive Spatio-temporal Perception for Audio-Visual Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612293(7808-7816)Online publication date: 26-Oct-2023
        • (2023)QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading ComprehensionACM Computing Surveys10.1145/356026055:10(1-45)Online publication date: 2-Feb-2023
        • (2023)NAAQA: A Neural Architecture for Acoustic Question AnsweringIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.319431145:4(4997-5009)Online publication date: 1-Apr-2023
        • (2023)Question-Aware Global-Local Video Understanding Network for Audio-Visual Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331822034:5(4109-4119)Online publication date: 2-Oct-2023
        • (2022)Automated audio captioning: an overview of recent progress and new challengesEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-022-00259-22022:1Online publication date: 9-Oct-2022

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media