research-article

Temporal Reasoning via Audio Question Answering

Authors:

Haytham M. Fayek,

Justin JohnsonAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Volume 28

Pages 2283 - 2294

https://doi.org/10.1109/TASLP.2020.3010650

Published: 17 August 2020 Publication History

Abstract

Multimodal question answering tasks can be used as proxy tasks to study systems that can perceive and reason about the world. Answering questions about different types of input modalities stresses different aspects of reasoning such as visual reasoning, reading comprehension, story understanding, or navigation. In this article, we use the task of Audio Question Answering (AQA) to study the temporal reasoning abilities of machine learning models. To this end, we introduce the Diagnostic Audio Question Answering (DAQA) dataset comprising audio sequences of natural sound events and programmatically generated questions and answers that probe various aspects of temporal reasoning. We adapt several recent state-of-the-art methods for visual question answering to the AQA task, and use DAQA to demonstrate that they perform poorly on questions that require in-depth temporal reasoning. Finally, we propose a new model, Multiple Auxiliary Controllers for Linear Modulation (MALiMo) that extends the recent Feature-wise Linear Modulation (FiLM) model and significantly improves its temporal reasoning capabilities. We envisage DAQA to foster research on AQA and temporal reasoning and MALiMo a step towards models for AQA.

References

[1]

S. Antol et al., “VQA: Visual question answering,” in Proc. Int. Conf. Comput. Vis., 2015.

[2]

J. Johnson, B. Hariharan, L. van der Maaten, L. Fei Fei, C. Lawrence Zitnick, and R. Girshick, “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning,” in Proc. Comput. Vision Pattern Recognit., 2017.

[3]

Y. Zhu, O. Groth, M. Bernstein, and L. Fei Fei, “Visual7W: Grounded question answering in images,” in Proc. Comput. Vision Pattern Recognit., 2016.

[4]

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” in Proc. Empir. Methods. Nat. Lang. Process., 2016.

[5]

J. Weston et al., “Towards AI-complete question answering: A set of prerequisite toy tasks,” in Proc. Int. Conf. Learn. Representations, 2016.

[6]

A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” in Proc. Comput. Vision Pattern Recognit., 2018.

[7]

M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “MovieQA: Understanding stories in movies through question-answering,” in Proc. Comput. Vision Pattern Recognit., 2016.

[8]

J. F. Allen, “Maintaining knowledge about temporal intervals,” Commun. ACM, vol. 26, no., pp. 832–843, Nov. 1983. [Online]. Available: https://doi.org/10.1145/182.358434

[9]

R. Kowalski and M. Sergot, “A logic-based calculus of events,” New Generation Comput., vol. 4, no. 1, pp. 67–95, Mar. 1986.

[10]

P. Van Beek and D. W. Manchak, “The design and experimental analysis of algorithms for temporal reasoning,” J. Artif. Intell. Res., vol. 4, pp. 1–18, 1996.

[11]

J. P. Bigham et al., “VizWiz: Nearly real-time answers to visual questions,” in Proc. User Interface Softw. Technol., 2010.

[12]

D. Gurari et al., “VizWiz grand challenge: Answering visual questions from blind people,” in Proc. Comput. Vision Pattern Recognit., 2018.

[13]

W. S. Lasecki, P. Thiha, Y. Zhong, E. Brady, and J. P. Bigham, “Answering visual questions with conversational crowd assistants,” in SIGACCESS, 2013.

[14]

D. G. Barrett, F. Hill, A. Santoro, A. S. Morcos, and T. Lillicrap, “Measuring abstract reasoning in neural networks,” in Proc. Int. Conf. Mach. Learn., 2018.

[15]

G. R. Yang, I. Ganichev, X.-J. Wang, J. Shlens, and D. Sussillo, “A dataset and architecture for visual reasoning with a working memory,” in Proc. Europ. Conf. Comput. Vis., 2018.

[16]

Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proc. Comput. Vision Pattern Recognit., 2016.

[17]

E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in Proc. Assoc. Adv. Artif. Intell., 2018.

[18]

P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” in Proc. Anterior Cruciate Ligament, 2018.

[19]

Y. Goyal, T. Khot, D. Summers Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” in Proc. Comput. Vision Pattern Recognit., 2017.

[20]

R. Krishna et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” in Proc. Int. J. Comput. Vis., 2017.

[21]

M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on uncertain input,” in Proc. Neural Inf. Process. Syst., 2014.

[22]

J. Lei, L. Yu, M. Bansal, and T. L. Berg, “TVQA: Localized, compositional video question answering,” in Proc. Empir. Methods in Nat. Lang. Process., 2018.

[23]

K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles, and M. Sun, “Leveraging video descriptions to learn video question answering,” in Proc. Assoc. Adv. Artif. Intell., 2017.

[24]

L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering the temporal context for video question answering,” in Proc. Int. J. Comput. Vision, 2017.

[25]

D. A. Hudson and C. D. Manning, “GQA: A new dataset for real-world visual reasoning and compositional question answering,” in Proc. Comput. Vision Pattern Recognit., 2019.

[26]

T. Zhang, D. Dai, T. Tuytelaars, M.-F. Moens, and L. Van Gool, “Speech-based visual question answering,” 2017, arXiv:1705.00464.

[27]

J. Abdelnour, G. Salvi, and J. Rouat, “CLEAR: A dataset for compositional language and elementary acoustic reasoning,” 2018, arXiv:1811.10561.

[28]

P. Anderson et al., “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. Comput. Vision Pattern Recognit., 2018.

[29]

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” in Proc. Empirical Methods in Nat. Lang. Process., 2016.

[30]

M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A neural-based approach to answering questions about images,” in Proc. Int. Conf. Comput. Vis., 2015.

[31]

A. Singh et al., “Pythia-a platform for vision & language research,” in NeurIPS SysML Workshop, 2019.

[32]

D. A. Hudson and C. D. Manning, “Compositional attention networks for machine reasoning,” in Proc. Int. Conf. Learn. Representations, 2018.

[33]

J. Johnson et al., “Inferring and executing programs for visual reasoning,” in Proc. Int. Conf. Comput. Vision, 2017.

[34]

A. Santoro et al., “A simple neural network module for relational reasoning,” in Neural Inf. Process. Syst., 2017.

[35]

H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. C. Courville, “Modulating early visual processing by language,” in Neural Inf. Process. Syst., 2017.

[36]

A. Vaswani et al., “Attention is all you need,” in Proc. Neural Inf. Process. Syst., 2017.

[37]

X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. Comput. Vision Pattern Recognit., 2018.

[38]

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. Comput. Vision Pattern Recognit., 2018.

[39]

J. F. Gemmeke et al., “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. Int. Conf. Acoust. Speech, and Signal Process., 2017.

[40]

B. R. Glasberg and B. C. Moore, “A model of loudness applicable to time-varying sounds,” J. Audio Eng. Soc., vol. 50, no. 5, pp. 331–342, 2002.

[41]

J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, 2012.

[42]

D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015.

[43]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015.

[44]

V. Mnih, N. Heess, and A. Graves, “Recurrent models of visual attention,” in Proc. Neural Inf. Process. Syst., 2014.

[45]

S. Yeung, O. Russakovsky, G. Mori, and L. Fei Fei, “End-to-end learning of action detection from frame glimpses in videos,” in Proc. Comput. Vision Pattern Recognit., 2016.

[46]

J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Adv. Neural Inform. Process. Syst., 2019, pp. 13–23.

[47]

W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-BERT: Pre-training of generic visual-linguistic representations,” in Proc. Int. Conf. Learn. Representations, 2020.

Cited By

Tang YTung AElkind E(2023)SQuAD-SRCProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/578(5206-5214)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/578
Lao MPu NLiu YHe KBakker ELew MWilliams BChen YNeville J(2023)COCAProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i11.26527(12995-13003)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i11.26527
Li GHou WHu DEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Progressive Spatio-temporal Perception for Audio-Visual Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612293(7808-7816)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612293
Show More Cited By

Index Terms

Temporal Reasoning via Audio Question Answering
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

An improving reasoning network for complex question answering over temporal knowledge graphs
Abstract
Question Answering Over Temporal Knowledge Graphs is an important topic in question answering, which aims to find an entity or timestamp to answer temporal reasoning questions from temporal knowledge graphs. Answering complex questions remains a ...
Evaluating Temporal Information Understanding with Temporal Question Answering
ICSC '12: Proceedings of the 2012 IEEE Sixth International Conference on Semantic Computing

The temporal annotation scheme Time ML was developed to support research in complex temporal question answering (QA). Given the complexity of temporal QA, most of the efforts have focused, so far, on extracting temporal information, which has been ...
Temporal knowledge graph question answering via subgraph reasoning
Abstract
Knowledge graph question answering (KGQA) has recently received a lot of attention and many innovative methods have been proposed in this area, but few have been developed for temporal KGQA. Most of the existing temporal KGQA methods ...

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing

IEEE/ACM Transactions on Audio, Speech and Language Processing Volume 28, Issue

2020

3123 pages

ISSN:2329-9290

EISSN:2329-9304

Issue’s Table of Contents

2329-9290 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 17 August 2020

Published in TASLP Volume 28

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
6
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang YTung AElkind E(2023)SQuAD-SRCProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/578(5206-5214)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/578
Lao MPu NLiu YHe KBakker ELew MWilliams BChen YNeville J(2023)COCAProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i11.26527(12995-13003)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i11.26527
Li GHou WHu DEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Progressive Spatio-temporal Perception for Audio-Visual Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612293(7808-7816)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612293
Rogers AGardner MAugenstein I(2023)QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading ComprehensionACM Computing Surveys10.1145/356026055:10(1-45)Online publication date: 2-Feb-2023
https://dl.acm.org/doi/10.1145/3560260
Abdelnour JRouat JSalvi G(2023)NAAQA: A Neural Architecture for Acoustic Question AnsweringIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.319431145:4(4997-5009)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TPAMI.2022.3194311
Chen ZWang LWang PGao P(2023)Question-Aware Global-Local Video Understanding Network for Audio-Visual Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331822034:5(4109-4119)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3318220
Mei XLiu XPlumbley MWang W(2022)Automated audio captioning: an overview of recent progress and new challengesEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-022-00259-22022:1Online publication date: 9-Oct-2022
https://dl.acm.org/doi/10.1186/s13636-022-00259-2

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents