Visual Goal-Step Inference using wikiHow

Yang, Yue; Panagopoulou, Artemis; Lyu, Qing; Zhang, Li; Yatskar, Mark; Callison-Burch, Chris

Computer Science > Computer Vision and Pattern Recognition

arXiv:2104.05845 (cs)

[Submitted on 12 Apr 2021 (v1), last revised 10 Sep 2021 (this version, v2)]

Title:Visual Goal-Step Inference using wikiHow

Authors:Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar, Chris Callison-Burch

View PDF

Abstract:Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2104.05845 [cs.CV]
	(or arXiv:2104.05845v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2104.05845

Submission history

From: Yue Yang [view email]
[v1] Mon, 12 Apr 2021 22:20:09 UTC (24,128 KB)
[v2] Fri, 10 Sep 2021 03:10:13 UTC (16,528 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Goal-Step Inference using wikiHow

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Goal-Step Inference using wikiHow

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators