VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

Qiucheng Wu¹, Handong Zhao², Michael Saxon¹,
Trung Bui², William Yang Wang¹, Yang Zhang³, Shiyu Chang¹
¹UC Santa Barbara, ²Adobe Research, ³MIT-IBM Watson AI Lab
qiucheng@ucsb.edu

Abstract

Vision language models (VLMs) are an exciting emerging class of language models (LMs) that have merged classic LM capabilities with those of image processing systems. However, the ways that these capabilities combine are not always intuitive and warrant direct investigation. One understudied capability in VLMs is visual spatial planning—the ability to comprehend the spatial arrangements of objects and devise action plans to achieve desired outcomes in visual scenes. In our study, we introduce VSP, a benchmark that 1) evaluates the spatial planning capability in these models in general, and 2) breaks down the visual planning task into finer-grained sub-tasks, including perception and reasoning, and measure the LMs capabilities in these sub-tasks. Our evaluation shows that both open-source and private VLMs fail to generate effective plans for even simple spatial planning tasks. Evaluations on the fine-grained analytical tasks further reveal fundamental deficiencies in the models’ visual perception and bottlenecks in reasoning abilities, explaining their worse performance in the general spatial planning tasks. Our work illuminates future directions for improving VLMs’ abilities in spatial planning. Our benchmark is publicly available at https://github.com/UCSB-NLP-Chang/Visual-Spatial-Planning.

1 Introduction

The rapid advancement of large language models has driven considerable growth in their capabilities to produce fluent text in many domains, generating outputs exhibiting potential “reasoning” and “understanding” abilities [1, 2, 3, 4]. Recently, vision language models (VLMs) have advanced on LMs through additional training on native image inputs, to achieve impressive performance generating text describing and relating to input images [5, 6, 7, 8, 9], with applications in image captioning, visual question answering, visual reasoning, and others [10, 11, 12, 13]. The swift evolution of VLMs has enabled them to tackle increasingly sophisticated tasks that require multiple emerging abilities in complex scenarios. However, as model capabilities and deployment needs advance, the challenges in usefully evaluating them grow in kind.

Planning is a fundamental capability in intelligent systems that is particularly contested in LMs [14], and is understudied in VLMs. Visual spatial planning refers to the task of comprehending the spatial arrangement of objects in a scene and designing action plans to achieve a desired outcome. For example, the classical maze problem can be considered a visual planning task, where an agent is given an input image describing the maze environment and is asked to produce a viable path to navigate the player from the starting position to the goal. This task requires two capabilities: image perception, which enables the agent to understand the objects, environment and spatial relations present in the image, and reasoning, which enables the agent to perform strategic decision-making.

Visual spatial planning is an important capability in many potential applications for VLMs, such as navigating in complex environments with autonomous driving [15, 16] or manipulating objects with robotic hands [17, 18]. Though there have been increasingly more benchmarks to evaluate the vision processing capabilities of VLMs, few current benchmarks systematically evaluate their capability to perform visual spatial planning tasks. As shown in Table 1, existing benchmarks mostly focus on VLMs’ ability to understand image content and perform visual logic reasoning [19, 20, 21]; however, they often overlook the ability to comprehend the spatial arrangements of entities within images and to devise spatial action plans based on practical restrictions in the visual environment. As a result, two research questions are left unanswered: ❶ How performant are VLMs in performing visual planning tasks? ❷ What are the bottleneck capabilities, e.g., perception or reasoning, that limit the performance of VLMs in the visual planning tasks?

To this end, we introduce Visual Spatial Planning (VSP), a benchmark specifically designed to evaluate the spatial planning capabilities of VLMs. As illustrated in Figure 1 and Figure 2, the VSP benchmark is developed from classical maze navigation and block-moving games, where the entire environment is fully observable in the input images. In this benchmark, the VLMs are required to interpret the visual inputs, deduce the consequences of each action, and execute the designated tasks accordingly. To comprehensively evaluate the fine-grained capabilities needed for the visual spatial planning, our VSP includes 4.4K questions in 10 meticulously designed tasks that feature both simulated and photo-realistic visual environments. In addition to testing end-to-end spatial planning performance, these tasks further evaluate essential individual capabilities needed for performing visual planning, such as image perception and reasoning.

We apply the VSP benchmark to evaluate existing state-of-the-art VLMs, including both open-source and private models. Surprisingly, we find that even the most competitive VLMs sometimes struggle in performing the simplest visual planning tasks, such as a 3x3 maze problem or an one-step block-moving task. Our fine-grained capability analysis further reveals that existing VLMs have flaws in reasoning and bigger bottlenecks in perception. We believe the VSP benchmark highlights critical weaknesses in current VLMs and sheds light on future directions for enhancing their spatial understanding and planning capabilities.

Table 1: Comparison with representative existing benchmarks.

Name	Tasks Description	Keywords
MME [19]	Image content understanding, reasoning	perception, reasoning
MMMU [20]	College-level knowledge reasoning	multi-discipline knowledge, reasoning
MathVision [21]	Math problems with visual contexts	mathematical reasoning
SeedBench [22]	Comprehension of scene & instance in image	perception, reasoning, spatial relation
MM-Vet [23]	General problems that need integrated abilities	perception, reasoning, spatial relation
VSP	Understand & extract spatial info and plan accordingly	Spatial planning, Spatial perception, reasoning

2 Related Work

2.1 General planning in LMs

Planning has been a central focus of research in AI. Traditional work in AI planning includes using formal languages to represent and solve planning problems [24], and developing algorithms like dynamic programming and reinforcement learning to explore environments and formulate viable plans [25, 26]. While these works mostly focus on predefined and restricted environments, recently, with the advancement of LMs, it has become intriguing to study whether LMs, with the potential to be general intelligent agents, can perform planning in different settings and environments [27, 28, 29]. Many works explore the best ways to activate the planning capabilities of LMs, including divide and conquer [30, 31, 32, 33], grounding outputs in admissible actions [34, 35], retrospecting and refining [36, 37], and leveraging external tools [38, 39]. Meanwhile, with the increasing capabilities of LMs, growing research efforts are now dedicated to benchmark their planning capabilities in various complex environments [40, 41, 42].

2.2 Spatial and visual planning in LMs

Many general planning tasks in LMs involve understanding visual environments and comprehending spatial information. In robotics and embodied agent studies, LMs play a crucial role in grounding visual entities with references in open-domain instructions and formulating plans based on spatial constraints. Consequently, they are increasingly used in physically grounded scenarios such as object rearrangement [17, 18], cooking [43, 44], and navigation [35, 34]. LMs are also used in AIGC to propose spatial arrangements of entities following instructions [45]. While realistic planning tasks align with real needs, their complexity and expansive action spaces limit the analysis of LMs’ detailed planning capabilities. Therefore, research also focuses on LMs’ planning in simulated environments and games. For example, mystery blocksworld is a dynamically generated set of blocksworld tasks to test generalization in LMs [14]. Additionally, many text games have been introduced to test LMs’ abilities in spatial understanding and imagination [46, 40, 47, 48]. However, most of these studies transform visual information into text inputs, thus not directly measuring LMs’ visual abilities.

2.3 Benchmarks for VLMs

VLMs have inherited and advanced many intriguing features from text-only LMs [47, 49]. Benchmarks for VLMs have rapidly emerged to evaluate performance in areas such as image content understanding [19, 50], perception [51, 52], knowledge [20, 21, 53], and reasoning [19, 20, 54]. Recently, there are also emerging benchmarks focusing on the capability of understanding multiple images in long context and complex realistic environments [55, 56]. While these benchmarks successfully quantify VLMs’ abilities in many fields, their capabilities in spatial understanding and reaction are relatively under-explored. Some benchmarks cover spatial relations understanding [22, 23], but often overlook the ability to devise complex spatial action plans based on visual environment constraints. We focus on visual spatial planning - the ability to comprehend spatial arrangements of objects and devise action plans to achieve specific outcomes. We fill the gap in benchmarking VLM abilities for visual spatial planning and highlight future directions for improving VLMs towards models with general intelligence.

Refer to caption — Figure 1: Overview of the Maze Navigation scenario.

3 The Visual Spatial Planning Benchmark

3.1 Overview of the Benchmark

In this benchmark, our objectives are two-fold: ❶ quantify the visual spatial planning capabilities of current VLMs; and ❷ uncover current capability bottlenecks that limit the effectiveness of VLMs in visual spatial planning tasks. While the first objective can be achieved through direct measurements on corresponding tasks, the second objective requires more careful benchmark design. Specifically, performing spatial planning in visual environments requires a series of cohesive steps. For example, to generate an accurate path to navigate a player to a goal, an agent needs to be able to correctly view and understand the visual map, reason to find which actions are safe or dangerous, and come up with a detailed plan to achieve the goal. Each of these steps could be challenging for a developing VLM, and understanding which of these subtasks challenge them most will drive future improvement.

To this end, we propose the Visual Spatial Planning (VSP) benchmark, with the objective of measuring and diagnosing the capabilities of VLMs in producing accurate spatial plans in visual environments. The VSP benchmark consists of two scenarios: ❶ the simulated Maze Navigation scenario, whose main task is to move a game character through a maze, and ❷ the photo-realistic Blocks World scenario, whose main task is to move blocks from a starting configuration to a goal configuration. In each scenario, in addition to the main task, VSP introduces four sub-tasks that focus on the individual capabilities needed for the main task:

$\bullet$ T1. Single Object Perception – Determine the characteristics of a single object;

$\bullet$ T2. Spatial Relation Perception – Determine the relative positions of two objects;

$\bullet$ T3. Environment Perception – Find textual descriptions that describe the visual environment;

$\bullet$ T4. Spatial Reasoning – Determine the consequence of a series of actions or moves.

The sub-task details are designed specific to each scenario. Furthermore, to demonstrate the model’s performance under different levels of environmental complexity, we establish progressive difficulty settings for each task, which are measured by parameters such as map size, minimum required number of actions, etc. We provide the details of task statistics, i.e., total number of problems, in appendix A. In what follows, we will introduce each scenario in detail, as well as the data curation and the task creation processes.

3.2 The Maze Navigation Scenario

The maze navigation scenario is inspired by the popular implementation [57] of a fully observable path-finding problem. As depicted in Figure 1 left, it simulates a classical grid world environment with a designated start and goal position, where part of the grids contain obstacles (the “holes”) and cannot be passed through.

The main spatial planning task and the four sub-tasks are defined as follows:

$\bullet$ Main Task (Spatial Planning) – Generate a safe path to navigate from the start grid to the goal;

$\bullet$ T1 (Single Object Perception) – Determine if a specified grid is safe;

$\bullet$ T2 (Spatial Relation Perception) – Find spatial relations between the player and the goal;

$\bullet$ T3 (Environment Perception) – Find the textual description that fits the visual environment;

$\bullet$ T4 (Spatial Reasoning) – Determine the consequence of a given action series.

An example of input image and questions is demonstrated in Figure 1. Each task is equipped with progressive adjusted difficulty settings to evaluate the model’s capability under various circumstances. For Main Task and T1-T3, the difficulties are measured by the size of the map, ranging from 3x3 to 8x8, where a larger map introduces more challenges in correctly perceiving objects and planning accordingly. For task T4, since a longer path naturally introduces more challenges for reasoning, we adopt path length ranging from 1 to 9 as the difficulty measure. Please refer to Appendix A for the complete example of the question and answer in each task.

3.3 The Blocksword Scenario

Blocksworld is a widely-adopted planning problem [42, 58, 59]. As depicted in Figure 2 left, in this scenario, the agent is given images containing sets of blocks in unique colors. These blocks are stacked vertically, forming multiple stacks on the table. The agent is asked to turn the blocks from initial state to target state through a series of moving actions. For each action, the agent can only move the top block of any stack, providing it is moved to either the table or the top of another stack.

Similarly, the main spatial planning task and the four sub-tasks are defined as follows:

$\bullet$ Main Task (Spatial Planning) – Form a moving plan to achieve the target state of block arrangement;

$\bullet$ T1 (Single Object Perception) – Determine the color of the block at a specific position;

$\bullet$ T2 (Spatial Relation Perception) – Determine the spatial relation between two blocks;

$\bullet$ T3 (Environment Perception) – Find the text representation that fits the visual environment;

$\bullet$ T4 (Spatial Reasoning) – Determine the consequence of a given moving plan.

An example of input image and questions is demonstrated in Figure 2. Similar to the maze navigation scenario, each task is equipped with progressive adjusted difficulty. Specifically, in Main Task and T4, the difficulties are measured by the number of actions involved, ranging from 1 to 7, which quantifies the complexity of the action plan. On the other hand, for tasks T1-T3, which focus on perception, the difficulty is measured by the number of blocks presented in the image, ranging from 3 to 5. Please refer to Appendix A for the complete example of the questions in each task.

3.4 Benchmark Creation Process

Figure 3 demonstrates our 3-stage general process for benchmark creation.

First, in the left panel of Figure 3, we prepare the input images used for each task and scenario. In the maze navigation scenario, we generate input maps using the OpenAI Gym package [57], with modifications to ensure that the positions of the player, the goal, and the holes are all randomly generated. In the blocksworld scenario, we sample pairs of images from the BIRD dataset [59], ensuring there is at least one viable plan to move the blocks from the initial state to the target state. The images are prepared conditional on different levels of difficulty.

Second, in the center panel of Figure 3, we formulate input prompts for each task. The prompt consists of interleaved text and images to provide sufficient information. For example, for maze navigation, we include images to show the appearance of elements in the map and provide example maps to better illustrate how the models should interpret the map. We invite native speakers to refine the prompts so that they accurately describe the task requirements. These prompts are demonstrated in Appendix A.

Finally, in the right panel of Figure 3, we evaluate the performance of VLMs under each task. It is worth noting that the answer for each task is often not unique. For example, in the blocksworld scenario, there can be many ways to move the blocks to reach the target state. As such, we develop scripts to automatically evaluate the answers for each task.

In addition to the steps above, some tasks require extra steps to construct meaningful questions, candidates, and answers, such as prompt design or example filtering. For example, in task 5 of the blocksworld scenario, the input actions must cover various valid/invalid movements, requiring filtering and balancing. The detailed steps we followed to create each task set are provided in Appendix A. We release all images, texts, and scripts to facilitate replication and scaling.

Table 2: Zero-shot success rates for the spatial planning task, at various difficulty levels. Maze navigation difficulty levels represent the maze’s square grid length. Blocksworld difficulty levels correspond to the minimum number of steps to a solution. Results better than 30% are bolded.

	Maze Navigation						Blocksword				Overall
Difficulty level	3	4	5	6	7	8	1	3	5	7
Gemini[7]	0.31	0.26	0.15	0.06	0.14	0.10	0.10	0.14	0.00	0.01	0.13
GPT-Vision[5]	0.55	0.36	0.27	0.13	0.17	0.10	0.50	0.17	0.03	0.00	0.23
Claude-3[60]	0.52	0.33	0.16	0.15	0.16	0.09	0.12	0.03	0.00	0.00	0.16
GPT-4o[61]	0.68	0.58	0.35	0.24	0.18	0.23	0.71	0.33	0.12	0.03	0.35
LLaVA[6]	0.03	0.03	0.02	0.08	0.09	0.04	0.04	0.01	0.00	0.00	0.04
InternLM[62]	0.27	0.16	0.06	0.05	0.04	0.07	0.10	0.03	0.00	0.00	0.08
InternLM-VL[62]	0.15	0.14	0.08	0.04	0.02	0.05	0.02	0.00	0.00	0.00	0.05
InstructBLIP[63]	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
SPHINX[64]	0.11	0.08	0.05	0.02	0.04	0.03	0.07	0.06	0.01	0.00	0.05

4 Experiments

In this section, we present evaluation results of state-of-the-art VLMs under our main tasks and sub-tasks. Our goal is to answer the following research questions: ❶ How well can state-of-the-art VLMs perform in the visual spatial planning tasks? ❷ What are the bottleneck capabilities that limit the VLMs in visual spatial planning tasks?

4.1 Baselines

We evaluate various representative VLMs including both private and open-source models.

We cover the following private models: ① Gemini [7] has demonstrated remarkable capabilities in image understanding and reasoning. We adopt Gemini-1.0-Pro-Vision in our experiments ¹¹1The latest Gemini-1.5-Pro-Vision currently has a daily request limits of 50. Therefore, we did not include its evaluation.. ② GPT-4 Turbo with vision [5] inherent strong text understanding capabilities from GPT-4 and is equipped with vision capabilities. We use turbo-2024-04-09 for evaluation. ③ Claude-3 [60] is a family of VLMs strong at advanced reasoning and vision analysis. We adopt claude-3-sonnet-20240229, the default model used in chat interface and has comparable speed & cost with GPT Vision. ④ GPT-4o [61] is a recently released multimodal LM with one of the most advanced abilities in processing combination of text, audio, and image outputs. We adopt gpt-4o-2024-05-13 in experiments.

We cover the following open-source models: ⑤ LLaVA [6] performs instruction tuning on LLaMA and projects input image into text embedding space through pre-trained CLIP visual encoder [65]. We adopt llava-v1.6-vicuna-7b for evaluation. ⑥ InternLM-XComposer2 [62] enhances ability to understand free-form text-image composition and surpasses GPT-4V in several tasks. The latest released checkpoints include internlm-xcomposer2-7b and internlm-xcomposer2-vl-7b, with the former focusing on general text-image composition and the latter focusing on VL benchmarks. We adopt both for evaluation. ⑦ InstructBLIP [63] is a popular VLM based on pre-trained BLIP-2 [66] model. We adopt blip2-t5-instruct-flant5xxl for evaluation. ⑧ SPHINX [64] unfreezes the LLM during pre-training to enhance cross-model alignment. We adopt SPHINX-v2-1k for evaluation. Additionally, we attempt to perform measurements on the latest CogVLM2 [67] fine-tuned on LLaMA-3 [68]. However, since its current codebase only supports single-image input, we do not include its results. For all open-source models, we use their public released checkpoints, codes, and hyperparameter choices.

4.2 Main Task (Spatial Planning) Evaluation

First, we present the main task evaluation results for both the maze and the Blocksword scenarios, which reflect the general spatial planning capabilities of existing VLMs. All the evaluation in this section is conducted under zero-shot setting without any fine-tuning or in-context learning. Evaluations with in-context learning and fine-tuning are presented in Sections 4.5 and 4.6.

The performance is demonstrated in Table 2. Each column represents a difficulty level, which is measured by the size of the map (3 represents 3x3 maps) in Maze Navigation and by the minimum number of steps in Blocksword. From the table, we summarize our findings as follows:

VLMs have considerable room for improvement in spatial planning tasks. We observe that both private and open-source models exhibit sub-optimal performance in various scenarios. In particular, open-source models face significant challenges and rarely succeed in these tasks. Besides, even the most capable private models could frequently make mistakes on relatively simple tasks, such as those involving a 3x3 size map or a single-step block moving task. Considering that these tasks would be simple for humans, the VSP benchmark poses a substantial challenge to VLMs, illustrating that current VLMs have considerable potential for improvement in spatial planning tasks.

Quick performance decay as difficulty increases. We observe a significant drop in the success rates of VLMs as task difficulty escalates. For example, GPT-Vision may achieve a success rate of over 50% on 3x3 size maps, but this plummets to just 10% on 8x8 maps. Analyzing the impact of increased difficulty, we identify two major challenges for the models: First, increasing size of the map in maze navigation scenario could make it difficult for the model to accurately perceive the positions of elements within the map. Second, the increase in both map size and the number of steps required for moving blocks heightens the challenge for the model to reason deeply through the entire path and devise a complete, viable solution. In the following experiments, we focus on these two factors and provide in-depth analysis with subsequent tasks.

Challenges in open-source models. Finally, we note that open-source models often face challenges when evaluating on these tasks. We identify two main factors. ① Context length: Open-source models typically have significantly shorter context windows compared to private models. Besides, image embeddings can occupy many tokens. Thus, these models may not have enough capacity to understand the complete inputs. For example, llava-v1.6-vicuna-7b is trained with a maximum context window of 2048 tokens, while each image consumes 576 tokens. Consequently, when fed with multiple images and relatively long texts in our tasks, the total token length may surpass training, resulting in poor performance. ② Multiple image input: Our tasks require the model to understand multiple images interleaved with text inputs, whereas many open-source models are only trained with single-image inputs, with the image positioned at the start of the input. To further explore their potential in our tasks, we assess their performance after training on our inputs in Section 4.6. Meanwhile, we suggest that future open-source models could consider increasing their context length and reducing restrictions on input formats to address complex and realistic tasks effectively.

Table 3: Decomposed Capability Analysis. Similar to the spatial planning task, each task consists of test with different difficulties. Results better than

70\%

are bolded. Please refer to Appendix E for the complete evaluation results for different difficulties.

	Maze Navigation				Blocksword
Task	T1	T2	T3	T4	T1	T2	T3	T4
Random Guess	0.5	0.25	0.25	0.5	0.17	0.25	0.25	0.5
Gemini [7]	0.58	0.56	0.33	0.49	0.86	0.51	0.54	0.55
GPT-Vision [5]	0.56	0.27	0.46	0.56	0.73	0.80	0.70	0.71
Claude-3 [60]	0.45	0.67	0.32	0.61	0.43	0.53	0.49	0.66
GPT-4o [61]	0.58	0.67	0.58	0.74	0.95	0.90	0.90	0.76
LLaVA [6]	0.49	0.27	0.21	0.54	0.22	0.21	0.24	0.55
InternLM [62]	0.48	0.27	0.29	0.58	0.25	0.32	0.26	0.53
InternLM-VL [62]	0.41	0.20	0.17	0.47	0.22	0.20	0.20	0.53
InstructBLIP [63]	0.44	0.23	0.21	0.37	0.21	0.16	0.22	0.47
SPHINX [64]	0.56	0.28	0.32	0.59	0.24	0.33	0.27	0.58

4.3 The Perception and Reasoning Sub-tasks Evaluation

From the previous observation, we identify that spatial perception and reasoning could be two important capabilities for an agent to successfully perform visual spatial planning. Next, we evaluate the perception and reasoning abilities through the remaining tasks T1-T4. Similar to previous setting, all the evaluation is conducted under zero-shot settings.

The performance results are presented in Table 3. We observe that the recent GPT-4o and GPT-Vision achieve good performance across a series of tasks, demonstrating a decent capability in perception and reasoning. However, the overall performance of private models hovers around $50\%$ accuracy, which is far from satisfactory for agents requiring spatial intelligence. Furthermore, the performance of open-source models is mostly close to random guessing on these tasks, indicating a significant gap compared to private models. Besides, we note that tasks T1-T3 focus on perception abilities, and task T4 involves the ability of both understanding input images and perform reasoning. We perform further analysis to disentangle these two abilities in Section 4.4.

4.4 The Effects of Visual Input Perception and Reasoning

Previous analysis shows that even current state-of-the-art models have clear deficiencies in various aspects of visual spatial planning. In this study, we focus on disentangling the effects of perception and reasoning by exploring the performance gain assuming the model had perfect perception.

The key strategy of this study is to create a scenario where the model has already acquired all the necessary information that would typically be obtained through visual perception. To this end, for every input image, we produce the corresponding textual inputs and replace those images, as demonstrated in Figure 4. For the maze navigation scenario, we use either pure text descriptions or tables to depict the image. For the blocks world scenario, we use pure text descriptions. We do not use tables for the blocks world scenario because the number of blocks in each horizontal stack is usually unequal, making it difficult to form a complete table. Please refer to Appendix B for a complete example with pure text or table input.

The results are shown in Figure 5. We observe a clear performance improvement when using textual input across every task. This suggests image perception presents significant challenges for VLMs, and poor perception ability is a key factor in the inferior performance observed in previous tasks. Meanwhile, we observe that even with textual input, Gemini still cannot achieve decent performance on tasks that require reasoning. This indicates deficiencies in its reasoning capabilities as well.

Table 4: Effects of providing in-context examples.

	Maze Navigation					Blocksword
Task	T1	T2	T3	T4	Main	T1	T2	T3	T4	Main
Gemini, 0-shot	0.58	0.56	0.33	0.49	0.17	0.86	0.51	0.54	0.55	0.03
Gemini, 1-shot	0.50	0.66	0.31	0.48	0.20	0.91	0.68	0.71	0.59	0.03
Gemini, 2-shot	0.53	0.68	0.31	0.51	0.21	0.90	0.76	0.70	0.61	0.03
Gemini, 4-shot	0.53	0.67	0.35	0.53	0.19	0.91	0.64	0.69	0.62	0.06
GPT-Vision, 0-shot	0.56	0.27	0.46	0.56	0.26	0.73	0.80	0.70	0.71	0.10
GPT-Vision, 1-shot	0.55	0.50	0.47	0.57	0.28	0.89	0.84	0.94	0.73	0.11
GPT-Vision, 2-shot	0.55	0.63	0.50	0.56	0.30	0.90	0.83	0.95	0.71	0.16
GPT-Vision, 4-shot	0.54	0.69	0.54	0.56	0.29	0.90	0.79	0.96	0.73	-

Table 5: Fine-tuning results for open-source models.

		Maze Navigation					Blocks of World
Model	Setting	T1	T2	T3	T4	Main	T1	T2	T3	T4	Main
LLaVA	zero-shot	0.49	0.27	0.21	0.54	0.05	0.22	0.21	0.24	0.55	0.01
LLaVA	fine-tune	0.53	0.99	0.51	0.93	0.60	1.00	1.00	1.00	1.00	0.97
InternLM	zero-shot	0.48	0.27	0.29	0.58	0.11	0.25	0.32	0.26	0.53	0.00
InternLM	fine-tune	0.52	0.59	0.91	0.59	0.17	0.29	0.44	0.69	0.62	0.09

4.5 In-context Learning in Visual Spatial Planning

In-context learning is a widely-adopted method to enhance LM’s reasoning ability [4]. In this analysis, we study if it boosts the visual spatial planning capabilities. We included varying numbers of examples for Gemini and GPT-Vision (refer to Appendix C for the input examples). The result is shown in Table 4. There are two key observations: First, in-context examples make some potential contributions, but they are not significant. Introducing examples only benefits in several sparse cases, such as T2 in maze navigation and T3 in blocksworld. Second, scaling in-context examples generally does not help, as illustrated by the saturated performance in each task.

4.6 Fine-tuning in VSP Tasks

Finally, we assess the capabilities of the open-source model through dedicated training for each task. In these tasks, the model is trained on 10k data points. We use the default hyperparameters provided in the official repo. The results, shown in Table 5, demonstrate clear performance improvements for both models across a series of tasks, highlighting their potential in spatial planning. Additionally, we observe that LLaVA shows greater improvement compared to InternLM, suggesting that different model architectures may exhibit varying levels of efficacy in spatial planning capabilities.

5 Conclusion

We present VSP, a benchmark measuring and diagnosing the visual spatial planning capabilities in VLMs. The VSP quantifies the model’s performance through a series of carefully designed tasks, with main tasks focusing on the general spatial planning abilities and sub-tasks focusing on the individual capabilities needed for the main task. Experiments on current models show that both private models and open-source models fail to generate effective plans for even simple spatial planning tasks, and further analyses expose their bottlenecks in spatial perception and reasoning abilities. Our work illuminates future directions for improving VLMs’ abilities in spatial planning.

References

Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971.
Bi et al. [2024] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. ArXiv preprint, abs/2401.02954, 2024. URL https://arxiv.org/abs/2401.02954.
Jiang et al. [2024] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. ArXiv preprint, abs/2401.04088, 2024. URL https://arxiv.org/abs/2401.04088.
Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
Liu et al. [2024] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. ArXiv preprint, abs/2312.11805, 2023. URL https://arxiv.org/abs/2312.11805.
Awadalla et al. [2023] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. ArXiv preprint, abs/2308.01390, 2023. URL https://arxiv.org/abs/2308.01390.
Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
Ying et al. [2024] Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. ArXiv preprint, abs/2404.16006, 2024. URL https://arxiv.org/abs/2404.16006.
Yang et al. [2024] Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, and Xin Geng. Exploring diverse in-context configurations for image captioning. Advances in Neural Information Processing Systems, 36, 2024.
Shao et al. [2023] Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14974–14983, 2023.
Zheng et al. [2023] Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. Advances in Neural Information Processing Systems, 36:5168–5191, 2023.
Valmeekam et al. [2023] Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models - a critical investigation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 75993–76005. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/efb2072a358cefb75886a315a6fcf880-Paper-Conference.pdf.
Tian et al. [2024] Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. ArXiv preprint, abs/2402.12289, 2024. URL https://arxiv.org/abs/2402.12289.
Ma et al. [2023] Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. ArXiv preprint, abs/2312.00438, 2023. URL https://arxiv.org/abs/2312.00438.
Chang et al. [2023] Haonan Chang, Kai Gao, Kowndinya Boyalakuntla, Alex Lee, Baichuan Huang, Harish Udhaya Kumar, Jinjin Yu, and Abdeslam Boularias. Lgmcts: Language-guided monte-carlo tree search for executable semantic object rearrangement. ArXiv preprint, abs/2309.15821, 2023. URL https://arxiv.org/abs/2309.15821.
Hu et al. [2023] Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. ArXiv preprint, abs/2311.17842, 2023. URL https://arxiv.org/abs/2311.17842.
Fu et al. [2023] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv preprint, abs/2306.13394, 2023. URL https://arxiv.org/abs/2306.13394.
Yue et al. [2024] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of CVPR, 2024.
Wang et al. [2024a] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. ArXiv preprint, abs/2402.14804, 2024a. URL https://arxiv.org/abs/2402.14804.
Li et al. [2023a] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv preprint, abs/2307.16125, 2023a. URL https://arxiv.org/abs/2307.16125.
Yu et al. [2023] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv preprint, abs/2308.02490, 2023. URL https://arxiv.org/abs/2308.02490.
Aeronautiques et al. [1998] Constructions Aeronautiques, Adele Howe, Craig Knoblock, ISI Drew McDermott, Ashwin Ram, Manuela Veloso, Daniel Weld, David Wilkins Sri, Anthony Barrett, Dave Christianson, et al. Pddl| the planning domain definition language. Technical Report, Tech. Rep., 1998.
Guo et al. [2014] Xiaoxiao Guo, Satinder P. Singh, Honglak Lee, Richard L. Lewis, and Xiaoshi Wang. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3338–3346, 2014. URL https://proceedings.neurips.cc/paper/2014/hash/8bb88f80d334b1869781beb89f7b73be-Abstract.html.
Sutton [1991] Richard S Sutton. Planning by incremental dynamic programming. In Machine learning proceedings 1991, pages 353–357. Elsevier, 1991.
Kambhampati [2024] Subbarao Kambhampati. Can large language models reason and plan? Annals of the New York Academy of Sciences, 1534(1):15–18, 2024.
Kambhampati et al. [2024] Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. Llms can’t plan, but can help planning in llm-modulo frameworks. ArXiv preprint, abs/2402.01817, 2024. URL https://arxiv.org/abs/2402.01817.
Stechly et al. [2024] Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limitations of large language models on reasoning and planning tasks. ArXiv preprint, abs/2402.08115, 2024. URL https://arxiv.org/abs/2402.08115.
Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
Yao et al. [2024] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
Shen et al. [2024] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024.
Yao et al. [2022] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. ArXiv preprint, abs/2210.03629, 2022. URL https://arxiv.org/abs/2210.03629.
Ahn et al. [2022] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. ArXiv preprint, abs/2204.01691, 2022. URL https://arxiv.org/abs/2204.01691.
Hazra et al. [2024] Rishi Hazra, Pedro Zuidberg Dos Martires, and Luc De Raedt. Saycanpay: Heuristic planning with large language models using learnable domain knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 20123–20133, 2024.
Shinn et al. [2024] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
Madaan et al. [2024] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
Guan et al. [2023] Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems, 36:79081–79094, 2023.
Ruan et al. [2023] Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei Xu, Tianpeng Bao, Guoqing Du, Shiwei Shi, Hangyu Mao, Xingyu Zeng, and Rui Zhao. Tptu: Task planning and tool usage of large language model-based ai agents. ArXiv preprint, abs/2308.03427, 2023. URL https://arxiv.org/abs/2308.03427.
Wu et al. [2023] Yue Wu, Xuan Tang, Tom M Mitchell, and Yuanzhi Li. Smartplay: A benchmark for llms as intelligent agents. ArXiv preprint, abs/2310.01557, 2023. URL https://arxiv.org/abs/2310.01557.
Xie et al. [2024] Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents. ArXiv preprint, abs/2402.01622, 2024. URL https://arxiv.org/abs/2402.01622.
Valmeekam et al. [2022] Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). ArXiv preprint, abs/2206.10498, 2022. URL https://arxiv.org/abs/2206.10498.
Joublin et al. [2023] Frank Joublin, Antonello Ceravola, Pavel Smirnov, Felix Ocker, Joerg Deigmoeller, Anna Belardinelli, Chao Wang, Stephan Hasler, Daniel Tanneberg, and Michael Gienger. Copal: Corrective planning of robot actions with large language models. ArXiv preprint, abs/2310.07263, 2023. URL https://arxiv.org/abs/2310.07263.
Sakib and Sun [2023] Md Sadman Sakib and Yu Sun. From cooking recipes to robot task trees–improving planning correctness and task efficiency by leveraging llms with a knowledge network. ArXiv preprint, abs/2309.09181, 2023. URL https://arxiv.org/abs/2309.09181.
Feng et al. [2024] Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36, 2024.
Shridhar et al. [2021] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew J. Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=0IOX0YcCdTn.
Yang et al. [2023] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). ArXiv preprint, abs/2309.17421, 2023. URL https://arxiv.org/abs/2309.17421.
Aghzal et al. [2023] Mohamed Aghzal, Erion Plaku, and Ziyu Yao. Can large language models be good path planners? a benchmark and investigation on spatial-temporal reasoning. ArXiv preprint, abs/2310.03249, 2023. URL https://arxiv.org/abs/2310.03249.
Qi et al. [2023] Zhangyang Qi, Ye Fang, Mengchen Zhang, Zeyi Sun, Tong Wu, Ziwei Liu, Dahua Lin, Jiaqi Wang, and Hengshuang Zhao. Gemini vs gpt-4v: A preliminary comparison and combination of vision-language models through qualitative cases. ArXiv preprint, abs/2312.15011, 2023. URL https://arxiv.org/abs/2312.15011.
Cha et al. [2024] Sungguk Cha, Jusung Lee, Younghyun Lee, and Cheoljong Yang. Visually dehallucinative instruction generation: Know what you don’t know. ArXiv preprint, abs/2402.09717, 2024. URL https://arxiv.org/abs/2402.09717.
Ge et al. [2023] Wentao Ge, Shunian Chen, Guiming Chen, Junying Chen, Zhihong Chen, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xidong Wang, et al. Mllm-bench, evaluating multi-modal llms using gpt-4v. ArXiv preprint, abs/2311.13951, 2023. URL https://arxiv.org/abs/2311.13951.
Tong et al. [2024] Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. ArXiv preprint, abs/2401.06209, 2024. URL https://arxiv.org/abs/2401.06209.
Lu et al. [2023] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. arXiv e-prints, pages arXiv–2310, 2023.
Liu et al. [2023] Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. ArXiv preprint, abs/2310.14566, 2023. URL https://arxiv.org/abs/2310.14566.
Song et al. [2024] Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, and Benyou Wang. Milebench: Benchmarking mllms in long context. ArXiv preprint, abs/2404.18532, 2024. URL https://arxiv.org/abs/2404.18532.
Wang et al. [2024b] Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models. ArXiv preprint, abs/2406.11230, 2024b. URL https://arxiv.org/abs/2406.11230.
Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ArXiv preprint, abs/1606.01540, 2016. URL https://arxiv.org/abs/1606.01540.
Hao et al. [2023] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. ArXiv preprint, abs/2305.14992, 2023. URL https://arxiv.org/abs/2305.14992.
Gokhale et al. [2019] Tejas Gokhale, Shailaja Sampat, Zhiyuan Fang, Yezhou Yang, and Chitta Baral. Blocksworld revisited: Learning and reasoning to generate event-sequences from image pairs. ArXiv preprint, abs/1905.12042, 2019. URL https://arxiv.org/abs/1905.12042.
Anthropic [2024] AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
[61] Gpt-4o. https://openai.com/index/hello-gpt-4o/.
Dong et al. [2024] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. ArXiv preprint, abs/2401.16420, 2024. URL https://arxiv.org/abs/2401.16420.
Dai et al. [2024] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
Lin et al. [2023] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. ArXiv preprint, abs/2311.07575, 2023. URL https://arxiv.org/abs/2311.07575.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
Li et al. [2023b] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023b.
Wang et al. [2023] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. ArXiv preprint, abs/2311.03079, 2023. URL https://arxiv.org/abs/2311.03079.
[68] Llama-3. https://ai.meta.com/blog/meta-llama-3/.

Appendix A Details of Benchmarks

A.1 Additional Task Implementation Process and Statistics

In Section 3.4, we described our general benchmark creation process. Our task creation process can be divided into three stages: (a) preparing the input image; (b) formulating task prompts; (c) developing scripts for auto evaluation. Additionally, some specific tasks require extra steps to implement, such as question and answer generation. In what follows, we describe these steps in detail.

Maze Navigation, Main Task In this task, agents need to find a safe path to navigate from the start grid to the goal. We adjust the map generation mechanisms to ensure the positions of the start grid, the goal, and the holes are all randomly generated, while ensuring there is at least one viable safe path from the start grid to the goal. For each grid, the probability that it contains the hole is 20%.

Maze Navigation, T1 In this task, agents need to identify whether a specific grid is safe (i.e., whether it contains a hole or not). We randomly sample a row number and a column number and ask the safety question for this randomly chosen grid in each problem of this task. Additionally, to prevent the model from patterned guessing and achieving falsely high ratings (e.g., answering "not safe" for all images and obtaining high accuracy scores), we regenerate the map for this task to ensure that the safe and unsafe grids each comprise around 50% of the total grids in a single map.

Maze Navigation, T3 In this task, agents need to find the correct textual description that fits the visual environment. For each problem, we prepare four textual description candidates. One candidate is the correct answer, one candidate has the correct size but an incorrect map arrangement, and the other two candidates have the wrong size. The candidates are shuffled to prevent the model from making random guesses.

Maze Navigation, T4 In this task, agents need to determine if the given action series is safe or not. Similarly, to prevent the models from achieving falsely high ratings through guessing, we generate the action series to ensure that around 50% of them are safe and the other 50% are not. For the unsafe paths, the particular step in which the player steps into a hole is also randomly chosen.

Blocks World, T2 In this task, agents need to determine the spatial relation between two designated blocks. In addition to the directional relation (“above” and “below”), we note that it is important for agents to recognize if two blocks are at the same stack or not. Therefore, we design the following four candidates for this question: (A) The first block is directly above the second block, and they are in the same stack; (B) The first block is directly below the second block, and they are in the same stack; (C) The two blocks are at different stacks; (D) At least one of the mentioned blocks do not exist in the presented image.

Blocks World, T4 In this task, agents need to determine the consequence of a given moving plan. Specifically, some invalid moving plans contain actions that cannot be executed in the given scenario, such as trying to move a block that is covered by another block. To prevent guessing in this task, similar to the maze navigation scenario, we generate the action plans to ensure that half of the plan candidates are executable. For the plans that cannot be executed, there can be two types of errors: first, the plan may include steps that involve moving a block from or to an invalid position; second, the plan may try to move a block that does not exist. We randomly generate these errors in inputs.

Statistics The VSP benchmark consists of 10 tasks in two scenarios. For each task, the problems are designed with different difficulty levels. Specifically, each difficulty level consists of 100 problems. In total, the VSP benchmark includes 4.4k questions.

A.2 Complete Prompt

In this subsection, we provide the complete prompts for each task. Generally, the prompts consist of a general task description at the beginning and a specific question at the end. The prompts interleave text and images in a pattern similar to a human-readable manual with reference figures.

Appendix B Prompt for textual input

In Section 4.4, we described the procedure of using textual representation instead of visual input. Below, we use the main task in the maze navigation scenario as an example to show the complete prompt after making the replacement.

Appendix C Prompt with in-context examples

In Section 4.4, we described the procedure of including in-context example in the test. Below, we use the main task in the maze navigation scenario as an example to show the complete prompt after adding in-context examples.

Appendix D Training Details

In this section, we describe the training details when we fine-tune LLaVA and InternLM-XComposer for our designed tasks. We perform LoRA fine-tuning, and we stick with the default hyperparameter settings in their official repo. The detailed hyperparameter choices are shown in Table 6.

Table 6: Training details on LLaVA and InternLM-XComposer.

		Value
LLaVA	Learning rate	2e-4
	Scheduler	Cosine
	Epoch	1
	Training data	10k
	Batch size	32
	Pretrained Checkpoint	llava-v1.6-vicuna-7b
InternLM	Learning rate	5e-5
	Scheduler	Cosine
	Epoch	1
	Training data	10k
	Batch size	8
	Pretrained Checkpoint	internlm-xcomposer2-7b

Appendix E Complete Task Performance Results with different difficulty levels

In this section, we present the experimental results of models across different difficulty levels. The results are shown in Table 7. As expected, we observe that as difficulty increases, all models perform progressively worse, with some performing close to random guessing at higher difficulty levels (e.g., Task 1 in Maze Navigation scenario). We also observe that GPT-4o, the most recently released model, performs the best across different tasks, although it still frequently makes mistakes under different difficulty levels. This suggests a current bottleneck in state-of-the-art VLMs.

Table 7: Model performance on task 1 - 4 with different difficulties.

Task 1	Maze Navigation						Blocksword
Difficulty	3	4	5	6	7	8	3	4	5
Gemini [7]	0.63	0.58	0.61	0.45	0.64	0.54	0.86	0.82	0.89
GPT-Vision [5]	0.60	0.56	0.56	0.47	0.62	0.54	0.77	0.70	0.72
Claude-3 [60]	0.44	0.41	0.39	0.42	0.55	0.49	0.52	0.40	0.36
GPT-4o [61]	0.72	0.65	0.57	0.44	0.56	0.53	0.98	0.94	0.92
Task 2	Maze Navigation						Blocksword
Difficulty	3	4	5	6	7	8	3	4	5
Gemini [7]	0.69	0.65	0.53	0.46	0.54	0.47	0.63	0.57	0.32
GPT-Vision [5]	0.37	0.27	0.22	0.30	0.29	0.18	0.86	0.77	0.77
Claude-3 [60]	0.65	0.65	0.70	0.65	0.67	0.70	0.59	0.57	0.43
GPT-4o [61]	0.80	0.63	0.65	0.64	0.66	0.64	0.90	0.92	0.87
Task 3	Maze Navigation						Blocksword
Difficulty	3	4	5	6	7	8	3	4	5
Gemini [7]	0.38	0.32	0.36	0.23	0.37	0.30	0.62	0.51	0.49
GPT-Vision [5]	0.79	0.41	0.43	0.37	0.33	0.40	0.68	0.76	0.67
Claude-3 [60]	0.35	0.22	0.18	0.32	0.35	0.47	0.52	0.45	0.49
GPT-4o [61]	0.89	0.72	0.49	0.4	0.39	0.59	0.85	0.95	0.90
Task 4	Maze Navigation					Blocksword
Difficulty	1	3	5	7	9	1	3	5	7
Gemini [7]	0.47	0.47	0.56	0.49	0.46	0.64	0.57	0.50	0.50
GPT-Vision [5]	0.62	0.55	0.57	0.52	0.53	0.66	0.70	0.74	0.73
Claude-3 [60]	0.57	0.60	0.60	0.59	0.67	0.72	0.65	0.60	0.68
GPT-4o [61]	0.72	0.78	0.79	0.67	0.76	0.92	0.73	0.70	0.68

Appendix F Benchmark Documentation and Intended Users

The VSP benchmark is designed to evaluate VLM’s capability in visual spatial planning. Visual spatial planning refers to the ability to comprehend the spatial arrangements of objects and devise action plans to achieve desired outcomes in visual scenes. The VSP benchmark consists of two scenarios: ❶ the simulated Maze Navigation scenario, whose main task is to move a game character through a maze, and ❷ the photo-realistic Blocks World scenario, whose main task is to move blocks from a starting configuration to a goal configuration. In each scenario, in addition to the main task, VSP introduces four sub-tasks that focus on the individual capabilities needed for the main task:

$\bullet$ T1. Single Object Perception – Determine the characteristics of a single object;

$\bullet$ T2. Spatial Relation Perception – Determine the relative positions of two objects;

$\bullet$ T3. Environment Perception – Find textual descriptions that describe the visual environment;

$\bullet$ T4. Spatial Reasoning – Determine the consequence of a series of actions or moves.

The VSP benchmark consists of 10 tasks in two scenarios. For each task, the problems are designed with different difficulty levels. Specifically, each difficulty level consists of 100 problems. In total, the VSP benchmark includes 4.4k questions. To implement the two scenarios, we utilize and enhance existing resources from two aspects. For the maze navigation scenario, we leverage OpenAI’s gym [57] engine to generate input images. For the blocks world scenario, we sample input images from the BIRD dataset [59]. The BIRD dataset is originally built to test an RL model’s capability in understanding visual block configurations and performing sequential actions to reach the target state. We enhance it by designing auxiliary tasks (T1-T4), corresponding textual descriptions, and text prompts necessary for the input of VLMs. Additionally, we implement auto-evaluation scripts for each task, aiming to provide a convenient testbed for current VLMs. All the images and texts in this benchmark do not contain any personally identifiable information or offensive content.

All the content in this benchmark can be accessed, reviewed, and downloaded via https://github.com/UCSB-NLP-Chang/Visual-Spatial-Planning. As the authors of this benchmark, we assume full responsibility for any rights violations related to this benchmark. The benchmark is licensed under the MIT license. We will consistently monitor issues and pull requests for better maintenance. Additionally, we also release the test scripts to replicate our results.

Appendix G Limitation

It is important to note that our proposed VSP benchmark also has limitations. First, as a VLM benchmark specifically tailored for visual spatial planning capability, the VSP does not measure a VLM’s abilities in other important aspects, such as semantic understanding, factual knowledge, etc. We emphasize that VSP is not a comprehensive benchmark for VLMs, but rather a benchmark focusing on an important capability that has been mostly overlooked by existing benchmarks. Second, we also note that the appearance of objects in the image may influence models’ performance. Specifically, a model might find it easier to recognize objects against a darker background in the blocks world scenario, and vice versa. The current measure is based on a single kind of object appearance, which might favor some particular models trained on similar images. An ideal measurement would assess the average performance on images with the same content but a variety of different appearances. That said, with the detailed prompt description and sufficient information provided in the image, we believe the current version of the VSP benchmark already demonstrates the deficiencies of current state-of-the-art models in visual spatial planning. In future work, we plan to incorporate appearance/style variations in the input images for a more thorough model ability quantification.