HPE-CogVLM: New Head Pose Grounding Task Exploration on Vision Language Model

Yu Tian¹, Tianqi Shao¹, Tsukasa Demizu¹, Xuyang Wu²^∗, Hsin-Tai Wu¹^∗
¹Docomo Innovations. ²Santa Clara University.
{yu.tian, jshao, tsukasa.demizu, hwu}@docomoinnovations.com
xwu5@scu.com

Abstract

Head pose estimation (HPE) task requires a sophisticated understanding of 3D spatial relationships and precise numerical output of yaw, pitch, and roll Euler angles. Previous HPE studies are mainly based on Non-large language models (Non-LLMs), which rely on close-up human heads cropped from the full image as inputs and lack robustness in real-world scenario. In this paper, we present a novel framework to enhance the HPE prediction task by leveraging the visual grounding capability of CogVLM. CogVLM is a vision language model (VLM) with grounding capability of predicting object bounding boxes (BBoxes), which enables HPE training and prediction using full image information input. To integrate the HPE task into the VLM, we first cop with the catastrophic forgetting problem in large language models (LLMs) by investigating the rehearsal ratio in the data rehearsal method. Then, we propose and validate a LoRA layer-based model merging method, which keeps the integrity of parameters, to enhance the HPE performance in the framework. The results show our HPE-CogVLM achieves a 31.5% reduction in Mean Absolute Error for HPE prediction over the current Non-LLM based state-of-the-art in cross-dataset evaluation. Furthermore, we compare our LoRA layer-based model merging method with LoRA fine-tuning only and other merging methods in CogVLM. The results demonstrate our framework outperforms them in all HPE metrics.

^*^*footnotetext: Corresponding authors.

1 Introduction

Nowadays, the head pose estimation (HPE) technique is applicable in various fields such as attention estimation fischer2018rt ; kellnhofer2019gaze360 ; cheng2018appearance , face recognition 7780892 ; 8255649 ; 5959981 ; 7299081 , customer behavior analysis 6655785 ; wu2016head , driver assistance systems 5443483 ; vora2018driver ; hu2021temporal ; hu2020robust and human-robot interaction app11125366 . This task involves predicting the Euler angles (yaw, pitch, and roll) of human heads from images or videos. Recent research efforts on some Non-Large Language Models (Non-LLMs) like 6DRepNet Hempel_2022 , HopeNet DBLP:journals/corr/abs-1710-00925 and WHENet zhou2020whenet have made significant advancements in HPE.

Despite the recent surge of interest in HPE, the application of this technique still faces several challenges in real-world scenario. The Non-LLMs typically rely on narrowly focused datasets such as 300W-LP Zhu_2019 for training, and validate on similarly constrained datasets like AFLW2000 Zhu_2019 and BIWI 6553713 . These datasets mainly feature close-up images of heads, mostly displaying frontal faces with yaw angles from $-99^{\circ}$ to $99^{\circ}$ , instead of covering the entire range of head poses ranged from $-180^{\circ}$ to $180^{\circ}$ . Additionally, the frequent use of close-up images in these datasets not only leads to uniform backgrounds, but also reduces the variability in the input data. The uniformity from datasets results in a lack of robustness in diverse real-world environments. The DirectMHP zhou2023directmhp model achieves the HPE prediction in a one-shot manner trained on the full-range HPE datasets, like Agora patel2021agora and CMU joo2016panoptic , but this model struggles to balance the head bounding box (BBox) detection and HPE task performance. Consequently, the model’s effectiveness remains uncertain in real-world settings.

Large Language Models (LLMs) significantly improve our lives by offering sophisticated assistance in various tasks. Recently, Vision Language Models (VLMs) have gained prominence due to their proficiency in interpreting and processing information from images and videos openai2024gpt4 ; geminiteam2024gemini ; liu2023visual ; wang2024cogvlm . Through integrating visual capabilities to LLMs, the VLMs can achieve more complex tasks than traditional LLMs, e.g., visual question answering openai2024gpt4 ; agrawal2016vqa ; liu2023visual and visual grounding wang2024cogvlm ; yu2016modeling . The visual grounding ability of CogVLM wang2024cogvlm exhibits strong adaptability to diverse environments, providing an opportunity to enhance the robustness of tasks that traditional CNN-based methods struggle to solve. In this paper, we aim to improve HPE task functionality with the grounding CogVLM. The grounding CogVLM’s capabilities include caption grounding, referring expression generation, referring expression comprehension and grounded visual question answering wang2024cogvlm . All of these functionalities involve in the description of object localization in the BBox format of [[ $x_{0}$ , $y_{0}$ , $x_{1}$ , $y_{1}$ ]] as shown in Figure 1 (a). This BBox prediction capability provides a foundational skill for learning the new HPE task in this paper. By leveraging this capability in our designed prompts, we enable CogVLM to learn HPE from the entire images instead of the cropped head images used in Non-LLMs, which significantly helps the model avoid over-fitting to a limited background.

Integrating the HPE task into the grounding CogVLM not only opens up new opportunities for exploration, but also presents several challenges. First, VLM tasks such as image description, visual reasoning, and visual perception usually contain answering questions with natural language responses. In contrast, our HPE task requires the VLM to produce precise numerical Euler angles. Although the grounding CogVLM can predict BBoxes, indicating its ability to produce numerical responses, the HPE task is significantly more complicated. HPE requires predicting the human head’s orientation in terms of yaw, pitch, and roll angles, which involves interpreting 3D orientation from 2D images, introducing additional dimensions of depth and angular perspective not required in the basic BBox detection task. Therefore, it raises the challenge of whether the grounding model can provide HPE answers with much higher accuracy. Secondly, catastrophic forgetting scialom2022finetuned ; huang2024mitigating ; luo2024empirical poses a significant challenge in fine-tuning LLMs. The catastrophic forgetting problem is a phenomenon that LLMs tend to forget previously learned information when acquiring new data. Currently, there is a lack of research addressing the catastrophic forgetting problem in complex grounding tasks. Lastly, the original grounding CogVLM only involves in outputting responses with a blend of natural languages and BBoxes in [[ $x_{0}$ , $y_{0}$ , $x_{1}$ , $y_{1}$ ]] format. In this paper, we introduce a new format { $yaw\_angle$ , $pitch\_angle$ , $roll\_angle$ } for answering HPE prompts as shown in Figure 1 (b). This enriches the knowledge of the original grounding CogVLM, meanwhile increasing the complexity of output formats. Empirically, we observe the LoRA hu2021lora fine-tuning and model merging methods often generate invalid blend outputs like [[ $x_{0}$ , $y_{0}$ , $yaw\_angle$ }, which is referred as invalid answers in this paper. More invalid answers are detailed in Appendix A.1 Table 5.

Refer to caption — Figure 1: Figure (a) shows an example of CogVLM Grounding Capability, which demonstrates the original grounding CogVLM’s ability to identify objects based on prompts, a foundational skill useful for HPE task. Figure (b) displays a visualization of head orientation predicted by our HPE-CogVLM from the CMU Panoptic dataset, using Euler angles. The head pose labels are depicted with pitch (red axis), roll (green axis), and yaw (blue axis) angles, each indicated in their respective directions.

In this paper, for addressing the catastrophic forgetting problem in grounding tasks, we evaluate and improve the data rehearsal methods scialom2022finetuned ; huang2024mitigating used in non-grounding VLMs to overcome the catastrophic forgetting problem. Here, the rehearsal ratio represents the percentage of images randomly selected from earlier training phases that are reintegrated during the training of new tasks, relative to the total number of earlier training images scialom2022finetuned ; huang2024mitigating . The results show that the visual grounding task, which demands multiple accurate numerical outputs, requires a significantly larger rehearsal ratio than non-grounding VLMs. We propose and validate a layer-based model merging method to enhance the HPE task’s performance. Utilizing this merging approach, our model shows remarkable robustness, achieving a 31.5% reduction in Mean Absolute Error (MAE) of Euler angles, compared to the Non-LLMs state-of-the-art (SOTA) in cross-dataset evaluations. Furthermore, we compare our layer-based model merging method with LoRA fine-tuned and merged models in CogVLM. Our approach consistently shows superior performance in both of MAE and invalid answer ratio reduction. Our contributions can be concluded as following:

•

Our work pioneers the improvement of HPE tasks through leveraging the visual grounding capability of CogVLM, showing the VLM model’s ability at managing complex 3D spatial perception while keeping the existing object localization knowledge.
•

We first explore the catastrophic forgetting problem and the invalid answer problem in the complex VLM grounding tasks.
•

We propose a novel layer-based model merging method that adopts a “winner takes all” strategy, which significantly outperforms the Non-LLMs SOTA and VLM-based model in MAE and invalid answer ratio reduction. This demonstrates our method is outstanding robustness and effectiveness in the HPE task, and holds potential for broader application in various grounding tasks. The paper is under review and codes will be released.

2 Related Work

Head Pose Estimation. Traditional approaches for HPE include landmark-based and landmark-free methods. Since full-range HPE often involves head orientations where facial features are not visible, we focus solely on landmark-free approach. Under this approach, several models divide continuous rotation variables into discrete bins for classification purposes DBLP:journals/corr/abs-1710-00925 ; zhou2020whenet ; 8444061 ; huang2020improving ; zhang2020fdn . Besides those, FSA-Net yang2019fsa employs a stage-wise regression and feature aggregation scheme to predict Euler angles. 6DRepNet Hempel_2022 and TriNet cao2020vectorbased estimate the rotation matrix rather than Euler angles. These Non-LLM methods pose significant robustness issues in real-life scenarios.

Grounding Vision Language Models. Some VLMs with grounding capabilities can provide accurate BBox information in the format of [[ $x_{0}$ , $y_{0}$ , $x_{1}$ , $y_{1}$ ]] wang2024cogvlm ; bai2023qwenvl ; chen2023shikra ; you2023ferret . In this paper, we focus a unique grounding task: HPE prediction that requires a understanding of 3D spatial relationships to produce accurate Euler angles in the format of { $yaw\_angle$ , $pitch\_angle$ , $roll\_angle$ }. Regular VLMs are not typically designed to handle such queries, and currently there is a lack of research investigating the effectiveness of VLMs in HPE task. The research of CLIP-Gaze yin2024clip , learns the gaze estimation task by using CLIP radford2021learning model. However, CLIP-Gaze limits its application solely to gaze estimation and does not address catastrophic forgetting, resulting in a loss of CLIP’s multi-task capabilities.

Catastrophic Forgetting Problem. Catastrophic forgetting has been a significant issue that limits the effectiveness of LLMs, as they tend to forget previously knowledge when learning new knowledge. Kirkpatrick et al. Kirkpatrick_2017 and Li et al. li-etal-2022-overcoming control the extent of parameter updates to prevent the forgetting problem of previously learned tasks, which needs to carefully tune for optimal performance. Xu et al. xu2018reinforced and Huang et al. huang2021continual separate parameters dedicated to individual tasks, however, this approach introduces extra parameters. The rehearsal method scialom2022finetuned ; huang2024mitigating ; luo2024empirical ; zhang-etal-2023-citb ; mok-etal-2023-large is the most widely used method to mitigate catastrophic forgetting which reuses a small portion of old task datasets into the new task fine-tuning process. The catastrophic forgetting problem is not well investigated for complex grounding task in previous literature. In this paper, we evaluate and improve rehearsal method for our grounding tasks.

Model Merging in LLMs. There has been extensive exploration of methods based on model merging to enhance the performance of LLMs. This method merges multiple LLMs with specialized capabilities, into a single LLM that can address tasks across various domains. The typical merging methods ilharco2023editing ; wortsman2022model ; jang2024model ; davari2023model ; yadav2024ties ; yu2023language ; akiba2024evolutionary usually apply rules or algorithms to trim or merge the parameters of LLMs. For example, task arithmetic ilharco2023editing defines arithmetic rules to incorporating new capability or deleting undesired capability. Evolutionary model merging method akiba2024evolutionary has demonstrated evolutionary algorithms’ effectiveness in LLMs. However, traditional merging methods usually produce blended invalid answers as our tasks requires higher complexity of output formats.

3 HPE-CogVLM Framework

The proposed framework of HPE-CogVLM as shown in Figure 2 is structured through a multi-stage process. It is specially designed to enhance the model’s capability in understanding and processing complex tasks associated with HPE, while retaining its original BBox prediction capabilities. The fine-tuning process at each stage follows the CogVLM’s fine-tuning scripts¹¹1https://github.com/THUDM/CogVLM, which implement LoRA hu2021lora across transformer blocks, including the query, key, value of attention layers and dense layers. Subsequently, the LoRA matrices of each layer are accumulated into the corresponding layer in original model. Below is a detailed description of each stage in the framework:

Stage 1: Pre-training of the Original Grounding CogVLM on Weak Label Data

At this initial stage, the original grounding CogVLM undergoes pre-training on the CrowdHuman dataset. The function of this stage is to enhance the model’s understanding of the HPE task, as the CrowdHuman dataset offer a rich collection of human head images. Since the original CrowdHuman dataset does not provide ground truth (GT) annotations for HPE, we infer the weak HPE annotations by 6DRepNet. The output model from this stage is termed as weak label CogVLM as shown in Figure 2. For this stage, our primary goal is to warm up the model using weak label data, designed to provide the model with a comprehensive understanding of various human head orientations.

Stage 2: Supervised Fine-tuning of the Weak Label CogVLM on Task-specific (HPE) Data

Following the pre-training, the model progresses to supervise fine-tuning solely using the task-specific HPE dataset. The task-specific HPE datasets contain less images and more accurate annotations than the weak label images. This stage concentrates on refining the weak label model’s HPE ability, aiming to maximize the precision of HPE task. The output model is referred as the HPE-oriented CogVLM as shown in Figure 2.

Stage 3: Layer-based Merging between Original Grounding CogVLM and HPE-oriented CogVLM

During this key stage, the original grounding CogVLM is merged with HPE-oriented CogVLM based on cosine similarity criteria. In our framework, the cosine similarity represents the averaged cosine similarity between layer parameter tensors along the last dimension. Cosine similarity is used to gauge the amount of information shared between layers. A high threshold of cosine similarity is set to ensure significant overlap in content. If the similarity falls below this threshold, we opt to completely retain the original knowledge. Otherwise, if the similarity exceeds the threshold, which indicates a substantial overlap in information due to the stringent criteria, we select the entire layer from the HPE-oriented CogVLM to guarantee the minimal risk of losing important existing knowledge.

The precedent methods merge models at the parameter-level by setting hyper parameters, or developing algorithms to discard and merge specific parameters ilharco2023editing ; yadav2024ties ; yu2023language ; akiba2024evolutionary . However, parameter-based merging models often blend output structures in our task, resulting in invalid answers. For example, when we query with HPE prompts, the parameter-based merging model may return a NLP response like “a person of head” or provide nonsensical responses like “[[999,231,123,389}”. More examples are detailed in Appendix A.1 Table 5. To overcome this issue, we adopt the “winner takes all” method to choose layers from either the original grounding CogVLM or the HPE-oriented CogVLM. The layer-based merging CogVLM is able to preserve the integrity of parameters specific to the expertise of each model by this approach. The merging criteria is detailed as below:

•

We calculate and rank the cosine similarities across all layers from both models, and always select the layer from the original grounding CogVLM model within the smallest 1% of cosine similarities.
•

When the cosine similarity between two layers from each model is less than the threshold (set at 0.95 in our experiments), we also select the layer from the original grounding CogVLM.
•

Otherwise, we choose the layer from the HPE-oriented CogVLM.

Stage 4: Continual Fine-tuning of Layer-based Merging CogVLM on Mixture Data

After merging, the layer-based merging CogVLM undergoes an additional round of fine-tuning with both the task-specific HPE dataset and the rehearsal images. We pre-define the optimal rehearsal ratio from stage 1 by tuning the original grounding CogVLM with weak label images combined with different proportions of rehearsal images. Then the optimal rehearsal ratio is used in fine-tuning the merged model. Unlike the fine-tuning in stage 2, this phase involves only a brief period of fine-tuning, less than one epoch. The rationale for incorporating additional brief fine-tuning is that while layer merging maintains parameter integrity, it lacks the fine-tuned parameters necessary to enhance prediction accuracy. Continual fine-tuning is the best way to update weights than any other algorithms. In our approach, the merging model can be quickly fine-tuned to deliver accurate numerical predictions. The final output model of this stage is HPE-CogVLM as shown in Figure 2.

Stage 5: Evaluation of HPE-CogVLM on Test Data

To demonstrate the robustness of our model, we utilize real-world CMU Panoptic images to evaluate the model’s performance on the HPE task. Meanwhile, we use rehearsal test datasets to assess the model’s performance on the BBox prediction task.

4 Experiments Setup

4.1 HPE Task Prompt Design

In some Non-LLMs, such as 6DRepNet and HopeNet, cropping the human head region is required as the initial step. In this paper, a new prompt method is proposed, allowing us to train and predict HPE utilizing the information of full images. In our prompts, BBox coordinates are leveraged to specify the human head of interest when multiple people are present. Therefore, the system is capable of effectively focusing on specific heads, which makes it easier to reduce the need for labor-intensive manual annotations and automate the inference process.

Meanwhile, the global features from self-attention and head of interest features from cross-attention are both learnt to improve the robustness of HPE task. Figure 1 (b) shows an example of our designed prompts and responses for the HPE task. More prompt details are in Appendix A.1 Table. 5.

4.2 Datasets

Table 1: A detailed overview of various datasets used in our framework.

Task	Dataset	# of Images		# of Heads		Usage
Task	Dataset	Train	Test	Train	Test	Usage
Weak Label Images	CrowdHuman shao2018crowdhuman	11,731	-	94,795	-	Stage 1
Task-specific Images	Agora patel2021agora	9,654	-	64,187	-	Stage 2, 4
Rehearsal Images	Refcoco yu2016modeling	42,404	3785	-	-	Stage 4, 5
	Refcoco+ yu2016modeling	42,278	3773	-	-	Stage 4, 5
	Refcocog mao2016generation	42,226	5023	-	-	Stage 4, 5
Evaluation Images	CMU Panoptic joo2016panoptic	-	16,216	-	32,738	Stage 5

Table 1 outlines datasets used in various stages in our framework. The CrowdHuman dataset shao2018crowdhuman serves as the pre-training dataset due to its extensive collection of human images. Its head pose annotations are derived from pseudo-labels inferred by the pre-trained 6DRepNet Hempel_2022 model, which is referred as the weak label images. The synthetic Agora dataset patel2021agora serves as the fine-tuning HPE dataset, which encompasses full-range of human head yaw angle images and provides the GT of SMPL-X parameters pavlakos2019expressive . Its head pose annotations are generated using the method of DirectMHP zhou2023directmhp . The Refcoco yu2016modeling , Refcoco+ yu2016modeling , and Refcocog mao2016generation train datasets, which are originally utilized by CogVLM to learn BBox prediction, are chosen as rehearsal images to help mitigate the catastrophic forgetting of existing knowledge. In our experiments, various portions of the rehearsal images are applied to determine the optimal rehearsal ratio scialom2022finetuned ; huang2024mitigating to address the catastrophic forgetting problem. A subset of the CMU Panoptic dataset (CMU dataset) is selected as the test dataset for evaluating HPE task, as its panoptic images of real people closely mirror real-life scenarios. The selection of images and annotations is guided by the DirectMHP ²²2https://github.com/hnuzhy/DirectMHP. To evaluate object BBox localization, the test datasets including testA and testB data from Refcoco and Refcoco+, as well as the test dataset from Refcocog, are selected as the BBox evaluation datasets.

4.3 Implementation Details

The original grounding CogVLM is the foundational model for all experiments due to its strong BBox prediction capability. It also serves as the baseline of BBox evaluation and provides the preliminary capabilities necessary for learning HPE task. In fine-tuning process, we choose 10 as our LoRA rank. In pre-training process, $1\text{\times}{10}^{-4}$ is selected as the learning rate. All other training parameters follow the default settings of the CogVLM. The experiments are performed on using two NVIDIA A100 80GB GPUs with a training batch size of 8. The training processes in stages 1, 2, and 4 of our framework cost 20, 50, and 10 hours, respectively.

4.4 Evaluation Metrics

We define four evaluation metrics for assessing HPE and BBox prediction tasks as follows:

Angel Error Ratio ( $E_{angle}$ ): $E_{angle}$ = $\frac{e_{angle}}{t_{angle}}$ , where $e_{angle}$ denotes the number of invalid HPE answers and $t_{angle}$ denotes the number of total HPE answers. This new metric is defined to assess the capability of models to provide relevant numerical outputs for HPE task. When we prompt with a HPE query, the CogVLM could produce irrelevant responses such as an natural language processing (NLP) task response like “a person head”, a bounding box task response like “[[111,222,333,444]]”, or even a completely nonsensical response like “111,999,999,999…”.

BBox Error Ratio ( $E_{bbox}$ ): $E_{bbox}$ = $\frac{e_{bbox}}{t_{bbox}}$ , where $e_{bbox}$ denotes the number of invalid BBox answers and $t_{bbox}$ denotes the number of total BBox answers. This new metric is defined to assess the capability of models to provide relevant numerical outputs for BBox prediction task.

BBox accuracy (ACC.): Acc. = $\frac{m}{\hat{m}}$ , where $m$ denotes the number of valid BBox answers with IoU > 0.5 and $\hat{m}$ denotes the number of total valid BBox answers. A BBox prediction is considered to be accurate if the intersection over union (IoU) between the GT and the prediction exceeds 0.5 yu2016modeling . And the invalid answers are excluded from accuracy and MAE calculation.

MAE of Euler angles (MAE): For HPE task, the MAE of GT Euler angles and predicted Euler angles is defined as follows:

\text{MAE}=\frac{1}{n}\sum_{i=1}^{n}\min(360^{\circ}-|\hat{A}_{i}-A_{i}|,|\hat% {A}_{i}-A_{i}|)

(1)

where $\hat{A}_{i}$ represents the GT’s Euler angles, $A_{i}$ represents the predicted Euler angles, and variable $n$ denotes the number of valid HPE answers. The MAE in HPE task is different from its conventional definition, where the error is measured in a circular manner rather than linearly, leading to the inclusion of a term that minimizes the difference between the predicted and actual angle by considering a full 360-degree rotation Hempel_2022 ; zhou2023directmhp . In this paper, the MAE value is considered as the average of MAE for yaw, pitch and roll Euler angles.

4.5 Baseline Methods

In this paper, three types of baseline methods are considered to be compared with our HPE-CogVLM.

Non-LLMs, including 6DRepNet, HopeNet and WHENet are selected as HPE Non-LLM baselines. The 6DRepNet model, recognized as the current SOTA, is specifically retrained and tested on the same Agora and CMU datasets used in the LLM experiment to ensure a fair comparison. This model is trained 100 epochs, and the best MAE is selected for baseline analysis with our HPE-CogVLM. The pre-trained models of HopeNet and WHENet are utilized because HopeNet scripts are hard-coded and WHENet training scripts are not publicly available.

Non-merging CogVLM, direct fine-tuned model without applying model merging technique, is selected to compare the effectiveness of our merging approach with fine-tuning only method scialom2022finetuned ; huang2024mitigating ; luo2024empirical . The difference of Non-merging CogVLM and our HPE-CogVLM methods is that the Non-merging CogVLM bypasses stages 2 and 3, instead it undergoes significantly more training iterations in stage 4 which is equal to the total iterations of stages 2 and 4 in the HPE-CogVLM framework. For examples, our HPE-CogVLM is fine-tuned 25k and 5k iterations on stage 2 and 4 respectively, while the Non-merging CogVLM is solely fine-tuned on stage 4 for 30k iterations. This ensures fair comparison with respect to HPE task training iterations.

Task Arithmetic (TA) merging CogVLM, which adheres to our framework but replacing the layer-based merging with TA based merging, is to provide a baseline for comparing our merging approach with another merging method. The TA merging process is chosen as it forms the foundation for many other merging algorithms yadav2024ties ; yu2023language . In this process, we set the lambda parameter of task arithmetic to 0.5 ilharco2023editing , assigning equal importance to both the BBox prediction task and the HPE task.

5 Experimental Results

5.1 Baseline Comparison

Table 2: Comparison of HPE-CogVLM performance with various baselines. The best results for each model are highlighted in bold.

Model	Refcoco		Refcoco+		Refcocog		CMU Panoptic
Model	$\boldsymbol{Acc_{test}}$	$\boldsymbol{E_{bbox}}$	$\boldsymbol{Acc_{test}}$	$\boldsymbol{E_{bbox}}$	$\boldsymbol{Acc_{test}}$	$\boldsymbol{E_{bbox}}$	$\boldsymbol{MAE_{test}}$	$\boldsymbol{E_{angle}}$
WHENet	-	-	-	-	-	-	29.55	-
HopeNet	-	-	-	-	-	-	22.16	-
6DRepNet	-	-	-	-	-	-	10.74	-
Original Grounding CogVLM	91.4%	0%	86.7%	0%	90.2%	0%	-	-
Non-merging CogVLM	91.1%	0%	85.2%	0%	88.9%	0%	8.18	0.13%
TA merging CogVLM	89.5%	0%	82.3%	0%	86.1%	0%	7.72	68.9%
HPE-CogVLM	90.5%	0%	84.7%	0%	87.8%	0%	7.36	0.052%

The results in Table 2 show the performance comparison between our HPE-CogVLM model and various baselines described in Section 4.5. In comparison with Non-LLMs, our HPE-CogVLM presents significantly lower MAE. The HPE-CogVLM MAE is 75.1%, 66.8%, and 31.5% lower than WHENet, HopeNet, and 6DRepNet, respectively. The Non-LLMs also perform worse than other CogVLM-based models, which highlights the superior robustness of VLM-based models over Non-LLMs.

In comparison with Non-merging CogVLM, HPE-CogVLM MAE is 10% lower than the Non-merging CogVLM. Meanwhile the $E_{angle}$ of our model is 2.5 times smaller than the Non-merging CogVLM. This indicates that our layer-based merging method is more proficient in HPE than the method that does not utilize any model merging technique. Regarding the BBox results, the HPE-CogVLM’s BBox prediction accuracy in test datasets is 0.6%, 0.5% and 1.1% lower than the Non-merging CogVLM, however, the Non-merging CogVLM costs five times more iterations for training rehearsal dataset as discussed in Section 4.5.

Comparing with TA merging CogVLM, our HPE-CogVLM wins in all metrics. For instance, when evaluated on test datasets, the BBox prediction accuracy of the HPE-CogVLM exceeds that of the TA merging CogVLM by 1%, 2.4%, and 1.7%, respectively. And the $E_{angle}$ of TA merging CogVLM is 68.9%, which is 1325 times greater than that of HPE-CogVLM, indicating that only 31% of the responses for the HPE task are valid. Due to the high number of invalid HPE responses, the MAE metric becomes ineffective for assessing the performance. This highlights that even with an additional round of fine-tuning, the task arithmetic merging fails to produce relevant numerical responses within our research domain, ultimately proving ineffective for the HPE task.

5.2 Performance of HPE-oriented CogVLM on HPE Task Only

Table 3: The HPE-oriented CogVLM model exhibits the highest HPE performance within our framework. The best results for each model are highlighted in bold.

Model	Epochs	$\boldsymbol{Acc_{test}}$	$\boldsymbol{MAE_{test}}$	$\boldsymbol{MAE_{training}}$	$\boldsymbol{E_{angle}}$
6DRepNet	3	-	12.70	9.40	-
6DRepNet	6	-	12.76	8.80	-
6DRepNet	9	-	11.44	7.90	-
6DRepNet	50	-	11.37	2.91	-
6DRepNet	100	-	11.4	2.23	-
HPE-oriented CogVLM	3	8.8%	6.40	-	0.0092%
HPE-oriented CogVLM	6	12.6%	6.31	-	0%
HPE-oriented CogVLM	9	11.0%	6.24	-	0%

In our framework, the HPE-oriented CogVLM is our most effective model for the HPE task only. Table 3 presents comparative performance results of 6DRepNet and the HPE-oriented CogVLM, both not accommodate BBox prediction capabilities, over similar training epochs. The low Refcoco test accuracy of HPE-oriented CogVLM is expected, given that no data rehearsal is implemented in this stage. In terms of MAE metric, the HPE-oriented CogVLM displays a gradual decrease in MAE from 6.4 at 3 epochs to 6.24 at 9 epochs. When compared our model with 6DRepNet, in the same epoch, our MAE shows much lower numbers than 6DRepNet. For example, in epoch 9, MAE of HPE-oriented CogVLM is 6.24 which is 45.5% lower than 6DRepNet. After extending the 6DRepNet training to 100 epochs, while its training MAE decreases from 9.40 to 2.23, the MAE on the CMU dataset does not improve, remaining stable around 11.4. This indicates that the model is over-fitting to the Agora dataset, with no enhancement in cross-dataset inference performance. This stark difference emphasizes the superior performance of VLM than Non-LLMs.

5.3 Selecting Optimal Rehearsal Ratios for Mitigating the Catastrophic Forgetting Problem

Table 4: Performance of weak label CogVLM under various rehearsal ratios.

Iterations	Rehearsal Ratio	$\boldsymbol{Acc_{test}}$	$\boldsymbol{E_{bbox}}$	$\boldsymbol{MAE_{test}}$	$\boldsymbol{E_{angle}}$
0k	0%	91.4%	0%	-	-
10k	0%	21.8%	0.026%	17.20	0.48%
10k	1%	77.5%	0.19%	21.51	0.85%
10k	10%	91.0%	0%	19.32	0.32%
10k	25%	91.5%	0%	19.92	0.23%

Table 4 presents the performance of weak label CogVLM across different proportions (0%, 1%, 10%, 25%) scialom2022finetuned ; huang2024mitigating ; luo2024empirical of the rehearsal dataset. The primary aim is to determine the appropriate data rehearsal ratio to retain old knowledge for the fine-tuning in stage 4. The Refcoco test accuracy at iteration 0 is 91.4%, indicating proficiency with the BBox prediction tasks. After the training is finished, the results demonstrate a clear trend that as the rehearsal ratio increases, the Refcoco test accuracy substantially improves. Starting at a low of 21.8% when no Refcoco data is used, the accuracy spikes to 77.5% with just 1% of rehearsal ratio, eventually reaching over 91% with 10% and 25% of rehearsal ratio. This clearly shows that the more original task data used in learning a new task, the less catastrophic forgetting occurs. In the $E_{bbox}$ column, the consistently low $E_{bbox}$ values suggest that the availability of BBox predictions tend to stabilize after 10K iterations. MAE and $E_{angle}$ for HPE task show a fluctuating trend. Since the head pose pseudo-label is provided for this pre-training experiment, they may not serve as a reliable metric for analysis. Rehearsal ratios of 10% and 25% are selected for the stage 4 experiment due to high Refcoco test accuracy. These numbers are significantly higher than the commonly used 1% rehearsal ratio in non-grounding tasks.

5.4 The Influence of Rehearsal Ratios on Multi-task Learning

Figure 3 presents comparative results for both the BBox prediction task and the HPE task under different rehearsal ratios. Between the two HPE-CogVLM models, the one with a lower rehearsal ratio (10%) achieves an MAE of 7.36, which is 12% lower than the 8.36 observed with the higher (25%) ratio. Conversely, the Refcoco test accuracy improves slightly with higher rehearsal ratios, showing increases of 0.3% compared to the lower ratio. The similar phenomenon also presents in Non-merging CogVLM and TA merging CogVLM results. Intuitively, a higher rehearsal ratio helps retain the existing knowledge better since more data from previous tasks is included in the fine-tuning process. Therefore, a higher rehearsal ratio improves existing knowledge retention but at the expense of new task performance. So we seek for balance between the retention of old knowledge and the performance on new tasks. In our case, the HPE-CogVLM with a 10% rehearsal ratio clearly stands out as the best model in both HPE and BBox prediction. More experiment results are cataloged in Appendix A.4 Table 7.

6 Conclusions

In our paper, we propose a novel framework to enhance the HPE task by using the grounding CogVLM. We design prompts to enable CogVLM to learn HPE from full images, explore the optimal rehearsal ratio to prevent catastrophic forgetting problem and introduce a layer-based merging method. This new framework exhibits outstanding robustness and effectiveness over both Non-LLMs and other VLMs based methods.

Limitations This paper is limited by GPU resources, which constrains the extent of our experiments. As a result, larger-scale experiments could not be fully explored within the scope of this study.

References

(1) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report, 2024.
(2) Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: Visual question answering, 2016.
(3) Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes. arXiv preprint arXiv:2403.13187, 2024.
(4) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
(5) Zhiwen Cao, Zongcheng Chu, Dongfang Liu, and Yingjie Chen. A vector-based representation to enhance head pose estimation, 2020.
(6) Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023.
(7) Yihua Cheng, Feng Lu, and Xucong Zhang. Appearance-based gaze estimation via evaluation-guided asymmetric regression. In Proceedings of the European conference on computer vision (ECCV), pages 100–115, 2018.
(8) MohammadReza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks, 2023.
(9) Gabriele Fanelli, Matthias Dantone, and Luc Van Gool. Real time 3d face alignment with random forests-based active appearance models. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pages 1–8, 2013.
(10) Tobias Fischer, Hyung Jin Chang, and Yiannis Demiris. Rt-gene: Real-time eye gaze estimation in natural environments. In Proceedings of the European conference on computer vision (ECCV), pages 334–352, 2018.
(11) Thorsten Hempel, Ahmed A. Abdelrahman, and Ayoub Al-Hamadi. 6d rotation representation for unconstrained head pose estimation. In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, October 2022.
(12) Heng-Wei Hsu, Tung-Yu Wu, Sheng Wan, Wing Hung Wong, and Chen-Yi Lee. Quatnet: Quaternion-based head pose estimation with multiregression loss. IEEE Transactions on Multimedia, 21(4):1035–1046, 2019.
(13) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
(14) Tiancheng Hu, Sumit Jha, and Carlos Busso. Robust driver head pose estimation in naturalistic conditions from point-cloud data. In 2020 IEEE Intelligent Vehicles Symposium (IV), pages 1176–1182. Ieee, 2020.
(15) Tiancheng Hu, Sumit Jha, and Carlos Busso. Temporal head pose estimation from point cloud in naturalistic driving conditions. IEEE Transactions on Intelligent Transportation Systems, 23(7):8063–8076, 2021.
(16) Bin Huang, Renwen Chen, Wang Xu, and Qinbang Zhou. Improving head pose estimation using two-stage ensembles with top-k regression. Image and Vision Computing, 93:103827, 2020.
(17) Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024.
(18) Yufan Huang, Yanzhe Zhang, Jiaao Chen, Xuezhi Wang, and Diyi Yang. Continual learning for text classification with information disentanglement based regularization, 2021.
(19) Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic, 2023.
(20) Dong-Hwan Jang, Sangdoo Yun, and Dongyoon Han. Model stock: All we need is just a few fine-tuned models, 2024.
(21) Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Godisart, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social interaction capture, 2016.
(22) Petr Kellnhofer, Adria Recasens, Simon Stent, Wojciech Matusik, and Antonio Torralba. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6912–6921, 2019.
(23) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, March 2017.
(24) Dingcheng Li, Zheng Chen, Eunah Cho, Jie Hao, Xiaohu Liu, Fan Xing, Chenlei Guo, and Yang Liu. Overcoming catastrophic forgetting during domain adaptation of seq2seq language generation. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5441–5454, Seattle, United States, July 2022. Association for Computational Linguistics.
(25) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.
(26) Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2024.
(27) Márcio Cerqueira de Farias Macedo, Antônio Lopes Apolinário, and Antonio Carlos dos Santos Souza. A robust real-time face tracking using head pose estimation for a markerless ar system. In 2013 XV Symposium on Virtual and Augmented Reality, pages 224–227, 2013.
(28) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions, 2016.
(29) Iacopo Masi, Feng-Ju Chang, Jongmoo Choi, Shai Harel, Jungyeon Kim, Kanggeon Kim, Jatuporn Leksut, Stephen Rawls, Yue Wu, Tal Hassner, Wael AbdAlmageed, Gerard Medioni, Louis-Philippe Morency, Prem Natarajan, and Ram Nevatia. Learning pose-aware models for pose-invariant face recognition in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):379–393, 2019.
(30) Iacopo Masi, Stephen Rawls, Gérard Medioni, and Prem Natarajan. Pose-aware face recognition in the wild. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4838–4846, 2016.
(31) Jisoo Mok, Jaeyoung Do, Sungjin Lee, Tara Taghavi, Seunghak Yu, and Sungroh Yoon. Large-scale lifelong learning of in-context instructions and how to tackle it. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12573–12589, Toronto, Canada, July 2023. Association for Computational Linguistics.
(32) Erik Murphy-Chutorian and Mohan Manubhai Trivedi. Head pose estimation and augmented reality tracking: An integrated system and evaluation for monitoring driver awareness. IEEE Transactions on Intelligent Transportation Systems, 11(2):300–311, 2010.
(33) Priyanka Patel, Chun-Hao P. Huang, Joachim Tesch, David T. Hoffmann, Shashank Tripathi, and Michael J. Black. Agora: Avatars in geography optimized for regression analysis, 2021.
(34) Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image, 2019.
(35) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
(36) Nataniel Ruiz, Eunji Chong, and James M. Rehg. Fine-grained head pose estimation without keypoints. CoRR, abs/1710.00925, 2017.
(37) Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners, 2022.
(38) Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123, 2018.
(39) Dominykas Strazdas, Jan Hintz, and Ayoub Al-Hamadi. Robo-hud: Interaction concept for contactless operation of industrial cobotic systems. Applied Sciences, 11(12), 2021.
(40) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models, 2024.
(41) Roberto Valenti, Nicu Sebe, and Theo Gevers. Combining head pose and eye location information for gaze estimation. IEEE Transactions on Image Processing, 21(2):802–815, 2012.
(42) Sourabh Vora, Akshay Rangesh, and Mohan Manubhai Trivedi. Driver gaze zone estimation using convolutional neural networks: A general framework and ablative analysis. IEEE Transactions on Intelligent Vehicles, 3(3):254–265, 2018.
(43) Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2024.
(44) Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022.
(45) Siyu Wu, Jie Liang, and Jason Ho. Head pose estimation and its application in tv viewers’ behavior analysis. In 2016 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pages 1–6. IEEE, 2016.
(46) Ju Xu and Zhanxing Zhu. Reinforced continual learning, 2018.
(47) Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36, 2024.
(48) Tsun-Yi Yang, Yi-Ting Chen, Yen-Yu Lin, and Yung-Yu Chuang. Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1087–1096, 2019.
(49) Pengwei Yin, Guanzhong Zeng, Jingjing Wang, and Di Xie. Clip-gaze: Towards general gaze estimation via visual-linguistic model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6729–6737, 2024.
(50) Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity, 2023.
(51) Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. arXiv preprint arXiv:2311.03099, 2023.
(52) Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions, 2016.
(53) Hao Zhang, Mengmeng Wang, Yong Liu, and Yi Yuan. Fdn: Feature decoupling network for head pose estimation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12789–12796, 2020.
(54) Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. Appearance-based gaze estimation in the wild. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4511–4520, 2015.
(55) Zihan Zhang, Meng Fang, Ling Chen, and Mohammad-Reza Namazi-Rad. CITB: A benchmark for continual instruction tuning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9443–9455, Singapore, December 2023. Association for Computational Linguistics.
(56) Huayi Zhou, Fei Jiang, and Hongtao Lu. Directmhp: Direct 2d multi-person head pose estimation with full-range angles, 2023.
(57) Yijun Zhou and James Gregson. Whenet: Real-time fine-grained estimation for wide range head pose, 2020.
(58) Xiangyu Zhu, Xiaoming Liu, Zhen Lei, and Stan Z. Li. Face alignment in full pose range: A 3d total solution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1):78–92, January 2019.

Appendix A Supplemental Material

A.1 Details of the Designed Prompts

Table 5: Prompts and Responses Design for HPE Task.

BBox Prediction Task
Designed Prompt	How many human heads are in this image and what are the head bounding boxes?
Correct Answer	Their head bounding boxes are [[106,168,148,242;245,168,270,230]].
Invalid Answer (Reason)	[[000,111,222,333… (Recycled output error)
	{112,432,211} (Angle format output error)
	A man in Red (NLP output error)
	[[212,123,212} (Mixed output error)
	[[234,134,100,111]] (Logical error)
Head Pose Estimation Task
Designed Prompt	What is the head yaw pitch roll inside the bounding box [[106,168,148,242]]?
Correct Answer	The head orientation angles are {072,354,002}.
Invalid Answer (Reason)	{112,432,211,201} (Wrong number error)
	[[234,134,100,111]] (BBox format output error)
	A person head (NLP output error)
	[[212,123,212} (Mixed output error)
	{999,389,001} (Logical error)

Table 5 provides examples of our designed prompts for HPE and BBox prediction tasks. The BBox format adheres to the specifications set by CogVLM wang2024cogvlm . Euler angles in response are first converted to positive floats. These values are then rounded to the nearest integer, formatted as strings with a fixed length of three characters, padded with zeros where necessary. This table also includes several typical examples of invalid answers to highlight how invalid answers can lead to completely ineffective outputs when multiple grounding tasks requires accurate numerical output varied in range and quantity. In response to these issues, we define a new metric described in Section 4.4 to assess the model’s availability.

A.2 Pilot Study 1: Visualization of Cross Attention Maps Supervised by Designed Prompts

Figure 4 provides a visual representation of cross attention maps created in response to specific prompts, which are designed to analyze HPE within designated BBoxes. The left image highlights the model’s response to the prompt of HPE task within the bounding box [[335,179,445,332]], effectively focusing on the head of the person on the left. The right image similarly demonstrates the model’s precision in targeting the head of the person on the right within the bounding box [[775,105,893,261]]. The visualizations confirm the model’s precision in focusing attention accurately on specified areas, demonstrating its localization capabilities. It also effectively demonstrates that CogVLM not only generates bounding box outputs in response to queries but also interprets bounding boxes specified within prompts.

A.3 Pilot Study 2: catastrophic Forgetting Pattern in HPE Task

Table 6: The impact of catastrophic forgetting when no data rehearsal are applied.

Iterations	$\boldsymbol{Acc_{test}}$	$\boldsymbol{E_{bbox}}$	$\boldsymbol{MAE_{test}}$	$\boldsymbol{E_{angle}}$
0	91.4%	0%	-	-
100	91.3%	0%	-	-
500	28.1%	36.2%	41.16	2.5%
1000	10.8%	10.2%	42.16	0.1%

Table 6 illustrates the profound impact of catastrophic forgetting in a model trained for HPE task using Agora dataset. The Refcoco test accuracy starts at a high of 91.4% at iteration 0, indicating initial proficiency in object detection. As the number of training iteration increases and the model is increasingly exposed to the HPE task, the Refcoco test accuracy drastically decreases to 10.8% at iteration 1000. This sharp decline illustrates significant forgetting of the original BBox knowledge. $E_{bbox}$ rises significantly from 0% to 36.2% at iteration 500 and then decreases to 10.2% at iteration 1000. This trend suggests that the model initially adapts to the new task at the expense of previously learned behaviors, causing a temporary increase in errors before stabilizing. The MAE transitions from "Not capable" at iteration 100 to 42.16 at iteration 1000, indicating that the model begins to gain proficiency in the new task. The decline in $E_{angle}$ from 2.5% to 0.1% implies that the model’s HPE output format becomes more consistent over time.

What is particularly noteworthy in this scenario is the nature of forgetting and learning displayed by the model—old knowledge is significantly diminished before new knowledge is solidified. This is in stark contrast to human learning, where new and old knowledge often coexist and can even support the acquisition of each other. In human cognition, learning new tasks frequently involves integrating new information with existing knowledge, without the catastrophic forgetting seen in this model.

A.4 Detailed results of the influence of rehearsal ratio on multi-task learning

Table 7: The influence of rehearsal ratio on HPE task.

Model	Rehearsal Ratio	Refcoco		Refcoco+		Refcocog		CMU Panoptic
Model	Rehearsal Ratio	$\boldsymbol{Acc_{test}}$	$\boldsymbol{E_{bbox}}$	$\boldsymbol{Acc_{test}}$	$\boldsymbol{E_{bbox}}$	$\boldsymbol{Acc_{test}}$	$\boldsymbol{E_{bbox}}$	$\boldsymbol{MAE_{test}}$	$\boldsymbol{E_{angle}}$
Non-merging CogVLM	10%	91.1%	0%	85.2%	0%	88.9%	0%	8.18	0.13%
Non-merging CogVLM	25%	91.2%	0%	86.4%	0%	89.1%	0%	9.87	0.43%
TA merging CogVLM	10%	89.5%	0%	82.3%	0%	86.1%	0%	7.72	68.9%
TA merging CogVLM	25%	90.8%	0%	84.6%	0%	87.7%	0%	8.10	67.7%
HPE-CogVLM	10%	90.5%	0%	84.7%	0%	87.8%	0%	7.36	0.052%
HPE-CogVLM	25%	90.9%	0%	85.6%	0%	88.1%	0%	8.36	0.26%

Table 7 presents the detailed comparative results for BBox prediction task and HPE task under different rehearsal ratios. The results of Refcoco BBox and HPE MAE results have been analyzed in section 5.4. The Refcoco+ and Refcocog follow the same trend as the Refcoco dataset result, which high rehearsal ratio leads to higher BBox Prediction accuracy. But at the same time, the error ratio $E_{angle}$ also increases. This aligns with expectations, a higher rehearsal ratio helps to improve BBox predictions, yet it may reduce the importance of the new task.