Robust Domain Generalization for Multi-modal Object Recognition

1^st Yuxin Qiao^* Department of Computer Information Technology
Northern Arizona University
Flagstaff, AZ, USA
yq83@nau.edu 1^st Keqin Li {@IEEEauthorhalign} 2^nd Junhong Lin Department of Computer Science
AMA University
Quezon, Philippines
keqin157@gmail.com Electrical Engineering & Computer Science Department
Massachusetts Institute of Technology
Cambridge, MA, USA
junhong@mit.edu 3^nd Rong Wei {@IEEEauthorhalign} 4^nd Chufeng Jiang Academy for Advanced Interdisciplinary Studies
Peking University
Beijing, China
wei_rong@pku.edu.cn Department of Computer Science
The University of Texas at Austin
Austin, TX, USA
chufeng.jiang@utexas.edu 5^nd Yang Luo Department of Computer Science
University of Southern California
Los Angeles, CA, USA
luoyangdxx@gmail.com 6^nd Haoyu Yang College of Computing
Georgia Institute of Technology
Atlanta, GA, USA
hyang645@gatech.edu

Abstract

In multi-label classification, machine learning encounters the challenge of domain generalization when handling tasks with distributions differing from the training data. Existing approaches primarily focus on vision object recognition and neglect the integration of natural language. Recent advancements in vision-language pre-training leverages supervision from extensive visual-language pairs. This allows learning across diverse domains and enhances recognition in multi-modal scenarios, showcasing superior transfer learning capabilities in methods like CLIPood. However, CLIPood has several limitations: differences in the utilized loss, loss of generality in evaluating only a single backbone, and neglect of class-aware visual fusion.

To address these, we propose this paper that infers the actual loss based on the implementation, broadens evaluations to larger vision-language backbones, and introduces Mixup-CLIPood with a novel mix-up loss for enhanced class-aware visual fusion.

Index Terms:

Multi-modal Learning, Domain Generalization, Vision-Language Pre-training, Class-aware Feature Fusion, Mix-up Loss

I Introduction

In multi-label classification, machine learning applications inevitably encounter the challenge of domain generalization [22, 2]. This challenge arises when confronting new tasks with distributions that differ from those encountered during training. While large-scale pretrained models and carefully crafted transfer learning algorithms are readily accessible, these current approaches are predominantly tailored for tasks focused on pure vision object recognition, neglecting the incorporation of natural language [20, 21].

Diverging from conventional methods that rely on learning from images and encoded labels, contemporary progress in vision-language pre-training aims to harness naturally occurring supervision derived from extensive visual-language pairs [4, 19, 18, 2]. This innovative approach allows for learning across diverse domains and enhances the recognition of concepts within a multi-modal scenario. As a result, vision-language pretrained models demonstrate impressive transfer learning capabilities, outperforming models trained exclusively on images and encoded labels. This highlights a promising pathway for tackling the challenges associated with domain generalization, like CLIPood [3].

However, there exist several limitations in CLIPood [3]. First, we observed a difference between the actual loss utilized in the official implementation and the one described in the paper of CLIPood [3]. Second, CLIPood [3] restricts its consideration to the use of a single type of backbone, not evaluating the generalization ability of CLIPood comprehensively. Third, CLIPood [3] neglects the fusion of class-aware visual information, focusing solely on cross-modal fusions and pure text fusions. This oversight may impede the model’s overall generalization capability.

Refer to caption — Figure 1: Overview of Our Proposed Mix-up Loss.

To rectify these limitations, we present this paper. Regarding the disparity between the actual loss and the one detailed in the paper, we strive to infer the actual loss based on implementations, conducting a comparison with the stated loss in Section III-B. To address the limited evaluation scope, we broaden our experiments to include two additional larger vision-language backbones. Regarding the oversight of class-aware visual knowledge fusion, we introduce a novel mix-up loss as shown in Fig. 1. Our contributions can be summarized in three folds:

•

We address the incongruity between the actual loss and the one documented in the paper. Through a meticulous analysis of implementations, we deduce the actual loss and compare it with the documented loss.
•

We expand our experiments to encompass two larger vision-language backbones. This comprehensive evaluation provides a more robust assessment of the method’s performance.
•

We propose Mixup-CLIPood with a novel mix-up loss to enhance the previous model’s generalization ability by incorporating class-aware visual information during training.

II Related Work

Multi-modal Learning. Multi-modal learning has garnered significant attention in recent years, propelled by the rise of powerful neural network architectures such as transformers [16] and vision transformers (ViT) [17, 24]. Noteworthy examples include CLIP [4], UNITER [18], and ALIGN [19], which harness transformer-based architectures to concurrently process both text and image modalities. For our foundational vision-language model, we employ CLIP. The field continues to progress with ongoing research, particularly in the domain of domain generalization. Representative multi-modal domain generalization methods encompass CoOp [20], CoCoOP [21], and CLIPood [3]. In this paper, we adopt CLIPood [3] as the cornerstone of our proposed method.

Domain Generalization. Current DG methods aim to learn domain-invariant representations and are categorized into three types: domain alignment [10, 11], meta-learning [12, 13], and augmentation strategies [14, 15]. For domain alignment, [10] enhances the conditional invariance of learned features by incorporating an entropy regularization term, leading to improved classifier generalization. [11] iteratively segregates samples into latent domains through clustering. Concerning meta-learning, [12] proposes a model-agnostic training procedure that simulates domain shift during training, whereas [13] applies meta-learning to single-domain generalization. Regarding augmentation strategies, [14] introduces a novel regularization term for adversarial data augmentation derived from the information bottleneck principle, while [15] presents a unique style hallucination module to generate style-diversified samples crucial for generalization. In this paper, we use a novel mix-up loss for better generalization, which is an augmentation-based technique.

III Method

III-A Preliminaries

In this study, our focus centers on addressing the zero-shot generalization challenge inherent in the vision-language pretrained model CLIP [4]. Commencing with a pretrained CLIP model, we initially adapt it using labeled (source) data denoted as $\mathcal{S}={(x^{s},y^{s})}$ . Inspired by [2], our objective is to achieve a robust generalization of this model to previously unseen (target) data represented by $\mathcal{T}=\{(x^{t},y^{t})\}$ through carefully tailored finetuning operations. Although the source and target data occupy the same label space, they originate from distinct distributions, encapsulated by the inequality $P(x^{s},y^{s})\neq P(x^{t},y^{t})$ .

Our approach adheres to the established protocols outlined in CLIPood [3], involving finetuning on the visual model of CLIP [4] while leveraging its text encoder to generate text embeddings. To elaborate, for each class $c$ within the label space, we construct a text prompt employing the format ”a photo of a [CLASS]”, with the ”[CLASS]” token dynamically replaced by the corresponding class name $c$ . Subsequently, the constructed text prompt undergoes processing in the text encoder to yield the text embedding specific to class $c$ .

III-B Calibration on CLIPood

For each training sample $(x,y)$ , the Margin Metric Softmax (MMS) loss in CLIPood [3] is used to finetune the vision model of CLIP [4], and this loss is represented as Eq. 1 in the paper of CLIPood:

		$\displaystyle\mathcal{L}_{paper}$		(1)
		$\displaystyle=-\log\frac{\exp(Sim(I_{x},T_{y})/\tau)}{\sum_{c=1}^{C}\exp((Sim(% I_{x},T_{c})+0.3(1-Sim(T_{y},T_{c})))/\tau)}.$		(1)

In this equation, $I_{x}$ represents the embedding of image $x$ , whereas $T_{y}$ and $T_{c}$ denote the embeddings of label $y$ and class $c$ respectively. $C$ stands for the total number of categories. The function $Sim(\cdot,\cdot)$ measures the similarity between two embeddings, and $\tau$ serves as the temperature parameter.

However, after carefully reviewing the official implementation of CLIPood [3], we found the actual loss used in the finetuning is not consistent with the formula in the paper. Based on the official implementation, we deduce the actual loss as:

\mathcal{L}_{actual}=-\log\frac{\exp(-(I_{x}T_{y}-0.3)/\tau)}{\sum_{c=1}^{C}% \exp(-(I_{x}T_{c}-0.3T_{y}T_{c})/\tau)}.

(2)

Here $I_{x}$ denotes the embedding of image $x$ , while $T_{y}$ and $T_{c}$ represent the embeddings of label $y$ and class $c$ respectively. $C$ corresponds to the total number of categories, and $\tau$ serves as the temperature parameter.

III-C Class-aware Feature Fusion

The MMS loss within CLIPood [3] places a heightened focus on cross-modal feature fusion to enhance generalization. It achieves classification by comparing visual features with text embeddings generated through the text prompts introduced in Section III-A, thereby leveraging knowledge from the text modality to improve image-text alignment.

However, certain limitations persist within the MMS loss framework. First, the cross-modal adaptation overlooks the fusion of class-aware visual information. Specifically, while cross-modal fusions like $I_{x}T_{c}$ and pure text fusions such as $T_{y}T_{c}$ exist, there is a lack of class-aware image feature fusions. This limitation restricts the model’s generalization ability in tasks pertinent to visual information. Second, practical implementation involving mini-batch stochastic gradient descent (SGD) [6] raises concerns, especially when dealing with a small batch size. In such scenarios, the limited number of samples may fail to ensure domain invariance between the source and target domains in the latent space. One paper [2] proposed a framework to address the potential invariance by using discrimination from one side to update its weak augmentor, while employing discrimination from the other side to optimize its strong augmentor. Inspired by [2, 5], we propose the integration of a cross-modal mix-up loss to address the latent space issue.

TABLE I: Accuracy(%) on PACS Dataset

Backbones	Method	ArtPaint	Cartoon	Photo	Sketch	Avg
ViT-B/16	CLIP	94.27	98.93	92.96	88.15	93.58
	CLIPood	98.78	99.47	100.0	89.87	97.03
	Ours	99.57	99.47	100.0	93.02	98.02
ViT-B/32	CLIP	95.11	98.08	100.0	86.62	94.95
	CLIPood	96.82	98.50	100.0	88.54	95.96
	Ours	95.84	98.08	100.0	86.62	95.14
ViT-L/14	CLIP	98.78	99.79	100.0	95.29	98.46
	CLIPood	99.27	99.79	100.0	94.90	98.49
	Ours	99.51	100.0	100.0	95.54	98.79

TABLE II: Accuracy(%) on VLCS Dataset

Backbones	Method	CalTech	LabelMe	SUN	VOC	Avg
ViT-B/16	CLIP	100.0	67.42	73.48	86.22	81.78
	CLIPood	98.94	68.36	80.87	89.85	84.51
	Ours	99.65	75.89	84.30	89.41	87.31
ViT-B/32	CLIP	75.13	71.94	67.53	86.37	75.24
	CLIPood	98.23	65.16	78.96	85.78	82.03
	Ours	97.88	90.83	78.05	86.07	88.21
ViT-L/14	CLIP	77.87	71.75	71.65	86.96	77.05
	CLIPood	97.88	68.36	79.57	89.19	83.75
	Ours	98.94	68.93	81.40	89.78	84.76

TABLE III: Accuracy(%) on Office-Home Dataset

Backbones	Method	Ar	Cl	Pr	Rw	Avg
ViT-B/16	CLIP	84.12	65.98	87.94	90.36	82.10
	CLIPood	88.35	72.22	92.39	92.65	86.40
	Ours	89.28	79.32	93.80	93.14	88.89
ViT-B/32	CLIP	80.21	63.46	85.46	87.37	79.12
	CLIPood	84.12	67.81	87.60	89.21	82.19
	Ours	80.41	68.04	88.05	88.05	81.14
ViT-L/14	CLIP	88.66	74.80	92.67	93.92	87.51
	CLIPood	91.55	77.89	94.36	94.60	89.60
	Ours	92.17	81.56	94.48	94.95	90.79

Randomly draw $\eta$ from $\mathbf{Beta}(0.2,0.2)$ and another training sample $(x^{\prime},y^{\prime})$ from the dataset, we build the mixed sample $(x_{m},y_{m})$ as:

x_{m}=\eta x+(1-\eta)x^{\prime},

(3)

y_{m}=\eta y+(1-\eta)y^{\prime}.

(4)

Correspondingly, the mixed image embedding and text embedding can be represented as:

I_{x_{m}}=\eta I_{x}+(1-\eta)I_{x^{\prime}},

(5)

T_{y_{m}}=\eta T_{y}+(1-\eta)T_{y^{\prime}}.

(6)

Diverging from conventional mix-up approaches found in prior works [5, 23, 25], which focus on achieving convex linear combinations between input and output, our method enhances class consistency by comparing the outputs of the original and mixed samples. Furthermore, our proposed mix-up loss places a greater emphasis on modality fusion and interactions, specifically tailored for addressing the zero-shot multi-modal generalization problem of this paper. The novel mix-up loss is then computed as follows:

	$\displaystyle\mathcal{L}_{mix}=$	$\displaystyle\sum_{c=1}^{C}\left\\|\frac{\exp(-(I_{x}T_{c}-0.3T_{y}T_{c})/\tau)% }{\sum_{i=1}^{C}\exp(-(I_{x}T_{i}-0.3T_{y}T_{i})/\tau)}\right.$		(7)
		$\displaystyle\left.-\frac{\exp(-(I_{x_{m}}T_{c}-0.3T_{y_{m}}T_{c})/\tau)}{\sum% _{j=1}^{C}\exp(-(I_{x_{m}}T_{j}-0.3T_{y_{m}}T_{j})/\tau)}\right\\|_{\ell_{1}}.$		(7)

Now we introduce the overall objective of finetuning the Mixup-CLIPood model, which is a combination of MMS loss and mix-up loss as follows:

\mathcal{L}_{total}=\mathcal{L}_{actual}+\lambda\mathcal{L}_{mix},

(8)

where $\lambda$ is the trade-off parameter in this objective and is set to $0.1$ in the implementation.

IV Experiments

IV-A Dataset

Three datasets are used in the experiments. Photo-Art-Cartoon-Sketch (PACS) [7] is a widely used dataset for domain generalization, which consists of four domains, namely Photo (1,670 images), Art Painting (2,048 images), Cartoon (2,344 images), and Sketch (3,929 images), and each domain contains seven categories. VLCS [8] is a dataset of large images that is popular in the field of out-of-distribution classification, with 5 classes (bird, car, chair, dog, and person) distributed equally across 4 domains (Caltech101, LabelMe, SUN09, and VOC2007). Office-Home [9] is a medium-size dataset containing four domains as Art, Clipart, Product, and Real World. Each domain includes 65 classes and the total number of images is 15,500.

IV-B Evaluation metrics and protocols

Adhering to the protocols outlined in CLIPood [3], we formulate four tasks for each dataset. In each task, one domain serves as the target data for final inference, abstaining from participation in the training/adaptation processes. Meanwhile, the remaining three domains function as the source data involved in the finetuning processes of CLIP [4]. Besides the default backbone ViT-B/16 [17] applied by CLIPood [3], we also adopt another two backbones ViT-B/32 [17] and ViT-L/14 [17], which are larger than ViT-B/16. The evaluation metric employed is the accuracy percentage on each respective target domain.

IV-C Results

Our findings, detailed in Table I for PACS [7], Table II for VLCS [8], and Table III for Office-Home [9], yield valuable insights and conclusions:

•

Robust Generalization Across Backbones: Mixup-CLIPood exhibits robust generalization capabilities across various backbones. In the majority of tasks, CLIPood significantly outperforms CLIP, underscoring the effectiveness of the proposed MMS loss in enhancing domain generalization.
•

Effectiveness of Mix-up Loss: Our introduced mix-up loss proves to be highly effective. Across most tasks, our method demonstrates a substantial performance improvement compared to CLIPood. This suggests the efficacy of our proposed Mixup-CLIPood in addressing the limitations of existing methods and enhancing generalization in multi-modal scenarios.

IV-D Analysis and discussions

In Fig. 2 we list two accuracy curves based on our adopted BACKBONES, tested using the VLCS dataset and the Office-Home dataset, respectively. The results showed that the accuracy increased as the epochs increased and reached stable. Notably, the performance varies across different backbone models. For instance, the ViT-L/14 model demonstrates a remarkable starting accuracy of approximately $89\%$ , which impressively surpasses the $90\%$ threshold upon further testing. In contrast, its counterpart, the ViT-B/16, exhibits a more modest trajectory, achieving an $81\%$ accuracy after undergoing 10 epochs of training with the test data. This comparison highlights the distinct capabilities and learning efficiencies of the respective models under study.

V Conclusion

This study presents a significant advancement in robust domain generalization for multi-modal object recognition. Our approach, which integrates a novel mix-up loss and extends the evaluation to larger vision-language backbones, has demonstrated superior performance across various datasets. The meticulous experiments conducted, as detailed in Tables Table I, II, and III, have been instrumental in establishing the efficacy of our proposed method. These experiments not only validate the robustness of our approach across different backbones but also underscore the utility of the mix-up loss in enhancing domain generalization capabilities in multi-modal scenarios.

The results obtained from these comprehensive experiments suggest that our method can effectively bridge the gap in domain generalization tasks, addressing the limitations of existing models like CLIPood. By incorporating class-aware visual information and extending the evaluation framework, our study sets a new benchmark in the field and opens avenues for future research in multi-modal learning and domain generalization.

Contribution

Yuxin Qiao and Keqin Li initiated addressing the limitations of CLIPood by proposing a novel mix-up loss function. Yuxin identified inconsistencies between actual loss results and those documented in a previous paper, which Keqin confirmed through thorough research. Together, they developed the theoretical framework and tested various loss functions to resolve the issues.

Junhong Lin, Yuxin Qiao, and Wei Rong evaluated the mix-up loss function’s reasonableness, with Junhong providing valuable suggestions on complex dataset results. Rong Wei and Chufeng Jiang continuously revised the loss function, ensuring robustness and accuracy. Chufeng and Keqin handled data collection and analysis, conducting pre-test experiments. Yang Luo and Yuxin designed iterative versions of the proposed functions. Haoyu Yang and Chufeng replicated experimental results and implemented error analysis.

All authors contributed to interpreting the results and drawing meaningful insights, which were instrumental in fine-tuning the proposed approach.

References

[1] George D. Greenwade. The Comprehensive Tex Archive Network (CTAN). TUGBoat, 14(3):342–351, 1993.
[2] Qucheng Peng, Ce Zheng, and Chen Chen. A dual-augmentor framework for domain generalization in 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2240–2249, 2024.
[3] Yang Shu, Xingzhuo Guo, Jialong Wu, Ximei Wang, Jianmin Wang, and Mingsheng Long. CLIPood: Generalizing CLIP to out-of-distributions. In Proceedings of the 40th International Conference on Machine Learning, pages 31716–31731, 2023.
[4] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
[5] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
[6] Shun-ichi Amari. Backpropagation and stochastic gradient descent method. Neurocomputing, 5(4-5):185–196, 1993.
[7] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017.
[8] Chen Fang, Ye Xu, and Daniel N Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, pages 1657–1664, 2013.
[9] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017.
[10] Shanshan Zhao, Mingming Gong, Tongliang Liu, Huan Fu, and Dacheng Tao. Domain generalization via entropy regularization. Advances in Neural Information Processing Systems, 33:16096–16107, 2020.
[11] Toshihiko Matsuura and Tatsuya Harada. Domain generalization using a mixture of multiple latent domains. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11749–11756, 2020.
[12] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
[13] Fengchun Qiao, Long Zhao, and Xi Peng. Learning to learn single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12556–12565, 2020.
[14] Long Zhao, Ting Liu, Xi Peng, and Dimitris Metaxas. Maximum-entropy adversarial data augmentation for improved generalization and robustness. Advances in Neural Information Processing Systems, 33:14435–14447, 2020.
[15] Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, and Gim Hee Lee. Style-hallucinated dual consistency learning for domain generalized semantic segmentation. In European Conference on Computer Vision, pages 535–552. Springer, 2022.
[16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
[18] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
[19] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
[20] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
[21] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
[22] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip Yu. Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, 2022.
[23] Yiyi Tao. SQBA: sequential query-based blackbox attack. In Fifth International Conference on Artificial Intelligence and Computer Science (AICS 2023), volume 12803, pages 721–729. SPIE, 2023.
[24] Tianchen Deng, Guole Shen, Tong Qin, Jianyu Wang, Wentao Zhao, Jingchuan Wang, Danwei Wang, and Weidong Chen. Plgslam: Progressive neural scene represenation with local to global bundle adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19657–19666, 2024.
[25] Junbo Chen, Xupeng Chen, Ran Wang, Chenqian Le, Amirhossein Khalilian-Gourtani, Erika Jensen, Patricia Dugan, Werner Doyle, Orrin Devinsky, Daniel Friedman, et al. Subject-agnostic transformer-based neural speech decoding from surface and depth electrode signals. bioRxiv, 2024. Cold Spring Harbor Laboratory Preprints.