Abstract
In this paper, we introduce \(\textrm{D}^4\)-VTON, an innovative solution for image-based virtual try-on. We address challenges from previous studies, such as semantic inconsistencies before and after garment warping, and reliance on static, annotation-driven clothing parsers. Additionally, we tackle the complexities in diffusion-based VTON models when handling simultaneous tasks like inpainting and denoising. Our approach utilizes two key technologies: Firstly, Dynamic Semantics Disentangling Modules (DSDMs) extract abstract semantic information from garments to create distinct local flows, improving precise garment warping in a self-discovered manner. Secondly, by integrating a Differential Information Tracking Path (DITP), we establish a novel diffusion-based VTON paradigm. This path captures differential information between incomplete try-on inputs and their complete versions, enabling the network to handle multiple degradations independently, thereby minimizing learning ambiguities and achieving realistic results with minimal overhead. Extensive experiments demonstrate that \(\textrm{D}^4\)-VTON significantly outperforms existing methods in both quantitative metrics and qualitative evaluations, demonstrating its capability in generating realistic images and ensuring semantic consistency. Code is available at https://github.com/Jerome-Young/D4-VTON.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For simplicity, we illustrate the case with \(N=3\) in Fig. 2.
References
Bai, S., Zhou, H., Li, Z., Zhou, C., Yang, H.: Single stage virtual try-on via deformable attention flows. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13675, pp. 409–425. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_24
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd GANs. In: ICLR (2018)
Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE TPAMI 11(6), 567–585 (1989)
Chen, C.Y., Chen, Y.C., Shuai, H.H., Cheng, W.H.: Size does matter: size-aware virtual try-on via clothing-oriented transformation try-on network. In: ICCV, pp. 7513–7522 (2023)
Choi, S., Park, S., Lee, M., Choo, J.: Viton-HD: high-resolution virtual try-on via misalignment-aware normalization. In: CVPR, pp. 14131–14140 (2021)
Du, Y., et al.: One-for-all: towards universal domain translation with a single stylegan. arXiv preprint arXiv:2310.14222 (2023)
Fele, B., Lampe, A., Peer, P., Struc, V.: C-VTON: context-driven image-based virtual try-on network. In: WACV, pp. 3144–3153 (2022)
Ge, Y., Song, Y., Zhang, R., Ge, C., Liu, W., Luo, P.: Parser-free virtual try-on via distilling appearance flows. In: CVPR, pp. 8485–8493 (2021)
Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS, vol. 27 (2014)
Gou, J., Sun, S., Zhang, J., Si, J., Qian, C., Zhang, L.: Taming the power of diffusion models for high-quality virtual try-on with appearance flow. In: ACM MM, pp. 7599–7607 (2023)
Han, X., Hu, X., Huang, W., Scott, M.R.: Clothflow: a flow-based model for clothed person generation. In: ICCV, pp. 10471–10480 (2019)
Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: an image-based virtual try-on network. In: CVPR, pp. 7543–7552 (2018)
He, S., Song, Y.Z., Xiang, T.: Style-based global appearance flow for virtual try-on. In: CVPR, pp. 3470–3479 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS, vol. 33, pp. 6840–6851 (2020)
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR, pp. 4401–4410 (2019)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Lee, S., Gu, G., Park, S., Choi, S., Choo, J.: High-resolution virtual try-on with misalignment and occlusion-handled conditions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13677, pp. 204–219. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_13
Li, Z., Wei, P., Yin, X., Ma, Z., Kot, A.C.: Virtual try-on with pose-garment keypoints guided inpainting. In: ICCV, pp. 22788–22797 (2023)
Li, Z., et al.: Grouplane: end-to-end 3D lane detection with channel-wise grouping. In: ICLR (2024)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)
Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712 (2016)
Morelli, D., Baldrati, A., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: LaDI-VTON: latent diffusion textual-inversion enhanced virtual try-on. In: ACM MM, pp. 8580–8589 (2023)
Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: high-resolution multi-category virtual try-on. In: CVPR, pp. 2231–2235 (2022)
Parmar, G., Zhang, R., Zhu, J.Y.: On aliased resizing and surprising subtleties in GAN evaluation. In: CVPR, pp. 11410–11420 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Shim, S.H., Chung, J., Heo, J.P.: Towards squeezing-averse virtual try-on via sequential deformation. In: AAAI, vol. 38, pp. 4856–4863 (2024)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, pp. 2256–2265. PMLR (2015)
Song, H., Du, Y., Xiang, T., Dong, J., Qin, J., He, S.: Editing out-of-domain GAN inversion via differential activations. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13677, pp. 1–17. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_1
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2020)
Tang, J., Zheng, G., Shi, C., Yang, S.: Contrastive grouping with transformer for referring image segmentation. In: CVPR, pp. 23570–23580 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurlPS, vol. 30 (2017)
Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., Yang, M.: Toward characteristic-preserving image-based virtual try-on network. In: ECCV, pp. 589–604 (2018)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)
Wei, Y., Ji, Z., Wu, X., Bai, J., Zhang, L., Zuo, W.: Inferring and leveraging parts from object shape for improving semantic image synthesis. In: CVPR, pp. 11248–11258 (2023)
Xie, Z., et al.: GP-VTON: towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In: CVPR, pp. 23550–23559 (2023)
Xie, Z., Huang, Z., Zhao, F., Dong, H., Kampffmeyer, M., Liang, X.: Towards scalable unpaired virtual try-on via patch-routed spatially-adaptive GAN. In: NeurIPS, vol. 34, pp. 2598–2610 (2021)
Xu, C., et al.: Learning dynamic alignment via meta-filter for few-shot learning. In: CVPR, pp. 5182–5191 (2021)
Xu, Y., Du, Y., Xiao, W., Xu, X., He, S.: From continuity to editability: inverting GANs with consecutive images. In: ICCV, pp. 13910–13918 (2021)
Yang, B., et al.: Paint by example: exemplar-based image editing with diffusion models. In: CVPR, pp. 18381–18391 (2023)
Yang, H., Zhang, R., Guo, X., Liu, W., Zuo, W., Luo, P.: Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In: CVPR, pp. 7850–7859 (2020)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595 (2018)
Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_18
Zhou, Y., Xu, Y., Du, Y., Wen, Q., He, S.: Pro-pulse: learning progressive encoders of latent semantics in GANs for photo upsampling. IEEE TIP 31, 1230–1242 (2022)
Acknowledgement
This project is supported by the National Natural Science Foundation of China (62102381, 41927805); Shandong Natural Science Foundation (ZR2021QF035); the National Key R&D Program of China (2022ZD0117201); and the China Postdoctoral Science Foundation (2020M682240, 2021T140631).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, Z. et al. (2025). \(\textrm{D}^4\)-VTON: Dynamic Semantics Disentangling for Differential Diffusion Based Virtual Try-On. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15104. Springer, Cham. https://doi.org/10.1007/978-3-031-72952-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-72952-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72951-5
Online ISBN: 978-3-031-72952-2
eBook Packages: Computer ScienceComputer Science (R0)