Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

High-fidelity instructional fashion image editing

Published: 01 October 2024 Publication History

Abstract

Instructional image editing has received a significant surge of attention recently. In this work, we are interested in the challenging problem of instructional image editing within the particular fashion realm, a domain with significant potential demand in both commercial and personal contexts. This specific domain presents heightened challenges owing to the stringent quality requirements. It necessitates not only the creation of vivid details in alignment with instructions, but also the preservation of precise attributes unrelated to the text guidance. Naive extensions of existing image editing methods produce noticeable artifacts. In order to achieve high-fidelity fashion editing, we propose a novel framework, leveraging the generative prior of a pre-trained human generator and performing edit in the latent space. In addition, we introduce a novel CLIP-based loss to better align the generated target with the instruction. Extensive experiments demonstrate that our approach outperforms prior works including GAN-based editing as well as diffusion-based editing by a large margin, showing impressive visual quality.

Graphical abstract

Display Omitted

References

[1]
H. Yang, R. Zhang, X. Guo, W. Liu, W. Zuo, P. Luo, Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 7850–7859.
[2]
B. Wang, H. Zheng, X. Liang, Y. Chen, L. Lin, M. Yang, Toward Characteristic-Preserving Image-based Virtual Try-On Network, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 589–604.
[3]
X. Han, Z. Wu, Z. Wu, R. Yu, L.S. Davis, Viton: An image-based virtual try-on network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7543–7552.
[4]
S. Choi, S. Park, M. Lee, J. Choo, VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization, in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2021.
[5]
H.J. Lee, R. Lee, M. Kang, M. Cho, G. Park, LA-VITON: A Network for Looking-Attractive Virtual Try-On, in: 2019 IEEE/CVF International Conference on Computer Vision Workshop, ICCVW, 2019, pp. 3129–3132, https://doi.org/10.1109/ICCVW.2019.00381.
[6]
A. Raj, P. Sangkloy, H. Chang, J. Lu, D. Ceylan, J. Hays, Swapnet: Garment transfer in single view images, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 666–682.
[7]
H. Dong, X. Liang, X. Shen, B. Wang, H. Lai, J. Zhu, Z. Hu, J. Yin, Towards multi-pose guided virtual try-on network, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9026–9035.
[8]
Minar M.R., Tuan T.T., Ahn H., Rosin P., Lai Y.-K., CP-VTON+: Clothing shape and texture preserving image-based virtual try-on, in: CVPRW, 2020.
[9]
M.R. Minar, H. Ahn, CloTH-VTON: Clothing Three-dimensional reconstruction for Hybrid image-based Virtual Try-ON, in: Asian Conference on Computer Vision, ACCV, 2020.
[10]
H. Dong, X. Liang, Y. Zhang, X. Zhang, X. Shen, Z. Xie, B. Wu, J. Yin, Fashion editing with adversarial parsing learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8120–8128.
[11]
W.-L. Hsiao, I. Katsman, C.-Y. Wu, D. Parikh, K. Grauman, Fashion++: Minimal edits for outfit improvement, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5047–5056.
[12]
Bansal H., Khan R., A review paper on human computer interaction, Int. J. Adv. Res. Comput. Sci. Softw. Eng. 8 (2018) 53–56.
[13]
D. Morelli, M. Fincato, M. Cornia, F. Landi, F. Cesari, R. Cucchiara, Dress Code: High-Resolution Multi-Category Virtual Try-On, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2231–2235.
[14]
Bai S., Zhou H., Li Z., Zhou C., Yang H., Single stage virtual try-on via deformable attention flows, in: European Conference on Computer Vision, Springer, 2022, pp. 409–425.
[15]
Lee S., Gu G., Park S., Choi S., Choo J., High-resolution virtual try-on with misalignment and occlusion-handled conditions, in: European Conference on Computer Vision, Springer, 2022, pp. 204–219.
[16]
H. Yang, X. Yu, Z. Liu, Full-Range Virtual Try-On With Recurrent Tri-Level Transform, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3460–3469.
[17]
S. He, Y.-Z. Song, T. Xiang, Style-Based Global Appearance Flow for Virtual Try-On, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3470–3479.
[18]
C.-Y. Chen, L. Lo, P.-J. Huang, H.-H. Shuai, W.-H. Cheng, Fashionmirror: Co-attention feature-remapping virtual try-on with sequential template poses, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13809–13818.
[19]
A. Cui, D. McKee, S. Lazebnik, Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-On and Outfit Editing, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV, 2021, pp. 14638–14647.
[20]
A. Neuberger, E. Borenstein, B. Hilleli, E. Oks, S. Alpert, Image Based Virtual Try-on Network from Unpaired Data, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 5184–5193.
[21]
C. Ge, Y. Song, Y. Ge, H. Yang, W. Liu, P. Luo, Disentangled Cycle Consistency for Highly-Realistic Virtual Try-On, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2021, pp. 16928–16937.
[22]
Z. Xie, X. Zhang, F. Zhao, H. Dong, M.C. Kampffmeyer, H. Yan, X. Liang, WAS-VTON: Warping Architecture Search for Virtual Try-on Network, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021.
[23]
J. Wang, T. Sha, W. Zhang, Z. Li, T. Mei, Down to the Last Detail: Virtual Try-on with Fine-grained Details, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020.
[24]
Gao X., Liu Z., Feng Z., Shen C., Ou K., Tang H., Song M., Shape controllable virtual try-on for underwear models, 2021, arXiv abs/2107.13156.
[25]
AlBahar B., Lu J., Yang J., Shu Z., Shechtman E., Huang J.-B., Pose with style: Detail-preserving pose-guided image synthesis with conditional StyleGAN, ACM Trans. Graph. (2021).
[26]
Avrahami O., et al., Blended latent diffusion, 2022, arXiv preprint.
[27]
Brooks T., Holynski A., Efros A.A., Instructpix2pix: Learning to follow image editing instructions, 2022, arXiv preprint arXiv:2211.09800.
[28]
Fu J., Li S., Jiang Y., Lin K.-Y., Qian C., Loy C.C., Wu W., Liu Z., StyleGAN-human: A data-centric odyssey of human generation, 2022, arXiv preprint arXiv:2204.11823.
[29]
Tov O., et al., Designing an encoder for StyleGAN image manipulation, TOG (2021).
[30]
O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, D. Lischinski, Styleclip: Text-driven manipulation of StyleGAN imagery, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2085–2094.
[31]
Gal R., Patashnik O., Maron H., Chechik G., Cohen-Or D., StyleGAN-nada: Clip-guided domain adaptation of image generators, 2021, arXiv preprint arXiv:2108.00946.
[32]
Xia W., Yang Y., Xue J.-H., Wu B., TediGAN: Text-guided diverse face image generation and manipulation, 2020, arXiv preprint arXiv:2012.03308.
[33]
Yu Y., Zhan F., Wu R., Zhang J., Lu S., Cui M., Xie X., Hua X.-S., Miao C., Towards counterfactual image manipulation via clip, 2022, arXiv preprint arXiv:2207.02812.
[34]
Y. Shen, P. Luo, J. Yan, X. Wang, X. Tang, Faceid-gan: Learning a symmetry three-player gan for identity-preserving face synthesis, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 821–830.
[35]
A. Pumarola, A. Agudo, A.M. Martinez, A. Sanfeliu, F. Moreno-Noguer, Ganimation: Anatomically-aware facial animation from a single image, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 818–833.
[36]
T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, in: Proc. CVPR, 2019, pp. 4401–4410.
[37]
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, T. Aila, Analyzing and improving the image quality of StyleGAN, in: Proc. CVPR, 2020, pp. 8110–8119.
[38]
R. Abdal, Y. Qin, P. Wonka, Image2stylegan: How to embed images into the StyleGAN latent space?, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4432–4441.
[39]
R. Abdal, Y. Qin, P. Wonka, Image2StyleGAN++: How to Edit the Embedded Images?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8296–8305.
[40]
Karras T., Aila T., Laine S., Lehtinen J., Progressive growing of GANs for improved quality, stability, and variation, 2017, arXiv:1710.10196.
[41]
J. Bao, D. Chen, F. Wen, H. Li, G. Hua, Towards open-set identity preserving face synthesis, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6713–6722.
[42]
M. Zanfir, E. Oneata, A.-I. Popa, A. Zanfir, C. Sminchisescu, Human synthesis and scene compositing, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 12749–12756.
[43]
Sarkar K., Liu L., Golyanik V., Theobalt C., Humangan: A generative model of human images, in: 2021 International Conference on 3D Vision, 3DV, IEEE, 2021, pp. 258–267.
[44]
Jiang Y., et al., Text2Human: Text-driven controllable human image generation, TOG (2022),.
[45]
Zhang K., Sun M., Sun J., Zhao B., Zhang K., Sun Z., Tan T., HumanDiffusion: a coarse-to-fine alignment diffusion framework for controllable text-driven person image generation, 2022, arXiv preprint arXiv:2211.06235.
[46]
Radford A., Kim J.W., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J., Krueger G., Sutskever I., Learning transferable visual models from natural language supervision, Image 2 (2021) T2.
[47]
Dong H., Yu S., Wu C., Guo Y., Semantic image synthesis via adversarial learning, Proc. ICCV (2017) 5707–5715.
[48]
Nam S., Kim Y., Kim S., Text-adaptive generative adversarial networks: Manipulating images with natural language, in: NeurIPS, 2018.
[49]
Y. Liu, M.D. Nadai, D. Cai, H. Li, X. Alameda-Pineda, N. Sebe, B. Lepri, Describe What to Change: A Text-guided Unsupervised Image-to-image Translation Approach, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020.
[50]
B. Li, X. Qi, T. Lukasiewicz, P.H. Torr, ManiGAN: Text-guided image manipulation, in: Proc. CVPR, 2020, pp. 7880–7889.
[51]
Zhou X., et al., Text guided person image synthesis, in: CVPR, 2019.
[52]
Xu X., et al., Text-guided human image manipulation via image-text shared space, TPAMI (2021).
[53]
U. Kocasari, A. Dirik, M. Tiftikci, P. Yanardag, StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 895–904.
[54]
Ho J., Jain A., Abbeel P., Denoising diffusion probabilistic models, Adv. Neural Inf. Process. Syst. 33 (2020) 6840–6851.
[55]
Dhariwal P., Nichol A., Diffusion models beat gans on image synthesis, Adv. Neural Inf. Process. Syst. 34 (2021) 8780–8794.
[56]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10684–10695.
[57]
O. Avrahami, D. Lischinski, O. Fried, Blended diffusion for text-driven editing of natural images, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18208–18218.
[58]
Pernuš M., Fookes C., Štruc V., Dobrišek S., FICE: Text-conditioned fashion image editing with guided gan inversion, 2023, arXiv preprint arXiv:2301.02110.
[59]
Baldrati A., Morelli D., Cartella G., Cornia M., Bertini M., Cucchiara R., Multimodal garment designer: Human-centric latent diffusion models for fashion image editing, 2023, arXiv preprint arXiv:2304.02051.
[60]
R. Yu, X. Wang, X. Xie, VTNFP: An Image-based Virtual Try-on Network with Body and Clothing Feature Preservation, in: The IEEE International Conference on Computer Vision, 2019, pp. 10511–10520.
[61]
T. Issenhuth, J. Mary, C. Calauzènes, Do Not Mask What You Do Not Need to Mask: a Parser-Free Virtual Try-On, in: Proceedings of the European Conference on Computer Vision, 2020.
[62]
B.L. Bhatnagar, G. Tiwari, C. Theobalt, G. Pons-Moll, Multi-Garment Net: Learning to Dress 3D People From Images, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5420–5430.
[63]
A. Mir, T. Alldieck, G. Pons-Moll, Learning to Transfer Texture From Clothing Images to 3D Humans, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 7023–7034.
[64]
C. Patel, Z. Liao, G. Pons-Moll, TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 7365–7375.
[65]
T. Yu, Z. Zheng, Y. Zhong, J. Zhao, Q. Dai, G. Pons-Moll, Y. Liu, SimulCap : Single-View Human Performance Capture With Cloth Simulation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2019.
[66]
Corona E., Pumarola A., Alenyà G., Pons-Moll G., Moreno-Noguer F., Smplicit: Topology-aware generative model for clothed people, in: CVPR, 2021.
[67]
Tuan T.T., Minar M.R., Ahn H., Wainwright J., Multiple pose virtual try-on based on 3D clothing reconstruction, IEEE Access 9 (2021) 114367–114380.
[68]
N. Zheng, X. Song, Z. Chen, L. Hu, D. Cao, L. Nie, Virtually trying on new clothing with arbitrary poses, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 266–274.
[69]
C.-W. Hsieh, C.-Y. Chen, C.-L. Chou, H.-H. Shuai, J. Liu, W.-H. Cheng, FashionOn: Semantic-guided Image-based Virtual Try-on with Detailed Human and Clothing Information, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 275–283.
[70]
J. Jiang, T. Wang, H. Yan, J. Liu, ClothFormer: Taming Video Virtual Try-on in All Module, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10799–10808.
[71]
X. Zhong, Z. Wu, T. Tan, G. Lin, Q. Wu, Mv-ton: Memory-based video virtual try-on network, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 908–916.
[72]
H. Dong, X. Liang, X. Shen, B. Wu, B. cheng Chen, J. Yin, FW-GAN: Flow-Navigated Warping GAN for Video Virtual Try-On, in: 2019 IEEE/CVF International Conference on Computer Vision, ICCV, 2019, pp. 1161–1170.
[73]
Richardson E., Alaluf Y., Patashnik O., Nitzan Y., Azar Y., Shapiro S., Cohen-Or D., Encoding in style: a StyleGAN encoder for image-to-image translation, 2020, arXiv:2008.00951.
[74]
Roich D., Mokady R., Bermano A.H., Cohen-Or D., Pivotal tuning for latent-based editing of real images, ACM Trans. Graph. (2021).
[75]
Jaegle A., Borgeaud S., Alayrac J.-B., Doersch C., Ionescu C., Ding D., Koppula S., Zoran D., Brock A., Shelhamer E., et al., Perceiver io: A general architecture for structured inputs & outputs, 2021, arXiv preprint arXiv:2107.14795.
[76]
Ye M., Shen J., Lin G., Xiang T., Shao L., Hoi S.C., Deep learning for person re-identification: A survey and outlook, IEEE Trans. Pattern Anal. Mach. Intell. 44 (6) (2021) 2872–2893.
[77]
R. Zhang, P. Isola, A.A. Efros, E. Shechtman, O. Wang, The unreasonable effectiveness of deep features as a perceptual metric, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 586–595.
[78]
Chen X., Fan H., Girshick R., He K., Improved baselines with momentum contrastive learning, 2020, arXiv preprint arXiv:2003.04297.
[79]
Andersson P., Nilsson J., Akenine-Möller T., Oskarsson M., Åström K., Fairchild M.D., FLIP: a difference evaluator for alternating images, Proc. ACM Comput. Graph. Interact. Tech. 3 (2) (2020) 15–1.
[80]
G. Kwon, J.C. Ye, Clipstyler: Image style transfer with a single text condition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18062–18071.
[81]
C.-H. Lee, Z. Liu, L. Wu, P. Luo, MaskGAN: Towards Diverse and Interactive Facial Image Manipulation, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2020.
[82]
Heusel M., Ramsauer H., Unterthiner T., Nessler B., Hochreiter S., Gans trained by a two time-scale update rule converge to a local nash equilibrium, Adv. Neural Inf. Process. Syst. 30 (2017).
[83]
Parmar G., Zhang R., Zhu J.-Y., On aliased resizing and surprising subtleties in GAN evaluation, in: CVPR, 2022.
[84]
S. Zhu, R. Urtasun, S. Fidler, D. Lin, C. Change Loy, Be your own prada: Fashion synthesis with structural coherence, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1680–1688.
[85]
Z. Liu, P. Luo, S. Qiu, X. Wang, X. Tang, DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016.
[86]
Xiao T., Li S., Wang B., Lin L., Wang X., Joint detection and identification feature learning for person search, in: CVPR, 2017.
[87]
Jocher G., Chaurasia A., Qiu J., YOLO by ultralytics, 2023, URL https://github.com/ultralytics/ultralytics.
[88]
Liu Y., Chu L., Chen G., Wu Z., Chen Z., Lai B., Hao Y., PaddleSeg: A high-efficient development toolkit for image segmentation, 2021, arXiv:2101.06175.
[89]
Authors P., PaddleSeg, end-to-end image segmentation kit based on PaddlePaddle, 2019, https://github.com/PaddlePaddle/PaddleSeg.
[90]
R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, V. Lempitsky, Resolution-robust large mask inpainting with fourier convolutions, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2149–2159.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Graphical Models
Graphical Models  Volume 135, Issue C
Oct 2024
63 pages

Publisher

Academic Press Professional, Inc.

United States

Publication History

Published: 01 October 2024

Author Tags

  1. Fashion editing
  2. Image generation
  3. Image editing
  4. Text-driven image editing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media