HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-Fine Pose-Reversible Guidance

Fang, Guian; Yan, Wenbiao; Guo, Yuanfan; Han, Jianhua; Jiang, Zutao; Xu, Hang; Liao, Shengcai; Liang, Xiaodan

doi:10.1007/978-3-031-73411-3_12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15090))

Included in the following conference series:

European Conference on Computer Vision

183 Accesses

Abstract

Text-to-image diffusion models have significantly advanced in conditional image generation. However, these models usually struggle with accurately rendering images featuring humans, resulting in distorted limbs and other anomalies. This issue primarily stems from the insufficient recognition and evaluation of limb qualities in diffusion models. To address this issue, we introduce AbHuman, the first large-scale synthesized human benchmark focusing on anatomical anomalies. This benchmark consists of 56K synthesized human images, each annotated with detailed, bounding-box level labels identifying 147K human anomalies in 18 different categories. Based on this, the recognition of human anomalies can be established, which in turn enhances image generation through traditional techniques such as negative prompting and guidance. To further boost the improvement, we propose HumanRefiner, a novel plug-and-play approach for the coarse-to-fine refinement of human anomalies in text-to-image generation. Specifically, HumanRefiner utilizes a self-diagnostic procedure to detect and correct issues related to both coarse-grained abnormal human poses and fine-grained anomaly levels, facilitating pose-reversible diffusion generation. Experimental results on the AbHuman benchmark demonstrate that HumanRefiner significantly reduces generative discrepancies, achieving a 2.9x improvement in limb quality compared to the state-of-the-art open-source generator SDXL and a 1.4x improvement over DALL-E 3 in human evaluations. Our AbHuman dataset is available at https://github.com/Enderfga/HumanRefiner.

G. Fang, W. Yan and Y. Guo—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DiffSurf: A Transformer-Based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

InterFusion: Text-Driven Generation of 3D Human-Object Interaction

References

Andriluka, M., et al.: PoseTrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5167–5176 (2018)
Google Scholar
Balaji, Y., et al.: eDiffI: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, pp. 1877–1901 (2020)
Google Scholar
Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: COYO-700M: image-text pair dataset. https://github.com/kakaobrain/coyo-dataset (2022)
Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43, 172–186 (2019)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Google Scholar
Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 (2023)
Charles, J., Pfister, T., Magee, D., Hogg, D., Zisserman, A.: Personalizing human video pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3063–3072 (2016)
Google Scholar
Chen, J., et al.: Pixart-$\alpha $: fast training of diffusion transformer for photorealistic text-to-image synthesis (2023)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis (2021)
Google Scholar
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: European Conference on Computer Vision, pp. 89–106 (2022)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23, 2249–2281 (2022)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., Zhou, J.: Composer: creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778 (2023)
Iqbal, U., Milan, A., Gall, J.: PoseTrack: joint multi-person pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2011–2020 (2017)
Google Scholar
Ju, X., Zeng, A., Wang, J., Xu, Q., Zhang, L.: Human-Art: a versatile human-centric dataset bridging natural and artificial scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Ju, X., Zeng, A., Zhao, C., Wang, J., Zhang, L., Xu, Q.: HumanSD: a native skeleton-guided diffusion model for human image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15988–15998 (2023)
Google Scholar
Kang, M., et al.: Scaling up GANs for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10124–10134 (2023)
Google Scholar
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10863–10872 (2019)
Google Scholar
Li, Y., et al.: GLIGEN: open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521 (2023)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: synthesizing high-resolution images with few-step inference (2023)
Google Scholar
Lv, W., et al.: DETRs beat YOLOs on real-time object detection (2023)
Google Scholar
Madhu, P., et al.: Enhancing human pose estimation in ancient vase paintings via perceptually-grounded style transfer learning. ACM J. Comput. Cult. Heritage 16(1), 1–17 (2022)
Google Scholar
Mou, C., et al.: T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
Mou, C., et al.: T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
Narasimhaswamy, S., Nguyen, T., Huang, M., Hoai, M.: Whose hands are these? Hand detection and hand-body association in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4889–4899 (2022)
Google Scholar
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
OpenAI: Improving image generation with better captions (2023). https://cdn.openai.com/papers/dall-e-3.pdf
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Google Scholar
Sauer, A., Karras, T., Laine, S., Geiger, A., Aila, T.: StyleGAN-T: unlocking the power of GANs for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515 (2023)
Schuhmann, C.: LAION-aesthetics predictor V2 (2023). https://laion.ai/blog/laion-aesthetics/
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Google Scholar
Segmind: Announcing SSD-1B: a leap in efficient T2I generation (2023). https://blog.segmind.com/introducing-segmind-ssd-1b/
Shonenkov, A., Konstantinov, M., Bakshandaeva, D., Schuhmann, C., Ivanova, K., Klokova, N.: DeepFloyd IF: a powerful text-to-image model that can smartly integrate text into images (2023). https://www.deepfloyd.ai/deepfloyd-if
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Google Scholar
Zhang, P., Yang, L., Lai, J.H., Xie, X.: Exploring dual-task correlation for pose guided person image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7713–7722 (2022)
Google Scholar
Zhang, S.H., et al.: Pose2Seg: detection free human instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 889–898 (2019)
Google Scholar

Download references

Acknowledgments

This work was supported in part by National Science and Technology Major Project (2020AAA0109704), National Science and Technology Ministry Youth Talent Funding No. 2022WRQB002, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Mobility Grant Award under Grant No. M-0461, Shenzhen Science and Technology Program (Grant No. GJHZ20220913142600001), Nansha Key RD Program under Grant No.2022ZD014.

Author information

Authors and Affiliations

Shenzhen campus of Sun Yat-sen University, Shenzhen, China
Guian Fang & Xiaodan Liang
Mohamed bin Zayed University of Artificial Intelligence, Masdar, UAE
Zutao Jiang & Xiaodan Liang
Huawei Noah’s Ark Lab, Montreal, Canada
Yuanfan Guo, Jianhua Han & Hang Xu
Xi’an Jiaotong University, Xian, China
Wenbiao Yan
United Arab Emirates University, Al Ain, UAE
Shengcai Liao

Authors

Guian Fang
View author publications
You can also search for this author in PubMed Google Scholar
Wenbiao Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yuanfan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Han
View author publications
You can also search for this author in PubMed Google Scholar
Zutao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Hang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shengcai Liao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodan Liang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6761 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fang, G. et al. (2025). HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-Fine Pose-Reversible Guidance. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15090. Springer, Cham. https://doi.org/10.1007/978-3-031-73411-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-73411-3_12
Published: 23 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73410-6
Online ISBN: 978-3-031-73411-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-Fine Pose-Reversible Guidance

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DiffSurf: A Transformer-Based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

InterFusion: Text-Driven Generation of 3D Human-Object Interaction

References

Acknowledgments

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6761 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-Fine Pose-Reversible Guidance

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DiffSurf: A Transformer-Based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

InterFusion: Text-Driven Generation of 3D Human-Object Interaction

References

Acknowledgments

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6761 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation