research-article

Open access

Auditing Gender Presentation Differences in Text-to-Image Models

Authors:

Diyi YangAuthors Info & Claims

EAAMO '24: Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization

Article No.: 12, Pages 1 - 10

https://doi.org/10.1145/3689904.3694710

Published: 29 October 2024 Publication History

All formats PDF

Abstract

Text-to-image models, which can generate high-quality images based on textual input, have recently enabled various content-creation tools. Despite significantly affecting a wide range of downstream applications, the distributions of these generated images are still not fully understood, especially when it comes to the potential stereotypical attributes of different genders. In this work, we propose a paradigm (Gender Presentation Differences) that utilizes fine-grained self-presentation attributes to study how gender is presented differently in text-to-image models. By probing gender indicators in the input text (e.g., “a woman” or “a man”), we quantify the frequency differences of presentation-centric attributes (e.g., “a shirt” and “a dress”) through human annotation and introduce a novel metric: GEP.1 Furthermore, we propose an automatic method to estimate such differences. The automatic GEP metric based on our approach yields a higher correlation with human annotations than that based on existing CLIP scores, consistently across three state-of-the-art text-to-image models. Finally, we demonstrate the generalization ability of our metrics in the context of gender stereotypes. We will publicly release our code/data.

Supplemental Material

PDF File

Appendix

Download
10.09 MB

References

[1]

American Psychological Association. 2015. Guidelines for psychological practice with transgender and gender nonconforming people. American psychologist 70, 9 (2015), 832–864.

[2]

Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang. 2022. How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?arxiv:2210.15230 [cs.CL]

[3]

Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, and Aylin Caliskan. 2022. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. arxiv:2211.03759 [cs.CL]

[4]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2022. InstructPix2Pix: Learning to Follow Image Editing Instructions. arxiv:2211.09800 [cs.CV]

[5]

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan. 2023. Muse: Text-To-Image Generation via Masked Generative Transformers. arxiv:2301.00704 [cs.CV]

[6]

Jaemin Cho, Abhay Zala, and Mohit Bansal. 2022. DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Models. arxiv:2202.04053 [cs.CV]

[7]

Colin Conwell and Tomer Ullman. 2022. Testing Relational Understanding in Text-Guided Image Generation. arxiv:2208.00005 [cs.CV]

[8]

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. 2022. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers. arxiv:2204.14217 [cs.CV]

[9]

Kathleen C. Fraser, Isar Nejadgholi, and Svetlana Kiritchenko. 2023. A Friendly Face: Do Text-to-Image Systems Rely on Stereotypes when the Input is Under-Specified?. In The AAAI-23 Workshop on Creative AI Across Modalities. https://openreview.net/forum?id=UqvWNBQKf5

[10]

Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, and Yezhou Yang. 2022. Benchmarking Spatial Relationships in Text-to-Image Generation. In ArXiv.

[11]

Sophia Gu, Christopher Clark, and Aniruddha Kembhavi. 2022. I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data. ArXiv abs/2211.09778 (2022).

[12]

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-Prompt Image Editing with Cross Attention Control. arxiv:2208.01626 [cs.CV]

[13]

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021).

[14]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Vol. 30. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf

[15]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arxiv:2006.11239 [cs.LG]

[16]

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. 2022. Imagic: Text-Based Real Image Editing with Diffusion Models. arxiv:2210.09276 [cs.CV]

[17]

Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81–93.

[18]

Saehoon Kim, Sanghun Cho, Chiheon Kim, Doyup Lee, and Woonhyuk Baek. 2021. minDALL-E on Conceptual Captions. https://github.com/kakaobrain/minDALL-E.

[19]

Evelina Leivada, Elliot Murphy, and Gary Marcus. 2022. DALL-E 2 Fails to Reliably Capture Common Syntactic Processes. ArXiv abs/2210.12889 (2022).

[20]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.

[21]

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. RePaint: Inpainting using Denoising Diffusion Probabilistic Models. arxiv:2201.09865 [cs.CV]

[22]

Douglas Martin, Jacqui Hutchison, Gillian Slessor, James Urquhart, Sheila J Cunningham, and Kenny Smith. 2014. The spontaneous formation of stereotypes via cumulative cultural evolution. Psychological Science 25, 9 (2014), 1777–1786.

[23]

Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405, 2 (1975), 442–451.

[24]

Mary Norris. 2019. Female trouble: The debate over "woman" as an adjective. https://www.newyorker.com/culture/comma-queen/female-trouble-the-debate-over-woman-as-an-adjective

[25]

David Nukrai, Ron Mokady, and Amir Globerson. 2022. Text-Only Training for Image Captioning using Noise-Injected CLIP. ArXiv abs/2211.00575 (2022).

[26]

OpenAI. 2022. Reducing bias and improving safety in dall·e 2. https://openai.com/blog/reducing-bias-and-improving-safety-in-dall-e-2/

[27]

Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, and Anna Rohrbach. 2021. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).

[28]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Andreas Müller, Joel Nothman, Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2012. Scikit-learn: Machine Learning in Python. arxiv:1201.0490 [cs.LG]

[29]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020 [cs.CV]

[30]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html

[31]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. arxiv:2204.06125 [cs.CV]

[32]

Royi Rassin, Shauli Ravfogel, and Yoav Goldberg. 2022. DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models. arXiv preprint arXiv:2210.10606 (2022).

[33]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arxiv:2112.10752 [cs.CV]

[34]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.

[35]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arxiv:2205.11487 [cs.CV]

[36]

Morgan Klaus Scheuerman, Jacob M Paul, and Jed R Brubaker. 2019. How computers see gender: An evaluation of gender classification in commercial facial analysis services. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–33.

Digital Library

[37]

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. arxiv:2210.08402 [cs.CV]

[38]

Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru. 2021. A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 916–925.

Digital Library

[39]

Robyn Speer. 2022. rspeer/wordfreq: v3.0. https://doi.org/10.5281/zenodo.7199437

[40]

Robyn Speer, Joshua Chin, and Catherine Havasi. 2016. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. arxiv:1612.03975 [cs.CL]

[41]

Lukas Struppek, Dominik Hintersdorf, and Kristian Kersting. 2022. The Biased Artist: Exploiting Cultural Biases via Homoglyphs in Text-Guided Image Generation Models. ArXiv abs/2209.08891 (2022).

[42]

Aaron Van Den Oord, Oriol Vinyals, 2017. Neural discrete representation learning. Advances in neural information processing systems 30 (2017).

[43]

Wikipedia. 2022. Gender expression. https://en.wikipedia.org/wiki/Gender_expression

[44]

Wenying Wu, Pavlos Protopapas, Zheng Yang, and Panagiotis Michalatos. 2020. Gender Classification and Bias Mitigation in Facial Images. In 12th ACM Conference on Web Science (Southampton, United Kingdom) (WebSci ’20). Association for Computing Machinery, New York, NY, USA, 106–114. https://doi.org/10.1145/3394231.3397900

Digital Library

[45]

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. 2022. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. arxiv:2206.10789 [cs.CV]

[46]

Lal Zimman. 2019. Trans self-identification and the language of neoliberal selfhood: Agency, power, and the limits of monologic discourse. International Journal of the Sociology of Language 2019, 256 (2019), 147–175.

Cited By

Zhang YTzeng EDu YKislyuk D(2024)Large-Scale Reinforcement Learning for Diffusion ModelsComputer Vision – ECCV 202410.1007/978-3-031-73036-8_1(1-17)Online publication date: 21-Nov-2024
https://doi.org/10.1007/978-3-031-73036-8_1

Index Terms

Auditing Gender Presentation Differences in Text-to-Image Models
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
2. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Social aspects of security and privacy

Recommendations

The perseverance of gender stereotype
HCI '18: Proceedings of the 32nd International BCS Human Computer Interaction Conference

Numerous studies have observed unequal representation of gender stereotypes across different areas. However, some of the studies featured in this body of research were focused on using direct measures of implicit associations to understand how students ...
Age and gender differences in photo tagging gratifications

The immense popularity of Facebook use among people from varying demographic groups has attracted the attention of communication scholars. While much is known about the age and gender differences in Facebook usage patterns and the general gratifications ...
Gender preferences for robots and gender equality orientation in communication situations
Abstract
The individual physical appearances of robots are considered significant, similar to the way that those of humans are. We investigated whether users prefer robots with male or female physical appearances for use in daily communication situations ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EAAMO '24: Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization

October 2024

221 pages

ISBN:9798400712227

DOI:10.1145/3689904

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Google Research Collabs program

Conference

EAAMO '24

Sponsor:

EAAMO '24: Equity and Access in Algorithms, Mechanisms, and Optimization

October 29 - 31, 2024

San Luis Potosi, Mexico

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
126
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)104

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YTzeng EDu YKislyuk D(2024)Large-Scale Reinforcement Learning for Diffusion ModelsComputer Vision – ECCV 202410.1007/978-3-031-73036-8_1(1-17)Online publication date: 21-Nov-2024
https://doi.org/10.1007/978-3-031-73036-8_1

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents