research-article

Open access

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

Authors:

Mahima Pushkarna,

Andrew Zaldivar,

Oddur KjartanssonAuthors Info & Claims

FAccT '22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency

Pages 1776 - 1826

https://doi.org/10.1145/3531146.3533231

Published: 20 June 2022 Publication History

All formats PDF

Abstract

As research and industry moves towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. A clear and thorough understanding of a dataset’s origins, development, intent, ethical considerations and evolution becomes a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of the documentation. It requires consistency and comparability across the documentation of all datasets involved, and as such documentation must be treated as a user-centric product in and of itself. In this paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset’s lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models—such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use; or decisions affecting model performance. We also present frameworks that ground Data Cards in real-world utility and human-centricity. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over 20 Data Cards.x

References

[1]

2017. AI Now Institute. https://ainowinstitute.org/

[2]

2021. ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT). https://facctconference.org/

[3]

Joint Artificial Intelligence Center Public Affairs. 2021. Enabling AI with Data Cards. https://www.ai.mil/blog_09_03_21_ai_enabling_ai_with_data_cards.html

[4]

Nuno Antunes, Leandro Balby, Flavio Figueiredo, Nuno Lourenco, Wagner Meira, and Walter Santos. 2018. Fairness and transparency of machine learning for trustworthy cloud services. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 188–193.

[5]

Parker Barnes Anurag Batra. 2020. Open Images Extended - Crowdsourced Data Card. https://research.google/static/documents/datasets/open-images-extended-crowdsourced.pdf

[6]

Matthew Arnold, Rachel K. E. Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Darrell Reimer, Alexandra Olteanu, David Piorkowski, Jason Tsay, and Kush R. Varshney. 2019. FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity. arxiv:1808.07261 [cs.CY]

[7]

[7] Anja Austermann, Michelle Linch, Romina Stella, and Kellie Webster.2021. https://storage.googleapis.com/gresearch/translate-gender-challenge-sets/Data%20Card.pdf

[8]

Iain Barclay, Harrison Taylor, Alun Preece, Ian Taylor, Dinesh Verma, and Geeth de Mel. 2020. A framework for fostering transparency in shared artificial intelligence models by increasing visibility of contributions. Concurrency and Computation: Practice and Experience (2020), e6129.

[9]

Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6 (2018), 587–604.

[10]

Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, 2021. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 401–413.

Digital Library

[11]

Ajay Chander, Ramya Srinivasan, Suhas Chelian, Jun Wang, and Kanji Uchino. 2018. Working with beliefs: AI transparency in the enterprise. In IUI Workshops.

[12]

[12] Candice chumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru.2021. https://storage.googleapis.com/openimages/open_images_extended_miap/Open%20Images%20Extended%20-%20MIAP%20-%20Data%20Card.pdf

[13]

Upol Ehsan, Q Vera Liao, Michael Muller, Mark O Riedl, and Justin D Weisz. 2021. Expanding explainability: Towards social transparency in ai systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–19.

Digital Library

[14]

Heike Felzmann, Eduard Fosch-Villaronga, Christoph Lutz, and Aurelia Tamò-Larrieux. 2020. Towards transparency by design for artificial intelligence. Science and Engineering Ethics 26, 6 (2020), 3333–3361.

[15]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010(2018).

[16]

GEM. 2022. Natural Language Generation, its Evaluation and Metrics Data Cards. https://gem-benchmark.com/data_cards

[17]

HuggingFace. 2021. HuggingFace - Create a Dataset Card. https://huggingface.co/docs/datasets/v1.12.0/dataset_card.html

[18]

Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 560–575.

Digital Library

[19]

People + AI Research Initiative. 2022. Know Your Data. https://knowyourdata.withgoogle.com/

[20]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, 2020. The open images dataset v4. International Journal of Computer Vision 128, 7 (2020), 1956–1981.

[21]

Susan Leigh Star. 2010. This is not a boundary object: Reflections on the origin of a concept. Science, Technology, & Human Values 35, 5 (2010), 601–617.

[22]

Colleen McCue. 2014. Data mining and predictive analysis: Intelligence gathering and crime analysis. Butterworth-Heinemann.

[23]

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency. 220–229.

Digital Library

[24]

Sundar Pichai. 2018. AI at Google: our principles. The Keyword 7(2018), 1–3.

[25]

Mahima Pushkarna, Andrew Zaldivar, and Daniel Nanas. [n. d.]. Data Cards Playbook: Participatory Activities for Dataset Documentation. https://facctconference.org/2021/acceptedcraftsessions.html#data_cards

[26]

Mahima Pushkarna, Andrew Zaldivar, and Vivian Tsai. [n. d.]. Data Cards GitHub Page. https://pair-code.github.io/datacardsplaybook/

[27]

Ben Shneiderman. 2003. The eyes have it: A task by data type taxonomy for information visualizations. In The craft of information visualization. Elsevier, 364–371.

[28]

Susan Leigh Star and James R Griesemer. 1989. Institutional ecology,translations’ and boundary objects: Amateurs and professionals in Berkeley’s Museum of Vertebrate Zoology, 1907-39. Social studies of science 19, 3 (1989), 387–420.

[29]

Harini Suresh, Steven R Gomez, Kevin K Nam, and Arvind Satyanarayan. 2021. Beyond Expertise and Roles: A Framework to Characterize the Stakeholders of Interpretable Machine Learning and their Needs. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.

Digital Library

Cited By

Agrawal LMulgund PSharman R(2024)Handling Imbalanced Data With Weighted Logistic Regression and Propensity Score Matching methodsJournal of Database Management10.4018/JDM.33588835:1(1-37)Online publication date: 7-Jan-2024
https://doi.org/10.4018/JDM.335888
Aranjuelo NHuang SArganda-Carreras IUnzueta LOtaegui OPfister HWei D(2024)Learning Gaze-aware Compositional GAN from Limited AnnotationsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36547067:2(1-17)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3654706
Madaio MChen JWallach HWortman Vaughan J(2024)Tinker, Tailor, Configure, Customize: The Articulation Work of Contextualizing an AI Fairness ChecklistProceedings of the ACM on Human-Computer Interaction10.1145/36537058:CSCW1(1-20)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3653705
Show More Cited By

Index Terms

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

Index terms have been assigned to the content through auto-classification.

Recommendations

Towards a Semantic Approach for Linked Dataspace, Model and Data Cards
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

The vast majority of artificial intelligence practitioners overlook the importance of documentation when building and publishing models and datasets. However, due to the recent trend in the explainability and fairness of AI models, several frameworks ...
Documenting Data Production Processes: A Participatory Approach for Data Work
CSCW

The opacity of machine learning data is a significant threat to ethical data work and intelligible systems. Previous research has addressed this issue by proposing standardized checklists to document datasets. This paper expands that field of inquiry by ...
Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?
FAccT '22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency

This paper contains images and descriptions that are offensive in nature.

Large datasets underlying much of current machine learning raise serious issues concerning inappropriate content such as offensive, insulting, threatening, or might otherwise cause ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

FAccT '22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency

June 2022

2351 pages

ISBN:9781450393522

DOI:10.1145/3531146

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2022

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

FAccT '22

Sponsor:

ACM

FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency

June 21 - 24, 2022

Seoul, Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
12,828
Total Downloads

Downloads (Last 12 months)7,558
Downloads (Last 6 weeks)802

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Agrawal LMulgund PSharman R(2024)Handling Imbalanced Data With Weighted Logistic Regression and Propensity Score Matching methodsJournal of Database Management10.4018/JDM.33588835:1(1-37)Online publication date: 7-Jan-2024
https://doi.org/10.4018/JDM.335888
Aranjuelo NHuang SArganda-Carreras IUnzueta LOtaegui OPfister HWei D(2024)Learning Gaze-aware Compositional GAN from Limited AnnotationsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36547067:2(1-17)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3654706
Madaio MChen JWallach HWortman Vaughan J(2024)Tinker, Tailor, Configure, Customize: The Articulation Work of Contextualizing an AI Fairness ChecklistProceedings of the ACM on Human-Computer Interaction10.1145/36537058:CSCW1(1-20)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3653705
Mickel J(2024)Racial/Ethnic Categories in AI and Algorithmic Fairness: Why They Matter and What They RepresentProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659050(2484-2494)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3659050
Eyuboglu SGoel KDesai AChen LMonfort MRé CZou J(2024)Model ChangeLists: Characterizing Updates to ML ModelsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659047(2432-2453)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3659047
Madaio MKapania SQadri RWang DZaldivar ADenton RWilcox L(2024)Learning about Responsible AI On-The-Job: Learning Pathways, Orientations, and AspirationsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658988(1544-1558)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658988
Turri VMorrison KRobinson KAbidi CPerer AForlizzi JDzombak R(2024)Transparency in the Wild: Navigating Transparency in a Deployed AI System to Broaden Need-Finding ApproachesProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658985(1494-1514)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658985
Kou T(2024)From Model Performance to Claim: How a Change of Focus in Machine Learning Replicability Can Help Bridge the Responsibility GapProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658951(1002-1013)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658951
Roegiest APinkosova Z(2024)Generative Information Systems Are Great If You Can ReadProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638345(165-177)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638345
Kawakami ACoston AZhu HHeidari HHolstein K(2024)The Situate AI Guidebook: Co-Designing a Toolkit to Support Multi-Stakeholder, Early-stage Deliberations Around Public Sector AI ProposalsProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642849(1-22)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642849
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents