review-article

Free access

Datasheets for datasets

Authors:

Timnit Gebru,

Jamie Morgenstern,

Briana Vecchione,

Jennifer Wortman Vaughan,

Hanna Wallach,

Hal Daumé III,

Kate CrawfordAuthors Info & Claims

Communications of the ACM, Volume 64, Issue 12

Pages 86 - 92

https://doi.org/10.1145/3458723

Published: 19 November 2021 Publication History

All formats PDF

Abstract

Documentation to facilitate communication between dataset creators and consumers.

References

[1]

Andrews, D., Bonta, J., and Wormith, J. The recent past and near future of risk and/or need assessment. Crime & Delinquency 52, 1 (2006), 7--27.

Crossref

Google Scholar

[2]

Bender, E. and Friedman, B. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Trans. of the Assoc. for Computational Linguistics 6 (2018), 587--604.

Crossref

Google Scholar

[3]

Bhardwaj, A. et al. DataHub: Collaborative data science & dataset version management at scale. CoRR abs/1409.0798 (2014).

Google Scholar

[4]

Bolukbasi, T., Chang, K., Zou, J., Saligrama, V., and Kalai, A. Man is to computer programmer as woman is to homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems (2016).

Google Scholar

[5]

Buolamwini, J. and Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conf. on Fairness, Accountability, and Transparency (2018). 77--91.

Google Scholar

[6]

Cao, Y. and Daumé, H. Toward gender-inclusive coreference resolution. In Proceedings of the Conf. of the Assoc. for Computational Linguistics (2020). abs/1910.13913.

Google Scholar

[7]

Cao, Y. and Daumé, H. Toward gender-inclusive coreference resolution. In Proceedings of the Conf. of the Assoc. for Computational Linguistics (2020).

Crossref

Google Scholar

[8]

Cheney, J., Chiticariu, L., and Tan, W. Provenance in databases: Why, how, and where. Foundations and Trends in Databases 1, 4 (2009), 379--474.

Google Scholar

[9]

Chmielinski, K. et al. The dataset nutrition label (2nd Gen): Leveraging context to mitigate harms in artificial intelligence. In NeurIPS Workshop on Dataset Curation and Security, 2020.

Google Scholar

[10]

Choi, E. et al. QuAC: Question answering in context. In Proceedings of the 2018 Conf. on Empirical Methods in Natural Language Processing.

Google Scholar

[11]

Chui, G. Project will use AI to prevent or minimize electric grid failures, 2017.

Google Scholar

[12]

Dastin, J. Amazon scraps secret AI recruiting tool that showed bias against women, 2018; https://reut.rs/3imOH4d.

Google Scholar

[13]

Garvie, C., Bedoya, A., and Frankle, J. The Perpetual Line-Up: Unregulated Police Face Recognition in America. Georgetown Law, Center on Privacy & Technology, Washington, D.C., 2016.

Google Scholar

[14]

Hind, M. et al. Varshney. Increasing trust in AI services through supplier's declarations of conformity. CoRR abs/1808.07261 (2018).

Google Scholar

[15]

Holstein, K., Vaughan, J., Daumé, H, Dudík, M., and Wallach, H. Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of 2019 ACM CHI Conf. on Human Factors in Computing Systems.

Google Scholar

[16]

Huang, G., Ramesh, M., Berg, T., and Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07-49. University of Massachusetts Amherst, 2007.

Google Scholar

[17]

Krasin, I. et al. OpenImages: A public dataset for large-scale multi-label and multi-class image classification, 2017.

Google Scholar

[18]

Lin, T. The new investor. UCLA Law Review 60 (2012), 678.

Google Scholar

[19]

Mann, G. and O'Neil, C. Hiring Algorithms Are Not Neutral, 2016; https://hbr.org/2016/12/hiring-algorithms-are-not-neutral.

Google Scholar

[20]

Mitchell, M. et al. Model cards for model reporting. In Proceedings of the Conf. on Fairness, Accountability, and Transparency (2019). 220--229.

Digital Library

Google Scholar

[21]

O'Connor, M. How AI Could Smarten Up Our Water System, 2017.

Google Scholar

[22]

Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42^nd Annual Meeting of the Assoc. for Computational Linguistics. 2004, 271.

Digital Library

Google Scholar

[23]

Seck, I., Dahmane, K., Duthon, P., and Loosli, G. Baselines and a datasheet for the Cerema AWP dataset. CoRR abs/1806.04016 (2018). http://arxiv.org/abs/1806.04016

Google Scholar

[24]

Doha Supply Systems. Facial Recognition, 2017.

Google Scholar

[25]

World Economic Forum Global Future Council on Human Rights 2016--2018. How to Prevent Discriminatory Outcomes in Machine Learning; 2018. https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-inmachine-learning.

Google Scholar

[26]

Yagcioglu, S., Erdem, A., Erdem, E., and Ikizler-Cinbis, N. RecipeQA: A challenge dataset for multimodal comprehension of cooking recipes. In Proceedings of the 2018 Conf. on Empirical Methods in Natural Language Processing.

Google Scholar

Cited By

View all

Geng Y(2025)Research on the promotion of intelligent entertainment voice robots in personalized English learning based on data mining and gamified teaching experienceEntertainment Computing10.1016/j.entcom.2024.10081652(100816)Online publication date: Jan-2025
https://doi.org/10.1016/j.entcom.2024.100816
Bandyopadhyay AOks MSun HPrasad BRusk SJefferson FMalkani RHaghayegh SSachdeva RHwang DAgustsson JMignot ESummers MFabbri DDeak MAnastasi MSampson AVan Hout SSeixas A(2024)Strengths, weaknesses, opportunities, and threats of using AI-enabled technology in sleep medicine: a commentaryJournal of Clinical Sleep Medicine10.5664/jcsm.1113220:7(1183-1191)Online publication date: Jul-2024
https://doi.org/10.5664/jcsm.11132
Chowdhery ANarang SDevlin JBosma MMishra GRoberts ABarham PChung HSutton CGehrmann SSchuh PShi KTsvyashchenko SMaynez JRao ABarnes PTay YShazeer NPrabhakaran VReif EDu NHutchinson BPope RBradbury JAustin JIsard MGur-Ari GYin PDuke TLevskaya AGhemawat SDev SMichalewski HGarcia XMisra VRobinson KFedus LZhou DIppolito DLuan DLim HZoph BSpiridonov ASepassi RDohan DAgrawal SOmernick MDai APillai TPellat MLewkowycz AMoreira EChild RPolozov OLee KZhou ZWang XSaeta BDiaz MFirat OCatasta MWei JMeier-Hellstern KEck DDean JPetrov SFiedel N(2024)PaLMThe Journal of Machine Learning Research10.5555/3648699.364893924:1(11324-11436)Online publication date: 6-Mar-2024
https://dl.acm.org/doi/10.5555/3648699.3648939
Show More Cited By

Index Terms

Datasheets for datasets
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User studies
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

Datasheets for Energy Datasets: An Ethically-Minded Approach to Documentation
e-Energy '23 Companion: Companion Proceedings of the 14th ACM International Conference on Future Energy Systems

This work presents an argument for the use of specific documentation for the ethical development, use, and sharing of energy datasets, and an evaluation of current practice in the energy AI community. Drawing on a recently developed resource from the ...
Augmented Datasheets for Speech Datasets and Ethical Decision-Making
FAccT '23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency

Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of ...
What is in our datasets?: describing a structure of datasets
ACSW '16: Proceedings of the Australasian Computer Science Week Multiconference

In order to facilitate research based on datasets in empirical software engineering, the meaning of data must be able to be interpreted correctly. Datasets contain measurements that are associated with metrics and entities. In some datasets, it is not ...

Comments

Information & Contributors

Information

Published In

Communications of the ACM Volume 64, Issue 12

December 2021

101 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/3502158

Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 November 2021

Published in CACM Volume 64, Issue 12

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Review-article
Popular
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

518
Total Citations
View Citations
38,305
Total Downloads

Downloads (Last 12 months)8,645
Downloads (Last 6 weeks)1,249

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Geng Y(2025)Research on the promotion of intelligent entertainment voice robots in personalized English learning based on data mining and gamified teaching experienceEntertainment Computing10.1016/j.entcom.2024.10081652(100816)Online publication date: Jan-2025
https://doi.org/10.1016/j.entcom.2024.100816
Bandyopadhyay AOks MSun HPrasad BRusk SJefferson FMalkani RHaghayegh SSachdeva RHwang DAgustsson JMignot ESummers MFabbri DDeak MAnastasi MSampson AVan Hout SSeixas A(2024)Strengths, weaknesses, opportunities, and threats of using AI-enabled technology in sleep medicine: a commentaryJournal of Clinical Sleep Medicine10.5664/jcsm.1113220:7(1183-1191)Online publication date: Jul-2024
https://doi.org/10.5664/jcsm.11132
Chowdhery ANarang SDevlin JBosma MMishra GRoberts ABarham PChung HSutton CGehrmann SSchuh PShi KTsvyashchenko SMaynez JRao ABarnes PTay YShazeer NPrabhakaran VReif EDu NHutchinson BPope RBradbury JAustin JIsard MGur-Ari GYin PDuke TLevskaya AGhemawat SDev SMichalewski HGarcia XMisra VRobinson KFedus LZhou DIppolito DLuan DLim HZoph BSpiridonov ASepassi RDohan DAgrawal SOmernick MDai APillai TPellat MLewkowycz AMoreira EChild RPolozov OLee KZhou ZWang XSaeta BDiaz MFirat OCatasta MWei JMeier-Hellstern KEck DDean JPetrov SFiedel N(2024)PaLMThe Journal of Machine Learning Research10.5555/3648699.364893924:1(11324-11436)Online publication date: 6-Mar-2024
https://dl.acm.org/doi/10.5555/3648699.3648939
Cañas PNieto MOtaegui ORodriguez I(2024)A Methodology to Enhance Transparency for Trustworthy Artificial Intelligence for Cooperative, Connected, and Automated MobilitySAE International Journal of Connected and Automated Vehicles10.4271/12-08-01-00108:1Online publication date: 2-Sep-2024
https://doi.org/10.4271/12-08-01-0010
Goel P(2024)Ethical and Privacy Considerations in Artificial Emotional Intelligence DeploymentHuman-Machine Collaboration and Emotional Intelligence in Industry 5.010.4018/979-8-3693-6806-0.ch022(405-426)Online publication date: 30-Jun-2024
https://doi.org/10.4018/979-8-3693-6806-0.ch022
P. APadhy P(2024)Cognitive Bias and Fairness Challenges in AI ConsciousnessComparative Analysis of Digital Consciousness and Human Consciousness10.4018/979-8-3693-2015-0.ch005(89-109)Online publication date: 5-Apr-2024
https://doi.org/10.4018/979-8-3693-2015-0.ch005
Azad YKumar A(2024)Ethics and Artificial IntelligenceDigital Technologies, Ethics, and Decentralization in the Digital Era10.4018/979-8-3693-1762-4.ch012(228-268)Online publication date: 8-Feb-2024
https://doi.org/10.4018/979-8-3693-1762-4.ch012
Le-Nguyen HTran T(2024)Charting the Ethical CourseThe Role of Generative AI in the Communication Classroom10.4018/979-8-3693-0831-8.ch011(214-261)Online publication date: 12-Feb-2024
https://doi.org/10.4018/979-8-3693-0831-8.ch011
Kozak JFel S(2024)The Relationship between Religiosity Level and Emotional Responses to Artificial Intelligence in University StudentsReligions10.3390/rel1503033115:3(331)Online publication date: 9-Mar-2024
https://doi.org/10.3390/rel15030331
Meiser MZinnikus I(2024)A Survey on the Use of Synthetic Data for Enhancing Key Aspects of Trustworthy AI in the Energy Domain: Challenges and OpportunitiesEnergies10.3390/en1709199217:9(1992)Online publication date: 23-Apr-2024
https://doi.org/10.3390/en17091992
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Datasheets for Energy Datasets: An Ethically-Minded Approach to Documentation

Augmented Datasheets for Speech Datasets and Ethical Decision-Making

What is in our datasets?: describing a structure of datasets

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Digital Edition

Magazine Site

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations