Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Datasheets For Datasets

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Datasheets for Datasets

TIMNIT GEBRU, Black in AI


JAMIE MORGENSTERN, University of Washington
arXiv:1803.09010v8 [cs.DB] 1 Dec 2021

BRIANA VECCHIONE, Cornell University


JENNIFER WORTMAN VAUGHAN, Microsoft Research
HANNA WALLACH, Microsoft Research
HAL DAUMÉ III, Microsoft Research; University of Maryland
KATE CRAWFORD, Microsoft Research

1 Introduction
Data plays a critical role in machine learning. Every machine learning model is
trained and evaluated using data, quite often in the form of static datasets. The
characteristics of these datasets fundamentally influence a model’s behavior: a
model is unlikely to perform well in the wild if its deployment context does not
match its training or evaluation datasets, or if these datasets reflect unwanted
societal biases. Mismatches like this can have especially severe consequences
when machine learning models are used in high-stakes domains, such as
criminal justice [1, 13, 24], hiring [19], critical infrastructure [11, 21], and
finance [18]. Even in other domains, mismatches may lead to loss of revenue or
public relations setbacks. Of particular concern are recent examples showing
that machine learning models can reproduce or amplify unwanted societal
biases reflected in training datasets [4, 5, 12]. For these and other reasons,
the World Economic Forum suggests that all entities should document the
provenance, creation, and use of machine learning datasets in order to avoid
discriminatory outcomes [25].
Although data provenance has been studied extensively in the databases
community [3, 8], it is rarely discussed in the machine learning community.
Documenting the creation and use of datasets has received even less attention.
Despite the importance of data to machine learning, there is currently no
standardized process for documenting machine learning datasets.
To address this gap, we propose datasheets for datasets. In the electronics
industry, every component, no matter how simple or complex, is accompanied
with a datasheet describing its operating characteristics, test results, recom-
mended usage, and other information. By analogy, we propose that every

Authors’ addresses: Timnit Gebru, Black in AI; Jamie Morgenstern, University of Washington;
Briana Vecchione, Cornell University; Jennifer Wortman Vaughan, Microsoft Research; Hanna
Wallach, Microsoft Research; Hal Daumé III, Microsoft Research; University of Maryland; Kate
Crawford, Microsoft Research.
2 Gebru et al.
dataset be accompanied with a datasheet that documents its motivation, com-
position, collection process, recommended uses, and so on. Datasheets for
datasets have the potential to increase transparency and accountability within
the machine learning community, mitigate unwanted societal biases in machine
learning models, facilitate greater reproducibility of machine learning results,
and help researchers and practitioners to select more appropriate datasets for
their chosen tasks.
After outlining our objectives below, we describe the process by which we
developed datasheets for datasets. We then provide a set of questions designed
to elicit the information that a datasheet for a dataset might contain, as well as
a workflow for dataset creators to use when answering these questions. We
conclude with a summary of the impact to date of datasheets for datasets and
a discussion of implementation challenges and avenues for future work.

1.1 Objectives
Datasheets for datasets are intended to address the needs of two key stake-
holder groups: dataset creators and dataset consumers. For dataset creators, the
primary objective is to encourage careful reflection on the process of creating,
distributing, and maintaining a dataset, including any underlying assumptions,
potential risks or harms, and implications of use. For dataset consumers, the
primary objective is to ensure they have the information they need to make
informed decisions about using a dataset. Transparency on the part of dataset
creators is necessary for dataset consumers to be sufficiently well informed
that they can select appropriate datasets for their chosen tasks and avoid
unintentional misuse.1
Beyond these two key stakeholder groups, datasheets for datasets may be
valuable to policy makers, consumer advocates, investigative journalists, in-
dividuals whose data is included in datasets, and individuals who may be
impacted by models trained or evaluated using datasets. They also serve a
secondary objective of facilitating greater reproducibility of machine learning
results: researchers and practitioners without access to a dataset may be able to
use the information in its datasheet to create alternative datasets with similar
characteristics.
Although we provide a set of questions designed to elicit the information that
a datasheet for a dataset might contain, these questions are not intended to be
prescriptive. Indeed, we expect that datasheets will necessarily vary depending
on factors such as the domain or existing organizational infrastructure and
workflows. For example, some the questions are appropriate for academic
researchers publicly releasing datasets for the purpose of enabling future

1 We note that in some cases, the people creating a datasheet for a dataset may not be the
dataset creators, as was the case with the example datasheets that we created as part of our
development process.
Datasheets for Datasets 3
research, but less relevant for product teams creating internal datasets for
training proprietary models. As another example, Bender and Friedman [2]
outline a proposal similar to datasheets for datasets specifically intended for
language-based datasets. Their questions may be naturally integrated into a
datasheet for a language-based dataset as appropriate.
We emphasize that the process of creating a datasheet is not intended to
be automated. Although automated documentation processes are convenient,
they run counter to our objective of encouraging dataset creators to carefully
reflect on the process of creating, distributing, and maintaining a dataset.

2 Development Process
We refined the questions and workflow provided in the next section over a
period of roughly two years, incorporating many rounds of feedback.
First, leveraging our own experiences as researchers with diverse back-
grounds working in different domains and institutions, we drew on our knowl-
edge of dataset characteristics, unintentional misuse, unwanted societal biases,
and other issues to produce an initial set of questions designed to elicit informa-
tion about these topics. We then “tested” these questions by creating example
datasheets for two widely used datasets: Labeled Faces in the Wild [16] and
Pang and Lee’s polarity dataset [22]. We chose these datasets in large part
because their creators provided exemplary documentation, allowing us to eas-
ily find the answers to many of the questions. While creating these example
datasheets, we found gaps in the questions, as well as redundancies and lack
of clarity. We therefore refined the questions and distributed them to product
teams in two major US-based technology companies, in some cases helping
teams to create datasheets for their datasets and observing where the questions
did not achieve their intended objectives. Contemporaneously, we circulated an
initial draft of this paper to colleagues through social media and on arXiv (draft
posted 23 March 2018). Via these channels we received extensive comments
from dozens of researchers, practitioners, and policy makers. We also worked
with a team of lawyers to review the questions from a legal perspective.
We incorporated this feedback to yield the questions and workflow provided
in the next section: We added and removed questions, refined the content
of the questions, and reordered the questions to better match the key stages
of the dataset lifecycle. Based on our experiences with product teams, we
reworded the questions to discourage yes/no answers, added a section on
“Uses,” and deleted a section on “Legal and Ethical Considerations.” We found
that product teams were more likely to answer questions about legal and ethical
considerations if they were integrated into sections about the relevant stages of
the dataset lifecycle rather than grouped together. Finally, following feedback
from the team of lawyers, we removed questions that explicitly asked about
compliance with regulations, and introduced factual questions intended to
4 Gebru et al.
elicit relevant information about compliance without requiring dataset creators
to make legal judgments.

3 Questions and Workflow


In this section, we provide a set of questions designed to elicit the information
that a datasheet for a dataset might contain, as well as a workflow for dataset
creators to use when answering these questions. The questions are grouped
into sections that roughly match the key stages of the dataset lifecycle: moti-
vation, composition, collection process, preprocessing/cleaning/labeling, uses,
distribution, and maintenance. This grouping encourages dataset creators to
reflect on the process of creating, distributing, and maintaining a dataset, and
even alter this process in response to their reflection. We note that not all
questions will be applicable to all datasets; those that do not apply should be
skipped.
To illustrate how these questions might be answered in practice, we provide
in the appendix an example datasheet for Pang and Lee’s polarity dataset [22].
We answered some of the questions with “Unknown to the authors of the
datasheet.” This is because we did not create the dataset ourselves and could
not find the answers to these questions in the available documentation. For an
example of a datasheet that was created by the creators of the corresponding
dataset, please see that of Cao and Daumé [6].2 We note that even dataset
creators may be unable to answer all of the questions provided in this section.
We recommend answering as many questions as possible rather than skipping
the datasheet creation process entirely.

3.1 Motivation
The questions in this section are primarily intended to encourage dataset
creators to clearly articulate their reasons for creating the dataset and to
promote transparency about funding interests. The latter may be particularly
relevant for datasets created for research purposes.

• For what purpose was the dataset created? Was there a specific task
in mind? Was there a specific gap that needed to be filled? Please provide
a description.
• Who created the dataset (e.g., which team, research group) and on
behalf of which entity (e.g., company, institution, organization)?
• Who funded the creation of the dataset? If there is an associated
grant, please provide the name of the grantor and the grant name and
number.
• Any other comments?
2 See https://github.com/TristaCao/into_inclusivecoref/blob/master/GICoref/datasheet-gicoref.
md.
Datasheets for Datasets 5
3.2 Composition
Dataset creators should read through these questions prior to any data collec-
tion and then provide answers once data collection is complete. Most of the
questions in this section are intended to provide dataset consumers with the
information they need to make informed decisions about using the dataset for
their chosen tasks. Some of the questions are designed to elicit information
about compliance with the EU’s General Data Protection Regulation (GDPR)
or comparable regulations in other jurisdictions.
Questions that apply only to datasets that relate to people are grouped
together at the end of the section. We recommend taking a broad interpretation
of whether a dataset relates to people. For example, any dataset containing
text that was written by people relates to people.

• What do the instances that comprise the dataset represent (e.g.,


documents, photos, people, countries)? Are there multiple types of
instances (e.g., movies, users, and ratings; people and interactions be-
tween them; nodes and edges)? Please provide a description.
• How many instances are there in total (of each type, if appropri-
ate)?
• Does the dataset contain all possible instances or is it a sample
(not necessarily random) of instances from a larger set? If the
dataset is a sample, then what is the larger set? Is the sample representa-
tive of the larger set (e.g., geographic coverage)? If so, please describe how
this representativeness was validated/verified. If it is not representative
of the larger set, please describe why not (e.g., to cover a more diverse
range of instances, because instances were withheld or unavailable).
• What data does each instance consist of? “Raw” data (e.g., unpro-
cessed text or images) or features? In either case, please provide a de-
scription.
• Is there a label or target associated with each instance? If so, please
provide a description.
• Is any information missing from individual instances? If so, please
provide a description, explaining why this information is missing (e.g.,
because it was unavailable). This does not include intentionally removed
information, but might include, e.g., redacted text.
• Are relationships between individual instances made explicit
(e.g., users’ movie ratings, social network links)? If so, please de-
scribe how these relationships are made explicit.
• Are there recommended data splits (e.g., training, development/validation,
testing)? If so, please provide a description of these splits, explaining
the rationale behind them.
6 Gebru et al.
• Are there any errors, sources of noise, or redundancies in the
dataset? If so, please provide a description.
• Is the dataset self-contained, or does it link to or otherwise rely on
external resources (e.g., websites, tweets, other datasets)? If it links
to or relies on external resources, a) are there guarantees that they will
exist, and remain constant, over time; b) are there official archival versions
of the complete dataset (i.e., including the external resources as they
existed at the time the dataset was created); c) are there any restrictions
(e.g., licenses, fees) associated with any of the external resources that
might apply to a dataset consumer? Please provide descriptions of all
external resources and any restrictions associated with them, as well as
links or other access points, as appropriate.
• Does the dataset contain data that might be considered confiden-
tial (e.g., data that is protected by legal privilege or by doctor–
patient confidentiality, data that includes the content of individ-
uals’ non-public communications)? If so, please provide a description.
• Does the dataset contain data that, if viewed directly, might be of-
fensive, insulting, threatening, or might otherwise cause anxiety?
If so, please describe why.
If the dataset does not relate to people, you may skip the remaining questions
in this section.
• Does the dataset identify any subpopulations (e.g., by age, gen-
der)? If so, please describe how these subpopulations are identified and
provide a description of their respective distributions within the dataset.
• Is it possible to identify individuals (i.e., one or more natural per-
sons), either directly or indirectly (i.e., in combination with other
data) from the dataset? If so, please describe how.
• Does the dataset contain data that might be considered sensitive
in any way (e.g., data that reveals race or ethnic origins, sexual ori-
entations, religious beliefs, political opinions or union member-
ships, or locations; financial or health data; biometric or genetic
data; forms of government identification, such as social security
numbers; criminal history)? If so, please provide a description.
• Any other comments?

3.3 Collection Process


As with the questions in the previous section, dataset creators should read
through these questions prior to any data collection to flag potential issues
and then provide answers once collection is complete. In addition to the goals
outlined in the previous section, the questions in this section are designed
to elicit information that may help researchers and practitioners to create
alternative datasets with similar characteristics. Again, questions that apply
Datasheets for Datasets 7
only to datasets that relate to people are grouped together at the end of the
section.

• How was the data associated with each instance acquired? Was
the data directly observable (e.g., raw text, movie ratings), reported by
subjects (e.g., survey responses), or indirectly inferred/derived from other
data (e.g., part-of-speech tags, model-based guesses for age or language)?
If the data was reported by subjects or indirectly inferred/derived from
other data, was the data validated/verified? If so, please describe how.
• What mechanisms or procedures were used to collect the data
(e.g., hardware apparatuses or sensors, manual human curation,
software programs, software APIs)? How were these mechanisms or
procedures validated?
• If the dataset is a sample from a larger set, what was the sampling
strategy (e.g., deterministic, probabilistic with specific sampling
probabilities)?
• Who was involved in the data collection process (e.g., students,
crowdworkers, contractors) and how were they compensated (e.g.,
how much were crowdworkers paid)?
• Over what timeframe was the data collected? Does this timeframe
match the creation timeframe of the data associated with the instances
(e.g., recent crawl of old news articles)? If not, please describe the time-
frame in which the data associated with the instances was created.
• Were any ethical review processes conducted (e.g., by an institu-
tional review board)? If so, please provide a description of these review
processes, including the outcomes, as well as a link or other access point
to any supporting documentation.
If the dataset does not relate to people, you may skip the remaining questions
in this section.
• Did you collect the data from the individuals in question directly,
or obtain it via third parties or other sources (e.g., websites)?
• Were the individuals in question notified about the data collec-
tion? If so, please describe (or show with screenshots or other informa-
tion) how notice was provided, and provide a link or other access point
to, or otherwise reproduce, the exact language of the notification itself.
• Did the individuals in question consent to the collection and use
of their data? If so, please describe (or show with screenshots or other
information) how consent was requested and provided, and provide a
link or other access point to, or otherwise reproduce, the exact language
to which the individuals consented.
• If consent was obtained, were the consenting individuals pro-
vided with a mechanism to revoke their consent in the future or
8 Gebru et al.
for certain uses? If so, please provide a description, as well as a link or
other access point to the mechanism (if appropriate).
• Has an analysis of the potential impact of the dataset and its use
on data subjects (e.g., a data protection impact analysis) been con-
ducted? If so, please provide a description of this analysis, including
the outcomes, as well as a link or other access point to any supporting
documentation.
• Any other comments?

3.4 Preprocessing/cleaning/labeling
Dataset creators should read through these questions prior to any prepro-
cessing, cleaning, or labeling and then provide answers once these tasks
are complete. The questions in this section are intended to provide dataset
consumers with the information they need to determine whether the “raw”
data has been processed in ways that are compatible with their chosen tasks.
For example, text that has been converted into a “bag-of-words” is not suitable
for tasks involving word order.

• Was any preprocessing/cleaning/labeling of the data done (e.g.,


discretization or bucketing, tokenization, part-of-speech tagging,
SIFT feature extraction, removal of instances, processing of miss-
ing values)? If so, please provide a description. If not, you may skip the
remaining questions in this section.
• Was the “raw” data saved in addition to the preprocessed/cleaned/labeled
data (e.g., to support unanticipated future uses)? If so, please pro-
vide a link or other access point to the “raw” data.
• Is the software that was used to preprocess/clean/label the data
available? If so, please provide a link or other access point.
• Any other comments?

3.5 Uses
The questions in this section are intended to encourage dataset creators to
reflect on the tasks for which the dataset should and should not be used. By
explicitly highlighting these tasks, dataset creators can help dataset consumers
to make informed decisions, thereby avoiding potential risks or harms.

• Has the dataset been used for any tasks already? If so, please provide
a description.
• Is there a repository that links to any or all papers or systems that
use the dataset? If so, please provide a link or other access point.
• What (other) tasks could the dataset be used for?
Datasheets for Datasets 9
• Is there anything about the composition of the dataset or the way
it was collected and preprocessed/cleaned/labeled that might im-
pact future uses? For example, is there anything that a dataset consumer
might need to know to avoid uses that could result in unfair treatment of
individuals or groups (e.g., stereotyping, quality of service issues) or other
risks or harms (e.g., legal risks, financial harms)? If so, please provide a
description. Is there anything a dataset consumer could do to mitigate
these risks or harms?
• Are there tasks for which the dataset should not be used? If so,
please provide a description.
• Any other comments?

3.6 Distribution
Dataset creators should provide answers to these questions prior to distributing
the dataset either internally within the entity on behalf of which the dataset
was created or externally to third parties.

• Will the dataset be distributed to third parties outside of the en-


tity (e.g., company, institution, organization) on behalf of which
the dataset was created? If so, please provide a description.
• How will the dataset will be distributed (e.g., tarball on website,
API, GitHub)? Does the dataset have a digital object identifier (DOI)?
• When will the dataset be distributed?
• Will the dataset be distributed under a copyright or other intel-
lectual property (IP) license, and/or under applicable terms of use
(ToU)? If so, please describe this license and/or ToU, and provide a link
or other access point to, or otherwise reproduce, any relevant licensing
terms or ToU, as well as any fees associated with these restrictions.
• Have any third parties imposed IP-based or other restrictions on
the data associated with the instances? If so, please describe these
restrictions, and provide a link or other access point to, or otherwise
reproduce, any relevant licensing terms, as well as any fees associated
with these restrictions.
• Do any export controls or other regulatory restrictions apply to
the dataset or to individual instances? If so, please describe these
restrictions, and provide a link or other access point to, or otherwise
reproduce, any supporting documentation.
• Any other comments?

3.7 Maintenance
As with the questions in the previous section, dataset creators should provide
answers to these questions prior to distributing the dataset. The questions
10 Gebru et al.
in this section are intended to encourage dataset creators to plan for dataset
maintenance and communicate this plan to dataset consumers.

• Who will be supporting/hosting/maintaining the dataset?


• How can the owner/curator/manager of the dataset be contacted
(e.g., email address)?
• Is there an erratum? If so, please provide a link or other access point.
• Will the dataset be updated (e.g., to correct labeling errors, add
new instances, delete instances)? If so, please describe how often, by
whom, and how updates will be communicated to dataset consumers
(e.g., mailing list, GitHub)?
• If the dataset relates to people, are there applicable limits on the
retention of the data associated with the instances (e.g., were the
individuals in question told that their data would be retained for
a fixed period of time and then deleted)? If so, please describe these
limits and explain how they will be enforced.
• Will older versions of the dataset continue to be supported/hosted/maintained?
If so, please describe how. If not, please describe how its obsolescence
will be communicated to dataset consumers.
• If others want to extend/augment/build on/contribute to the
dataset, is there a mechanism for them to do so? If so, please
provide a description. Will these contributions be validated/verified? If
so, please describe how. If not, why not? Is there a process for commu-
nicating/distributing these contributions to dataset consumers? If so,
please provide a description.
• Any other comments?

4 Impact and Challenges


Since circulating an initial draft of this paper in March 2018, datasheets for
datasets have already gained traction in a number of settings. Academic re-
searchers have adopted our proposal and released datasets with accompanying
datasheets [e.g., 7, 10, 23, 26]. Microsoft, Google, and IBM have begun to pilot
datasheets for datasets internally within product teams. Researchers at Google
published follow-up work on model cards that document machine learning
models [20] and released a data card (a lightweight version of a datasheet) along
with the Open Images dataset [17]. Researchers at IBM proposed factsheets [14]
that document various characteristics of AI services, including whether the
datasets used to develop the services are accompanied with datasheets. The
Data Nutrition Project incorporated some of the questions provided in the
previous section into the latest release of their Dataset Nutrition Label [9].
Finally, the Partnership on AI, a multi-stakeholder organization focused on
Datasheets for Datasets 11
studying and formulating best practices for developing and deploying AI tech-
nologies, is working on industry-wide documentation guidance that builds on
datasheets for datasets, model cards, and factsheets.3
These initial successes have also revealed implementation challenges that
may need to be addressed to support wider adoption. Chief among them is the
need for dataset creators to modify the questions and workflow provided in
the previous section based on their existing organizational infrastructure and
workflows. We also note that the questions and workflow may pose problems
for dynamic datasets. If a dataset changes only infrequently, we recommend
accompanying updated versions with updated datasheets.
Datasheets for datasets do not provide a complete solution to mitigating
unwanted societal biases or potential risks or harms. Dataset creators cannot
anticipate every possible use of a dataset, and identifying unwanted societal
biases often requires additional labels indicating demographic information
about individuals, which may not be available to dataset creators for reasons
including those individuals’ data protection and privacy [15].
When creating datasets that relate to people, and hence their accompanying
datasheets, it may be necessary for dataset creators to work with experts in
other domains such as anthropology, sociology, and science and technology
studies. There are complex and contextual social, historical, and geographical
factors that influence how best to collect data from individuals in a manner
that is respectful.
Finally, creating datasheets for datasets will necessarily impose overhead
on dataset creators. Although datasheets may reduce the amount of time that
dataset creators spend answering one-off questions about datasets, the process
of creating a datasheet will always take time, and organizational infrastruc-
ture and workflows—not to mention incentives—will need to be modified to
accommodate this investment.
Despite these implementation challenges, there are many benefits to creat-
ing datasheets for datasets. In addition to facilitating better communication
between dataset creators and dataset consumers, datasheets provide an opportu-
nity for dataset creators to distinguish themselves as prioritizing transparency
and accountability. Ultimately, we believe that the benefits to the machine
learning community outweigh the costs.

Acknowledgments
We thank Peter Bailey, Emily Bender, Yoshua Bengio, Sarah Bird, Sarah Brown,
Steven Bowles, Joy Buolamwini, Amanda Casari, Eric Charran, Alain Couillault,
Lukas Dauterman, Leigh Dodds, Miroslav Dudík, Michael Ekstrand, Noémie
Elhadad, Michael Golebiewski, Nick Gonsalves, Martin Hansen, Andy Hickl,
Michael Hoffman, Scott Hoogerwerf, Eric Horvitz, Mingjing Huang, Surya
3 https://www.partnershiponai.org/about-ml/
12 Gebru et al.
Kallumadi, Ece Kamar, Krishnaram Kenthapadi, Emre Kiciman, Jacquelyn Kro-
nes, Erik Learned-Miller, Lillian Lee, Jochen Leidner, Rob Mauceri, Brian Mcfee,
Emily McReynolds, Bogdan Micu, Margaret Mitchell, Sangeeta Mudnal, Bren-
dan O’Connor, Thomas Padilla, Bo Pang, Anjali Parikh, Lisa Peets, Alessandro
Perina, Michael Philips, Barton Place, Sudha Rao, Jen Ren, David Van Riper,
Anna Roth, Cynthia Rudin, Ben Shneiderman, Biplav Srivastava, Ankur Terede-
sai, Rachel Thomas, Martin Tomko, Panagiotis Tziachris, Meredith Whittaker,
Hans Wolters, Ashly Yeo, Lu Zhang, and the attendees of the Partnership on
AI’s April 2019 ABOUT ML workshop for valuable feedback.

References
[1] Don A Andrews, James Bonta, and J Stephen Wormith. 2006. The recent past and near
future of risk and/or need assessment. Crime & Delinquency 52, 1 (2006), 7–27.
[2] Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language
Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of
the Association for Computational Linguistics 6 (2018), 587–604.
[3] Anant P. Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J.
Elmore, Samuel Madden, and Aditya G. Parameswaran. 2014. DataHub: Collaborative
Data Science & Dataset Version Management at Scale. CoRR abs/1409.0798 (2014).
[4] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai.
2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word
Embeddings. In Advances in Neural Information Processing Systems (NeurIPS).
[5] Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy
Disparities in Commercial Gender Classification. In Proceedings of the Conference on
Fairness, Accountability, and Transparency (FAT*). 77–91.
[6] Yang Trista Cao and Hal Daumé. 2020. Toward Gender-Inclusive Coreference Resolution.
In Proceedings of the Conference of the Association for Computational Linguistics (ACL).
abs/1910.13913.
[7] Yang Trista Cao and Hal Daumé, III. 2020. Toward Gender-Inclusive Coreference
Resolution. In Proceedings of the Conference of the Association for Computational
Linguistics (ACL).
[8] James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in databases:
Why, how, and where. Foundations and Trends in Databases 1, 4 (2009), 379–474.
[9] Kasia Chmielinski, Sarah Newman, Matt Taylor, Josh Joseph, Kemi Thomas, Jessica
Yurkofsky, and Yue Chelsea Qiu. 2020. The Dataset Nutrition Label (2nd Gen): Leveraging
Context to Mitigate Harms in Artificial Intelligence. In NeurIPS Workshop on Dataset
Curation and Security.
[10] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and
Luke Zettlemoyer. 2018. QuAC : Question Answering in Context. In Proceedings of the
2018 Conference on Empirical Methods in Natural Language Processing.
[11] Glennda Chui. 2017. Project will use AI to prevent or minimize electric grid failures.
[Online; accessed 14-March-2018].
[12] Jeffrey Dastin. 2018. Amazon scraps secret AI recruiting tool that showed bias against
women.
https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-
scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G
[13] Clare Garvie, Alvaro Bedoya, and Jonathan Frankle. 2016. The Perpetual Line-Up:
Unregulated Police Face Recognition in America. Georgetown Law, Center on Privacy &
Datasheets for Datasets 13
Technology, New Jersey Ave NW, Washington, DC.
[14] Michael Hind, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan
Ramamurthy, Alexandra Olteanu, and Kush R. Varshney. 2018. Increasing Trust in AI
Services through Supplier’s Declarations of Conformity. CoRR abs/1808.07261 (2018).
[15] Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miroslav Dudík, and
Hanna M. Wallach. 2019. Improving Fairness in Machine Learning Systems: What Do
Industry Practitioners Need?. In 2019 ACM CHI Conference on Human Factors in
Computing Systems.
[16] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. 2007. Labeled faces
in the wild: A database for studying face recognition in unconstrained environments.
Technical Report 07-49. University of Massachusetts Amherst.
[17] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina
Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Shahab Kamali, Matteo Malloci,
Jordi Pont-Tuset, Andreas Veit, Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun,
Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. 2017.
OpenImages: A public dataset for large-scale multi-label and multi-class image
classification.
[18] Tom CW Lin. 2012. The new investor. UCLA Law Review 60 (2012), 678.
[19] G Mann and C O’Neil. 2016. Hiring Algorithms Are Not Neutral.
https://hbr.org/2016/12/hiring-algorithms-are-not-neutral.
[20] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben
Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards
for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and
Transparency (FAT*). 220–229.
[21] Mary Catherine O’Connor. 2017. How AI Could Smarten Up Our Water System. [Online;
accessed 14-March-2018].
[22] Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual
Meeting of the Association for Computational Linguistics. 271.
[23] Ismaïla Seck, Khouloud Dahmane, Pierre Duthon, and Gaëlle Loosli. 2018. Baselines and a
datasheet for the Cerema AWP dataset. CoRR abs/1806.04016 (2018).
http://arxiv.org/abs/1806.04016
[24] Doha Suppy Systems. 2017. Facial Recognition. [Online; accessed 14-March-2018].
[25] World Economic Forum Global Future Council on Human Rights 2016–2018. 2018. How
to Prevent Discriminatory Outcomes in Machine Learning.
https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-
machine-learning.
[26] Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Nazli Ikizler-Cinbis. 2018. RecipeQA: A
Challenge Dataset for Multimodal Comprehension of Cooking Recipes. In Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing.
14 Gebru et al.

A Appendix
In this appendix, we provide an example datasheet for Pang and Lee’s polarity
dataset [22] (figure 1 to figure 4).
Datasheets for Datasets 15

Movie Review Polarity Thumbs Up? Sentiment Classification using Machine Learning Techniques

Motivation these are words that could be used to describe the emotions of john sayles’
characters in his latest , limbo . but no , i use them to describe myself after
For what purpose was the dataset created? Was there a specific task sitting through his latest little exercise in indie egomania . i can forgive many
in mind? Was there a specific gap that needed to be filled? Please provide things . but using some hackneyed , whacked-out , screwed-up * non * -
a description. ending on a movie is unforgivable . i walked a half-mile in the rain and sat
through two hours of typical , plodding sayles melodrama to get cheated by a
The dataset was created to enable research on predicting senti- complete and total copout finale . does sayles think he’s roger corman ?
ment polarity—i.e., given a piece of English text, predict whether
it has a positive or negative affect—or stance—toward its topic. Figure 1. An example “negative polarity” instance, taken from the file
neg/cv452 tok-18656.txt.
The dataset was created intentionally with that task in mind, fo-
cusing on movie reviews as a place where affect/sentiment is fre-
quently expressed.1 exception that no more than 40 posts by a single author were in-
Who created the dataset (e.g., which team, research group) and on
cluded (see “Collection Process” below). No tests were run to
behalf of which entity (e.g., company, institution, organization)? determine representativeness.
The dataset was created by Bo Pang and Lillian Lee at Cornell What data does each instance consist of? “Raw” data (e.g., unpro-
University. cessed text or images)or features? In either case, please provide a de-
scription.
Who funded the creation of the dataset? If there is an associated grant, Each instance consists of the text associated with the review, with
please provide the name of the grantor and the grant name and number.
obvious ratings information removed from that text (some errors
Funding was provided from five distinct sources: the National were found and later fixed). The text was down-cased and HTML
Science Foundation, the Department of the Interior, the National tags were removed. Boilerplate newsgroup header/footer text was
Business Center, Cornell University, and the Sloan Foundation. removed. Some additional unspecified automatic filtering was
Any other comments? done. Each instance also has an associated target value: a positive
None. (+1) or negative (-1) sentiment polarity rating based on the num-
ber of stars that that review gave (details on the mapping from
number of stars to polarity is given below in “Data Preprocess-
Composition ing”).
What do the instances that comprise the dataset represent (e.g., doc- Is there a label or target associated with each instance? If so, please
uments, photos, people, countries)? Are there multiple types of in- provide a description.
stances (e.g., movies, users, and ratings; people and interactions between The label is the positive/negative sentiment polarity rating derived
them; nodes and edges)? Please provide a description.
from the star rating, as described above.
The instances are movie reviews extracted from newsgroup post-
ings, together with a sentiment polarity rating for whether the text Is any information missing from individual instances? If so, please
provide a description, explaining why this information is missing (e.g., be-
corresponds to a review with a rating that is either strongly pos- cause it was unavailable). This does not include intentionally removed
itive (high number of stars) or strongly negative (low number of information, but might include, e.g., redacted text.
stars). The sentiment polarity rating is binary {positive, nega- Everything is included. No data is missing.
tive}. An example instance is shown in figure 1.
Are relationships between individual instances made explicit (e.g.,
How many instances are there in total (of each type, if appropriate)? users’ movie ratings, social network links)? If so, please describe
how these relationships are made explicit.
There are 1,400 instances in total in the original (v1.x versions)
and 2,000 instances in total in v2.0 (from 2014). None explicitly, though the original newsgroup postings include
poster name and email address, so some information (such as
Does the dataset contain all possible instances or is it a sample (not threads, replies, or posts by the same author) could be extracted
necessarily random) of instances from a larger set? If the dataset is if needed.
a sample, then what is the larger set? Is the sample representative of the
larger set (e.g., geographic coverage)? If so, please describe how this Are there recommended data splits (e.g., training, develop-
representativeness was validated/verified. If it is not representative of the ment/validation, testing)? If so, please provide a description of these
larger set, please describe why not (e.g., to cover a more diverse range of splits, explaining the rationale behind them.
instances, because instances were withheld or unavailable). The instances come with a “cross-validation tag” to enable repli-
The dataset is a sample of instances. It is intended to be a ran- cation of cross-validation experiments; results are measured in
dom sample of movie reviews from newsgroup postings, with the classification accuracy.
1 All information in this datasheet is taken from one of the following five Are there any errors, sources of noise, or redundancies in the
sources; any errors that were introduced are the fault of the authors of the dataset? If so, please provide a description.
datasheet: http://www.cs.cornell.edu/people/pabo/movie-review-data/; http: See preprocessing below.
//xxx.lanl.gov/pdf/cs/0409058v1; http://www.cs.cornell.edu/people/pabo/
movie-review-data/rt-polaritydata.README.1.0.txt; http://www.cs.cornell. Is the dataset self-contained, or does it link to or otherwise rely on
edu/people/pabo/movie-review-data/poldata.README.2.0.txt. external resources (e.g., websites, tweets, other datasets)? If it links

Fig. 1. Example datasheet for Pang and Lee’s polarity dataset [22], page 1.
16 Gebru et al.

Movie Review Polarity Thumbs Up? Sentiment Classification using Machine Learning Techniques

to or relies on external resources, a) are there guarantees that they will ex- programs, software APIs)? How were these mechanisms or procedures
ist, and remain constant, over time; b) are there official archival versions of validated?
the complete dataset (i.e., including the external resources as they existed Unknown to the authors of the datasheet.
at the time the dataset was created); c) are there any restrictions (e.g.,
licenses, fees) associated with any of the external resources that might If the dataset is a sample from a larger set, what was the sam-
apply to a dataset consumer? Please provide descriptions of all external pling strategy (e.g., deterministic, probabilistic with specific sam-
resources and any restrictions associated with them, as well as links or pling probabilities)?
other access points, as appropriate.
The sample of instances collected is English movie reviews from
The dataset is entirely self-contained. the rec.arts.movies.reviews newsgroup, from which a
Does the dataset contain data that might be considered confidential “number of stars” rating could be extracted. The sample is limited
(e.g., data that is protected by legal privilege or by doctor–patient to forty reviews per unique author in order to achieve broader
confidentiality, data that includes the content of individuals’ non- coverage by authorship. Beyond that, the sample is arbitrary.
public communications)? If so, please provide a description.
Unknown to the authors of the datasheet. Who was involved in the data collection process (e.g., students,
crowdworkers, contractors) and how were they compensated (e.g.,
Does the dataset contain data that, if viewed directly, might be offen- how much were crowdworkers paid)?
sive, insulting, threatening, or might otherwise cause anxiety? If so, Unknown to the authors of the datasheet.
please describe why.
Some movie reviews might contain moderately inappropriate or Over what timeframe was the data collected? Does this timeframe
offensive language, but we do not expect this to be the norm. match the creation timeframe of the data associated with the instances
(e.g., recent crawl of old news articles)? If not, please describe the time-
Does the dataset identify any subpopulations (e.g., by age, gender)? frame in which the data associated with the instances was created.
If so, please describe how these subpopulations are identified and provide Unknown to the authors of the datasheet.
a description of their respective distributions within the dataset.
No. Were any ethical review processes conducted (e.g., by an institu-
tional review board)? If so, please provide a description of these review
Is it possible to identify individuals (i.e., one or more natural per- processes, including the outcomes, as well as a link or other access point
sons), either directly or indirectly (i.e., in combination with other to any supporting documentation.
data) from the dataset? If so, please describe how. Unknown to the authors of the datasheet.
Some personal information is retained from the newsgroup post-
Did you collect the data from the individuals in question directly, or
ing in the “raw form” of the dataset (as opposed to the “prepro- obtain it via third parties or other sources (e.g., websites)?
cessed” version, in which these are automatically removed), in- As described above, the data was collected from newsgroups.
cluding the name and email address the author posted under (note
that these are already public on the internet newsgroup archive). Were the individuals in question notified about the data collection?
If so, please describe (or show with screenshots or other information) how
Does the dataset contain data that might be considered sensitive in notice was provided, and provide a link or other access point to, or other-
any way (e.g., data that reveals race or ethnic origins, sexual orienta- wise reproduce, the exact language of the notification itself.
tions, religious beliefs, political opinions or union memberships, or No. The data was crawled from public web sources, and the au-
locations; financial or health data; biometric or genetic data; forms of
government identification, such as social security numbers; criminal
thors of the posts presumably knew that their posts would be pub-
history)? If so, please provide a description. lic, but the authors were not explicitly informed that their posts
Aside from the aforementioned name/email addresses, no. were to be used in this way.
Any other comments? Did the individuals in question consent to the collection and use of
their data? If so, please describe (or show with screenshots or other
None. information) how consent was requested and provided, and provide a link
or other access point to, or otherwise reproduce, the exact language to
Collection Process which the individuals consented.
No (see previous question).
How was the data associated with each instance acquired? Was the
data directly observable (e.g., raw text, movie ratings), reported by sub- If consent was obtained, were the consenting individuals provided
jects (e.g., survey responses), or indirectly inferred/derived from other data with a mechanism to revoke their consent in the future or for certain
(e.g., part-of-speech tags, model-based guesses for age or language)? If uses? If so, please provide a description, as well as a link or other access
the data was reported by subjects or indirectly inferred/derived from other point to the mechanism (if appropriate).
data, was the data validated/verified? If so, please describe how. N/A.
The data was mostly observable as raw text, except that the
Has an analysis of the potential impact of the dataset and its use
labels were extracted by the process described below. The on data subjects (e.g., a data protection impact analysis) been con-
data was collected by downloading reviews from the IMDb ducted? If so, please provide a description of this analysis, including the
archive of the rec.arts.movies.reviews newsgroup, at outcomes, as well as a link or other access point to any supporting docu-
http://reviews.imdb.com/Reviews. mentation.
N/A.
What mechanisms or procedures were used to collect the data (e.g.,
hardware apparatuses or sensors, manual human curation, software Any other comments?

Fig. 2. Example datasheet for Pang and Lee’s polarity dataset [22], page 2.
Datasheets for Datasets 17

Movie Review Polarity Thumbs Up? Sentiment Classification using Machine Learning Techniques

None. There is a repository, maintained by Pang/Lee through April


2012, at http://www.cs.cornell.edu/people/pabo/movie%2Dreview%
Preprocessing/cleaning/labeling 2Ddata/otherexperiments.html.

What (other) tasks could the dataset be used for?


Was any preprocessing/cleaning/labeling of the data done (e.g., dis-
cretization or bucketing, tokenization, part-of-speech tagging, SIFT The dataset could be used for anything related to modeling or
feature extraction, removal of instances, processing of missing val- understanding movie reviews. For instance, one may induce a
ues)? If so, please provide a description. If not, you may skip the remain- lexicon of words/phrases that are highly indicative of sentiment
ing questions in this section. polarity, or learn to automatically generate movie reviews.
Instances for which an explicit rating could not be found
were discarded. Also only instances with strongly-positive Is there anything about the composition of the dataset or the way it
was collected and preprocessed/cleaned/labeled that might impact
or strongly-negative ratings were retained. Star ratings were future uses? For example, is there anything that a dataset consumer
extracted by automatically looking for text like “**** out of might need to know to avoid uses that could result in unfair treatment of
*****” in the review, using that as a label, and then removing individuals or groups (e.g., stereotyping, quality of service issues) or other
the corresponding text. When the star rating was out of five stars, risks or harms (e.g., legal risks, financial harms)? If so, please provide
a description. Is there anything a dataset consumer could do to mitigate
anything at least four was considered positive and anything at these risks or harms?
most two negative; when out of four, three and up is considered There is minimal risk for harm: the data was already public, and
positive, and one or less is considered negative. Occasionally half in the preprocessed version, names and email addresses were re-
stars are missed which affects the labeling of negative examples. moved.
Everything in the middle was discarded. In order to ensure that
sufficiently many authors are represented, at most 20 reviews Are there tasks for which the dataset should not be used? If so, please
(per positive/negative label) per author are included. provide a description.
This data is collected solely in the movie review domain, so
In a later version of the dataset (v1.1), non-English reviews were systems trained on it may or may not generalize to other senti-
also removed. ment prediction tasks. Consequently, such systems should not—
without additional verification—be used to make consequential
Some preprocessing errors were caught in later versions. The fol- decisions about people.
lowing fixes were made: (1) Some reviews had rating information Any other comments?
in several places that was missed by the initial filters; these are None.
removed. (2) Some reviews had unexpected/unparsed ranges and
these were fixed. (3) Sometimes the boilerplate removal removed
Distribution
too much of the text.
Will the dataset be distributed to third parties outside of the en-
Was the “raw” data saved in addition to the prepro- tity (e.g., company, institution, organization) on behalf of which the
cessed/cleaned/labeled data (e.g., to support unanticipated dataset was created? If so, please provide a description.
future uses)? If so, please provide a link or other access point to the
“raw” data.
Yes, the dataset is publicly available on the internet.
Yes. The dataset itself contains all the raw data. How will the dataset will be distributed (e.g., tarball on website, API,
GitHub)? Does the dataset have a digital object identifier (DOI)?
Is the software that was used to preprocess/clean/label the data avail-
The dataset is distributed on Bo Pang’s webpage at Cornell: http:
able? If so, please provide a link or other access point.
//www.cs.cornell.edu/people/pabo/movie-review-data. The dataset does
No.
not have a DOI and there is no redundant archive.
Any other comments?
When will the dataset be distributed?
None.
The dataset was first released in 2002.
Will the dataset be distributed under a copyright or other intellectual
Uses property (IP) license, and/or under applicable terms of use (ToU)? If
so, please describe this license and/or ToU, and provide a link or other
Has the dataset been used for any tasks already? If so, please provide
access point to, or otherwise reproduce, any relevant licensing terms or
a description.
ToU, as well as any fees associated with these restrictions.
At the time of publication, only the original paper (http://xxx.lanl. The crawled data copyright belongs to the authors of the reviews
gov/pdf/cs/0409058v1). Between then and 2012, a collection of pa-
unless otherwise stated. There is no license, but there is a request
pers that used this dataset was maintained at http://www.cs.cornell. to cite the corresponding paper if the dataset is used: Thumbs up?
edu/people/pabo/movie%2Dreview%2Ddata/otherexperiments.html.
Sentiment classification using machine learning techniques. Bo
Is there a repository that links to any or all papers or systems that Pang, Lillian Lee, and Shivakumar Vaithyanathan. Proceedings
use the dataset? If so, please provide a link or other access point. of EMNLP, 2002.

Fig. 3. Example datasheet for Pang and Lee’s polarity dataset [22], page 3.
18 Gebru et al.

Movie Review Polarity Thumbs Up? Sentiment Classification using Machine Learning Techniques

Have any third parties imposed IP-based or other restrictions on the tion. Will these contributions be validated/verified? If so, please describe
data associated with the instances? If so, please describe these restric- how. If not, why not? Is there a process for communicating/distributing
tions, and provide a link or other access point to, or otherwise reproduce, these contributions to dataset consumers? If so, please provide a descrip-
any relevant licensing terms, as well as any fees associated with these tion.
restrictions. Others may do so and should contact the original authors about
No. incorporating fixes/extensions.
Do any export controls or other regulatory restrictions apply to the Any other comments?
dataset or to individual instances? If so, please describe these restric-
tions, and provide a link or other access point to, or otherwise reproduce,
None.
any supporting documentation.
Unknown to authors of the datasheet.
Any other comments?
None.

Maintenance
Who will be supporting/hosting/maintaining the dataset?
Bo Pang is supporting/maintaining the dataset.
How can the owner/curator/manager of the dataset be contacted
(e.g., email address)?
The curators of the dataset, Bo Pang and Lillian Lee, can be
contacted at https://sites.google.com/site/bopang42/ and http://www.cs.
cornell.edu/home/llee, respectively.

Is there an erratum? If so, please provide a link or other access point.


Since its initial release (v0.9) there have been three later releases
(v1.0, v1.1, and v2.0). There is not an explicit erratum, but up-
dates and known errors are specified in higher version README
and diff files. There are several versions of these: v1.0:
http://www.cs.cornell.edu/people/pabo/movie-review-data/README;
v1.1: http://www.cs.cornell.edu/people/pabo/movie%2Dreview%
2Ddata/README.1.1 and http://www.cs.cornell.edu/people/pabo/
movie-review-data/diff.txt; v2.0: http://www.cs.cornell.edu/people/pabo/
movie%2Dreview%2Ddata/poldata.README.2.0.txt. Updates are listed
on the dataset web page. (This datasheet largely summarizes
these sources.)
Will the dataset be updated (e.g., to correct labeling errors, add new
instances, delete instances)? If so, please describe how often, by
whom, and how updates will be communicated to dataset consumers (e.g.,
mailing list, GitHub)?
This will be posted on the dataset webpage.
If the dataset relates to people, are there applicable limits on the re-
tention of the data associated with the instances (e.g., were the indi-
viduals in question told that their data would be retained for a fixed
period of time and then deleted)? If so, please describe these limits and
explain how they will be enforced.
N/A.
Will older versions of the dataset continue to be sup-
ported/hosted/maintained? If so, please describe how. If not,
please describe how its obsolescence will be communicated to dataset
consumers.
The dataset has already been updated; older versions are kept
around for consistency.
If others want to extend/augment/build on/contribute to the dataset,
is there a mechanism for them to do so? If so, please provide a descrip-

Fig. 4. Example datasheet for Pang and Lee’s polarity dataset [22], page 4.

You might also like