Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3531146.3533231acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article
Open access

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

Published: 20 June 2022 Publication History
  • Get Citation Alerts
  • Abstract

    As research and industry moves towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. A clear and thorough understanding of a dataset’s origins, development, intent, ethical considerations and evolution becomes a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of the documentation. It requires consistency and comparability across the documentation of all datasets involved, and as such documentation must be treated as a user-centric product in and of itself. In this paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset’s lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models—such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use; or decisions affecting model performance. We also present frameworks that ground Data Cards in real-world utility and human-centricity. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over 20 Data Cards.x

    References

    [1]
    2017. AI Now Institute. https://ainowinstitute.org/
    [2]
    2021. ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT). https://facctconference.org/
    [3]
    Joint Artificial Intelligence Center Public Affairs. 2021. Enabling AI with Data Cards. https://www.ai.mil/blog_09_03_21_ai_enabling_ai_with_data_cards.html
    [4]
    Nuno Antunes, Leandro Balby, Flavio Figueiredo, Nuno Lourenco, Wagner Meira, and Walter Santos. 2018. Fairness and transparency of machine learning for trustworthy cloud services. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 188–193.
    [5]
    Parker Barnes Anurag Batra. 2020. Open Images Extended - Crowdsourced Data Card. https://research.google/static/documents/datasets/open-images-extended-crowdsourced.pdf
    [6]
    Matthew Arnold, Rachel K. E. Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Darrell Reimer, Alexandra Olteanu, David Piorkowski, Jason Tsay, and Kush R. Varshney. 2019. FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity. arxiv:1808.07261 [cs.CY]
    [7]
    [7] Anja Austermann, Michelle Linch, Romina Stella, and Kellie Webster.2021. https://storage.googleapis.com/gresearch/translate-gender-challenge-sets/Data%20Card.pdf
    [8]
    Iain Barclay, Harrison Taylor, Alun Preece, Ian Taylor, Dinesh Verma, and Geeth de Mel. 2020. A framework for fostering transparency in shared artificial intelligence models by increasing visibility of contributions. Concurrency and Computation: Practice and Experience (2020), e6129.
    [9]
    Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6 (2018), 587–604.
    [10]
    Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, 2021. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 401–413.
    [11]
    Ajay Chander, Ramya Srinivasan, Suhas Chelian, Jun Wang, and Kanji Uchino. 2018. Working with beliefs: AI transparency in the enterprise. In IUI Workshops.
    [12]
    [12] Candice chumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru.2021. https://storage.googleapis.com/openimages/open_images_extended_miap/Open%20Images%20Extended%20-%20MIAP%20-%20Data%20Card.pdf
    [13]
    Upol Ehsan, Q Vera Liao, Michael Muller, Mark O Riedl, and Justin D Weisz. 2021. Expanding explainability: Towards social transparency in ai systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–19.
    [14]
    Heike Felzmann, Eduard Fosch-Villaronga, Christoph Lutz, and Aurelia Tamò-Larrieux. 2020. Towards transparency by design for artificial intelligence. Science and Engineering Ethics 26, 6 (2020), 3333–3361.
    [15]
    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010(2018).
    [16]
    GEM. 2022. Natural Language Generation, its Evaluation and Metrics Data Cards. https://gem-benchmark.com/data_cards
    [17]
    HuggingFace. 2021. HuggingFace - Create a Dataset Card. https://huggingface.co/docs/datasets/v1.12.0/dataset_card.html
    [18]
    Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 560–575.
    [19]
    People + AI Research Initiative. 2022. Know Your Data. https://knowyourdata.withgoogle.com/
    [20]
    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, 2020. The open images dataset v4. International Journal of Computer Vision 128, 7 (2020), 1956–1981.
    [21]
    Susan Leigh Star. 2010. This is not a boundary object: Reflections on the origin of a concept. Science, Technology, & Human Values 35, 5 (2010), 601–617.
    [22]
    Colleen McCue. 2014. Data mining and predictive analysis: Intelligence gathering and crime analysis. Butterworth-Heinemann.
    [23]
    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency. 220–229.
    [24]
    Sundar Pichai. 2018. AI at Google: our principles. The Keyword 7(2018), 1–3.
    [25]
    Mahima Pushkarna, Andrew Zaldivar, and Daniel Nanas. [n. d.]. Data Cards Playbook: Participatory Activities for Dataset Documentation. https://facctconference.org/2021/acceptedcraftsessions.html#data_cards
    [26]
    Mahima Pushkarna, Andrew Zaldivar, and Vivian Tsai. [n. d.]. Data Cards GitHub Page. https://pair-code.github.io/datacardsplaybook/
    [27]
    Ben Shneiderman. 2003. The eyes have it: A task by data type taxonomy for information visualizations. In The craft of information visualization. Elsevier, 364–371.
    [28]
    Susan Leigh Star and James R Griesemer. 1989. Institutional ecology,translations’ and boundary objects: Amateurs and professionals in Berkeley’s Museum of Vertebrate Zoology, 1907-39. Social studies of science 19, 3 (1989), 387–420.
    [29]
    Harini Suresh, Steven R Gomez, Kevin K Nam, and Arvind Satyanarayan. 2021. Beyond Expertise and Roles: A Framework to Characterize the Stakeholders of Interpretable Machine Learning and their Needs. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.

    Cited By

    View all
    • (2024)Handling Imbalanced Data With Weighted Logistic Regression and Propensity Score Matching methodsJournal of Database Management10.4018/JDM.33588835:1(1-37)Online publication date: 7-Jan-2024
    • (2024)Learning Gaze-aware Compositional GAN from Limited AnnotationsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36547067:2(1-17)Online publication date: 17-May-2024
    • (2024)Tinker, Tailor, Configure, Customize: The Articulation Work of Contextualizing an AI Fairness ChecklistProceedings of the ACM on Human-Computer Interaction10.1145/36537058:CSCW1(1-20)Online publication date: 26-Apr-2024
    • Show More Cited By

    Index Terms

    1. Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          FAccT '22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency
          June 2022
          2351 pages
          ISBN:9781450393522
          DOI:10.1145/3531146
          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 20 June 2022

          Check for updates

          Author Tags

          1. data cards
          2. dataset documentation
          3. datasheets
          4. model cards
          5. responsible AI
          6. transparency

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          FAccT '22
          Sponsor:

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)7,558
          • Downloads (Last 6 weeks)802
          Reflects downloads up to 12 Aug 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Handling Imbalanced Data With Weighted Logistic Regression and Propensity Score Matching methodsJournal of Database Management10.4018/JDM.33588835:1(1-37)Online publication date: 7-Jan-2024
          • (2024)Learning Gaze-aware Compositional GAN from Limited AnnotationsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36547067:2(1-17)Online publication date: 17-May-2024
          • (2024)Tinker, Tailor, Configure, Customize: The Articulation Work of Contextualizing an AI Fairness ChecklistProceedings of the ACM on Human-Computer Interaction10.1145/36537058:CSCW1(1-20)Online publication date: 26-Apr-2024
          • (2024)Racial/Ethnic Categories in AI and Algorithmic Fairness: Why They Matter and What They RepresentProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659050(2484-2494)Online publication date: 3-Jun-2024
          • (2024)Model ChangeLists: Characterizing Updates to ML ModelsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3659047(2432-2453)Online publication date: 3-Jun-2024
          • (2024)Learning about Responsible AI On-The-Job: Learning Pathways, Orientations, and AspirationsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658988(1544-1558)Online publication date: 3-Jun-2024
          • (2024)Transparency in the Wild: Navigating Transparency in a Deployed AI System to Broaden Need-Finding ApproachesProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658985(1494-1514)Online publication date: 3-Jun-2024
          • (2024)From Model Performance to Claim: How a Change of Focus in Machine Learning Replicability Can Help Bridge the Responsibility GapProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658951(1002-1013)Online publication date: 3-Jun-2024
          • (2024)Generative Information Systems Are Great If You Can ReadProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638345(165-177)Online publication date: 10-Mar-2024
          • (2024)The Situate AI Guidebook: Co-Designing a Toolkit to Support Multi-Stakeholder, Early-stage Deliberations Around Public Sector AI ProposalsProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642849(1-22)Online publication date: 11-May-2024
          • Show More Cited By

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media