1 Introduction
Today’s societies are producing immense volumes of data that get used by
Artificial Intelligence (AI) systems. At the core of AI applications is the field of
Machine Learning (ML), which relies on the use of data to classify or detect patterns in existing information (unsupervised ML), as well as using past data to “train” algorithms to solve new tasks (supervised ML) [
37]. Our article focuses on the latter subset of ML, which is growing in popularity in systems created to predict probable outcomes based on certain inputs, or to make recommendations about which decisions would be optimal in a given scenario.
These systems work with structured as well as unstructured data (e.g., text, images, audio) to address practical use cases in fields such as clinical diagnosis, criminal justice, financial lending, manufacture and autonomous vehicles, among others [
52]. In the remainder of this article, we will refer to ML systems as software systems in which ML models or algorithms are deployed, typically for the purposes of solving a problem in the real world.
Poor-quality datasets and data science pipelines can compromise ML systems in several ways. This includes historical signals and inappropriately proxied measures that make ML systems vulnerable to reproducing past discrimination against under-represented groups (e.g., in contexts like job hiring and criminal justice) or propagating abusive content [
46,
67]. Messy or inaccurate data can also disturb the operational efficiency of businesses, with estimates of 10% to 30% of revenue being spent on resolving data quality issues [
30]. The importance of data quality is therefore increasingly being recognised by private and public stakeholders who want to mitigate social risks, reduce costs and support the effective assimilation of ML technologies in society.
The growing use of ML across industries, and the high-stakes nature of some of the preceding uses [
17,
44], is being accompanied by greater scrutiny of the processes that determine the output of ML-based decision-support systems [
51]. Routines for ensuring transparency in ML datasets and ML development pipelines are being encouraged by national and international organisations such as the OECD
1 and the Open Government Partnership.
2 The UK government has recently published an Algorithmic Transparency Standard
3 alongside templates designed to help public sector organisations to document the datasets that underlie their ML tools. Similar trends are happening in industry, where new standards are currently being developed to guide businesses on how to define, implement and measure data quality throughout the ML development lifecycle [
18]. Standards of this kind crystallise an ever-growing corpus of academic literature that has explored ML data quality challenges and ways to mitigate them [
25,
57,
64,
70].
The growing range of ML data management guidelines, frameworks and standards presents practitioners with a vast range of possible criteria to aspire to, on top of the traditional data management practices that were established in previous decades. This raises a twofold challenge: (1) how to navigate the ML literature and select only those data quality requirements that are meaningful to the practitioner’s use case, and (2) how to address the new requirements using frameworks and practices that are already familiar to the data management community.
Our article aims to help data practitioners to navigate these challenges by distilling some of the key concepts from recent literature in the fields of ML, data management and
Human-Computer Interaction (HCI). Our contributions include the following:
•
An overview of some of the key data quality requirements that matter in ML systems.
•
An illustration of how these requirements map onto traditional data quality criteria.
•
A structure for identifying the most salient data quality requirements depending on the stage of the ML lifecycle where the data use case occurs.
The rest of the article is structured as follows. In Section
2, we present the background literature that motivates our work. We then present our methodology for conducting a literature review in Section
3, followed by a summary of results in Section
4 and discussion of the findings in Section
5.
2 Background
Training data for ML algorithms can be collected in a variety of ways. In their comprehensive survey of data collection methods for ML, Roh et al. [
61] group these into three categories: (1) data acquisition (including discovery, augmentation and generation), (2) data labelling (using manual or semi-supervised approaches) and (3) improvement (cleaning the data itself or improving the model built upon it). The extent to which these data collection methods are used varies depending on the use case and the type of data upon which an ML system relies.
In larger organisations and complex innovation ecosystems, the data may pass through multiple stakeholders and be transformed in various ways before it reaches an ML practitioner or their resultant product. Because of this, the topic of data quality is beginning to transcend beyond the field of data management and into the realm of HCI, which accommodates holistic considerations such as how people search for relevant datasets [
42], how developers perceive data work [
64] and the best ways of using crowdsourcing to generate, evaluate or label data [
71]. Although the role of these dynamic processes and multi-stakeholder configurations is increasingly being recognised by data practitioners, it is less clear how traditional data quality frameworks and notions of data accountability are adapting to ML development pipelines [
35].
2.1 Data Management Practices Differ Between Academia and Industry
Long-standing definitions of data quality have viewed good-quality data as “data that are fit for use by data consumers” [
74]. This has been accompanied by granular specifications of what makes a good-quality dataset, with essential dimensions such as accuracy, completeness, consistency and validity being just some of the 60 dimensions identified in the wider data management literature [
13]. Practical applications of these dimensions typically focus on smaller subsets of the most relevant qualities, and can be found in the UK government’s data strategy,
4 professional associations for information management such as AHIMA,
5 and the requirements of open data and open science initiatives, where datasets should ideally be linked by the peer-reviewed code or publication that uses them.
It is worth noting that ML data tend to be managed differently depending on whether the system is within an academic or industry setting [
52]. In academia, data management is typically contained within the projects of individuals or small teams, who are able to design and amend the data collection, storage and sharing systems at their discretion. Industry researchers, however, often rely on separate data collection, processing and storage systems that sit across multiple company functions, requiring formal data management guidelines to ensure consistency and coordination across teams.
Formal practices of this kind are sometimes established with the help of industry standards. For example, the International Organization for Standardization (ISO) standard ISO/IEC 25012 provides guidance on how to define the data quality characteristics that matter to an organisation. Defining these characteristics is a prerequisite to deciding how data quality can be evaluated in a practical sense. This latter task is addressed by the standard ISO/IEC 25024, which guides organisations in defining the data quality assurance criteria and ways of measuring them quantitatively. These data quality models are complemented by standards which recognise that organisations differ in their preparedness to define and execute data quality assurance. The ISO 8000-61 standard specifies the pure activities of enhancing data quality processes, whereas ISO 8000-62 defines ways to assess the maturity, or readiness, of organisations to implement these data quality tasks.
More recently, ISO has begun to develop the ISO/IEC 5259 standard which focuses on data quality for the fields of analytics and ML, as well as ISO/IEC DIS 8183 that addresses the AI data lifecycle framework. These newer standards address processes that can be employed by various stakeholders at different stages of the AI lifecycle, which differs from earlier data quality guidelines that tended to view quality as a uniform outcome that fulfils a pre-defined list of desired criteria. These standards are still under development, so there is value in publications that inform practitioners of how data quality applies to ML tasks.
2.2 Data Quality Means Meeting the Needs of Different Users
Traditionally, data quality compliance has meant meeting the needs of the immediate data users (e.g., analysts or engineers who value clean machine-readable data). However, this singular focus can flatten the variety of uses and data quality requirements that are encountered at various stages of the ML development pipeline over the longer term [
57,
70]. For instance, data quality aspects that are important to ML developers are likely to be different from what was important to upstream data subjects, who may have valued mechanisms for expressing consent and data usage preferences. Similarly, downstream users of trained ML algorithms, such as software developers and organisations, may have their own preferences for specific data qualities when procuring the system, including aspects such as security, provenance, legal compliance and the capacity to meet business goals in real-world contexts. It is therefore useful to consider data quality processes in ML as being less about obtaining a finished outcome and more about creating a dynamic artefact that is imbued with the potential to be improved and shaped by different stakeholders to meet their own requirements.
Many of the data quality issues that could reasonably concern the stakeholders mentioned previously can already be accommodated by the granular data quality specifications produced in the field of data management. For example, the list of 60 dimensions created by Black and van Nederpelt [
13] includes qualities related to data accuracy, lineage, currency, coverage, legal compliance and usability. These dimensions are subsumed by higher-order characterisations that capture the intrinsic, contextual, accessibility and representational aspects of datasets [
74].
Although the advent of data-centric technologies has been accompanied by a proliferation of updated data quality definitions and metrics tailored to fields such as big data [
68] and linked data [
77], contemporary authors continue to find value in existing data quality characterisations and conceptual structures. For example, in their “Data Quality in Use” model for big data, Merino et al. [
47] draw on the canonical distinction of Wang and Strong [
74] between the intrinsic, contextual, accessibility and representational aspects of datasets when using the preceding industry standards ISO/IEC 25012 and ISO/IEC 25024. Other efforts have been made to adapt traditional data quality management practices to specific fields. This includes the work of Kim et al. [
41], who developed new frameworks for assessing and improving the maturity of IoT data quality processes based on the standards ISO 8000-61 and ISO 8000-62.
2.3 The Four Dimensions of Data Quality
In the following, we will draw on the categorisation by Wang and Strong [
74] of the intrinsic, contextual, accessibility and representational aspects of datasets to illustrate some of the ways in which previously established data quality categories already apply to ML problems.
Intrinsic data quality has traditionally been understood to reflect the extent to which data values conform to the actual or true values [
74]; this includes specific requirements such as accuracy, provenance and cleanliness, the latter of which covers practices such as the addressing missing values and redundant cases. Besides the usual data qualities needed for statistical analysis (e.g., addressing missing data, anomalies), an intrinsic quality that is increasingly valued by ML practitioners and regulators relates to data lineage and traceability. For data that require multiple pre-processing steps or transactions between organisations, the origins of their features becomes important. Traceability makes it possible to interpret and audit the history that precedes the output of ML algorithms [
33], but despite recent regulations on
Explainable AI (XAI),
6 traceability is not yet shortlisted in the data quality framework used by the UK government,
7 suggesting that this data quality characteristic may need to be promoted in the context of ML.
Contextual data quality relates to the extent to which data are pertinent to the task of the data user [
74]; this includes dimensions such as relevance, timeliness, completeness and appropriateness. An essential question that is considered here is the extent to which the sample of cases contained in the dataset diverges from the true distribution of cases that are likely to be encountered when the ML model is deployed. Possible sources of divergence may include historical time or geographic representation. For example, temporality has been flagged as a potential source of difficulty in textual data, where models trained on historical text corpora, such as Google News articles, have been found to reproduce past social stereotypes (e.g., the word “man” being associated with “computer programmer” and “woman” with “homemaker”) [
14]. If left untreated, the use of such data in downstream applications (e.g., web search rankings, question retrieval) can perpetuate or amplify the biases that were and continue to be present in broader society. Other contextual biases have been detected in image data, with publicly available image corpora such as ImageNet and Open Images coming predominantly from amerocentric and eurocentric contexts [
66]. Insufficient representation of some geographic regions, such as Asia or Africa, has meant that ML algorithms have less information to learn about these contexts. This results in solutions that perform poorly for under-represented groups (e.g., passport photo software that does not recognise the facial expressions of ethnic minorities, or electronic soap dispensers that do not respond to darker skin tones). These cases urge ML data practitioners to think critically about the context captured by their dataset and the degree to which it reflects the use case and lived experience of the end users.
Representational data quality refers to the extent to which data are presented in an intelligible and clear manner, including requirements such as being interpretable, easy to understand, and represented concisely and consistently [
74]. In practical terms, these qualities can be implemented through practices such as standardisation and documentation. Standardisation refers to conventions for capturing information in a consistent manner, including machine-readable data structures and formats for capturing specific attributes (e.g., date, location, measurement error). This helps engineers to ingest datasets from multiple sources and build interoperable solutions. Documentation about the dataset provides an additional layer of descriptive information to support the creation of ML applications. For example, it can help engineers to understand where the dataset sits in relation to the physical world (e.g., the calibration of equipment, seasonality of data collection, contextual limitations) [
64] so that the training data or model output can be transformed accordingly. It is worth highlighting that when the limitations of a dataset are made explicit in the documentation, this helps subsequent users to take the steps needed to improve the quality of the dataset for their specific use case. Some solutions even allow for the dataset to remain unchanged while the ML algorithm is tuned to produce more robust or socially equitable outcomes [
14,
29].
The
accessibility category refers to the extent to which data are available, obtainable and secure. The rise of big data and ML applications in recent decades has been accompanied by calls for publishing datasets in an open manner, as well as secure access mechanisms for restricted datasets, so that their value can be realised [
75]. For ML stakeholders who work with personal or commercially sensitive data, advances in the accessibility of data have been tempered by security and legal precautions (e.g., compliance with GDPR and intellectual property rights).
The data quality concerns exemplified previously are already within the scope of the concepts and frameworks that have been established in data management literature, suggesting that this field already has a good grounding for defining the data quality dimensions that will continue to remain important to ML. What is new, however, is that ML development is characterised by complex configurations of datasets, data services and data handlers, which makes individuals more vulnerable to abstain from taking action due to the belief that data quality is somebody else’s problem [
34]. This diffusion of responsibility can be addressed by providing clearer indicators about which data quality aspects are and are not out of scope of particular ML roles.
2.4 Why Knowledge of Desirable Data Quality Practices Is Important
The struggle of clarifying which data quality requirements are important is not exclusive to ML. Even where detailed data quality standards and practices exist, organisations and/or practitioners have to specify which data quality characteristics are relevant to their use case and how to define them.
In a study of organisations that applied the ISO/IEC 25012 data quality standard, Gualo et al. [
24] found that practitioners struggled to identify and describe the data quality rules that applied to their use case. The authors found that providing examples of what the requirements can look like helps to guide practitioners in clarifying their own rules.
Another challenge relates to information overload. Long lists of requirements have been found to deter practitioners from applying traditional standards, with Kim et al. [
41] showing that there is value in simplified frameworks that are tailored to a specific use case or technology.
Both of the preceding challenges are encountered during the initial stage of planning and defining which data qualities to evaluate. In other words, they occur at the beginning of the data quality management process defined by the standard ISO 8000-61, as illustrated in Figure
1. Without the planning stage, it becomes harder for a practitioner to develop the right data quality rules and select the tools to enforce them.
2.5 Data Quality Planning Precedes Implementation
Our work aims to support ML practitioners and data managers at the planning stage of their data quality journey. We identify a series of considerations that can help them to define their own requirements and data quality strategy. By understanding the requirements that exist, practitioners can be better positioned to select the most meaningful data quality control, assurance and improvement steps for their use case. Although the tasks of implementing specific data quality measures, evaluation criteria and tools for checking data quality are outside the scope of this review, we will mention examples where relevant.
Our goal in this article is twofold. Firstly, we want to inform practitioners of the data quality requirements and practices that exist and are meaningful in the field of ML. This will be done by synthesising recent academic literature and grouping the recommendations according to the dimensions of data quality that are already familiar to the field of data management. Secondly, to assist readers in selecting a smaller set of data quality practices that may apply to their use case, we map the recommendations onto specific stages and stakeholders in the ML development pipeline. In doing so, it is our hope to make it easier for organisations and individuals to prepare their data management routines for ML and to anticipate some of the scenarios that may arise at each stage of the ML development pipeline.
3 Methodology
Our literature review was conducted using a systematic mapping protocol [
54] to select a small set of relevant articles from the much larger collection of literature emerging at the intersection of data quality and ML. In the following, we present the research questions, inclusion criteria and search strategy that were used to select articles for review. We analysed the selected articles using thematic coding, which revealed additional themes related to the development stages of ML and the scope occupied by data quality management in the wider ML literature.
3.1 Research Questions
Our review aimed to identify and discuss the data quality requirements that are important to ML development, and how they differ from more established data management practices. For this purpose, we defined the following research questions:
•
Where do the data quality requirements of ML sit in relation to traditional data quality frameworks from data and information management?
•
Does ML present any new challenges that are not yet accommodated by traditional data quality frameworks?
The preceding questions deal with data quality management planning as opposed to implementation. This is a distinction that has previously been recognised in industry standards such as ISO 8000-61, as depicted in Figure
1. The planning stage (1) deals with the identification of data quality requirements and strategies for implementing them, whereas the implementation stages (2–4) are about translating these plans into practical rules and techniques for data quality control, assurance and improvement. This distinction between data quality planning and implementation informed the selection criteria of our review.
3.2 Selection Criteria
Our interest in data quality planning (as distinct from implementation) helped to limit the scope of our literature review and make the topic small enough to be discussed in a single paper. Specifically, our targeted articles dealt with philosophical or experiential perspectives on data quality frameworks, as opposed to articles that evaluated specific data management techniques or proposed new solutions for managing data quality. Our choice of research categories is highlighted in Table
1 alongside the other possible types of research as defined in the systematic mapping protocol of Petersen et al. [
54].
Our inclusion criteria were as follows:
•
The abstract of the paper must discuss conceptual frameworks for defining data quality requirements in relation to ML, or experiences of how these requirements have been defined in practice.
•
The paper was published between 2015 and 2022, to provide a contemporary overview.
•
The paper is peer reviewed and published in a journal, conference or workshop.
•
The paper may come in the form of a full-length article, extended abstract or workshop description.
Our exclusion criteria were as follows:
•
The abstract of the paper focuses only on techniques for data quality processing, assurance or improvement rather than conceptual frameworks for defining the data quality requirements.
•
The abstract of the paper only considers the data quality requirements of a specific industry that uses ML (e.g., healthcare, finance, materials science).
•
The paper does not contain information about the publisher.
•
The paper is an early iteration of a later work (e.g., if a similar workshop was delivered by the same authors multiple times, we selected only the latest version).
There was some overlap between our inclusion and exclusion criteria. For example, many abstracts discussed conceptual frameworks in addition to validating specific techniques, developing new prototypes or sector-specific solutions. We included these article as long as the the main part of the abstract was generalisable (i.e., discussing data quality concepts that apply to general ML applications and not focusing only on a specific industry or solution).
3.3 Search Strategy
Our literature search strategy consisted of three stages: (1) pre-selected articles that were already known to us, (2) automatic search on Google Scholar and selected conference proceedings, and (3) forward and backward snowballing to identify further articles.
Pre-Selected Articles. We began with a list of six articles [
3,
23,
32,
34,
35,
58] related to data quality planning, and in particular documentation, that were already known to us based on our previous work with ML models.
Automatic Search. We used Google Scholar to search for articles whose title included keywords related to our research questions. Limiting the search only to titles helped to eliminate marginally relevant articles from the results. The results were then filtered by examining the titles and abstracts of the articles. Only those that met the selection criteria were retained.
We began by searching the entire Google Scholar corpus using the query “
allintitle: “data quality” (“machine learning” OR “AI”).” This returned 185 results. We truncated our analysis after examining the first 30 results, as many of them did not meet our inclusion criteria. After examining the abstracts, seven articles were retained [
12,
19,
21,
25,
27,
28,
63].
We then conducted searches inside the proceedings of two leading academic conferences in ML and HCI: ICML (International Conference on Machine Learning) and CHI (Conference on Human Factors in Computing Systems). This was done using Advanced searches in Google Scholar, where the “published in” box was filled with the name of each conference. We adapted the search query to each venue’s area of specialisation. For example, when searching through CHI proceedings, we used used a slightly more lenient query due to the smaller size of the search space: “
allintitle: data (quality OR “machine learning” OR AI).” This returned 19 results, 9 of which met our inclusion criteria [
2,
26,
31,
49,
56,
62,
64,
70,
73]. We also adapted the query for ICML, as the conference already specialises in ML. A search for “
allintitle: “data quality” OR “data management”” returned 16 results, 1 of which was identified as relevant [
38]. Table
2 summarises each of our search queries, the number of results returned by them, and the number of articles that were subsequently selected for our discussion.
We are aware that there may be other venues with relevant contributions that were not included in our selection.
Snowballing. After we started reading and reviewing the articles selected using the preceding techniques, we came across references to other articles that were relevant to our research questions. Eight articles were identified in this way [
5,
9,
11,
36,
48,
53,
55,
57]. These articles were initially chosen based on the descriptions provided by authors who cited them, then assessed using our inclusion criteria. One further article [
53] was identified using a forward search of articles that cited [
64], as we were curious about the work of other authors who cited this paper. Our general approach to snowballing was informal. Due to time constraints, we did not conduct a systematic review of all possible forward and backward citations.
Although we sought to gather a representative sample of articles, it is important to acknowledge that the 32 articles reviewed here are only a small part of the growing number of articles related to data quality in ML that exist in reality.
3.4 Thematic Coding
After selecting the articles, we read them and extracted information that helped to answer our research questions. Relevant information was recorded for each paper using a spreadsheet with the following groups of columns:
•
Basic information about the paper—five columns: title, authors, publication venue, year, how the paper was found (e.g., automated search, snowballing or existing knowledge).
•
Comments raised by the paper in relation to each of the four traditional data quality dimensions—four columns: intrinsic, contextual, representational, accessibility (as described by Wang and Strong [
74]).
•
One column to highlight any unusual data quality issues or requirements presented by ML.
Once we started reading the articles, we found that some of the authors’ comments and data quality requirements were targeted to specific stages in the ML development pipeline. For this reason, we added the following set of columns to organise our notes:
•
Stages of the ML development lifecycle—eight columns: dataset use case and design, data collection, data cleaning and pre-processing, data maintenance, ML building, ML verification and testing. ML deployment, ML monitoring (as described in Section
4).
Information about each paper was coded using the 18 columns described previously. We reviewed this spreadsheet to synthesise common themes at the intersection of two dimensions: each stage of ML development vs. the four traditional categories of data quality. This is the structure we use to present our results.
3.5 Scope of the Findings
Before presenting our results, we want to clarify their scope. Although the initial goal of our work was concerned with theoretical frameworks that can help to define and plan data quality requirements in ML, we also noted down any practical techniques mentioned by the authors. Many of our reviewed articles went beyond data quality “planning” to make recommendations on how data practitioners and managers should prepare their datasets for ML. We did not review these techniques in a systematic manner, as this would merit a separate review of its own. However, we included some of the techniques in our findings to illustrate how data quality plans can be translated into practical specifications, assurance techniques and solutions that apply during stages 2 through 4 of the process depicted in Figure
1.
Besides extending into practical techniques, many of our selected articles discussed topics that went beyond our original focus on data quality in the technical sense. Specifically, they overlapped with other related communities of research and practice in data management, such as data ethics, data justice and data feminism. These fields have historically been addressed by different communities, so the relations between them are not neatly delineated. Nonetheless, there is significant overlap which we attempt to illustrate in Figure
2. Rising [
59] presented an understanding where justice is about situations and consequences, whereas ethics is about the actions that lead to consequences. In line with this, data ethics deals with the way that practitioners manage data to ensure privacy, fairness, accountability, security and environmental sustainability [
6]. However, data justice addresses inequalities in the way people are represented and treated as a result of the data that they emit [
69]. Data feminism traces the cause of such inequalities to the power relations present in society, and advocates for actions that support the political, social and economic equality of the sexes, including intersections across other social dimensions such as race and class, sexuality, ability, age, religion and geography [
20].
As illustrated in Figure
2, each of these works highlights how systemic challenges in the lived experience of ordinary people are embedded in data, and their potential to be reinforced or mitigated through data-centric technologies. Although we did not explicitly search for these perspectives, and time and space constraints prevent us from covering them in the detail that they deserve, we encourage interested readers to investigate these topics separately.
Another scoping challenge which emerged during our review was related to the definition of data. Our initial intention was to focus on observational data (for training, testing and serving models), but this was later expanded due to the substantial attention that our reviewed articles dedicated to the quality of software systems, ML models and their accompanying documentation. Although there is some ambiguity among academics as to whether it is constructive to view software as data [
39], we have included the aspects of ML model and documentation quality that emerged during our review. For instance, we found that training data quality can be mediated by software systems (e.g., for data maintenance, or for checking input or output data). Moreover, the inclusion of model and documentation quality helped to highlight the areas where ML model quality is dependent on good-quality training, testing or serving data, as well as metadata in the form of documentation. For these reasons, our discussion of data quality grew to include model, software and documentation quality.
4 Results
We structure our findings according to the main stages of ML development. Because this is an iterative process that involves numerous decision pathways, there is no single agreed-upon workflow that is universally applicable to every scenario. Nonetheless, several commonalities have been identified by researchers.
As early as 1996, Fayyad et al. [
22] proposed a sequence of nine stages that constitute the task of knowledge discovery in datasets.
8 The authors suggested that the process typically begins with developing an understanding of the application domain and use case, followed by data collection, pre-processing and reduction, before moving on to identifying and applying relevant data mining methods, as well as interpreting and acting on their insights. Although the authors recognised that knowledge discovery workflows also include challenges related to data accessibility, HCI and model scaling, their pipeline focused on the granular steps contained within data mining. A similar focus on data is adopted by the upcoming industry standard ISO/IEC 5259, whose provisional data processing framework is illustrated in the upper part of Figure
3 [
18].
Recent academic discussions of the ML pipeline have been more detailed in separating out the different stages undergone by ML data. Specifically, they explore model development, verification, deployment and monitoring, which pose different requirements in terms of organisational and operational considerations [
5,
43].
For the purposes of this article, we organise our findings into a series of stages listed in the first column of Table
3 and illustrated in the lower part of Figure
3. Our first five stages (from dataset design to ML building) are adapted from the foundational work of Fayyad et al. [
22], and the last three stages (ML verification to deployment and monitoring) are additions derived from more recent literature. We use Figure
3 to anticipate how our terminology maps onto the framework of the forthcoming ISO/IEC 5259 standard.
Earlier publications and standards acknowledge that ML development rarely follows a pre-defined sequence, meaning that data pipelines are difficult to consolidate across different operational contexts. Our stages must therefore not be assumed to occur in a linear sequence. There are several ways in which reality may diverge from the stylised view presented in our diagram. The first of these relates to data iteration, where the steps of model building and testing are frequently followed by the need to collect new data, or enriching the existing dataset [
5,
11,
31,
34]. Other scenarios that are becoming increasingly common are multi-dataset–multi-model pipelines, where existing ML models are used for pre-processing data or training new ML models [
5]. We will flag these scenarios when we discuss our findings in the following subsections.
Although time and space constraints prevent us from anticipating every possible workflow that may occur in reality, we illustrate a simple example of an ML data quality pipeline in Figure
4. Additionally, we use Figure
5 to illustrate a more specific scenario of an ML application trained on text data that might involve multiple data sources and multiple models. The purpose of these diagrams is to show how different aspects of data quality assurance can map onto different stages of the ML development process. This is not an exhaustive view, and we encourage readers to be critical in evaluating how the data quality requirements discussed in the following would apply to their own non-linear cycles of dataset development.
We would also like to highlight that the data quality requirements described here should be viewed as desirable rather than essential. It is unrealistic to expect them to be achieved in their entirety, especially where practitioners have competing priorities such as time and cost. It is also common for data management capabilities to change and mature throughout the progression of a project [
7]. Thus, readers should treat the information reported here as aspirational rather than prescriptive.
4.1 Dataset Use Case and Design
The initial steps to ensuring data quality begin before data are collected. These steps include clarifying the use case for which the data are sought and investigating the operational and/or infrastructural requirements of gathering the data. These preparatory steps must be recorded in the dataset’s documentation, to inform current and/or future colleagues about the requirements of the use case.
It is common that the precise data quality requirements will not be known upfront, and new use cases may emerge as the model matures. This means that practitioners will likely need to return iteratively to the dataset design and data collection process [
31]. In cases where additional data are required but cannot be collected iteratively, other methods are available to enhance the dataset, as we will discuss later. With this in mind, the preparatory steps described in the following should be viewed as a desirable rather than essential part of the data quality pipeline.
4.1.1 Use Case Documentation.
The definition of ML data requirements must begin by consulting with relevant stakeholders [
35]. Those who are commissioning the system should be consulted to understand not only the problem that the ML needs to address but also the anticipated characteristics of the end users (e.g., demographics, cultural and environmental context). This information can support the acquisition of training data that are representative of the population of interest, thus increasing the likelihood that the output of the ML system will match their needs [
9,
36,
70]. Some questions that ML researchers may want to consider include asking how much supervision, domain expertise and specialisation would be needed to collect and label data for the scoped project [
36].
The careful analysis of requirements prior to data collection, as recommended previously, is different from the data collection practices that are typical of contemporary ML implementations [
36]. Our recommendation reflects an “interventionist” approach, which contrasts with minimally supervised data collection methods such as web crawling and crowdwork that have traditionally been used to generate large volumes of data. The problem is that these approaches do not typically evaluate the origin, motivation, platform or potential impact of the gathered data. This has been flagged as one of the causes of historical and representational bias in ML systems that use those data, with numerous authors urging for slower and more methodical approaches to data collection [
36,
53]. This includes the recruitment and training of data workers, as they are an integral part of how ML data come into being [
73].
Another issue that can get overlooked with big data is the interrogation of assumptions about which questions are answerable with certain data attributes in the first place. For example, Paullada et al. [
53] draw attention to studies that attempted to predict personal attributes from photos of human faces, under the false assumption that these predictions are possible and worthwhile to make. Careful documentation of the use case and underlying assumptions about relevant data attributes can help practitioners and organisations to avoid collecting data signals that may subsequently get discarded.
4.1.2 Data Availability and Coherence.
Once the use case and requirements for a dataset are known, it is important to conduct further checks into the availability of the required data. Whereas discussions in traditional data management tended to focus on static datasets that were already accessible to practitioners, common ML use cases include big data and real-time analytics where data reside in multiple storage systems characterised by streaming, heterogeneous and cloud-based data [
52,
57]. Data that are dispersed across multiple sources tend to have different schemas and approaches to storage and access [
33,
52]. This can lead to difficulties in discovering what data are available, their structure and how to parse, query or store them, which complicates the task of integrating information into a single dataset suitable for ML. Several authors have therefore noted that traditional data quality approaches designed for relational and static datasets may not be sufficient when dealing with the kinds of large-scale decentralised ML pipelines that are increasingly being used for operational and organisation-wide decision making [
25,
57].
As a result of the preceding, managing data quality in industrial use cases may require new infrastructure that ingests data and converts them into a form that is more compatible with the ML trainer [
57]. This may involve the creation of data warehouses to extract, clean, transform and integrate data. For instance, Paleyes et al. [
52] discuss how data-oriented architectures can help to make data flowing between elements of business logic more explicit and accessible, simplifying the tasks of data discovery, collection and labelling.
For real-time applications, runtime verification techniques can help to deal with data that arrive continuously and where models are trained continuously. This form of “online learning” requires continuous monitoring to correct data quality issues on-the-fly and ensure that they are within acceptable bounds to match the assumptions of the respective ML model [
21]. This may include checking that the operational input distribution is similar to that represented by the original model, to avoid issues of distribution shift [
50].
Besides technical infrastructure and tools for data quality assurance in online learning, some use cases may also require additional human resources for data labelling. Access to human annotators and field experts may be a particularly significant bottleneck in data labelling tasks, such as those in medical fields [
52,
70]. Here, the framing of tasks, labour conditions, and legal issues pertaining to data collection and distribution will need to be investigated as part of the technical and institutional infrastructure that precedes data collection [
53]. For example, Mitra et al. [
49] and Thakkar et al. [
70] discussed the importance of preparatory measures in the form of screening and training of data workers, with Mitra et al. [
49] finding that this preemptive approach produced better-quality data than what would typically be achieved through automated post-processing of noisy data.
4.2 Data Collection
Once the data use case and operational requirements are in place, the process of data collection can start. The design decisions made in the previous step may be implemented in several ways, such as through software systems, annotator guidelines and labelling platforms. In the following, we discuss the ways in which documentation, standards and interfaces can support the acquisition of data that are high in quality.
4.2.1 Data Collection Documentation.
The data collection process should be documented as early as possible during task design [
53]. Numerous authors have shared templates on how to structure the documentation. This includes datasheets [
23,
35], data statements [
9] and checklists [
58]. These documents are intended to help dataset creators to become more intentional and reflective about their data collection objectives, underlying assumptions, implications of use and stakeholder values as they work. Benefits to this include an improved understanding of the dataset’s contextual validity, by asking questions about how the dataset instances or sampling approach can be made more reflective of the larger population (e.g., in terms of geographic or demographic coverage) [
23,
70], or application context [
48].
For consumers, documentation about data collection methods provides the information needed to make informed decisions about using a dataset and to avoid unintentional misuse [
23,
25]. It supports users in deciding whether the data are comprehensive enough for their use case [
19]. In some cases, the documentation may reveal assumptions that would not be readily apparent from basic metadata or dataset content [
35]—for instance, a recent crawl of old news articles would benefit from a statement to explain that the time of data collection is different from the original time of creation of the data values.
Besides understanding the dataset, some documentation frameworks are designed to equip downstream practitioners with the transparency needed to repeat the data collection process (e.g., for the purposes of gathering alternative datasets with similar characteristics, auditing or repeating an experiment in different contexts) [
23,
53,
58]. Documentation methods of this kind have been particularly encouraged in sociocultural data collection mechanisms, such as crowdsourcing, where data workers are hired worldwide to read texts, view images and video, and label the data that are used to develop ML models. This means recording operations related to sampling, mapping experimental conditions to micro-tasks and ensuring quality contributions from participants [
58].
In this vein, data users are beginning to assess quality not only in terms of the characteristics of the data (e.g., accuracy) but also the working conditions, skills and aspirations of the individuals who annotated those data [
73]. Authors in the field of HCI envision that as ML practitioners respond to the push for better documentation, this creates an opportunity for data labour practices to also be documented and reviewed. To this end, Rothschild et al. [
62] propose that crowdsourced ML datasets can be accompanied by a cover sheet that describes the precise hiring and employment practices. The intention is to encourage requesters to create institutional norms around just and respectful employment for data workers.
4.2.2 Data Collection Standards.
As noted in our earlier discussion of data use cases, a major challenge in ML data collection in industrial applications relates to data heterogeneity, which can be manifested as unstructured, semi-structured and structured data of disparate types [
25]. During the data collection process, the user requirements established in the previous step (dataset use case and design) need to be translated into common standards that allow datasets to be linked and that capture the necessary information. For example, streaming data from the web may need to be filtered and converted to a more structured format, whereas data from IoT sensors may require standardised semantics to capture the types of equipment used, as well as accommodating uncertainty around measurements.
4.2.3 Data Collection Interfaces.
One of the novel aspects of production ML is that data collection is automated rather than manual (e.g., data arrives continuously from sensors or web applications). In cases like this, part of the responsibility for ensuring good data quality lies with software engineers, who can design systems that generate actionable alerts to inform users of potential data quality issues (e.g., if a feature is missing or has an unexpected value) [
57]. Other examples of data collection interfaces can take a more creative format, such as data collection games. However, Gundry and Deterding [
26] found that such interfaces can present a tradeoff between participant enjoyment and data quality, where games elicited more enjoyment but led to less accurate data compared to an equivalent control.
4.3 Data Validation and Maintenance
Once the data have been collected, they typically undergo a process of checking and cleaning before being usable for an ML system. This stage of the ML development pipeline bears a large bulk of the activities related to data quality assurance. In the following, we discuss these tasks, which include pre-processing, validating the contextual coverage of the data, data quality metrics, user interfaces for inspecting data, dataset accessibility and maintenance over the longer term.
4.3.1 Pre-Processing.
Data collection is often followed by pre-processing tasks such as feature selection, deduplication, removal of outliers, consistency checking, anonymisation and imputation of missing values [
52,
70]. As was done during the data collection step, information about the pre-processing steps should be recorded in the dataset documentation. This helps subsequent data consumers to determine whether the data are readily compatible with their chosen task or if they need to undertake further transformations (e.g., dimensionality reduction, bucketing, tokenization, removal of instances, normalisation, etc.) [
23,
57]. Another aspect of data composition that can be useful to inspect and report in some ML use cases relates to potential dependencies that may exist between features, where information leakages between variables could later cause the trained ML models to produce unrealistically accurate predictions during testing [
57].
Although the nature of the preceding work is not unusual in relation to long-standing data management literature that has dealt with validity, consistency and integrity concerns, literature from the field of ML has highlighted constraints in the order in which the data pre-processing tasks should be executed. Differences in the sequence of data pre-processing steps have been found to produce radically different ML results (e.g., correcting the data for missing values using imputation can affect outliers in the dataset) [
27]. Given that the search space of all possible sequences of data pre-processing tasks is combinatorially large, some authors have proposed algorithmic solutions for establishing the optimal pre-processing pipeline [
11]. Others have drawn attention to formal ways of establishing and treating the reasons behind problematic data. For example, Bertossi and Geerts [
12] suggest that XAI techniques can be applied to identify the features that cause inconsistencies in data and use this information to predict the best repair actions.
But even where formal data cleaning techniques have not been used, data practitioners can still take care to document their actions where possible (e.g., using pre-defined protocols or ex-ante publication of reproducible code that was used to prepare the data). One possible way of doing this is through the use of interactive notebooks to weave together code and documentation [
72]. Data validation routines and publication of pre-processing code is particularly valuable in contexts where data preparation is decoupled from the ML pipeline, providing more transparency and opportunities to detect bugs, feedback loops or changes in data dependencies [
52].
4.3.2 Data Context and Coverage.
The period after data collection is a good time to re-evaluate contextual characteristics of the dataset and the degree to which they align with the intended use case. In sociocultural data, important factors to explore could include cultural biases related to gender, race, ethnicity or religion [
9,
25]. Guidance on which protected characteristics to look out for can be found within practical toolkits such as “AI Fairness 360” [
8], and checklists can be used to document such information to ensure legal and ethical compliance [
60].
Additionally, practitioners should consider the possibility that some variables captured in a dataset may not explicitly refer to demographic groups but still contain stereotype-aligned correlations [
32,
53]. For example, variables such as wages or location may be strongly correlated with specific populations in a given region. To surface these kinds of relationships, practitioners may need to compute comparisons to variables from other datasets considered to be “ground truth,” such as Census data [
32]. In use cases that do not capture human data, it may also be useful to evaluate the variance of data in capturing different environmental contexts, such as the environment in which autonomous vehicles are trained in the lab and how it may differ from situations in the real world [
52].
Although some of the contextual biases described earlier may be detectable in the existing data through effort, others may become clear only once the dataset is deployed through ML in production. This is especially true of unstructured data (e.g., text, images) where the features are opaque and difficult to inspect. In cases like this, it is important to document the populations from whom the data originate. Numerous authors have observed that ML systems perform better for users whose demographic characteristics match those represented in the training data [
9,
36]. The contextual origins of datasets must therefore be recorded in the documentation as a means to preempt scientific and ethical issues that may result from the use of data from certain populations to develop ML technology for other populations. Bender and Friedman [
9] provide examples of data statements for NLP datasets, which can be used to provide the context needed by developers and users to better understand how the subsequent ML results might generalise, how best to deploy the software and what biases might be embedded in it. For datasets that originate from crowdworkers, it is important to additionally report any potential sampling and selection biases, as well as response bias, design bias and ethical integrity aspects (e.g., informed consent, minimum wage), that will allow the experimental setup to be traced or reproduced where necessary [
48,
58].
4.3.3 Data Quality Metrics.
In addition to the qualitative descriptions of dataset use cases, collection and pre-processing steps discussed earlier, during the data maintenance step it is beneficial to include quantitative metrics about the dataset. Several generalised and context-specific frameworks have been proposed for this in the literature.
Holland et al. [
32] developed a web-based “dataset nutrition label” comprising seven modules to display general aspects such as metadata, provenance, variables, statistics, pair plots, probabilistic models and ground truth correlations. In contrast to this standardised approach, Gudivada et al. [
25] recommend metrics that are more task specific. For example, the data quality metrics that matter most in classification tasks are proposed to include class overlap, outliers, boundary complexity, label noise and class imbalance. Regression tasks, however, benefit from data quality metrics regarding outliers and missing values. This suggests that data practitioners who are responsible for maintaining the dataset may need to refer back to the anticipated ML use case to decide which metrics would be most meaningful to consider and report.
4.3.4 User Interfaces.
Besides quantitative metrics, the preceding proposals for data quality metrics have also advocated for the use of dashboards and visual aids for data inspection and sanity checks (e.g., min/max values in continuous data, distribution of categorical values) [
25,
32,
52,
57]. Holzinger [
33] highlights that “at the end of the pipeline there is a human, who is limited to perceive information in dimensions. It is a hard task to map the results, gained in arbitrarily high dimensional spaces, down to the lower dimensions.” To this end, interactive software tools can help users to explore the data through pair plots, distributions, correlations, histograms or heatmaps, and evaluate their suitability for certain demographics or other criteria.
4.3.5 Accessibility.
Maintaining a dataset after its creation can present several accessibility questions, especially for personal or commercially sensitive datasets whose disclosure could pose risks to privacy, security or intellectual property [
32]. Before publishing, data managers will need to determine the usage affordances of the dataset, its policies and designated owners [
35]. Specific mechanisms may need to be identified for achieving good data availability while simultaneously protecting them from unauthorised access (e.g., by defining user entitlements to data access, including metadata containing licence type and DOI) [
25,
32,
35]. One possibility here is the use of specialised infrastructure (e.g., data trusts) that allow for secure data storage, retrieval and purging mechanisms between trusted parties. In cases where direct access to data is not possible, proxy metrics such as the data “nutrition label” described earlier may provide sufficient information for auditing and accountability purposes [
32].
4.3.6 Maintenance.
Datasets will require governance standards and specifications to support their maintenance, especially in larger organisations that handle multiple datasets [
25]. This documentation should include information about the conventions used for naming and organising the data, their meaning, source and version history [
32,
35,
57], as well as specifying the complex relationships that may exist between multiple data sources.
For datasets that deal with contextually significant data (e.g., from specific geographic regions, populations or industries), data managers may have an interest in maintaining them in ways that help to address data coverage issues over the longer term. This can involve the establishment of open repositories and data trusts with the goal of gathering more representative data [
36]. As part of this, data managers can develop “mission statements” to communicate their curation goals and encourage external contributions that can make the collection more contextually representative in future.
4.4 ML Building
In many contexts, the previous data collection and preparation steps are likely to have been carried out by a person different to the one who builds the ML model. For this reason, the ML practitioner would ideally go back and check the dataset’s documentation to make sure that it meets their use case requirements. This can help them to avoid using the data for a purpose that may be morally or ethically objectionable to the original curators [
53].
Once the dataset is confirmed to be suitable, the process of building ML can begin. Some of the initial data work may be similar to the data pre-processing stage mentioned earlier, but here the requirements will depend to a greater extent on the selected ML techniques and use case. Examples of possible tasks include feature selection, enrichment and sampling. We summarise these requirements next, followed by a discussion on data accessibility issues that accompany ML models.
4.4.1 Feature Selection.
During the initial development of a model, an important part of data preparation involves selecting or engineering a set of features that are most predictive of the outcome [
57]. This includes removing redundancies (e.g., correlated variables) or using dimensionality reduction methods (e.g., PCA) before using the data as model input. However, preparations of this kind are not always feasible with unstructured data such as images, language and video, where high dimensionality and large sizes make it hard to identify relevant features from the outset [
34]. Some work on feature selection may therefore be put on hold until ML models are more mature, where the focus shifts from preparatory steps on the incoming dataset towards ex-post feature selection as a way to optimise resources and reduce latency while still retaining the same accuracy in the model.
4.4.2 ML-Informed Data Pre-Processing and Enrichment.
Once relevant features are selected, ML developers may need to re-examine data quality challenges related to contextual coverage and cleanliness, where the limitations of the dataset may need to be mitigated through enrichment and/or sampling approaches before feeding them to the model. In the following, we discuss each of these processes.
Exploration into data coverage that was initiated at the data collection and pre-processing stages should continue during the ML building process. In particular, ML practitioners should be mindful that it is not always possible for the preceding data handlers to obtain
a priori knowledge of potentially sensitive features (e.g., gender, race), especially in high-dimensional data such as images, language and video. In cases like this, ML in itself can become a tool for detecting smaller subsets of data that would most benefit from enrichment or using modeling choices to mitigate bias [
33,
34]. In the case of enrichment, the first step is to contextualise the available data, then augment the existing features with new signals from other datasets or acquire new labels [
57]. Solutions of this kind have been applicable in contexts such as gender-biased text data, where authors have proposed the use of further data collection and improvement steps, such as crowdwork and debiasing algorithms, to identify and remove discriminative word mappings from training data [
14].
Besides enriching the available data, another solution for creating contextually relevant datasets involves sampling. Such practices target the dataset’s representativeness, rather than size, as the quality that will influence the performance of an ML model. Several authors have noted that a small number of representative observations can be more effective than using an extremely large but biased dataset [
25]. Indeed, using all available data to train models can sometimes have a detrimental effect [
28].
Examples of this have been especially prominent in research that deals with imbalanced datasets, where the outcome of interest is under-represented in the observation space (e.g., fraud detection, clinical diagnosis). Here, techniques such as under-sampling and synthetic data have been found to enhance model performance [
16]. Others have proposed that training datasets should be filtered in other contexts that deal with human behaviour. For example, Hagendorff [
28] proposed to single out data from certain subpopulations that are deemed more competent, eligible or morally versed for a specific task.
One of the downsides of re-sampling approaches is that they can be costly to implement and require the practitioner to know in advance which features are responsible for the undesirable bias [
34]. To this end, some authors have proposed algorithmic approaches for identifying subsamples of training data that are most effective at meeting the desired model metrics (e.g., log loss, AUC and calibration) [
57].
In addition to mitigating bias, ML tools can also be used to enhance the cleanliness of datasets for specific models. As mentioned during the pre-processing stage, automated techniques can be used to select the optimal sequence of data pre-processing tasks that maximise the performance of the ML model [
11].
4.4.3 Multi-Dataset–Multi-Model Scenarios.
Another common scenario involves practitioners reusing existing ML models as part of their data pre-processing steps, or relying on an existing ML model as a starting point to train a second model for a new domain. These scenarios have implications for data quality because they determine part of the context to which the data quality needs to be tailored.
For example, in the NLP domain, it is common to reuse tools such as part-of-speech (POS) taggers, dependency parsers and pre-defined stop word lists to prepare the data for subsequent use in a model. To do this, the practitioner will typically need to prepare their text data by removing special characters and tokenising the string into a list of words that can be read by the pre-processing tool.
In other cases, model reuse forms a more substantial part of the ML development process. This is common with complex models that could take weeks of computation on multiple machines, where using existing models as a starting point can save valuable time and resources when training a second model. For example, a convolutional neural network trained on human faces that already has the capacity to extract the main features (e.g., eyes, noses, etc.) can prove more efficient than training a new convolutional neural network from scratch [
5]. This is termed
transfer learning in the literature, and it typically means using one of a few “foundation models” created by large organisations that had access to huge data and computational power [
15].
An important data quality challenge here relates to knowledge about the data on which the model was trained, as well as the data used to evaluate the model. For example, duplicate entries in a dataset can produce an overlap between the datasets used to train and evaluate a model, which can cause the performance metrics to be exaggerated [
19].
Another issue to consider is the extent to which the original model’s intended usage matches that of the new application. Foundational models that are built to be generalisable can come at the expense of specificity. For example, their training data may not sufficiently capture an operational context that is characterised by specific demographic or cultural traits. In cases like this, reusing and tuning a trained model helps to improve model performance only if the tuning is done using a dataset that contains task-specific data entries [
19]. Some authors have called for smaller reusable models that are trained on contextually relevant, rather than large, datasets [
10].
4.4.4 Documentation.
Where possible, the ML building process should be accompanied by documentation that has all the necessary information to reproduce or verify the model [
35,
55].
This includes defining the metrics and statistics used to evaluate the model, as well as reporting the measures of central tendency (e.g., mode, median and mean) and uncertainty around observed effects (e.g., range, quartiles, absolute deviation, variance and standard deviation) [
48]. Documentation practices at this stage can also provide an opportunity to examine and reflect on the data properties that significantly affect the model accuracy, and whether there are any dependencies to other data and infrastructure that may affect the outcome [
57]. Besides the model results, the documentation should also report the provenance of the model (e.g., who developed it, potential conflicts of interest, when it was developed, versioning, etc.) [
48].
The preceding information can come in the form of separate documents, or as comments and variable identifiers embedded in the code. Pinhanez et al. [
56] preset an example from the field of conversational systems where practitioners have tended to structure their documentation in a manner that is readable by machines. The authors discussed how documentation of this kind can have its own computational value when building new tools to assist the developers.
Besides assisting collaboration between ML developers, the documentation also provides an opportunity to disclose decisions and facts that can be used by the broader community to better understand what the model does [
48]. As with dataset maintenance, the model documentation should also be accompanied by versioning information and DOIs, which could be done through institutional repositories or other open platforms where the model itself or its metadata are housed [
35]. In commercially sensitive settings, the level of disclosure may be tempered by the requirement to protect intellectual property rights.
4.4.5 Accessibility.
In contexts where openness is possible, a growing number of research venues are encouraging ML practitioners to publish their models for the purposes of review and verification (e.g., checking experimental conditions, hyperparameters, proper use of statistics, robustness), as well as supporting the replication of existing models in subsequent innovation and research. Structured guidelines for sharing ML models can be found in reproducibility checklists, such as the one proposed by Pineau et al. [
55]. Such checklists cover both the accessibility of model code as well as training data.
In the publication of code, practitioners in industry may first need to ensure that their applications do not contain software that is protected by intellectual property, or is built on top of proprietary libraries. Although this is an important consideration, prior research has observed that many authors from industry were indeed able to submit code [
55]. In cases where the model cannot be shared at all and practitioners still want to provide access for model verification and review, they can share minimal information on model performance across various factors [
48]. One way of doing this is to use “model cards,” which are short documents that describe model evaluation procedures and results across different settings that are relevant to the intended application domain [
48]. We will elaborate on these procedures in the next section. Additionally, models that use decision thresholds can include a threshold slider in the digital documentation that accompanies a model [
48], allowing users to view performance parameters across different decision thresholds.
With regard to the publication of data, ML practitioners are typically encouraged to share the training and test data that underpin their model. However, this presents a challenge to ML models that rely on commercially sensitive or personal data (e.g., in healthcare or finance). For cases like this, synthetic training and test data can be generated using distribution hypotheses from the original data [
32], or complementary empirical results can be provided using open source benchmark datasets in addition to results based on the confidential data [
55]. ML practitioners should also be mindful of using and distributing training data that come from unknown sources; this includes benchmark datasets scraped from the web, whose licensing and copyright restrictions are unclear, or datasets that may have become deprecated [
53].
4.5 ML Testing
In many documented cases of adverse ML outcomes, the issues with training data became apparent only after the solution was deployed in real-world contexts. To avoid this, ML practitioners and auditors can test the system for contextual bias and security issues before releasing the system. We discuss these considerations next.
4.5.1 Performance Metrics and Explainability.
The evaluation metrics for ML models have traditionally focused on generic cues such as information loss, and false-positive and false-negative rates. However, more recently researchers have started encouraging practitioners to develop context-specific criteria that rely on specific types of test data. For example, to assess the contextual coverage of an ML model, its performance can be tested in different demographic and intersectional groups (e.g., by age, race, gender, geography) [
48]. This is particularly important in cases where protected attributes may be under-represented in the training dataset, prompting fairness concerns [
34]. When deciding which factors to present in the intersectional analyses, practitioners must be cautious to preserve the privacy of individuals; this can be done through collaboration with policy, privacy and legal experts to decide which groups may be responsibly inferred, and how this information can be stored and accessed [
48]. For practitioners who are struggling to find test data for populations outside of the initial domain used in training, possible solutions include using synthetic datasets to represent use cases that may otherwise go unevaluated [
48].
Besides testing performance on different demographic groups, different business contexts may also be relevant to consider (e.g., plant recognition worldwide or in the Pacific Northwest, vehicular crash tests with one or another phenotype in dummies) [
48]. This allows stakeholders (policymakers, developers and individuals) to compare models not only based on generic evaluation metrics but also on social and economic dimensions such as ethics, inclusivity and fairness, making it possible to take remedial action where necessary.
In addition to representativeness, other meaningful metrics might include reflections on model performance in real business settings—for instance, by estimating customer conversion rates [
52], model size and energy consumption incurred by the model [
53]. Additionally, sensitivity studies of dataset parameters can give insight into the features that have an impact on the model’s prediction [
35]. This not only helps to support transparency and explainability for data users but also can help practitioners to understand the effect that errors in specific features can have on a model’s output and performance. This understanding is vital to applying data quality assurance and correction tools [
63].
One of the themes emerging from the preceding authors is that performance metrics need to be tailored to the specific use case of the model. Often this involves tradeoffs between traditional evaluation metrics such as precision and recall [
25], as well as contextually sensitive issues such as test-set accuracy, robustness and fairness, compactness, and privacy, where maximising one performance metric may come at the expense of another [
34]. Because of the subjective nature of the model evaluation process, and the various different metrics that practitioners can choose to prioritise, these decisions can be communicated to users using “model cards” that contextualise the results according to different benchmarks that matter in the intended application [
48]. As was the case with dataset documentation, the use of visualisations can help to demonstrate cross-sectional analyses of model performance according to different metrics.
Besides performance, model inspection and visualisation methods can also support the interpretability of the model, which can in turn influence its perceived quality [
4]. This falls within the field of XAI, which aims to help practitioners and operators to analyse the output of ML models and the reasons behind automated decisions. Possible ways of doing this include providing natural language explanations based on decision trees, using model visualisations to support understanding and explaining the outcome by example [
1]. Whereas many XAI approaches have focused on model-based explanations, Anik and Bunt [
2] proposed that data-centric explanations can be equally meaningful to evaluating the trustworthiness of ML models, both by engineers and end users.
4.5.2 Access Security.
The steps taken to test the security of ML models will depend on whether they are open or closed, and whether their data are subject to privacy restrictions. There may be ambiguous cases where the training and evaluation data may need different levels of disclosure. For example, the training data may be proprietary or require a non-disclosure agreement, whereas the evaluation datasets are shared publicly for third-party use [
48]. Open datasets that have been anonymised will require a careful review to mitigate the risk of de-anonymisation; ideally, this would be done by someone who has a good background knowledge of the hypothetical enemy [
33].
Another evaluation that is important to conduct during the testing stage relates to weighing the benefits of detailed reporting practices outlined earlier against the potential risks of exposing confidential data. Adversarial testing should be conducted to make sure that the public-facing model output cannot be used to recreate the original data [
48], especially in cases that provide confidence intervals and interactive interfaces (e.g., sliders) in digitised model documentation. Besides test-based approaches, practitioners can opt for using theoretical models for proving that their models are safe against adversarial attacks [
40,
45,
76].
4.6 ML Deployment
Once ML models are trained and ready for deployment, the focus of data quality work shifts from internal operations on training and test data, and instead looks at assuring the quality of serving data that enter the system from the outside.
Mechanisms are needed to ensure that the serving data undergo the same preparation steps as the steps that were applied to the raw training data [
57]. This can be especially challenging in settings where new data arrive continuously, and where they are used to retrain and deploy updated models. The latter case will require additional measures for preventing adversarial attacks such as data poisoning [
52] or spam [
57].
Other precautions also apply to models that do not ingest new training data. For example, proprietary models can be stolen by repeatedly querying the system (e.g., via a public prediction API) and monitoring the outputs to reverse engineer a substitute model [
52]. Another similar risk relates to model inversion, where querying can be used to recover parts of a private training dataset, thereby breaking its confidentiality [
52]. These risks are especially likely in models that report confidence values alongside their predictions.
To mitigate the preceding risks, ML developers should work closely with software engineers to ensure that public-facing systems built on top of the ML are robust against malicious attacks. Recent trends in ML have discussed the development of new engineering approaches such as federated learning to foster privacy, data protection and security [
33]. Federated learning works by allowing devices to learn a shared prediction model collaboratively while keeping the training data securely on the user’s own computer.
4.7 ML Monitoring
Once a model is deployed, the focus on serving data should continue. At this stage, the work shifts to monitoring the properties of incoming data and ensuring that they are contextually similar to the data that the model was trained on. Polyzotis et al. [
57] and Schelter et al. [
65] propose analyses that can be used to detect training-serving skew in pre-defined variables. However, others note the difficulty in trying to establish which columns must be inspected, and what the required thresholds should be [
63]. Chen et al. [
19] suggest that the thresholds can be based on the expected distribution of a targeted population for relevant features (e.g., the usage frequency of a phrase, or the number of individuals with a particular skin tone).
Some monitoring activities can be automated and communicated to the users of ML systems via alerts. This may include data integrity checks, anomaly detection and performance metrics [
52]. Additionally, the system can be designed to gather additional data about misuses or outliers while the model operates in the real world, providing DevOps engineers and ML developers with more information for mitigating security and performance issues in subsequent versions of the model.
4.8 Challenges for Stakeholders
In this final section of our results, we summarise the data quality requirements that matter within specific stakeholder roles. Knowledge of relevant responsibilities can help practitioners to understand and resolve the data quality issues that are within their capacity, and to articulate their own requirements to relevant colleagues:
•
Subject matter experts are typically consulted during the early stages of defining the dataset use case and design. These experts can advise on which data features are relevant in their domain of expertise, and the anticipated characteristics of the end users (e.g., in terms of demographic, cultural or environmental traits). Discussions of this kind should help the data collector to assess how much supervision and domain expertise would be needed to collect, label and document the dataset.
•
Data engineers and software engineers may be involved in different stages of the ML development pipeline. During the initial stages of dataset design and collection, they may be asked to select or build systems for data storage, access, transformation and linking. During the stage of ML deployment, their role may shift to building a user-facing ML system that is secure against attacks or unauthorised access while simultaneously being transparent and user-friendly. Other responsibilities may include building systems for monitoring incoming data and generating alerts if they do not meet a pre-defined set of criteria.
•
Data managers work with the data validation and maintenance stage of the ML pipeline. Their role is to collaborate with other stakeholders to ensure that a dataset is clean, contextually relevant, well documented and accessible in the right way. Important responsibilities include determining the policies and designated owners of the dataset and ensuring that it is protected from unauthorised access where necessary. Data managers must also take responsibility for putting together relevant documentation about how the dataset was collected, as well as its naming conventions, purpose and version history.
•
Data analysts and data scientists are involved during the stages of data validation and ML building. After data are collected, they are likely to carry out pre-processing tasks such as feature selection, deduplication, removal of outliers, consistency checking, anonymisation and imputation of missing values. Data analysts may also be required to inspect the dataset to identify potential biases, protected characteristics or stereotype-aligned correlations. When it comes to building ML models, data scientists may be tasked with selecting or creating new features, enriching and/or re-sampling the dataset. In every task, it is important that the practitioner records the sequence of actions they perform on the data, and the properties and limitations they may discover about the dataset.
•
ML engineers are mostly involved with the ML building and testing stages, and they are likely to collaborate closely with data scientists whose role is to validate and prepare the dataset. ML engineers will make decisions about which data features to use in the model, how to split the training and evaluation data, and whether to build a new model or re-use an existing one. They may need to consult with subject matter experts to establish which performance criteria should be prioritised and the different contexts in which the model needs to be tested.
•
DevOps engineers work with ML engineers and software developers to oversee the ML system once it has been deployed. Their responsibilities include monitoring the properties of incoming and outgoing data to make sure that the system is operating reliably. These responsibilities may be subsumed by ML engineers in the absence of DevOps staff.
An important caveat we would like to restate is that our findings are not exhaustive, and they capture only a small selection of recurring themes that came up during our review. We therefore encourage readers to remain open to other data quality requirements that may matter to them and their colleagues, bearing in mind that these may not have been covered here.
5 Discussion
Our work provides a literature review of data quality requirements that matter during ML development. We find that these requirements can be broadly accommodated within the data quality frameworks traditionally endorsed by data management research, including routines for data collection, processing and documentation. What is unique about the experience of ML practitioners is that their data quality requirements and corresponding tasks are disaggregated across different stages of the ML development pipeline.
Each stage of ML development embodies a new purpose with its own data uses and quality requirements, meaning that the traditionally accepted definition of data quality as “fitness for use” should not be viewed as a singular outcome. Instead, data quality must be defined using stage-specific approaches that are sensitive to where in the ML lifecycle the data are encountered and who encounters them [
25,
57]. Because of this, the four traditionally used categories of data quality (intrinsic, contextual, representational and accessibility) must be addressed differently at different stages of the ML development pipeline, as we will discuss in the following.
Requirements around intrinsic data quality may initially be targeted to the data collection stage, where careful monitoring and human-in-the-loop methods can support the acquisition of data that are accurate, reliably sourced and clean from the outset. Once the data have been collected, the requirements may shift to removing any remaining inconsistencies and redundancies as part of general data maintenance. When it is time for the dataset to be used to train an ML model, the intrinsic requirements will include determining an appropriate level of dimensionality and ensuring the completeness of relevant features. During the later stages of ML deployment, the tasks of intrinsic data quality shift from working with training data to the preparation of serving data received from the outside world.
With contextual data quality, the authors in our review highlighted the importance of understanding the anticipated ML use case and characteristics of the end users before data collection. This understanding is needed to design the data collection process to gather data that adequately reflect their purpose and the environment in which the trained ML will be deployed. Other contextual requirements during data collection, especially in sociocultural contexts, relate to compliance with ethical and inclusivity guidelines. After the data are collected, their contextual integrity must be evaluated and, where necessary, improved through the curation of additional data. When approaching the early stages of ML development, the contextual fit of the dataset may be improved through steps such as feature selection and re-sampling of the training distribution. Once the ML model is built, requirements around performance can be assessed using benchmarked evaluation in different contexts. After the model is deployed, the data requirements shift to monitoring the quality of serving data in terms of their distribution and features, to ensure that they align with data characteristics upon which the model was trained.
A large part of the representational aspect of data quality involves documenting how the preceding requirements were met. In the earlier stages of dataset development, documentation should focus on the user requirements and dataset design, followed by summaries of the dataset collection process, cleaning, maintenance and evaluation steps. Other representational requirements that may arise during the data collection stage relate to the standards used to capture data, as well as the quality of user interfaces for data collection and exploration.
Lastly, requirements around accessibility include the quality and security of infrastructure used for data storage, access and maintenance, which must be in place before the data are available to develop ML models. This can be supported by institutional frameworks and guidelines on consent, transparency and privacy of datasets. When it comes to data security, the later stages of model development require thorough testing and monitoring processes to mitigate against adversarial attacks that could poison training data or expose private datasets.
From the preceding summary of the intrinsic, contextual, representational and accessibility requirements of ML datasets, we see that the responsibility for managing data quality is distributed across various stakeholders. This includes subject matter experts, data analysts, software engineers, ML engineers and DevOps specialists (or site reliability engineers). Distinguishing between these different classes of users is necessary if we are to understand the radically different backgrounds and tasks that are needed to keep ML data quality pipelines running smoothly [
57]. In Figures
4 and
5 of the Section
4, we presented illustrated examples of this complex web of relationships and the nature of their interactions with datasets. At the intersection of dataset development and ML pipelines, we came across several synergies and tensions that have implications on data quality but have been less explicit in previous data quality frameworks. These are as follows:
•
Ethical and legal requirements: Numerous articles in our review commented on ethical issues such as working with sensitive data, the impact of data-driven decisions on human life and potential security risks. Rather than being a distinct and temporally constrained task, we observed that these requirements transcend different stages of the ML lifecycle. This is in line with the observation made by Gebru et al. [
23] that the best way to elicit information about ethical and legal compliance is by requiring practitioners to document specific stages of the dataset development process.
•
Amount of data: Early advances in ML were motivated and, in some cases, enabled by the availability of big datasets, and big data remain necessary in many ML applications such as autonomous vehicles and clinical diagnosis [
33,
75]. However, numerous researchers in our review highlighted that bigger datasets are not always better. Earlier trends of opportunistic data collection and post hoc justifications of large datasets are gradually moving towards a requirement for more deliberative data collection methods [
35,
36], sampling techniques [
16] and minimal data architectures [
57] to deliver better performance without reducing model accuracy.
•
Representational standards: Adherence to common standards and metadata already has a long history in traditional data management literature. However, ML applications that are built on social and cultural data require practitioners to reconcile different vocabularies and unique ways of perceiving the world with the need for standardised and homogeneous datasets to be fed into ML systems [
36]. This requirement for contextual sensitivity is being met by the growing use of semantic standards that use ontologies and annotate data with graph-like properties [
33].
•
Software requirements: Software quality can impact data quality in several ways. Software infrastructures may determine how data are structured and collected, how access to datasets is granted, and how the dataset is presented for exploration by prospective users (e.g., via visualisations or dashboards). When ML models are integrated into client-side applications, software developers need to ensure that model training and serving data are protected against adversarial attacks, and that they do not inadvertently expose any personal or commercially sensitive data [
33].
•
Documentation: Rather than being a post hoc activity that accompanies completed datasets, the authors in our review viewed documentation as a pre-emptive activity that should span the entire ML development lifecycle. The stages of dataset design, collection, ML training and testing should each yield documents that can support communication and decision making between successive stakeholders [
35,
70]. This is especially valuable in larger organisations where the data and ML activities are separated across teams, or where they are vulnerable to information loss due to staff handover.
To show how the preceding implications map onto the four traditional data quality dimensions that were discussed earlier, we summarise them in Table
4.
Many of the processes described previously span across multiple stakeholders, whose ability to self-organise into a robust data quality workflow will require the support of higher-level institutional structures. Part of that is about providing incentives to individuals and organisations [
23,
53]. At present, the field of ML suffers from the devaluation of data work, with model development tasks being held in higher esteem than data quality processes [
35,
64]. In response to this, authors have advocated for the professionalisation of data work as a means to promote best practices in data management and accountability. Practical approaches to this include establishing membership organisations and review panels with standardised codes of conduct [
36]. Participation in these schemes will impose greater costs which may be felt particularly strongly by smaller stakeholders such as startups and SMEs. For this reason, policymakers could explore solutions for achieving economies of scale through consortia and trusts
9 that pool the resources needed by practitioners to produce good quality data.
Besides institutional change as a long-term strategy for improving data quality, it is equally important to consider actionable steps that can be taken in the shorter term by individuals and teams wishing to improve their practices. Reviewing the complete range of data quality enhancement tools and protocols goes beyond the scope of this article, but several examples of such tools were encountered during our review. In the sphere of documentation, there exist various checklists, such as those for reporting crowdsourcing experiments [
58] and model reproducibility [
55], as well as datasheets [
23], cover sheets on employment practices [
62], data nutrition labels [
32], model cards [
48], notebooks [
72] and explainability toolkits.
10 When it comes to mitigating the risks of ML models through data, readers may be interested in ethics assurance tools such as AI Fairness 360,
11 legal and ethical checklists for NLP [
60], and verification tools for streaming and serving data [
21,
65]. Lastly, for readers who are interested in sharing ML datasets and the models built upon them, repositories hosted by CodeOcean, GitHub, Zenodo and HuggingFace
12 can serve as good candidates. We encourage interested readers to investigate the relevance, advantages and drawbacks of these tools in their specific use case.
5.1 Relevance to Use Cases
Earlier in this article, we noted that data quality frameworks and standards present practitioners with dozens of possible criteria to comply with. These are accompanied by a growing range of tools for data pre-processing, documentation and assurance. It is impossible for all of these requirements to be met, nor is it necessary. Previous studies that explored the application of data quality standards found that practitioners benefit from seeing examples of data quality requirements, as it helps to clarify their own needs [
24]. It was also found that there is value in simplified data quality frameworks that are tailored to specific use cases or technologies [
41].
Our review sought to assist ML practitioners who are trying to define their data quality requirements. Firstly, we synthesised previous literature to illustrate the common data quality requirements that can exist in ML. Secondly, by mapping these requirements to different stages of the ML pipeline, we provide a way for readers to see the requirements that are likely to precede and follow their specific task, and to discern which data quality outcomes to focus on in their role. This type of clarity is needed to prevent the diffusion of responsibility and to ensure that every stakeholder is proactive at mitigating data quality issues that are within their capacity.
Besides individuals who work directly with data, we anticipate that our review will be useful to coordinators of data innovation projects that involve multiple stakeholders. Our own experience of this includes a series of projects that emerged from a Public Private Partnership (PPP) between the European Commission and the Big Data Value Association (BDVA). These projects included the
EuropeanDataIncubator (EDI),
EuRopEAnincubatorfortrustedandsecuredatavalueCHains (REACH) and
EUHubs4Data. Their goal was to facilitate data-driven innovation in startups and SMEs through collaboration between data providers, data users, business coaches and legal experts assembled from different geographic regions. The review provided in this article can help managers of similar initiatives to understand the data quality requirements of colleagues who are responsible for different parts of the data value chain, and to signpost participants to resources that will support their data quality practice and documentation.