This survey looked at case studies from a variety of industries: computer networks, manufacturing, space exploration, law enforcement, banking, and more. However, further growth of ML adoption can be severely hindered by poor deployment experience. To make the ML deployment scalable and accessible to every business that may benefit from it, it is important to understand the most critical pain points and to provide tools, services, and best practices that address those points. We see this survey as an initial step in this direction: By recognizing the most common challenges currently being reported, we hope to foster an active discussion within the academic community about what possible solutions might be. We classify possible research avenues for solutions into two categories, which we discuss below. We also give some concrete examples, but, since the purpose of this section is illustrative, we do not aim to provide a complete survey of ML tools and development approaches.
8.1 Tools and Services
The market for machine learning tools and services is experiencing rapid growth [
124]. As a result, tools for individual deployment problems are continuously developed and released. Consequently, some of the problems we have highlighted can be solved with the right tool.
For example, this is most likely the case for operational maintenance of ML models, discussed in Sections
6.2 and
6.3. Many platforms on the market offer end-to-end experience for the user, taking care of such things as data storage, retraining, and deployment. Examples include AWS SageMaker [
125], Microsoft ML [
126], Uber Michelangelo [
127], TensorFlow TFX [
128], MLflow [
129], and more. A typical ML platform would include, among other features, a data storage facility, model hosting with APIs for training and inference operations, a set of common metrics to monitor model health, and an interface to accept custom changes from the user. By offering managed infrastructure and a range of out-of-the-box implementations for common tasks, such platforms greatly reduce the operational burden associated with maintaining the ML model in production.
Quality assurance, which is the focus of Section
5, also looks to be an area where better tools can be of much assistance. Models can greatly benefit from the development of a test suite to verify their behavior, and the community actively develops tools for that purpose. Jenga [
130] ensures model’s robustness against errors in data, which very commonly occur in practice, as was mentioned in Section
3. CheckList methodology [
131] provides a formal approach towards assessing the quality of NLP models. The Data Linter [
132] inspects ML datasets to identify potential issues in the data.
As discussed in Section
3.3, obtaining labels is often a problem with real-world data. Weak supervision has emerged as a separate field of ML that looks for ways to address this challenge. Consequently, a number of weak supervision libraries are now actively used within the community and show promising results in industrial applications. Some of the most popular tools include Snorkel [
133], Snuba [
134], and cleanlab [
135].
A growing field of AutoML [
136] aims to address challenges around a model selection and hyper-parameter tuning, discussed in Sections
4.1 and
4.3. There is a large variety of tools that provide general-purpose implementations of AutoML algorithms, such as Auto-keras [
137], Auto-sklearn [
138], or TPOT [
139]. However, practical reports of applying AutoML to real-world problems indicate that practitioners need to exercise extreme caution, as AutoML methods might not be ready for decision-making in high-stakes areas yet [
140,
141].
Given the potential damage an unnoticed dataset shift can cause to the quality of predictions of deployed ML models (see Section
6.3), there are many techniques to detect and mitigate its effects. A large variety of methods based on dimensionality reduction and statistical hypothesis testing is reviewed by Rabanser et al. [
142] and are now implemented in software libraries (e.g., Alibi Detect [
143]) and services (e.g., Azure ML
11 and AWS Sagemaker
12), available for use in deployment’s monitoring suites. The community has also made strides in developing strategies for dealing with the shift once it was detected. We refer interested readers to the works on using domain adaptation [
144], meta-learning [
145], and transfer learning [
146,
147] as ways of addressing dataset shift.
Using specific tools for solving individual problems is a straightforward approach. However, practitioners need to be aware that by using a particular tool they introduce an additional dependency into their solution. While a single additional dependency seems manageable, their number can quickly grow and become a maintenance burden. Besides, as we mentioned above, new tools for ML are being released constantly, thus presenting practitioners with the dilemma of choosing the right tool by learning its strengths and shortcomings.
8.2 Holistic Approaches
Even though ML deployments require software development, ML projects are fundamentally different from traditional software engineering projects. The main differences arise from unique activities such as data discovery, dataset preparation, model training, deployment success measurement, and so on. Some of these activities cannot be defined precisely enough to have a reliable time estimate (as discussed in Section
3.1 in regard to data collection), some require a different style of project management (Section
6.1 discusses the challenges of managing mixed team engineers and scientists), and some make it difficult to measure the overall added value of the project (see Section
5.1 for the discussion of translation of ML model performance to the business value). For these reasons, ML deployment projects often do not lend themselves well to widespread approaches to software engineering management paradigms and neither to common software architectural patterns [
148].
Compared to classical
software engineering (SE), ML introduces unique artifacts with unique characteristics: datasets and models. Unlike program source and configuration files, these artifacts are not distributed as program code and exist in the form of tabular or binary files. Regular SE tools for such common operations as source control, branching and merging, review, cannot be applied to these new artifacts “as is” [
149]. Consequently, it is essential to develop documentation approaches that are most suitable for these artifacts. There is a growing body of literature that aims to adapt existing or develop new practices for handling datasets and models in a coherent and reliable way. Lawrence [
150] proposes an approach toward classifying the readiness of data for ML tasks, which Royal Society DELVE applied in one of their reports in the context of COVID-19 pandemic [
20]. Gebru et al. suggested “datasheets for datasets” [
151] to document dataset’s motivation, composition, collection process, and intended purposes.
Data Version Control (DVC) is an open-source project that aims to create a Git-like source control experience for datasets [
152]. For models, Mitchell et al. [
153] proposed model cards, short documents that accompany trained models detailing their performance, intended use context, and other information that might be relevant for model’s application.
Data-Oriented Architectures (DOA) [
154,
155,
156] is an example of an idea that suggests rethinking how things are normally approached in software development, and by doing so promises to solve many of the issues we have discussed in this survey. Specifically, the idea behind DOA is to consider replacing micro-service architecture, widespread in current enterprise systems, with dataflow-based architectures, thus making data flowing between elements of business logic more explicit and accessible. Micro-service architectures have been successful in supporting high scalability and embracing the single responsibility principle. However, they also make dataflows hard to trace, and it is up to owners of every individual service to make sure inputs and outputs are being stored in a consistent form (these issues are also discussed in Section
3.1). DOA provides a solution to this problem by moving data to streams flowing between stateless execution nodes, thus making data available and traceable by design, therefore making simpler the tasks of data discovery, collection, and labeling. In its essence, DOA proposes to acknowledge that modern systems are often data-driven and therefore need to prioritize data in their architectural principles.
As noted above, ML projects normally do not fit well with commonly used management processes, such as Scrum or Waterfall. Therefore, it makes sense to consider processes tailored specifically for ML. One such attempt is done by Lavin et al. [
157], who propose
Machine Learning Technology Readiness Levels (MLTRL) framework. MLTRL describes a process of producing robust ML systems that takes into account key differences between ML and traditional software engineering with a specific focus on the quality of the intermediate outcome of each stage of the project. As we discussed in Section
5, verification is the area of ML deployment that suffers from a lack of standard practices, and in that context MLTRL suggests a possible way to define such standards.
A very widespread practice in software engineering is to define a set of guidelines and best practices to help developers make decisions at various stages of the development process. These guidelines can cover a wide range of questions, from variable names to execution environment setup. For example, Zinkevich [
158] compiled a collection of best practices for machine learning that are utilized in Google. While this cannot be viewed as a coherent paradigm for doing ML deployment, this document gives practical advice on a variety of important aspects that draw from the real-life experiences of engineers and researchers in the company. Among others, rules and suggestions for such important deployment topics as monitoring (discussed in Section
6.2), end-user experience (Section
7.3), and infrastructure (Section
6.1) are described.
Besides serving as a collection of advice for common problems, guidelines can also be used as a way to unify approaches towards deploying ML in a single area. The Association of German Engineers (VDI) has released a series of guidelines on various aspects of big data applications in the manufacturing industry [
159]. These documents cover a wide range of subjects, including data quality, modeling, user interfaces, and more. The series aims to harmonize the available technologies used in the industry, facilitate cooperation and implementation. Such initiatives can help bridge the gap between ML solutions and regulations in a particular applied area discussed in Section
7.2 of this survey.
Holistic approaches are created with ML application in mind, and therefore they have the potential to offer significant ease of deploying ML. But it should be noted that all such approaches assume significant time investment, because they represent significant changes to current norms in project management and development. Therefore, a careful assessment of risks versus benefits should be carried out before adopting any of them.