Makinen Sasu Thesis 2021
Makinen Sasu Thesis 2021
Makinen Sasu Thesis 2021
Sasu Mäkinen
Faculty of Science
University of Helsinki
Supervisor(s)
Prof. T. Mikkonen, Prof. J. K. Nurminen
Examiner(s)
Prof. T. Mikkonen, Prof. J. K. Nurminen
Contact information
Sasu Mäkinen
Työn nimi — Arbetets titel — Title
Deploying machine learning models is found to be a massive issue in the field. DevOps and
Continuous Integration and Continuous Delivery (CI/CD) has proven to streamline and accel-
erate deployments in the field of software development. Creating CI/CD pipelines in software
that includes elements of Machine Learning (MLOps) has unique problems, and trail-blazers in
the field solve them with the use of proprietary tooling, often offered by cloud providers.
In this thesis, we describe the elements of MLOps. We study what the requirements to automate
the CI/CD of Machine Learning systems in the MLOps methodology. We study if it is feasible
to create a state-of-the-art MLOps pipeline with existing open-source and cloud-native tooling
in a cloud provider agnostic way.
We designed an extendable and cloud-native pipeline covering most of the CI/CD needs of
Machine Learning system. We motivated why Machine Learning systems should be included
in the DevOps methodology. We studied what unique challenges machine learning brings to
CI/CD pipelines, production environments and monitoring. We analyzed the pipeline’s design,
architecture, and implementation details and its applicability and value to Machine Learning
projects.
We evaluate our solution as a promising MLOps pipeline, that manages to solve many issues
of automating a reproducible Machine Learning project and its delivery to production. We
designed it as a fully open-source solution that is relatively cloud provider agnostic. Configuring
the pipeline to fit the client needs uses easy-to-use declarative configuration languages (YAML,
JSON) that require minimal learning overhead.
1 Introduction 1
2 Research approach 3
2.1 Problem identification and motivation . . . . . . . . . . . . . . . . . . . . . 3
2.2 Objectives for a solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Design and development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Demonstration and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 DevOps 7
3.1 DevOps life-cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Benefits of the DevOps methodology . . . . . . . . . . . . . . . . . . . . . 10
3.4 Production environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.1 Virtualization and Containers . . . . . . . . . . . . . . . . . . . . . 13
3.4.2 Container orchestration . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.3 Kubernetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 MLOps 17
4.1 Machine Learning systems process . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Extract-Transform-Load (ETL) . . . . . . . . . . . . . . . . . . . . 20
4.1.2 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.3 Training pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.4 Serving pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Elements of MLOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Artefacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Deployment and production monitoring . . . . . . . . . . . . . . . . 25
4.3 MLOps platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1 GitOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Model training and serving integration . . . . . . . . . . . . . . . . . . . . 31
5.3 Cluster architechture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3.1 Artefact storage and version tracking . . . . . . . . . . . . . . . . . 33
5.3.2 Inference Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3.3 Workflow engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3.4 Networking and messaging . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.5 Deployment Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.6 Real-time monitoring and alerting . . . . . . . . . . . . . . . . . . . 43
5.3.7 Offline monitoring systems . . . . . . . . . . . . . . . . . . . . . . . 44
6 Demonstration 45
6.1 ETL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Model training and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3 Model serving and monitoring . . . . . . . . . . . . . . . . . . . . . . . . . 47
7 Discussion 49
7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8 Conclusions 55
Bibliography 57
The process of model deployment is a difficult part of the machine learning project’s life-
cycle [65, 106]. There is proven success in rapid development cycles by incorporating
DevOps methodology [97, 11, 32] and tooling for Continuous Integration and Continuous
Delivery (CI/CD) into software development [34, 29].
Machine learning is often a small part of software systems, but a software that incorpo-
rates machine learning has fallen behind on the CI/CD trend. Machine learning systems
introduce added complexity and unique problems to the CI/CD pipelines [91]. Extending
DevOps methodology to include machine learning system features is widely referred to as
MLOps [69].
Current popular products for MLOps are often proprietary and offered by different cloud
providers, Amazon, Google, or Microsoft [2, 1, 10]. Proprietary solutions reduce extendibil-
ity and transparency on a pipeline while enforcing heavy vendor lock-in. Cloud providers
services are often not a feasible solution for companies that work on regulatory software de-
vices or software with user-privacy concerns. They require an MLOps solution that can be
run cloud-agnostic and on-premises machines. Current end-to-end open-source and cloud-
agnostic solutions for MLOps pipelines are often incomplete, immature, non-generalized,
hard to learn and use, or not fit for production.
This thesis aims to design an MLOps pipeline for cloud-native applications that can not
rely on proprietary software or cloud provider’s terms and conditions. We propose an
open-source easy-to-use MLOps pipeline that provides a Kubernetes based cloud-native
and production-ready environment for CI/CD and monitoring for machine learning sys-
tems. Pipeline generated model’s inferences are traceable to its data and parameters for
governance purposes. It also provides an interface for model exploration with automatic
tracking of data, parameters and metrics.
In Chapter 2, we introduce our research approach, research questions and design objectives
for the MLOps pipeline. In Chapter 3, we define DevOps methodology, describe CI/CD
pipelines, and their requirements, describe why DevOps and CI/CD it is beneficial in
software development, and define its’ implementation phases. We describe cloud-native
technologies to understand and motivate their usage. In Chapter 4, we extend and compare
the DevOps methodologies and CI/CD pipelines into machine learning systems, to find
2
the problems and requirements of a MLOps pipeline. Chapter 5 goes over the design
and implementation details of our proposal, including an overview of tools, methods and
architecture of the pipeline. We describe how it implements the elements and phases
of DevOps defined in earlier sections, and satisfies the requirements of an end-to-end
machine learning systems CI/CD pipelines. Chapter 6 goes through a demonstration of
implementing a simple machine learning project’s MLOps requirements on this MLOps
pipeline. Chapter 7 contains discussion on limitations, future, and related work. Chapter
8 ends the thesis with conclusions.
2 Research approach
The background information is gathered as a literature review. All design and imple-
mentation of an MLOps pipeline as the artefact is conducted as Design Science [80]. We
conduct design science because of the empirical need and applicability of creating the
artefact for evaluation as a solution to our identified problem in the domain of MLOps.
The key phases of the applied Design Science research process are detailed in following
sections.
RQ2: How feasible is designing and implementing a MLOps pipeline with existing
open-source cloud-native tooling?
To find answers to RQ1 we conduct a literature review of the background in the field of
machine learning, and software engineering. We conduct Design Science research process to
build an artefact to answer RQ2 and RQ3. Further problem identification and motivation
for the artefact design is conducted in Chapters 3 and 4.
As objectives for the solution, we have three distinct sets of features that guide us towards
a better artefact: must-have objectives, features that correlate with positive development
performance, and features that solve problem definition specific issues. These objectives
are reflected on every selected tool and the whole artefact.
As must-have objectives for the MLOps pipeline we list the following features:
4 2.3. Design and development
Obj1: Must provide features of the MLOps methodologies CI/CD pipelines, on the
basis of RQ1.
These three feature ensure that the solution is usable and extendable for its job in modern
software infrastructure.
Accelerate report [34] lists as the features of a toolchain that correlate with positive de-
velopment performance as: how easy it is to use the toolchain?, how useful the toolchain
is in accomplishing job-related goals?, and open-source and customizability? From the
report we specify our set of objectives and considerations for the MLOps pipeline design
and implementation as:
Obj7: Maturity of the tool – in this rapidly changing landscape, we want something
that is industry-ready.
Obj8: Level of integration with runtime – we consider this for the interoperability
of tools monitoring, debugging and portability.
Obj9: Generality – we want every data scientist to be able to integrate into the
MLOps pipeline without significant added skill or knowledge, enabling focus
on business issues.
To summarize the above objectives, we want to leverage mature tools that integrate well
into a cloud environment that use general-purpose configuration languages like JSON and
YAML, or general Python classes, functions and scripts as its user-facing building blocks.
We want an tool that requires minimal source code changes to any services or processes
to integrate all the steps and pieces of the pipeline. We want the pipeline to be portable
to any cloud platform, on-premises machines, and simulated cloud environments on local
machines. This way, the pipeline can be used on many projects having requirements on
its cloud environment.
Chapter 2. Research approach 5
The design and development artefact are presented to show how the artefact works and
how it is built (Chapter 5). The design follows the identified problems and objectives of
the artefact. The design choices are rationalized in Appendix A.
In practice, DevOps aims to use tooling and workflows to automate one or more of the
phases of the DevOps life-cycle: Coding, building, testing, releasing, configuring, and
monitoring (Figure 3.1).
Coding phase includes development, code review and version control tools. For example,
a team decides to use Git as a version control tool and Github as a remote repository.
The team defines a set of automatically enforced code style guidelines and test coverage
percentage, decides on using the branching strategy of Trunk-Based Development [79] and
enforces all changes to the main branch to be reviewed and accepted by a senior developer.
Building phase consists of the automatic creation and storing of artefacts. For example,
a team decides to create a runnable container image of their product.
Testing phase includes continuous testing tools, team setups an environment where a set
of ever-increasing set of tests is automatically run against every code revision, giving crucial
information on its quality. Set of tests can range from unit, integration, configuration, end-
to-end, and performance testing.
Releasing phase consists of release strategy and automation. The team has to decide
on how to get their product released. For example, they decide if the product is released
internally as a staging release first, and if they are using A/B testing in production. Also,
they have to concern themselves with what to do when something is wrong with the
deployment, create their rollback strategy.
Configuring phase consists of automatic infrastructure configuration and management.
Best practices include declaring the production infrastructure as code [71, 47, 7], which
means a set of scripts to reproduce the software’s running environment and infrastructure
from operating systems to databases and specific services and their networking configura-
tion. The infrastructure can also be version controlled, tested, and deployed with the rest
of the code.
Monitoring phase ranges from product performance to end-user experience monitoring.
For example, it can cover how long database queries or website loading takes or how many
of users is using specific features of the product or how many visits to a website ends up in
registration or how many new users churn in a specific set of time. Monitoring phase covers
automatic alerting of crashes or possible system metric thresholds, e.g. CPU utilization
threshold. Production monitoring is essential to make sure the team is building the right
product [23, 14].
3.2 Pipelines
To achieve CI/CD, developers create "pipelines", which are written manifestations of how
to automatically build, test, release and configure software release [52]. The pipeline has
an event that triggers the steps in a sequential fashion. If any step fails it will stop the
pipeline from advancing further and gives feedback to the developers. In context Software
Development the triggering event is often a new code revision. When each revision is
continuously tested and proven to be releasable, the codebase stays in a working state,
which means that every single code revision made can lead up to a new release of the
software. CI/CD supports practice of regular, small commits, and even dozens of releases
per day.
Without CI/CD there is a big risk of long-lived branches of broken code that requires
Chapter 3. DevOps 9
integration into the software, blocking the entire development team from delivering other
features, resulting in so called "integration hell", a situation where delivery is in halt for a
long-period of time. With robust CI/CD there is no added cost of integration and delivery
when implementing features, the Software Development team can stay agile and make high
level decisions in short-term cycles, e.g. change their requirements, re-scope their product,
and do experimental feature research.
Table 3.1: Examples of tools for specific phases of CI/CD pipeline automation.
Phase Tools
Build XCode [114], Docker [24], Gradle [39].
Test pytest [85], Cypress [50], Mockito [70].
Release Jenkins [51], Spinnaker [98], Flux [31].
Configure Terraform [101], Ansible [45], CloudFormation [9].
Monitor New Relic [73], Sentry [94], Prometheus [83].
In practice, the creating CI/CD pipelines consists of various tools, that are designed to
solve a specific tasks or subset of tasks in a specific context of the DevOps life-cycle, e.g.
building phase of a website. Examples of tools can be seen in Table 3.1. These tools
are often interchangeable and a resulting set of tools is often called a toolchain. The
CI/CD pipeline is an written artefact running selected toolchain on machines or cloud.
The pipelines can be written to run on hosted services, e.g. GitHub Actions, CircleCI,
or Travis [27, 20, 104]. There’s also tools to run CI/CD pipelines on own machines,
e.g. Jenkins or ArgoCD [51, 5]. These tools provide a configuration language to run
execution steps and tools as a pipeline. It is not uncommon to see the pipeline split into
multiple instructions running machines that each handles a subset of the process, e.g.
building and testing happening in GitHub Actions and release on a local machine.
Key elements to consider in a CI/CD implementation are [52, 36]:
• Rollback capability,
• Security,
• Time to production.
10 3.3. Benefits of the DevOps methodology
For every single release, there needs to be a strategy to rollback the release into a previous
version if something goes wrong. An easy solution for rollback could be running the older
version through the same CI/CD pipeline. Rollbacks are not always straightforward,
even if the running service can be rolled back, there is also possible data migrations and
downtime of deploying a new version to consider.
Each release should be transparent on what has changed, who authorized the change and
should be alerted or observable to the development team. The team needs to know if the
deployment was successful and when. There should be clear alerts of broken code in the
pipeline and possible failures in the release. If something goes wrong with the deployment,
and there is no clear audit trail of changes, it can be tough to know where to rollback
the system and what is causing issues in the deployment. Imagine a situation where a
developer manually connects to a production machine and accidentally deletes a key file
from the filesystem, which causes system failures after a few days. There is no auditable
trace of changes, no one knows where to rollback, and even a rollback in the code might
not help if it does not reproduce the missing file – a recipe for complete mayhem.
Security considerations for CI/CD pipelines are mostly regarding managing who can au-
thorize deployments, and how to manage secrets in the pipelines. Limiting developers’
access is critical for both security and observability – all production changes go through
CI/CD pipelines. Giving third-party CI/CD tools access to a system and its secrets can
leave the system vulnerable in case of third-party security breaches.
Time to production is essential for a company to stay agile. Long-lasting tests and builds
can block other releases resulting in bigger commits, merge conflicts, and piling issues to
be cleared [52].
In a machine learning context, the pipeline’s steps are extended to handle the life-cycle of
the model and data, but the elements, benefits, and objectives stay the same.
• IT performance – how long does it take for a system to restore from a defective
version or bugs, how long does it take for a written feature to end in production and
how often the organization deploy code.
Chapter 3. DevOps 11
• Burnout – how many or often developers feel burned out, exhausted or indifferent
of their work.
There are several reasons why DevOps has been observed to be useful for developers and
organizations in software projects:
While learning DevOps practices is hard and time consuming [56], DevOps improves orga-
nization performance factors and saves money and resources in the long run by trust and
governance in software quality. Characteristics of quality software include understand-
ability, completeness, conciseness, portability, consistency, maintainability, testability, us-
ability, reliability, structuredness, and efficiency [13]. Software deployment pipeline can
automatically test software development team’s code revisions in version control to match
these software quality characteristics and have it released to the production environment.
Eliminating a series of operations, traditionally done manually, saves minutes, hours or
even days of manual labour for each software release [34].
DevOps mitigates change failures and reduces deployment pain. Software deployment
pipeline with automatic tests forces all code changes to run through the same set of tests
with possibly new tests introduced for this and future revisions. Infrastructure as code, the
practice of writing the infrastructure either as scripts or in a declarative fashion, increases
the truck factor of the production environment as everything running on the machine or
12 3.4. Production environment
server is transparently defined and reproducible. The truck factor is a commonly used
measurement of how much key personnel can be hit by a truck or otherwise disappear
from a software development team until the team would be unable to deliver updates to
the system because of lack of knowledge [8].
DevOps brings happiness. Listed as top reasons for unhappiness in developers work are:
Time pressure, low-quality code, mundane or repetitive tasks, broken code, and under-
performing colleagues [43]. These are all issues DevOps aims to solve. Developer happiness
is a crucial factor in developer productivity [42].
DevOps practices creates a faster feedback loop. With robust monitoring systems devel-
opment teams can respond to defects, alerts or other metrics rapidly and thus improve
code quality or better target business goals.
Incorporating DevOps into the machine learning context, we hope to see similar benefits
as in traditional software development.
Managing a production environment is challenging, and even more-so with software that
has special requirements with hardware, such as machine learning systems. The production
environment is a set of hardware and software of a service that the end-users are directly
or indirectly in contact with. The environment includes physical computer components,
operating systems, all installed software in the machine, network configuration, and even
the state of the software and hardware, e.g. used disk space or cached data.
The environment needs to be fail-safe and reproducible. To reduce the risk of new features
of a system not working on production environment – even though it might work on
developers own environment, it is common practice to run a staging environment, which is
supposed to mimic the production environment as closely as possible. Some teams even use
development environments that mimic the production environment to have hands-on tests
and exploration on new features before committing with the features to the staging – and
eventually the production environment. Creating a close replication of an environment is
hard, but much success is found by abstracting layers of the environment by virtualization,
and a declarative infrastructure as code definition of infrastructure and installed software.
Chapter 3. DevOps 13
3.4.3 Kubernetes
Kubernetes is the de-facto industry standard for container orchestration in the cloud-
native space [18]. Kubernetes is an open-source platform to run containerized services as
a cluster, providing several benefits and features to run reliable and portable applications.
Kubernetes provides service discovery and load balancing, exposing the service’s domain
name system (DNS) name and automatically distributes traffic to multiple replicas of the
same service to keep the traffic stable. Kubernetes can automatically replicate all services
based on CPU loads [112].
Kubernetes can automatically rollout or rollback new versions of services with zero down-
time. It will launch up new versions of the service, wait for it to instantiate or respond
positively to user-defined readiness check. When the new version is ready, it will auto-
matically switch all traffic to the new version and take down the old one.
Kubernetes does bin packing. The developer defines Kubernetes how much CPU and
memory each container requires. Kubernetes can automatically organize containers onto
cluster nodes to make the best use of the cluster’s resources. In the context of machine
learning we can use node selectors and node labels to define specific workloads to be
scheduled on labeled machines, e.g. a resource heavy training workload to run on a
machine with a powerful graphics card.
Kubernetes is self-healing, meaning that if any container goes down unexpectedly or fails
to respond to user-defined health checks, Kubernetes automatically replaces it.
All resources in a Kubernetes cluster are declarative containerized workloads running in
pods. It relies on an OCI compliant container runtime to run the containers. a Pod
is a set of running containers on a cluster, the smallest deployable unit a Kubernetes
cluster can have. The standard workload resources in a Kubernetes cluster are Deploy-
ment, ReplicaSet, StatefulSet, DaemonSet and Jobs. The Kubernetes API is extended by
writing custom controllers, which controls these workload resources or pods directly. For
example, a custom controller could be a Cronjob which controls a set of jobs to work in
defined intervals, and a Job controls that defined pods successfully run its workload and
terminates.
4 MLOps
MLOps is the practice of applying DevOps to a software project including machine learning
systems [69]. Machine learning brings in new roles and elements to the traditional soft-
ware development process such as data, data scientists, data engineers, machine learning
engineers, models and their regulation, model training pipelines, and model monitoring.
These additions can increase the complexity of a Continuous Delivery system significantly.
In machine learning systems, the automated CI/CD pipelines are used to accelerate de-
livery and improve reproducibility [63, 44]. Valohai, a Finnish startup, surveyed 330
professionals in the machine learning domain, what they were working on in the last three
months, and what obstacles they deem as relevant in their work. Most of the respondents
said that they were working on topics that MLOps aims to automate, such as deploying
to production, automating model re-training and model monitoring [106].
When categorizing answers between the people, who are in early stages of machine learn-
ing development, i.e. currently with issues regarding gathering or figuring out how to
use data, learning machine learning technologies or proving their worth, we see that the
perceived challenges shift more towards issues MLOps are trying to solve (Figure 4.1).
With companies in later stages of development, we see respondents realize the challenges
with deploying models to production and experiment tracking and comparison [65].
Delivering machine learning system to production from the ground up consists of several
steps which can be manual labour or made by an automatic delivery pipeline. A machine
learning process consists of [3]:
Figure 4.1: Perceived challenges with early-stage (left) machine learning organizations and mature-stage
(right) organizations. Data from State of ML 2020, Valohai Survey. Legend: a) Lack of data; b) Messiness
of data; c) Accessibility of data; d) Not enough data scientists; e) Not enough engineers/DevOps personnel;
f) Not enough budget for purchases (solutions, computing); g) Difficulty of developing an effective model;
h) Difficulty of using cloud resources for training; i) Difficulty of building training pipelines; j) Difficulty
of deploying models; k) Difficulty of tracking and comparing experiments; l) Difficulty of collaborating on
projects; m) Lack of version control; n) Lack of executive buy-in; o) Difficult regulatory environment; p)
Unclear or unrealistic expectations. 0-4 scale of how relevant the challenge is.
Model requirements step consists of gathering understanding about what data or model
algorithms to use to solve a problem. The value is the initial findings, and as such, it is
considered something that is not to be automized or reproduced.
A machine learning Continuous Delivery pipeline consists of Extract-Transform-Load
(ETL), training and serving pipelines [88] (Figure 4.2). ETL handles process steps of
data cleaning, labeling, and makes sure the processed data is accessible for further steps.
Training pipeline handles that a model is created, tested and evaluated. Training pipeline
handles machine learning process steps of feature engineering, model training and model
evaluation. Serving pipeline handles model deployment and monitoring, transfers the
trained model for the end-users and setups monitoring systems for it.
Chapter 4. MLOps 19
The first thing building machine learning system requires is an ETL procedure [108, 112]
(Figure 4.3). The extract phase is extracting data from multiple data sources onto a
production database, data store, or data lake. Original data can be of any size, in any
format and any geographical location.
In the transform phase, the data is cleaned and labeled, removing misspellings, incon-
sistencies, duplicate, conflicting or missing information, transforming it into a uniform
format, and aggregated with other sources into its final business ready format. For some
models, e.g. supervised learning models, the data requires labeled true values. Depending
on the problem and models used, the labeling step might not be needed, and it might not
be possible to automate. If labeling can not be automated, it requires a manual step, where
an engineer receives the cleaned data, labels it and then triggers the pipeline onwards.
The transformed data is then loaded into production databases or data stores for training
pipelines or other clients to use for analysis or other business requirements.
The ETL procedure can be as shallow as extracting rows from a single CSV onto a database
table, or consist of hundreds of transformation operations on globally extracted data, and
loading terabytes of data onto a data lake.
Chapter 4. MLOps 21
4.1.2 Exploration
Exploration steps are not considered part of the deployment or model training pipelines,
but give the team invaluable information on what to build and how, as such this step does
not concern MLOps, which is why it is not shown on figures.
Data exploration is done to understand the data better and provide insight into its form
and business applicability. Exploration is done with just reading the data, visualizing
or with statistical analysis. A better understanding of the data provides information on
which models or algorithms would be the best for achieving business goals.
In model exploration, several different models are trained with the data to see which one
seems most applicable.
The model training and hyper-parameter optimization starts phase after choosing a model.
The model is evaluated against set metrics and its trained several times over different
hyper-parameters, e.g. learning rates, and the best performing model is elevated to testing
stage, and others discarded. There are multiple strategies for hyper-parameter optimiza-
tion and it is up to implementation what to use.
In evaluation stage, trained models are tested that they have good results in selected
metrics with previously unseen data. The data is completely split from any evaluation
or training data used earlier in the model training. It is recommended to test the model
performance with known edge cases, adversarial inputs and other cherry-picked inputs that
might surface unwanted bias or discrimination in the model. These tests should work as a
benchmark for all possible iterations of trained models – so they’re comparable with each
other. At this stage the model performance should be compared to the current production
model. A model with acceptable or better test score is picked as the final production-ready
candidate.
The final model is stored in a location, services using it can access it. The stored model
can be on a server disk, remote data storage or inside client applications.
Computationally expensive model training processes requires access to GPU hardware re-
sources. Figure 4.4 presents performance differences of different models training processes
with varying hardware, showing that using GPUs dramatically increases performance,
regardless of model and framework [38].
22 4.2. Elements of MLOps
Figure 4.4: Processing speed increases with GPU-units, with all machine learning libraries and tasks
[38].
The served model provides an interface for end users or services to input data into the
model. The model output is redirected to an appropriate place, usually back to the entity
giving the input, or to another service or model for further processing. The model interface
is called an inference server or service.
The inference server can be served as a web server or a client application, such as a
mobile app, or otherwise. Inference serving requires written application code and a robust
infrastructure to succeed. The infrastructure has to be scalable and self-healing for the
inference service to stay online. The model itself requires to be monitored to detect bias,
drifting or adversarial attack. When detecting faulty models the actions to fix it must be
swift to avoid further damage. Re-training is a common fix for biases and drifting, which
requires training the model with new data, starting the whole process all over again, and
without automation and version control it is hard to deliver fast.
Chapter 4. MLOps 23
4.2.1 Artefacts
In traditional software systems, the product artefact is built from a specific snapshot
of code. However, in machine learning systems, the artefact is a product of the source
code and the data used to train the involved model(s), visualized in Figure 4.5. Often
the training source code also takes model hyperparameters as parameters. In machine
learning system, to reproduce an artefact, its source code, hyperparameters and training
data needs to be version controlled.
While traditional software requires unit, integration, security and end-to-end system test-
ing, an machine learning system requires all of this plus data and model testing. Testing
phases are defined in the pipeline as declarative steps. If any tests fail the defined criteria,
the pipeline execution is terminated, and thus the model is never released. The pipeline
also can consist of steps where model selection is done based testing and evaluation metrics.
As the training can take hours to several days to complete, the training pipeline must
be capable of restoring progress from a checkpoint in case of fault or suspension of the
system. Because of this same time constraint, it is also essential that the pipeline can run
multiple training sessions in parallel, independently from each other.
After a specific version of the model is created, it requires evaluation and testing in the
24 4.2. Elements of MLOps
Figure 4.5: Factors in building a server artefact with machine learning are often more complex than in
traditional software
Chapter 4. MLOps 25
The serving pipeline takes care of the deployment strategy of the model. When a new
model is trained – or in traditional software, a new version of the code is created, it
is common to deploy it into a staging environment before going into production. After
staging is green-lighted, the new artefacts might be fully deployed with various deployment
strategies. For example as a rolling A/B release, where a steadily increasing portion of the
production traffic is split to the new version ultimately replacing the older. This strategy
is used to monitor and benchmark the new version’s performance with a smaller portion
of end-users to detect faults and rollback before any more damage done.
The serving pipeline automatically integrates the served models into various monitoring
systems to track performance and detect faults like model drifting, where the model’s
domain has changed radically since its training. A common way to fix models concept
drift is to re-train the model with an updated dataset. A monitoring system should
evaluate the model performance with live data detect the drift and trigger a re-training
workflow, automatically training a new model and serving it once its ready [64].
Various monitoring measures are encouraged to be used to ensure a fast feedback-loop on
system warnings, bugs, and downtime. A monitoring system is a software that actively
reads or receives data from the production service. For example, a service may send all
software errors to a monitoring system or a monitoring system could read metrics like
concurrent users. A monitoring system often reacts to user-defined thresholds either by
alarming developers or automatically changes the state of the production environment,
e.g., by releasing an older version of the production service.
Choosing an action policy for faulty models is not trivial. The policy could be different
for different alerts. For example, when drift is detected, it is not clear if the current model
should be allowed to stay online. It is domain-specific. In some cases, the harm of wrong
inference can be such that the model should be taken down immediately on detection or
26 4.3. MLOps platforms
We introduce few mature MLOps platforms, which are selected based on different angles
on open-source and level of vendor lock-in and discuss their shortcomings.
AWS Sagemaker [2] is an MLOps platform offered by AWS, which provides a rich
industry-proven feature-complete set to the whole MLOps lifecycle – from data processing
to deployments. However, the AWS Sagemaker is completely proprietary tooling, with
heavy integration to the AWS other products and offerings. Using Sagemaker makes much
sense if a company uses or intends to use AWS tooling on other aspects of their software
development, and has no interest in migrating elsewhere – effectively using Sagemaker can
lead to significant vendor lock-in. There is also no transparency on the inner workings of
Sagemaker as it is proprietary software, which can be an issue for some developers.
Kubeflow [57] is an open-source cloud-native machine learning toolkit, designed to be
run in Kubernetes. It has great integrations to many major cloud providers. It is also a
feature-rich industry-proven tool with big company users [44]. The caveats of Kubeflow are
that it requires quite a bit of work to integrate into other systems as it does not implement
Continuous Delivery. Another caveat with Kubeflow is that it relies on writing in a Python
library or DSL, which is an overhead for developing automated training pipelines.
Valohai [107] is an feature-rich end-to-end MLOps platform. It is proprietary but cloud-
agnostic, running on most major cloud providers or on-premises. Valohai configuration is
done with general-purpose YAML files, and the need for configuration is minimal. The
main caveat of Valohai is that it is proprietary tooling. When using it, you rely on its
Chapter 4. MLOps 27
features and quality – if some feature is needed for your solution or something is broken,
there is nothing you can do to extend or fix the platform.
5 Design & implementation
5.1 GitOps
effect.
When using Pull-based model for Continuous Delivery, the system admins do not need to
provide any developer or tool, credentials to access and modify their system. Restricting
access credentials is a significant security improvement, eliminating the risk of credentials
leaking to malicious actors via third party system breaches or human errors. Access to
make changes to the system is handled by managing write-access to the tracked reposito-
ries.
Another improvement GitOps provides is risk reduction as rolling back the system is
simple and requires no additional scripting or manual work from the system admins. As
the whole system is declaratively defined and in version control, reverting the repository
to a working revision is all that is needed to roll back the system entirely.
Using Git repositories as the single source of truth provides transparency on what is
running on the cluster. New developers do not need to spend time trying to figure out, or
documenting what is running, increasing the truck factor of the company.
With proper Git practices: small commits, descriptive commit messages, and pull request
reviews – the changes made to the system are transparent, described and reviewed. A
descriptive Git log is an excellent way to document information and background on why
something is configured as it is, and by whom, further increasing the truck factor.
GitOps provides fast portability to the system. To move or clone the system, all that is
needed is to spin up machines elsewhere and point it to track the same Git repository.
GitOps makes developing easier as developers can have their development environments
easily reproducible and matching each other’s environments perfectly, reducing "works on
my environment" issues.
On our Continuous Delivery implementation, we are using Flux version 2 [31] to track
and reconcile our production environment to the desired state, following the GitOps work-
flow. The GitOps is considered to be a part of the Serving pipeline of the MLOps delivery
pipeline.
Integrating the model to function in the training and serving pipelines requires a bit of
work.
32 5.3. Cluster architechture
For model training process we have written a Python [111] wrapper that enables parame-
ter passing to a user-written training function from a configuration file. The wrapper does
automatic logging of parameters and outputs to a metadata file. The wrapper saves all
JSON-formatted [22] logging in the user training function as runtime logs in the metadata
file. Using the wrapper enables a data scientist to focus on writing a general training func-
tion. It makes integration of training workload with hyperparameter search and automatic
result ranking to a pipeline easy. Automatic runtime logs enable providing transparent
summaries of the training processes.
For now, with only python wrapper available, the training process must be written in
Python to utilize these features.
Building a model to be able to receive inference requests, traditionally, includes much
coding outside the scope of model development. For example, implementing an HTTP
server [60] that runs inputs through the model requires the developer to write code for
fetching the model, starting up and configuring a web server, configure routing, and write
adequate logging and metrics. This work is rarely domain-specific.
To enable focus on model-specific tasks and run the model, we use a model wrapping tech-
nique provided by seldon-core [92], an open-source library for machine learning deploy-
ments. The model wrapper takes a class describing the model as a parameter, including
functions for initialization and prediction and a file of required third-party dependencies
and injects this model code to a functional web server. The server includes all necessary
configurations to start the server, routing, logging, metrics, and eventing. The result is
an OCI-compliant image running a inference server of the model. Seldon-core Kubernetes
controllers handles the lifecycle of the model deployment inside the Kubernetes cluster
(SeldonDeployment).
Seldon-core also provides pre-packaged inference servers, which we will cover later in
Section 5.3.2.
All code and infrastructure as a code declaration are in Git repository. Large model
binaries and datasets are in cloud storage, as shipping the data with code images is unfea-
sible and redundant. A production workload can automatically fetch the model or data to
local disk before using it, leveraging initialization containers in Kubernetes. A pod with a
initialization container is kept at unready state until the initialization container has exited
its process successfully.
Training pipelines launch with a unique tag, which is placed on all artefacts generated in
the pipeline. With models it is tagged in the metadata file and also best practice would
be to tag the cloud storage bucket access with the same tag. This way it is easy to trace
and verify which artefacts are a result of which training run. All deployed models are
present as a declaration in the cluster Git repository and its history. Data version control
is assumed to be handled by third-party storage providers.
When the model deployed as an Inference server, we can leverage multiple pre-packaged
inference servers provided by Seldon-Core, or write our own as a runnable image. Using
pre-packaged servers eliminate a lot of the work required to get a model deployed. Pre-
packaged servers offered are:
Of these servers, XGBoost and SKLearn are offered through Seldon-Core MLServer
[93]. These pre-packaged servers provide a gRPC or HTTP/REST server for specific
model execution backends, e.g. SKLearn Server can be used to deploy a model created
34 5.3. Cluster architechture
Figure 5.2: Cluster architecture – everything for ETL and training pipelines are running in the Workflow
Engine, rest is for serving pipeline (Figure 4.2).
Chapter 5. Design & implementation 35
and serialized with Scikit-learn machine learning library [90]. The servers implement
36 5.3. Cluster architechture
request scheduling and batching to fit use-cases where real-time inference is not feasible.
The Triton Inference Server can run multiple models inside the server, schedule and
batch each inference request on a per-model basis (Figure 5.3). The Triton Inference
Service is designed to be used in Kubernetes environment, as such it shipped as a Docker
container, exposes status, health and metrics in Prometheus usable [84, 83] format for
Kubernetes liveness probes and metric monitors [74]. Prometheus is a common product
for aggregating and querying real-time metrics in Kubernetes ecosystems. The provided
servers cover the deployment of many of the used machine learning libraries and use-cases
in the industry.
Most of the pre-packaged servers offer three protocols:
The protocols define the API endpoints and payload format used to communicate with the
server and how the data is chained between multiple models. The KFServing Protocol V2
is a joint project between KFServing and Nvidia Triton Inference Server to design
and offer a standardized machine learning inference Protocol.
The KFServing Protocol V2 REST/HTTP defines server endpoints such as
• Health:
GET v2/health/live
GET v2/health/ready
GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready
• Server Metadata:
GET v2
• Model Metadata:
GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]
• Inference:
POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer
Chapter 5. Design & implementation 37
The inference end point POST endpoint defines a specific JSON-schema for request and
response, and a valid example request could look like Listing 5.1.
The order of work is often relevant; for example, a model deployment process requires the
model training to have finished before execution. For handling the sequence of processes
automatically, we need workload orchestration.
In our implementation we use Argo workflows – container-native workflow engine for
orchestrating parallel jobs on Kubernetes [113]. In the context of our design, we use the
workflow engine to run ETL and training pipeline workloads.
Argo workflows provide a declarative interface to define workloads and their order of
execution, either as a sequence of workloads or as a directed acyclic graph. Argo work-
flows allow parallel workloads that enable multiple instances of the training pipeline to be
launched on-demand, utilized in model exploration and hyperparameter tuning. Workflow
38 5.3. Cluster architechture
configuration also enables an easy way to implement artefact saving and passing between
workloads.
A YAML Kubernetes manifest of a model training workload declares the image and com-
mand to run the training, with input data fetched from a cloud storage bucket and output
model artefact saved into another bucket Listing 5.2. The training image does not need
to implement downloading or uploading artefacts, other than having a reference of them
on the local disk. Argo workflows launch an initialization container to download the files
and another container to upload after workload execution. The output parameters are
used as a parameter input in following workloads, such as deployment processes. Figure
5.4 demonstrates an example of an Argo workflow model training process.
Chapter 5. Design & implementation 39
1 − name: training
2 container:
3 image: sasumaki/ lol −train
4 args: [ aiga_train , Train , −−parameters, "{{inputs.parameters}}"]
5 inputs:
6 parameters:
7 - name: learning_rate
8 artifacts :
9 - name: data
10 path: /app/data
11 s3:
12 endpoint: s3 . amazonaws.com
13 bucket: aiga −data
14 key: MNIST_data
15 outputs:
16 parameters:
17 - name: S3_MODEL_URI
18 value: s3:// aiga −models/{{pod.name}}
19 artifacts :
20 - name: model
21 path: /app/ outputs /model.onnx
22 s3:
23 endpoint: s3 . amazonaws.com
24 bucket: aiga −models
25 key: "{{pod.name}}/model.onnx"
26 ...
27 steps:
28 - −name: train
29 template: training
30 arguments:
31 parameters:
32 - name: learning_rate
33 value: "{{item.rate}}"
34 withItems:
35 - { rate: 0.0002 }
36 - { rate: 0.0003 }
37
40 5.3. Cluster architechture
With Argo events [6] we create a webhook to allow triggering workflows from remote
services, such as monitoring systems. With these Argo tools, developers can configure a
complete end-to-end model training, evaluation and deployment process in an automated,
reproducible and scalable way without any human interaction needed.
Networking defines how the different services in the cluster share data between each other.
We use mainly two ways of networking in our implementation: via a service mesh and
cloud messaging. A service mesh is a way to network a series of services on the application
layer, typically enabling automatic service discovery, traffic control, and complex load
balancing. As a service mesh in our cluster, we use Istio [48], because of its maturity,
robust integration with Seldon-core, and rich feature set:
• Routing rules
• Network resiliency features by retries, failovers, circuit breakers, and fault injection.
Listing 5.3: Minimal Istio gateway configuration to publish HTTP model endpoints
A messaging system is used to pass and filter inference inputs and outputs from models
to monitoring systems. The system is based on Knative eventing [55] brokers and
triggers and NATS Streaming [72] channels. A Seldon-core deployment automatically
publishes all model inputs and outputs as CloudEvents [17] to a Knative eventing broker.
CloudEvents is a standardized specification of message data. The message is filtered and
routed to specific channels by a user-defined trigger Listing 5.4 Figure 5.5. Filtering is
done based on CloudEvents attribute headers. Referred monitoring services receive these
messages as HTTP Post requests, without an explicit subscription.
42 5.3. Cluster architechture
Figure 5.5: Model to monitoring service messaging. Different colored balls represent messages with a
different type attribute.
Listing 5.4: Knative eventing trigger configuration to filter all model requests in nats-broker to a specific
service
For deployments, the system provides several stages of verification and validation. First,
in every code commit we can provide an automatic test that the new infrastructure dec-
laration can be automatically built inside a docker environment and test end-to-end that
requests are going through, monitoring is working and other system-wide tests. After
successful tests, the infrastructure is accepted into staging or production branch.
Another cluster with GitOps agents can be configured to watch and reconcile a specific
staging folder or branch in the infrastructure repository. With the staging cluster, devel-
opers can validate their system before a production release.
Istio service-mesh and Seldon-core enable ways to test a new production environment
model without much or any risk of catastrophes. Shadow deployment, also known as
shadow testing [4], is a deployment strategy where a model service is created inside the
cluster as a "shadow model" with all production traffic coming into a production model
is mirrored into the shadow model. However, the shadow does not propagate the output
to the end-user. Using a shadow model we can monitor, validate and compare how a new
model would behave in a production setting without it interfering with the rest of the
system as long as the shadow is a stateless application.
Shadow can be gradually deployed as the production version as a canary deployment. In
canary deployment, the traffic is not mirrored like in shadow but split between the new
and old version. The percentage of split traffic can be increased gradually, eventually
hitting 100% of the traffic. These deployment strategies decrease the risk of releasing a
faulty version of the model as the model runs in a real production environment, no test
cases or simulation involved.
We use Prometheus for real-time metrics of Inference server requests and inference
time, each metric is also tagged with its service name, deployment name, predictor name,
predictor version, model name, model image, model version, and used training data. The
tags can be used as filters to draw more granular data from the metrics. Seldon-core enables
users to create more custom metrics for prometheus in their model image. Prometheus
also enables creating system alerts on user defined metrics thresholds, e.g. if inference
time gets too long.
44 5.3. Cluster architechture
We use Grafana [40, 41] to create dashboard view of metrics in Prometheus. Grafana pro-
vides observability by filterable time-series data visualizations of the Prometheus metrics.
The dashboards can be exposed as a public web-service with authentication.
this context, offline monitoring systems refer to services that receive inputs and outputs
of inference servers deployed in the cluster. They monitor that there are no issues with
them, e.g. Concept drifting of models. Offline monitors do not respond in real-time, and
they do not interfere with responses of the models. Instead, all inputs and outputs are
aggregated and distributed in asynchronous fashion using NATS Streaming and a message
broker. Concept drift [110], adversarial attack [28], outlier detection [46] and other typical
monitoring applications in machine learning are often machine learning models. The
models can have long inference times or do batch processing, making them unfeasible in
real-time inference services. These systems are used to alert developers or trigger different
actions in the cluster, e.g. re-training of a model.
6 Demonstration
6.1 ETL
We use MNIST [61], a dataset of handwritten digits. The dataset consists of 70 000
labeled digits (Figure 6.1). The digits are 28 x 28 pixels of size and each pixel is encoded
as grayscale integer values between 0-255. We have saved the digits’ source data in S3, an
object datastore offered by Amazon [16]. The S3 files themselves are version controlled
by Amazon. The data could be saved in any cloud providers data storage or inside the
hosted cluster’s Minio server. From the data, we have set up a Argo workflow pipeline
that first does an arbitrary check on the data and saves it into another file in S3. The
check is a data function, which has a chance of failure, demonstrating a validation step in
a production use-case. The data is clean so we do not need to do any data processing or
labeling for the model training, but if we did it would be done in similar fashion as the
validation.
†
The demonstration using the pipeline can be found at https://github.com/sasumaki/asd.
Using the data, we run a training step with declarative parameters. Training implementa-
tion can be found at https://github.com/sasumaki/mnist. The model is a convolutional
neural network built with Tensorflow [99] and Keras [54] as a programming interface List-
ing 6.1.
1 class MyModel(Model):
2 def __init__(self):
3 super(MyModel, self).__init__()
4 self.conv1 = Conv2D(32, 3, activation=’relu’)
5 self.flatten = Flatten()
6 self.d1 = Dense(128, activation=’relu’)
7 self.d2 = Dense(10)
8
9 model = MyModel()
10 loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=
True)
11 optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
12 train_loss = tf.keras.metrics.Mean(name=’train_loss’)
13 test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name=’
test_accuracy’)
14
15 model.compile(optimizer=optimizer,
16 loss=train_loss,
17 metrics=[train_accuracy])
18
The model training image creates a file for the model, which is defined as an output artefact
in the Argo workflow declaration Listing 5.2, which automatically saves it into Amazon
S3. The S3 address is pushed to staging branch of the model repositories deployment
declaration, where it is automatically tested by Github Actions CI tool [27] that checks it
can be run in Kubernetes as own wrapped implementation or as Triton Inference Server,
and has good enough accuracy. The staging model is also deployed into the production
cluster as a shadow deployment. When the model is approved by the repositories CI test,
Chapter 6. Demonstration 47
When merged to the main branch, the GitOps agents in the cluster will trigger a rolling
update of a new version of the inference server serving the new model. We have an
implementation where we use a Seldon model wrapper to create the inference server and
we a version where we use Triton as the inference server (Listing 6.2). These require the
model to have specific folder structures and file naming conventions in S3 which means the
same S3 address can not be used in both implementations. A deployed model can be then
accessed as demostrated in Listing 6.3. The models’ metrics are automatically connected
to the clusters monitoring metrics aggregation system and immediately seen in Figure 6.2
dashboard. Inferences are also sent to the messaging broker, where they are propagated
48 6.3. Model serving and monitoring
to a service acting as a real-time monitoring system such as a drift detector. This service
will call a webhook triggering the training pipeline all over again.
We’ve run this demonstration on a local machine in K3d, a lightweight Kubernetes distri-
bution in docker environment [86], and in the cloud at Civo managed Kubernetes service
[66].
Listing 6.2: An example of mnist deployment with Triton Inference and KFServing V2 protocol.
Listing 6.3: Python script for getting a prediction out of a model Listing 6.2.
1 url = "http://localhost:8081/seldon/seldon-system/mnist-model-triton/v2/
models/mnist/infer"
2 data = {"inputs":[{"name":"input_1","data": x.tolist() ,"datatype":"FP32
","shape":[1,28,28,1]}]}
3 headers = {"Content-Type": "application/json"}
4 r = requests.post(url, data = json.dumps(data), headers = headers)
5 res = r.json()["outputs"][0]["data"]
6 prediction = int(np.argmax(np.array(res).squeeze(), axis=0))
7
7 Discussion
We discuss the results of the research questions, evaluation and limitations of the MLOps
pipeline as the Design Science artefact, and related work that has been done during this
thesis.
7.1 Results
We determine the requirements of the MLOps pipeline as a system that is able to handle all
processes of a machine learning process in an automated and reproducible fashion. That
is, data cleaning, data labeling, feature engineering, model training, model evaluation,
model deployment, and model monitoring RQ1.
Our MLOps pipeline is able to handle all these steps in the process with open-source and
cloud-native tooling RQ2. Data labeling is not handled in any special fashion. If the
labeling is an automatic process that is done with, for example unsupervised methods,
it is possible to plugin the workflow. If labeling is done manually by domain experts, it
cannot be automatically plugged into the pipeline. A way to incorporate labeling is to have
automatic cleaning and alert of a finished clean dataset that a domain expert can label
and then manually trigger the pipeline to continue onwards on the process. Deployment
and monitoring is automatic with GitOps agents and a deployed model will automatically
attach on monitoring systems in place.
All components are open-source and is cloud provider agnostic to a degree RQ3. We
do not use any tooling that is, or depends on, other tooling that is proprietary to any
cloud provider. Managed Kubernetes platforms can have unique elements in them that
can have an effect on migrating the pipeline between different providers. We can also
manage our own Kubernetes cluster on different cloud providers, which would eliminate
this issue. Also we can move the pipeline on on-premises Kubernetes clusters. We can run
the pipeline in simulated Kuberenetes environments.
The naive demonstration (Chapter 6) shows that most of the MLOps pipeline requirements
can be met by the solution either by design or by the underlying technologies (Table 7.1).
Using the pipeline should increase development performance in automizing CI/CD of ma-
50 7.1. Results
chine learning projects. It is relatively straight forward to use but requires some knowledge
of working with Kubernetes (Table A.1). The pipeline relies on the Kubernetes cluster
heavily with chosen tooling that integrates well with it. The chosen tools are mature, at
least in the context of a new field such as MLOps. As shown in the demonstration, there
are little changes done to any source code of the mnist business-relevant pieces, and there
is no additional code for the pipeline. Every piece is integrated together using YAML
configuration (Table A.2).
Tool specific objectives are discussed in Appendix A, to provide rationale for tool decisions
and granular view of the design objectives.
Obj1 The MLOps pipeline provides the MLOps pipeline features discussed in this
thesis as demonstrated in Chapter 6.
Obj2 Everything running in the pipeline is containerized workloads, and it is seam-
lessly running in Civo Cloud – managed Kubernetes environment, so the
pipeline is cloud-native.
Obj3 All components are open-source.
Table 7.2: MLOps pipeline objectives that correlate with development performance
Obj4 The whole pipeline can be installed in any Kubernetes cluster with creating
a GitHub repository with flux bootstrap command and a simple configuration
YAML Listing 7.1. Objectives for model serving with the pipeline is writing
deployment YAML declarations Listing 6.2 of running containerized workloads,
and for training workflow declarations, which are relatively simple and client-
friendly. The easiness of this is yet to be quantified, but we think it should be
easy enough for a developer or a data scientist.
Obj5 The pipeline manages to handle a use-case as demonstrated in Chapter 6, more
complex and business-critical experiments needs to be quantified.
Obj6 The pipeline is fully open-source and as such customizable. There is customiza-
tion lined up for future work.
7.2 Limitations
The design’s main limitations are running in a Kubernetes cluster and the need for build-
ing containerized workloads. Containerization requires knowledge and effort out of the
developers using the pipeline. The pipeline itself is complex and requires in-depth knowl-
edge of Kubernetes and its tooling to manage and develop. In production, this could
mean a full-time employee who is in charge of managing the pipeline. The pipeline aims
to save the time of data scientists and developers working with business issues in a ma-
ture – production-ready and scaling teams. It is not recommended to incorporate such an
environment in the early stages of machine learning services or business.
The pipeline solution does not consider data or model version control but relies on third-
party data storage providers version control, such as Amazon S3. We do not think it
is necessary for the pipeline to take care of data version control. The pipeline is given
addresses of data and models in its configuration. Any processes modifying them generate
new versions of the data they pass onwards to other processes or save them into the cloud
with a new address.
There are differences between Kubernetes distributions. Because of this, we cannot
promise a seamless transition between various cloud platforms. The differences and re-
quired changes are usually small; for example, there were no changes needed between
local K3d and Civo cloud. For example, in Google’s managed Kubernetes service GKE
[59], you may need to grant your account the ability to create new cluster roles for Argo
Workflows to install correctly. These Kubernetes distribution relevant issues would need
to be manually handled in the pipeline configuration and infrastructure, which should be
possible for at least the major Kubernetes cloud providers.
There was related work being released during the time working on this thesis, answering
some of the research questions and problems. K3Ai [53] is an "infrastructure in a box"
– a command-line tool that offers integrated installation many of the same tools in a
Kubernetes cluster. They aim to solve "combining a multitude of tools and frameworks
that help Data Scientist and Data Engineers to solve the problem of building end-to-end
pipelines" and state as their requirements as:
54 7.3. Related work
These goals and requirements match many of our own. The same components of K3Ai as
in our pipeline are Argo and Triton Inference Server. As for differences, it offers Kubeflow
and MLflow, which are commonly used model training and exploration tools we specifically
wanted to avoid because they relied on DSL libraries.
Also released during our work is "ZenML [37], an extensible MLOps framework to create
reproducible pipelines". ZenML lists as key features:
• Ability to quickly switch between local and cloud environments (e.g. Kubernetes,
Apache Beam).
• Built-in and extensible abstractions for all MLOps needs – from distributed process-
ing on large datasets to Cloud-integrations and model serving backends
• Pre-built helpers to compare and visualize input parameters as well as pipeline results
(e.g. Tensorboard, TFMA, TFDV).
The ZenML feature set is promising, and it offers most of our requirements and objectives.
It even offers to switch outside the Kubernetes environment. It mostly leverages different
cloud platforms proprietary solutions for training like AWS Sagemaker [2], Google AI
Platform [1], and Azure machine learning [10], which is something our solution wants to
avoid. ZenML is also configured with a custom DSL library which is something we wanted
to avoid.
8 Conclusions
We reviewed the requirements of an modern machine learning pipeline that delivers au-
tomation and reproducibility in most steps of the machine learning process. We designed
an open-source cloud-native MLOps pipeline, that should fit into most machine learning
projects and teams that are aiming to automize and scale their machine learning process.
The pipeline can be run on multiple cloud providers Kubernetes environments as well
as on-premises Kubernetes and simulated Kubernetes on local machines. We evaluated
the pipeline based on multiple features we recognized as must-have requirements, features
that correlate with positive development performance, and features that solve problem
definition specific issues.
As future work we would want to create a custom operator to orchestrate retraining life-
cycles, as of now the solution is a simple webhook triggering. The operator could take
into consideration, for example not to re-trigger the training process if one is already in
progress or otherwise have more complex scenarios when to trigger. The pipeline should
be made more lightweight. Currently a user would need a high-end computer to run the
pipeline locally. If the pipeline were more lightweight it would be easier to develop and
experiment on local machines. Also cluster costs increases as better hardware is required.
Istio could be changed to another lighter service mesh implementation, but is not because
of other tooling has integration with it. Knative-serving installs many controllers and
resources that should not be needed, these should be trimmed down and only use what
is required. We should consider the cybersecurity of this pipeline. We would want to
conduct a case study to quantify the results, performance and feasibility of our MLOps
pipeline. Implementing a real, complex, and resource heavy machine learning system into
our pipeline would be needed. Implementing a managed phase for data labeling could be
useful. User interface forms for adding deployments, workflows, and triggers could be an
interesting addition to make the pipeline easier to use. Also a UI for training experiment
tracking would be great addition.
We believe that with further validation and development discussed in future work, this
solution can be recommended as a selected MLOps pipeline on most production scale
machine learning projects.
Bibliography
[11] L. Bass, I. Weber, and L. Zhu. DevOps: A Software Architect’s Perspective. Addison-
Wesley Professional, 2015.
[12] K. Beck, M. Beedle, A. Van Bennekum, A. Cockburn, W. Cunningham, M. Fowler,
J. Grenning, J. Highsmith, A. Hunt, R. Jeffries, et al. “Manifesto for agile software
development”. In: Available at https://agilemanifesto.org/ accessed Dec 21, 2020
(2001).
[13] B. W. Boehm, J. R. Brown, and M. Lipow. “QUANTITATIVE EVALUATION OF
SOFTWARE QUALITY”. In: (), p. 14.
[14] B. W. Boehm. “Verifying and Validating Software Requirements and Design Spec-
ifications”. In: IEEE Software 1.1 (February 1984), pp. 75–88. issn: 07407459. doi:
http : / / dx . doi . org / 10 . 1109 / MS . 1984 . 233702. url: https : / / search .
proquest . com / docview / 215842250 / abstract / 5ED70725FEB84F17PQ / 1 (Ac-
cessed: 20 January 2021).
[15] E. Casalicchio and S. Iannucci. “The state of the art in container technologies: Ap-
plication, orchestration and security”. In: Concurrency and Computation: Practice
and Experience 32.17 (2020), e5668. issn: 1532-0634. doi: https://doi.org/10.
1002/cpe.5668. url: https://onlinelibrary.wiley.com/doi/abs/10.1002/
cpe.5668 (Accessed: 5 February 2021).
[16] Cloud Object Storage | Store & Retrieve Data Anywhere | Amazon Simple Storage
Service (S3). url: https://aws.amazon.com/s3/ (Accessed: 3 February 2021).
[17] cloudevents/spec. original-date: 2017-12-09T21:18:13Z. January 2021. url: https:
//github.com/cloudevents/spec (Accessed: 21 January 2021).
[18] CNCF Kubernetes Project Journey Report. url: https://www.cncf.io/cncf-
kubernetes-project-journey/ (Accessed: 18 November 2020).
[19] cncf/toc. url: https://github.com/cncf/toc (Accessed: 4 February 2021).
[20] Continuous Integration and Delivery. url: https://circleci.com/ (Accessed: 15
February 2021).
[21] cortexproject/cortex. original-date: 2016-09-09T11:23:12Z. February 2021. url: https:
//github.com/cortexproject/cortex (Accessed: 18 February 2021).
[22] D. Crockford <douglas@crockford.com>. The application/json Media Type for JavaScript
Object Notation (JSON). url: https://tools.ietf.org/html/rfc4627 (Ac-
cessed: 7 January 2021).
Bibliography 59
[62] T. A. Limoncelli. “GitOps: a path to more self-service IT”. In: Commun. ACM 61.9
(2018), pp. 38–42. doi: 10.1145/3233241.
[63] L. E. Lwakatare, I. Crnkovic, E. Rånge, and J. Bosch. “From a Data Science Driven
Process to a Continuous Delivery Process for Machine Learning Systems”. In:
Product-Focused Software Process Improvement. Ed. by M. Morisio, M. Torchiano,
and A. Jedlitschka. Lecture Notes in Computer Science. Cham: Springer Interna-
tional Publishing, 2020, pp. 185–201. isbn: 978-3-030-64148-1. doi: 10.1007/978-
3-030-64148-1_12.
[64] L. E. Lwakatare, A. Raj, I. Crnkovic, J. Bosch, and H. Olsson. “Large-Scale Machine
Learning Systems in Real-World Industrial Settings A Review of Challenges and
Solutions”. In: Information and Software Technology 127 (July 2020), p. 106368.
doi: 10.1016/j.infsof.2020.106368.
[65] S. Mäkinen, H. Skogström, E. Laaksonen, and T. Mikkonen. “Who Needs MLOps:
What Data Scientists Seek to Accomplish and How Can MLOps Help?” In: May
2021.
[66] Managed Kubernetes service, powered by K3s. url: https://www.civo.com (Ac-
cessed: 3 February 2021).
[67] Microservices. url: https : / / martinfowler . com / articles / microservices .
html (Accessed: 5 February 2021).
[68] MLflow Documentation — MLflow 1.13.1 documentation. url: https : / / www .
mlflow.org/docs/latest/index.html (Accessed: 19 January 2021).
[69] MLOps: Continuous delivery and automation pipelines in machine learning. url:
https://cloud.google.com/solutions/machine-learning/mlops-continuous-
delivery - and - automation - pipelines - in - machine - learning (Accessed: 19
November 2020).
[70] Mockito framework site. url: https://site.mockito.org/ (Accessed: 15 Febru-
ary 2021).
[71] K. Morris. Infrastructure as Code. Google-Books-ID: UW4NEAAAQBAJ. "O’Reilly
Media, Inc.", December 2020. isbn: 978-1-09-811464-0.
[72] nats-io/stan.go. original-date: 2016-01-20T15:49:02Z. January 2021. url: https:
//github.com/nats-io/stan.go (Accessed: 7 January 2021).
[73] New Relic | Deliver more perfect software. url: https : / / newrelic . com/ (Ac-
cessed: 15 February 2021).
Bibliography 63
Tool Obj1: Must provide features Obj2: Must be cloud-native Obj3: Must be open-source.
of the MLOps methodologies
CI/CD pipelines
Argo Workflows Used for orchestrating order of Designed to run in Kubernetes Open-source
tasks in the ETL and Training and leverage multiple parallel
pipelines. machines and jobs.
Seldon Core Handles fetching created mod- Designed to be run in Kuber- Open-source
els, creating an inference server netes. Enables complex de-
and API’s for inference and ployment strategies, e.g. A/B
metrics. out of the box.
Knative Eventing Used as a tool to specify mes- Designed for messaging be- Open-source
saging routes between differ- tween Kubernetes services.
ent services, e.g. post spe-
cific model’s inference requests
and responses to specific set of
monitoring services.
NATS Streaming Acts as a back-channel for Designed for cloud-native ap- Open-source
Knative Eventing messages. plications and microservices. A
CNCF Project.
Prometheus A real-time metrics and alert- A CNCF Graduated project. Open-source
ing system.
Grafana Analytics platform for Designed for cloud-native ap- Open-source
Prometheus metrics plications, can import data
from mixed data sources
Istio Offers easier networking and Built for Kubernetes. Open-source
service discovery for all the
models and monitoring ser-
vices, enables a lot of Seldon-
core features.
Flux v2 Offers Continuous Delivery Designed for Kubernetes. Open-source
and declarative Kubernetes
cluster configurations following
the GitOps methodology.
MinIO Works as a temporary in- Comes with container images Open-source
cluster storage for running and custom Kubernetes Oper-
workloads artefacts. Can be ator
configured to be a persistant
main storage.
SealedSecrets Handles secret management Designed for Kubernetes Open-source
in our Continuous Delivery
scheme.
Table A.2: Tool specific objectives that correlate with development performance.
ii Appendix A
Istio Configuring routing, TLS con- It is used mostly because of There is a plugin system [87],
nections, domain names, and high integration with other and various tools are integrat-
larger networks of services can tooling. For example, Is- ing with Istio capabilities e.g.
be extremely difficult. tio is the driving technol- Seldon-core.
ogy behind eventing configura-
tion and complex deployment
strategies.
Flux v2 Mostly automatic system, that It works great for automatic The tool is new and there’s is
does not concern developers. If deploying of microservices in no popular tools extending on
something goes wrong with de- the cluster, following the Gi- this tool. The tool itself is
ployments, it might need man- tOps methodology. using GitOps Toolkit as its
ual debugging. runtime, which is designed to
be extendable for other tooling
[82].
MinIO Unseen for developers by de- It works greatly in passing Minio is being integrated in
fault. Can be configured to be artefacts between workloads in some cloud projects. No
the main data storage. It has Argo. It also works as a substi- known customized or extended
an API that implements the tute for data storages, e.g. S3. projects.
AWS S3 API, so changing be- Using Minio as the main stor-
tween the two does not require age requires operators to have
any changes to code. disk space and node storage
management in the Kubernetes
cluster, which can introduce a
lot of work and complexity.
SealedSecrets Objuires fetching public key Provides a way to store secrets It is open-source but no known
from the SealedSecrets opera- in a public repository, without examples of extension or cus-
tor upon creating a new clus- anyone having private key to tomization.
ter. In the example demonstra- decrypt them.
tion we have a easy-to-use bash
script to handle heavy-lifting of
using SealedSecrets.
Tool Obj7: Maturity of the tool Obj8: Level of integration Obj9: Generality
with runtime
Argo Workflows CNCF Incubating project. Designed to run containerized All workflows are configured in
workloads in Kubernetes. Pro- YAML configuration files. Ev-
vides Prometheus metrics end- eryone is able to configure even
point. complex workflows.
Seldon Core Seldon-core is not featured in Designed for Kubernetes Developers don’t need to con-
the CNCF space, but its sev- deployments. Provides cern themselves with anything
eral years of development, over Prometheus metrics endpoints, else than writing a SeldonDe-
hundred contributors, over fifty API health and readiness ployment YAML manifest.
releases, rich feature-set and endpoints. Sends CloudEvents
over 2 million downloads sug- by default and integrates with
gests that it is mature for our Istio networking for complex
MLOps pipeline. deployment schemes and API
routing.
iv Appendix A
Knative Eventing Considered to be in the CNCF Designed for Kubernetes mes- YAML configurations for at-
sandbox stage but is backed saging. taching services. Monitoring
and used by Google, which systems has to be programmed
gives it credibility on our eval- to read and send CloudEvents,
uation. which requires learning a SDK.
NATS Streaming CNCF Incubating project. It is running in containers and Developers and Data Scientist
designed for cloud-native mi- does not interact or see it.
croservice architectures.
Prometheus CNCF Graduated project. De-facto monitoring solu- Creating new metric queries or
tion in modern Kubernetes exporting own custom metrics
projects. Almost every tool in will require learning of syntax.
this pipeline exports its own
Prometheus metrics.
Grafana CNCF Sandbox project. Heavy integration with Building new graphs re-
Started 8 Years ago, with hun- Prometheus. quires added knowledge
dreds of commits per month of Prometheus syntax and
and over thousand contribu- Grafana JSON configuration.
tors. Over 75 million dollar
funding according to CNCF.
Used commonly together with
Prometheus.
Istio Considered to be in the CNCF Designed for Kubernetes net- Configuring networking re-
sandbox stage but is backed working. Other tools like quires added knowledge in any
and used by Google, which Seldon-core and Knative de- production environment. Istio
gives it credibility on our eval- pends partly on Istio. is not the easiest or the most
uation. general in anyway.
Flux v2 In CNCF sandbox stage, and A great tool for GitOps con- Data Scientist needs to run
we consider it the least mature tinuous delivery in Kubernetes a bootstrap command when
tool in our pipeline, as it was clusters. starting a project to start the
released just weeks before we GitOps syncing – otherwise it
started working on this follow- does not concern developers.
ing success of Flux V1.
MinIO CNCF sandbox project. Runs as a container. No other A general purpose data stor-
Started 6 years ago, dozens special integration. age. Works like AWS S3.
of commits per month and
297 contributors. 23.3 million
dollar funding according to
CNCF.
SealedSecrets Not a very mature project. Works great for managing Ku- A Data Scientist needs to know
bernetes environment secrets to run a CLI program to crypt
in Git. A key feature to have all secrets they introduce to the
with the GitOps methodology. project.