¹¹institutetext: Vrije Universiteit Amsterdam, The Netherlands ¹¹email: {f.amou.najafabadi,j.bogner,i.g.gerostathopoulos,p.lago}@vu.nl

An Analysis of MLOps Architectures:
A Systematic Mapping Study

Faezeh Amou Najafabadi 11 0009-0006-6413-9118 Justus Bogner 11 0000-0001-5788-0991 Ilias Gerostathopoulos 11 0000-0001-9333-7101 Patricia Lago 11 0000-0002-2234-0845

Abstract

Context. Despite the increasing adoption of Machine Learning Operations (MLOps), teams still encounter challenges in effectively applying this paradigm to their specific projects. While there is a large variety of available tools usable for MLOps, there is simultaneously a lack of consolidated architecture knowledge that can inform the architecture design. Objective. Our primary objective is to provide a comprehensive overview of (i) how MLOps architectures are defined across the literature and (ii) which tools are mentioned to support the implementation of each architecture component. Method. We apply the Systematic Mapping Study method and select 43 primary studies via automatic, manual, and snowballing-based search and selection procedures. Subsequently, we use card sorting to synthesize the results. Results. We contribute (i) a categorization of 35 MLOps architecture components, (ii) a description of several MLOps architecture variants, and (iii) a systematic map between the identified components and the existing MLOps tools. Conclusion. This study provides an overview of the state of the art in MLOps from an architectural perspective. Researchers and practitioners can use our findings to inform the architecture design of their MLOps systems.

Keywords:

Machine Learning Operations MLOps Architecture Components Tools Systematic Mapping Study.

1 Introduction

The use of Machine Learning (ML) continues to grow in industry, and developing high-quality ML models is important to sustain it. However, when creating ML-based systems, another major concern of ML engineers and operations teams is the effective deployment and maintenance of ML models in production [11]. To address this, the Machine Learning Operations (MLOps) paradigm has formed in industry [7]. Similar to DevOps [3], MLOps comprises a set of best practices and related methods, technologies, and tools that aim to bridge the gap between the development of ML models and their deployment, maintenance, and evolution. Despite the increasing adoption of MLOps [14], it is still challenging for practitioners to effectively apply the paradigm to their projects [9]. First, there is a large variety of available tools usable for MLOps, which makes it hard for practitioners to analyze and compare all the options at their disposal [6]. Second, while reusable design decisions for certain parts of ML-enabled systems are starting to emerge [22], there is still a lack of consolidated MLOps architecture knowledge that could guide architectural decisions. Third, MLOps evolves at a fast pace and simultaneously in different domains, which makes it difficult to discern the generalizable concepts and technologies from the domain-specific ones.

In this paper, we therefore aim to provide an overview of the state of the art in MLOps from an architectural perspective. We extract and analyze the components that comprise typical architectures of MLOps systems and identify several variants of such architectures based on the existing variability points, e.g., the optional presence of an online training pipeline. We analyze the dependencies between components and synthesize them in the form of a UML component diagram. To align terminology in a rather scattered domain, we provide several known aliases for each component we identify. We also extract and analyze the tools that are mentioned in the MLOps literature and map them to the architecture components. This provides insight into the implementation options for each component and sheds light on untapped R&D opportunities in the form of less tool-supported components.

We accomplish this by performing a Systematic Mapping Study (SMS) of the scientific literature of MLOps, which grounds our analysis in the scientific state of the art. As such, we complement other attempts that analyzed gray literature for MLOps best practices and architecture design decisions [6, 22]. As our long-term goal, we aim to provide a comprehensive reference architecture [17] for MLOps that covers both the structural perspective (the focus of this paper), and the process and stakeholder perspectives.

In summary, the contributions of this study are (i) a synthesis and categorization of 35 MLOps architecture components, (ii) a description of several MLOps architecture variants, and (iii) a systematic map between the identified components and the existing MLOps tools.

The target audience of our study is (i) researchers in software architecture for ML, who can build on our results to derive and consolidate architecture knowledge in the form of MLOps best practices and patterns, and (ii) ML practitioners, who can use our findings to inform the architecture design of their MLOps systems.

2 Related work

MLOps definitions, practices, and guidelines have been the subject of numerous secondary studies including scoping reviews, systematic literature reviews, and multivocal literature reviews. Many secondary studies on this topic aim at clarifying the definition of MLOps. Mboweni et al. [16] state that there is still no official standard definition for MLOps. Based on their systematic review to disambiguate the definition of MLOps in the literature, they claim that they did not find evidence of a common understanding among scholars and experts on how MLOps should be implemented and institutionalized across the industry to create a common vision. Lima et al. [13] systematically reviewed 30 papers aiming at deriving practices, standards, roles, maturity models, challenges, and tools for MLOps. Based on the addressed challenges and assessment of models, they draw the conclusion that “research on MLOps is still in its initial stages.”

Some literature reviews provide methodologies for effectively approaching MLOps projects. Testi et al. [21] provide a taxonomy of the current approaches toward and propose a methodology for addressing MLOps projects. Kolltveit and Li [10] specifically focus on the operationalization of ML models with regard to tools and infrastructure that are deployed in different stages of MLOps workflows. Recupito et al. [19] take a different perspective and provide an overview of the most common tools and their characteristics that support the creation of MLOps pipelines, without a clear mapping to components.

Several papers on the architecture of ML-based systems are also closely related to our study. For example, Warnett and Zdun conducted two studies in which they used practitioner gray literature to synthesize architectural design decisions (ADDs) for ML workflows [23] and ML deployment [22], with several of their sources being blog posts about MLOps. As a result, several of their ADDs are related to architecture components that we synthesize in our study. However, they do not combine this knowledge into a holistic architecture and also do not cover several parts of MLOps, such as inference and monitoring. In a controlled experiment, Warnett and Zdun [24] also compared the understandability of informal textual and graphical MLOps architecture representations with semiformal MLOps architecture diagrams. They conclude that the understandability of MLOps architecture descriptions is significantly larger with supplementary semiformal architecture diagrams. Lastly, Kumara et al. [12] strive towards a reference architecture of MLOps by eliciting requirements and components from the gray literature. In their preprint, they provide a layered architecture that focuses on requirements that the MLOps environment needs to provide.

Our own study complements the above studies by using scientific literature to synthesize MLOps architecture components, their relationships, and supporting tools to implement them. The results of this synthesis address gaps identified by previous studies, namely “no common understanding of MLOps definition” and “no clear mapping between the tools and the related components”. Moreover, unlike existing works, we also synthesize and discuss several architectural variants. Researchers and practitioners can use the proposed architectures to identify a suitable variant for their requirements.

3 Methodology

In this section, we describe the research goals and process of our study. We design and follow a rigorous protocol, by following established guidelines for systematic secondary studies [8, 18].

3.1 Goal and research questions

The main goal of this SMS is to identify, classify, and analyze the architectures of existing MLOps systems described in scientific literature. In particular, we focus on their structural view, namely the individual components, their dependencies, and their responsibilities. The target audience of this study is (i) practitioners, e.g., ML engineers and software architects, who need to obtain an overview of the common MLOps landscape to make informed decisions, and (ii) researchers aiming to improve the state of the art in architectures of MLOps systems. To approach this goal, we have phrased our overall Research Question (RQ) as “How are MLOps architectures described in the scientific literature?”, and broke it down into the following sub-RQs:

RQ1:

Which are the different components and their dependencies?
This RQ helps us categorize the various components within an MLOps architecture and comprehend how different researchers and practitioners interconnect these components.
RQ2:

Which tools are used to support or implement the identified components?
This RQ helps us identify tools that can be used for implementing an architecture component, as well as identify the most and the least tool-supported components.

3.2 Research process

We visualize our research process in Fig. 1 and describe the different steps in this section. Essentially, we first obtain an initial set of papers via an automated title-based search on Google Scholar, then filter this to obtain a starting set of primary studies. After extracting data from the first set of primary studies, we augment this set via bi-directional one-step snowballing [25] and extract data from the new set of primary studies. Using Google Scholar as a meta-engine allows us to avoid bias towards specific publishers [2]. Finally, the extracted data is used in a rigorous synthesis process.

Refer to caption — Figure 1: Overview of the research process

3.2.1 Step 1: Automated initial search

By extracting relevant search terms from MLOps literature known to us and adding synonyms, we iteratively construct and test several search strings. As a result, we finally arrive at the following query to search in titles of the available literature:

allintitle: (MLOps OR “machine learning operations”) AND (model OR models OR pipeline OR pipelines OR architecture OR architectures OR architecting OR workflow OR workflows OR process OR processes)

The advantages of this final title-focused variant are its manageable number of results and the low number of false positives. Additionally, its potential limitation regarding the extensiveness of its results is compensated by snowballing. The automated query was executed via Google Scholar in September 2023 and yielded 39 potentially relevant papers.

3.2.2 Step 2: Paper selection

The following inclusion (I) and exclusion (E) criteria are used during the selection of primary studies.

I1

The paper describes the architecture of an MLOps system or part thereof.
I2

The described MLOps architecture is an original contribution of the paper, not simply cited related work.
E1

The paper is not written in English.
E2

The paper is a shorter or earlier version of a paper that is already included.
E3

The paper is a secondary or tertiary study.
E4

The full text of the paper is not available.
E5

The paper is not peer-reviewed and published.

To be included in the list of primary studies, a paper has to fulfill all inclusion and no exclusion criteria. To arrive at the above selection criteria, we conduct a selection pilot in which two reviewers independently review five papers and discuss their selection strategy in a consensus meeting. After fine-tuning the selection process, it is applied to all potentially relevant papers. Each paper is independently evaluated by two researchers for inclusion, with a consensus being necessary for final inclusion. Applying the selection criteria results in 18 papers that form our starting set used in the first round of data extraction and in snowballing.

3.2.3 Step 3: Data extraction

In this step, we systematically analyze the primary studies and extract data related to the RQs. To refine our data extraction strategy, we first conduct a data extraction pilot on five randomly chosen papers. Data from each paper is extracted by two authors independently and discussed in a consensus meeting leading to the final data extraction framework depicted in Table 1. For RQ1, we extract the items “Architecture or process figures” and “Architecture components”, while we extract “Tools” and “Tool-component mapping” for RQ2. The “Application domain” and the “Author affiliations” are extracted as generic information to use for further analysis. From this point on, each of the remaining papers is assigned to two authors for data extraction. The extracted data is validated through bilateral discussions and consensus between the authors.

Table 1: Data Extraction Framework

RQ	Data Item	Notes
Generic	Application domain	The respective domain the architecture is proposed for
Generic	Author affiliations	Origin of the architecture: academia, industry, or collaboration
RQ1	Architecture or process figures	The list of relevant figures and their types (values can be architecture, process, or combined)
RQ1	Architecture components	The list of components and their relationship
RQ2	Tools	The list of tools that are used or suggested in the paper
RQ2	Tool-component mapping	The list of tools mapped to the components

3.2.4 Step 4: Snowballing

We apply backward and forward snowballing to enrich the results obtained via automated search, as suggested by Wohlin et al. [25]. During this step, all papers that either cite or are cited by a paper from the starting set are examined for inclusion in the final set of primary studies. We apply both the same selection criteria and data extraction (steps 2 and 3) as for the initial set of papers and the same process: each paper is examined independently by two researchers and consensus needs to be reached for including it, then the data from each paper extracted by two researchers and the results are discussed. After conducting a first round of bidirectional snowballing, we add 25 more papers to our set of primary studies. We limit the snowballing to a single round; this decision stemmed from the observation that subsequent data extraction yielded minimal additional components and tools compared to the initial seed collection.

3.2.5 Step 5: Data synthesis

In this phase, we harmonize and classify the extracted data per parameter (architecture components, tools). To achieve this synthesis, we use card-sorting, a lightweight, collaborative, qualitative analysis technique [27].

In particular, we use hybrid card-sorting [5], a combination of open card-sorting (where categories emerge from the data) and closed card-sorting (categories are defined beforehand based on existing taxonomies). We define the initial set of categories based on our background knowledge of MLOps and software architecture and iteratively refine and enrich this set. We conduct the card-sorting in three phases: in the preparation phase, we print all the extracted components on cards; in the execution phase, we sort the cards into meaningful categories and groups and name them; in the analysis phase, we identify the relationships between the identified components and categories. We also disambiguate and group together the extracted tools and derive their mapping to the synthesized architecture components. The final output is a general architecture of the MLOps workflows from a structural perspective (represented in a UML component diagram), along with a map between tools and components. Lastly, we synthesize the information about the mentioned domains for which the MLOps models are proposed.

4 Results

The results of our study are based on extracted and synthesized data from 43 papers published between 2020 and 2024. This timeframe aligns with the emergence and maturation of the MLOps discipline, which is placed in early 2019 [14]. We observe a peak in the publications on this topic in 2022 (19/43 papers).

Generally, we observe that most papers (20/43) propose domain-agnostic architectures. At the same time, six papers focus on the domain of edge computing, three on manufacturing, and each of the remaining 14 papers focus on different domains including healthcare, psychomotor learning, etc.¹¹1To observe the complete list of domains, please refer to the replication package [1]. Our data also shows that 23/43 papers are authored by academic researchers, 10/43 by industrial practitioners, and 10/43 as a collaboration between the two communities. This is a testament to the strong interest of both academia and industry in MLOps.

In the remainder of this section, we provide the results of the study regarding the first and second RQs.

4.1 MLOps architecture components and their dependencies (RQ1)

In total, we synthesized 35 unique architecture components by systematically going through all the architecture figures and descriptions contained in our 43 primary studies. These components are domain-agnostic, i.e., they are not tied to a specific application domain. We also identified domain-specific components such as IoT Sensors (IoT domain) or User Feedback Collector (psychomotor learning domain), but excluded them for the sake of general applicability. We group the 35 identified architecture components into 6 categories:

•

Data Curation entails components responsible for gathering and processing data for the MLOps system.
•

Storage and Versioning comprises components responsible for storing, versioning, and managing the data and models in the system.
•

ML Training includes components responsible for training and evaluating the ML models, both in the experimentation and production phases.
•

CI/CD refers to the category of components responsible for continuously building and deploying ML pipelines, models, and components.
•

Inference entails the components responsible for providing predictions, making subsequent decisions, and monitoring the system.
•

Infrastructure and Supporting Services comprises infrastructure components that provide system support, e.g., Container Manager, Orchestrator, etc.

Table 2 displays the 35 components by category, their responsibilities, important aliases that we identified for each component name, and the number of occurrences of each.

Table 2: MLOps architecture components, their aliases, responsibilities, and number of occurrences in the primary studies

	Component Name	Aliases	Responsibilities	#
Data Curation	Data Source	external data sources	Produces and exposes data from a real-world environment, e.g., domain events, IoT sensors, human inputs, etc.	4
	Data Collector	data acquisition, data loading	Collects raw data like events from various data sources.	10
	Data Preprocessor	data processing, data cleaning, data validation, data curation pipeline	Validates, cleans, and prepares collected data for storing as ML training datasets.	9
Storage and Versioning	Dataset Catalogue	–	Stores metadata of datasets in an organized inventory, allows users to browse datasets.	1
	Dataset Repository	data store, data repository, data versioning	Stores and versions the datasets used for ML workflows.	10
	Raw Data Store	data store	Stores the raw data that are collected from sources.	2
	Feature Store	–	Computes and stores reusable features, serves the computed features with low latency.	8
	Code Repository	source code management, source code repository	Stores and versions the training, deployment, and application source code.	9
	Model Repository	model registry, model store	Stores and versions the trained ML models along with basic metadata, e.g., their versions, etc.	21
	Artifact Repository	image repository, container registry	Stores a packaged or containerized ML component that incorporates an ML model for inference.	3
	ML Metadata Repository	experiment tracking DB, ML metadata store	Stores metadata related to model training for experiment tracking purposes, e.g., model performance metrics, etc.	15
	Feedback Database	feedback store	Stores stakeholder feedback and experiences, e.g., from domain experts or engineers, which are manually considered during iterative model development.	2
ML Training	Data Labelling Component	data annotation, ground truth annotation	Adds the ground truth labels for supervised learning models to dataset instances.	3
	Feature Engineering Pipeline	feature selection	Selects and transforms the features of the used dataset for model training.	4
	ML Experiment Pipeline (Offline)	ML pipeline (offline), manual ML pipeline, data science experiments	Develops and trains ML models at design time (more experimental and manual).	14
	ML Training Pipeline (Online)	ML pipeline (online), continuous training pipeline, incremental online learning, MLOps pipeline	Continuously trains ML models at runtime in a production environment (completely automated).	12
	Model Evaluator	–	Evaluates the prediction performance of the models during training.	4
CI/CD	ML Pipeline Builder	build and test pipeline, CI tool	Builds, tests, and packages, e.g., in containers, the code of the ML pipeline.	2
	ML Pipeline Deployer	pipeline deployment, ML training pipeline deployment, CD tool	Deploys the built and packaged code of the ML pipeline to staging or production environments.	3
	ML Model Deployer	model deployment, deployment pipeline	Deploys the trained model packaged with the dependencies, e.g., required libraries, preprocessing code, etc. to the production environment.	4
	ML Component Builder	build automation pipeline, continuous integration (CI)	Builds and tests ML components, i.e., deployment-ready containerized ML models wrapped in an API.	3
	ML Component Deployer	ML deployment, continuous delivery, continuous deployment, CI/CD pipeline	Deploys the ML components to staging or production environments.	7

Table 2: MLOps architecture components, their aliases, responsibilities, and number of occurrences in the primary studies (cont.)

	Component Name	Aliases	Responsibilities	#
Inference	Inference Service	model inference, production ML service, model server	Serves the trained models to provide predictions on new data (ML component).	10
	Inference Engine	pool inference, local inference engine, model serving component	Includes an ML runtime into which trained models can be continuously deployed to serve predictions.	9
	Runtime Model Monitor	performance monitor, monitoring component, model runtime monitor	Continuously observes the model-serving performance and infrastructure in real-time.	12
	Trigger	retraining triggering webhook, retraining trigger	Triggers retraining of the ML models based on predefined events and intervals or a predefined performance threshold observed via Runtime Model Monitor.	3
	Model Comparison Runner	model comparison runner, model metrics evaluator	Compares the newly trained model to the old model and deploys the better performing one.	2
	Decision Processor	decision processing	Derives decisions based on the predictions of the model. The decisions are then acted upon by the actors inside or outside the system.	2
Infrastructure and Supporting Services	Resource Manager	resource leasing, model engine	Provides foundational hardware and software computational resources. The provided computational resources can be distributed or non-distributed and scalable or non-scalable.	5
	Communication Middleware	message queue, event streaming bus	Distributes the received requests and model predictions to resources.	4
	Container Manager	container service	Manages, e.g., starts and stops, the containerized ML components that are built with the ML Component Builder.	2
	Orchestrator	adaptive scheduler, workflow orchestration, job tracking module	Provides system-wide orchestration, and decides the execution schedule of multiple models balancing throughput and latency.	5
	Log Master	logging, info collector, object store, predictions store	Records and saves information regarding all the actions in the system, e.g., running the services, training, user requests, predictions, etc.	3
	API	API gateway	Provides interaction between the components within the platform and also between the platform and external entities.	5
	MLOps User Interaction Manager	Ops dashboard, front-end	Provides interaction between the MLOps team and the MLOps platform.	2

We combined the synthesized components and their dependencies into a holistic UML component diagram depicted in Fig. 2. In this diagram, we introduce five types of architecture components: baseline components are mandatory components. These components form the baseline of the architecture (available in all variants of MLOps architectures) and need to be complemented with the components of either the inference service or the inference engine group to form a complete MLOps architecture. Optional components represent non-essential components that can be situationally useful, e.g., the Model Comparison Runner. As a special type of optional components, the ones of the online training group can be added to an MLOps architecture, but always together as a group. Note that the Infrastructure and Supporting Services components are not included in the diagram, since they provide support to the whole system.

The baseline components include the Data Preprocessor which reads its input from the Raw Data Store and stores its results to the Dataset Repository. The latter provides data to both the Data Labelling Component and the ML Experiment Pipeline. At the same time, the ML Experiment Pipeline uses the datasets from the Dataset Repository and the ML algorithms from the Code Repository to train ML models, and then stores the trained ML models to the Model Repository. After deployment, the Runtime Model Monitor provides real-time model performance data.

Overall, the blueprint for assembling a complete MLOps system involves the above-mentioned baseline components, as well as incorporating either the inference service or the inference engine group, and potentially the online training group and/or other optional components. In the following, we describe four characteristic architecture variants (V1 to V4) depicted in Fig. 2.

V1: This architecture variant describes an architecture containing an Inference Service to serve ML models in a production environment. The Inference Service represents an ML component [15], i.e., a containerized, deployment-ready ML model wrapped into an API that is usable for predictions. Thus, in this variant, these ML components are built, tested, and packaged by the ML Component Builder, deployed through the ML Component Deployer, and stored in the Artifact Repository.

V2: Contrary to the Inference Service of V1, this architecture variant involves an Inference Engine to serve the trained ML models. The Inference Engine contains a runtime for ML models that allows the deployment of new models through the ML Model Deployer. Hence, this variant efficiently updates only the ML model instead of always replacing the complete ML component. Alternatively, the Inference Engine can also check the Model Repository periodically or in an event-based fashion to fetch a new model version if certain criteria are met.

V3: The third variant combines the components in the online training group with V1. The presence of Trigger entails the presence of ML Training Pipeline (Online), ML Pipeline Builder, ML Pipeline Deployer, and the optional presence of Model Evaluator. The Trigger, using data provided by the Runtime Model Monitor, submits a periodic or event-based retraining request to the ML Training Pipeline. This automatically retrains and deploys a new model in production.

V4: The fourth variant, similar to the third, combines the components in the online training group with V2, which results in the automatic retraining and deploying of ML models in the Inference Engine.

Vx: In addition to these four described variants, the selection of any combination of the optional components can result in several additional variants. Table 2 can be consulted for a detailed description of all the other components. As an example, we describe the addition of the Model Comparison Runner here. Adding this component to either V3 or V4 allows more informed model update decisions. It compares the performance of a newly trained model in the production environment to the currently deployed model and keeps the one that performs better. In the absence of this component, the newly trained model is always deployed, even if its performance would be inferior to the current ones.

Considering the eight optional components and four described major variants, the selection of any combination between them results in a large number of different architecture variants ( $4\times(2^{8}-1)=1020$ ). Selecting the most suitable variant may depend on factors like specific design decisions, resource availability, scalability considerations, required update frequency, or technological expertise.

4.2 Tools used to support or implement the components (RQ2)

Among the 43 reviewed papers, we identified 76 tools in total. Fig. 3 depicts a heatmap of the tools and platforms that are mentioned at least 3 times among the papers, mapped to the components that they support. In the figure, it is evident that Jenkins is the most frequently mentioned tool. This tool is used to support the components within the CI/CD category. AWS SageMaker is also one of the most popular tools. This tool is observed to support the highest number of components in the architecture of MLOps workflows (10 distinct components among 5 categories of our architecture). MLflow follows by supporting six different components mentioned in several papers.

Within the six categories of architecture components, the greatest variety of tools is mentioned for Storage and Versioning, followed by CI/CD. The tools supporting the entire CI/CD category encompass Jenkins, GitHub Actions, TeamCity, CircleCI, and Bamboo. Regarding the components, however, the most diverse variety of tools is mentioned for Model Repository, ML Model Deployer, and ML Pipeline Deployer. There are also several components for which no tool is mentioned. These include Dataset Catalogue, Raw Data Store, Feedback Database, Data Collector, Data Source, Model Evaluator, Log Master, MLOps User Interaction Manager, Trigger, and Decision Service. The complete mapping between the tools and the components can be accessed through the replication package [1].

Lastly, some papers generally mention tools and services that support the end-to-end MLOps workflows, without mapping them to any specific component. These tools are AWS SageMaker, MLflow, Kubeflow, Weights and Biases, Clear ML, MLReef, Iguazio, Polyaxon, Vertex AI, Azure ML, and Snorkel.

5 Discussion

In this section, we discuss the implications of the derived results of this SMS. The UML component diagram and component-tool mapping provided in this study serve as valuable references for practitioners and researchers who aim to design or enhance MLOps systems. Nonetheless, the findings derived from this study suggest several noteworthy points, which we will discuss in the following.

Complexity and mixed views in architecture diagrams. A majority of the analyzed figures and descriptions of MLOps workflows were a combination of architecture, process, and stakeholder roles, thereby combining several different architectural views and concerns in a single diagram. Even though this approach is a common practice and makes the provided information more compact, the increased complexity reduces the clarity of the provided architectures.

Non-standard notations to represent architectures. Almost all of the analyzed figures used informal box-and-line diagrams as a notation. Among the 43 reviewed papers, only one paper (S23) uses a standard notation, the Fundamental Modelling Concepts (FMC) [4]. Combined with the complexity of the mixed views and following the results of Warnett and Zdun [24], the understandability of these MLOps architecture representations is substantially impacted.

Level of abstraction in architectures. Among the extracted figures and descriptions of the papers, the level of provided details and abstraction varied over a large spectrum. During the study, we extracted and synthesized data from figures only representing as few as six high-level architecture components (S10, S43) to complex “combined figures” with over 40 concrete entities including components, actions, and stakeholder roles (S7). The different levels of abstraction and multitude of dependencies between components also allow many fine-grained possibilities for variations in component dependencies and interface directions. For simplicity, we modeled only the most common dependencies in the diagram.

Tools in place of architectural entities. In addition to the mixed architectural views and different levels of abstractions, the extracted figures in some papers also included a mixture of architectural entities and the employed tools as a stand-alone entity (see, e.g., S31). This complexity in the views makes the architectural understanding and comparison difficult.

Inconsistently named components and activities. The terminology for the same individual components or activities could vary substantially among the papers. We identified two types of inconsistent naming: (a) some papers use common component names that imply different responsibilities, e.g., using “Feature Store” for a component that provides the training data for the ML training pipeline instead of “Data(set) Store” or using “Artifact Store” instead of “ML Model Store”, and (b) other papers use unique names to represent common ML-based software components, e.g., “knowledge base manager” for a component that stores and versions the trained ML models (Model Repository in our synthesis). These inconsistencies highlight why communication and collaboration in MLOps projects is often difficult.

Domain-specific architecture components. In almost 50% of the papers, the provided architectures are domain-specific and therefore contain components that are only situationally applicable or focused on a specific application scenario. For example, the presence of “additive manufacturing” components in S36, or “Blockchain”-related components in S40. In the analysis, we generalized these components to a relevant category or discarded the components that represented a very specialized entity.

End-to-end tools only mentioned for a certain set of components. A tool-related observation is that most of the end-to-end MLOps tools like MLflow, Kubeflow, or AWS SageMaker are recommended only for a subset of components, rather than for the entire workflow. For example, in the reviewed papers, MLflow is mentioned only for components in the Storage and Versioning, Inference, and Infrastructure and Supporting Services among our categories. Interestingly, this tool is not mentioned for any component in the ML Training category.

6 Threats to validity

Following the categorization by Wohlin et al. [26] and the checklist by Ampatzoglou [2], we outline the threats that may have affected the validity of our research and outline the actions that we take to mitigate the threats.

Threats to internal validity undermine the conclusion about a possible causal relationship between the study and the outcome [26]. To mitigate this threat, we assign two researchers for the study selection and data extraction phases, who perform the selection and extraction independently, and then three researchers discuss the results in consensus meetings. Another possible threat to internal validity in this study stems from the selection of papers from various domains. Varying author expertise across fields, particularly outside ML-based systems, may influence the accuracy of the architecture and process descriptions in papers, which influences our data extraction. To mitigate this threat, we have mapped every component and activity name and description to the authors’ intentions, e.g., in some papers authors refer to a dataset repository as a feature store.

External validity is concerned with the generalization extent of the findings of the study [26]. A possible threat to the external validity of our study is the selection of the papers among the peer-reviewed resources, which may limit our initial set of papers. This is a research design decision to ensure the reliability of the selected papers. We mitigate this threat by applying our search query on Google Scholar as a meta-engine, which results in papers from different venues, as well as applying one round of bidirectional snowballing. However, since we focus on peer-reviewed scientific papers, we might miss works from the many practitioners that certainly focus on this very popular topic. This may introduce a threat to the generalizability of the results.

Reliability is concerned with the extent to which the data and the analysis are dependent on the specific researchers [2, 20]. A possible threat in our study concerning reliability can be the authors’ bias in synthesizing data due to the nature of the extracted data and the different levels of abstraction of the architecture in the papers. To mitigate this threat, three authors participate in analyzing and synthesizing the data.

7 Conclusions

Based on an SMS with 43 scientific papers, we synthesized architectural MLOps components, their dependencies, and tools to implement them. Furthermore, we combined these components into a holistic MLOps architecture and discussed several architectural variants that emerge from the literature. Our results contribute to understanding the architecture aspects of MLOps systems and potentially support communication in this complex and still maturing domain.

Regarding future work, a structural perspective is not the only architectural view that is important in MLOps. Therefore, we plan to synthesize a process view of MLOps with a similar research design and to map activities to MLOps roles. Moreover, synthesizing architectural decisions, best practices, and antipatterns in this domain may also support practitioners, which we plan to provide via our long-term goal, a reference architecture for MLOps. In the end, practitioners could be best supported if we could clearly link architectural MLOps variants to functional and quality requirements, so that practitioners can easily choose the variant that best suits their needs. To allow such endeavors and to increase transparency, we share our research artifacts on Zenodo [1].

Data Availability

The data and artifacts of this study are available as a replication package [1].

{credits}

7.0.1 Acknowledgements

This research is supported by ExtremeXP, a project co-funded by the European Union Horizon Programme under Grant Agreement No. 101093164.

References

[1] Amou Najafabadi, F., Bogner, J., Gerostathopoulos, I., Lago, P.: An Analysis of MLOps Architectures: A Systematic Mapping Study. [Data set], Zenodo (2024), https://doi.org//10.5281/zenodo.11067770
[2] Ampatzoglou, A., Bibi, S., Avgeriou, P., Verbeek, M., Chatzigeorgiou, A.: Identifying, categorizing and mitigating threats to validity in software engineering secondary studies. Information and Software Technology 106, 201–230 (2019)
[3] Bass, L., Weber, I., Zhu, L.: DevOps: A Software Architect’s Perspective. Addison-Wesley Professional, 1st edn. (2015)
[4] The fundamental modeling concepts, http://www.fmc-modeling.org, online.
[5] Hudson, W.: Card sorting. The Encyclopedia of Human-Computer Interaction (2014), https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/card-sorting, accessed Mar. 31, 2024
[6] Idowu, S., Strüber, D., Berger, T.: Asset management in machine learning: a survey. In: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice. pp. 51–60. ICSE-SEIP ’21, IEEE Press (2021)
[7] John, M.M., Olsson, H.H., Bosch, J.: Towards MLOps: A Framework and Maturity Model. In: 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). pp. 1–8. IEEE (2021)
[8] Keele, S., Kitchenham, B., et al.: Guidelines for performing systematic literature reviews in software engineering (2007)
[9] Kolar Narayanappa, A., Amrit, C.: An analysis of the barriers preventing the implementation of mlops. In: International Working Conference on Transfer and Diffusion of IT. pp. 101–114. Springer (2023)
[10] Kolltveit, A.B., Li, J.: Operationalizing machine learning models: A systematic literature review. In: Proceedings of the 1st Workshop on Software Engineering for Responsible AI. pp. 1–8 (2022)
[11] Kreuzberger, D., Kühl, N., Hirschl, S.: Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access 11, 31866–31879 (2023)
[12] Kumara, I., Arts, R., Di Nucci, D., Van Den Heuvel, W.J., Tamburri, D.A.: Requirements and reference architecture for mlops: Insights from industry. Authorea Preprints (2023)
[13] Lima, A., Monteiro, L., Furtado, A.P.: Mlops: Practices, maturity models, roles, tools, and challenges-a systematic literature review. ICEIS (1) pp. 308–320 (2022)
[14] Mark Treveil, t.D.T.: Introducing MLOps. O’Reilly Media, Inc (2021)
[15] Martínez-Fernández, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendowicz, A., Vollmer, A.M., Wagner, S.: Software Engineering for AI-Based Systems: A Survey. ACM Transactions on Software Engineering and Methodology 31(2) (2022)
[16] Mboweni, T., Masombuka, T., Dongmo, C.: A systematic review of machine learning devops. In: 2022 international conference on electrical, computer and energy technologies (ICECET). pp. 1–6. IEEE (2022)
[17] Nakagawa, E.Y., Antonino, P.O.: Reference Architectures for Critical Domains: Industrial Uses and Impacts. Springer Nature (2023)
[18] Petersen, K., Vakkalanka, S., Kuzniarz, L.: Guidelines for conducting systematic mapping studies in software engineering: An update. Information and Software Technology 64, 1–18 (2015)
[19] Recupito, G., Pecorelli, F., Catolino, G., Moreschini, S., Di Nucci, D., Palomba, F., Tamburri, D.A.: A multivocal literature review of mlops tools and features. In: 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). pp. 84–91. IEEE (2022)
[20] Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empirical software engineering 14, 131–164 (2009)
[21] Testi, M., Ballabio, M., Frontoni, E., Iannello, G., Moccia, S., Soda, P., Vessio, G.: Mlops: A taxonomy and a methodology. IEEE Access 10, 63606–63618 (2022)
[22] Warnett, S.J., Zdun, U.: Architectural Design Decisions for Machine Learning Deployment. In: 2022 IEEE 19th International Conference on Software Architecture (ICSA). pp. 90–100 (2022)
[23] Warnett, S.J., Zdun, U.: Architectural Design Decisions for the Machine Learning Workflow. Computer 55(3), 40–51 (2022), publisher: IEEE
[24] Warnett, S.J., Zdun, U.: On the Understandability of MLOps System Architectures. IEEE Transactions on Software Engineering pp. 1–25 (2024)
[25] Wohlin, C.: Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering. EASE ’14, ACM (2014)
[26] Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in software engineering. Springer Science & Business Media (2012)
[27] Zimmermann, T.: Card-sorting: From text to themes. In: Perspectives on Data Science for Software Engineering, pp. 137–141. Elsevier (2016)

An Analysis of MLOps Architectures: A Systematic Mapping Study