2501.06706v1
2501.06706v1
2501.06706v1
Yinfang Chen 1 Manish Shetty 2 Gagan Somashekar 3 Minghua Ma 3 Yogesh Simmhan 4 Jonathan Mace 3
Chetan Bansal 3 Rujia Wang 3 Saravan Rajmohan 3
A BSTRACT
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root
arXiv:2501.06706v1 [cs.AI] 12 Jan 2025
cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools
and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language
Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation.
This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire
incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision
requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this
end, we present AIO PS L AB, a framework that not only deploys microservice cloud environments, injects faults,
generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces
for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and
demonstrate how AIO PS L AB can facilitate the evaluation of next-generation AIOps agents. Through evaluations
of state-of-the-art LLM agents within the benchmark created by AIO PS L AB, we provide insights into their
capabilities and limitations in handling complex operational tasks in cloud environments.
‘Dev’ side of DevOps by accelerating software development. scenarios, referred to as problems, which replicates realistic
However, progress in AI for ‘Ops’, particularly AgentOps, incidents within the microservice system. AIO PS L AB’s
remains limited, due to the lack of high-quality benchmarks problem pool is structured around a task-level taxonomy that
for diverse, realistic scenarios. Addressing this gap requires categorizes tasks of different problems across the incident
a framework that aids the design, development, and evalua- management lifecycle. Our approach ensures that evaluation
tion of AIOps agents within an interactive environment, a scenarios go beyond simple performance or crash failures
key contribution of this paper. (that cannot be further analyzed or mitigated by the agents),
incorporating fine-grained root causes to fully assess the
Challenges and contributions. Building a holistic bench- diagnostic and mitigation abilities of AIOps agents.
mark framework that can allow agents to interact dynam-
ically with the cloud poses several challenges. The first Implementation. We developed AIO PS L AB, an inno-
challenge is to manage an evaluation flow that is generally vative framework for building AgentOps benchmarks to
applicable to diverse agents and clouds, powerful enough to evaluate LLM-based AIOps agents. AIO PS L AB utilizes
evaluate agents by complex and realistic operational tasks, two microservice applications from DeathStarBench (Gan
and valuable enough to provide different feedback or ob- et al., 2019) as testbeds, along with their workload gen-
servability, together with extensibility that make it possible erators. An extensible fault library, integrated with
to accommodate new tasks and agents by the users. While ChaosMesh (ChaosMesh Authors, 2022), enables diverse
existing tools address individual components of the AIOps fault injections into the system. A telemetry observer, in-
evaluation, such as observability (He et al., 2023; Simonsson corporating Prometheus (Prometheus Authors, 2024) for
et al., 2021), application suites (Gan et al., 2019; Zhou et al., metrics, Jaeger (Jaeger Authors, 2024) for tracing, and File-
2021; Sriraman and Wenisch, 2018) and chaos engineer- beat (Elasticsearch, 2024b) and Logstash (Elasticsearch,
ing (Netflix, 2011; ChaosBlade Team, 2019; ChaosMesh 2024a) for logging, supports on-disk storage of telemetry
Authors, 2022), they lack the integration necessary to sup- data, facilitating evaluations of both traditional AIOps al-
port a unified AIOps evaluation. gorithms and agentic solutions. We also integrate Helm
and Kubernetes APIs into the AIO PS L AB’s orchestrator
We present AIO PS L AB, a holistic framework that can au- implementation.
tomatically manage the entire end-to-end evaluation pro-
cess for AIOps solutions. This involves deploying services, To demonstrate the application of our framework in evalu-
fault injection, workload generation, orchestrating the agent- ating LLM-based agents as the benchmark, we use AIO P -
cloud interaction, and analyzing results. Specifically, AIO P - S L AB to create 48 problems as evaluation scenarios covering
S L AB features the Agent-Cloud Interface (ACI), a unified different types of AIOps tasks, and register four agents from
interface that enables agents to interact with the cloud. ACI different types on those problems. The agent registration is
allows agents to communicate, take action, and receive feed- lightweight, with less than a hundred lines of code to im-
back, orchestrating these interactions to detect and resolve plement. Our evaluation process reveals distinct challenges
issues in dynamic and interactive environments. agents face across tasks.
Moreover, a common challenge in operation benchmarks Summary. This paper makes the following contributions:
is the lack of realistic evaluation scenarios, as existing ap-
• We unravel the requirements and challenges of achieving
proaches often rely on static datasets, such as system met-
a holistic framework that supports the design, develop-
rics (Han et al., 2022; Jacob et al., 2020) that are typically
ment, and evaluation of autonomous AIOps agents;
time series data, or on fixed question-answer format (Liu
et al., 2023). Such setups do not capture the dynamic, unpre- • We develop a framework, AIO PS L AB, which can not only
dictable, and evolving nature of real-world cloud environ- deploy microservice cloud environments, inject faults,
ments, where workloads and incidents fluctuate over time. generate workloads, and export telemetry data but also
To make matters worse, recent efforts on AgentOps (Wang orchestrate these components and provide agent-cloud
et al., 2023; Zhang et al., 2024a) use proprietary services interfaces for interacting with and evaluating agents.
and datasets. Furthermore, existing AIOps approaches and • We leverage the AIO PS L AB framework to construct a
their benchmarks often focus only on isolated aspects of benchmark suite with 48 problems across different AIOps
the incident lifecycle, such as anomaly detection (Yu et al., tasks in an interactive environment and evaluate four
2024b) or fault localization (Sun et al., 2024). This lacks a LLM-based agents.
cohesive framework to evaluate AIOps agents comprehen- • We provide a detailed analysis of the agents’ performance
sively. Moreover, it limits support for decision-making that and limitations by evaluating them on AIO PS L AB.
could assist in chaining algorithms or selecting the most • We will make AIO PS L AB publicly available.1
suitable agent for a given operation scenario.
1
To address these limitations, we designed a set of evaluation The link will be provided.
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
provided by AIOpsLab
Fault library Services Under Test (§2.3) Workload Policy
extensible by developer
(§2.4) (§2.2.3)
evaluation result SocialNetwork
external API 4 HotelReservation 6
Fault Generator Workload
internal call (§2.4) Others Generator (§2.2.3)
system state/telemetry
3 2 8 5
1 Register
Telemetry Collector (§2.5)
Orchestrator (§2.2)
Agents 7 Action A Problem Pool (§3.3)
(§3.1) Evaluator (§2.2.3)
... ... Problem Definition (§2.1): Task (§2.4.1):
- Task T - Detect
9 Submit S Customized Evalution
- Workload W - Localize
Evaluation - Specified Fault F - RCA
Common Evaluation - Expected Solution S’ - Mitigate
Result (§3) 10
Figure 2. Overview of AIO PS L AB. The Orchestrator coordinates interactions between various system components and serves as the
Agent-Cloud-Interface (ACI). Agents engage with the Orchestrator to solve tasks, receiving a problem description, instructions, and
relevant APIs. The Orchestrator generates diverse problems using the Workload and Fault Generators, injecting these into applications it
can deploy. The deployed service has observability, providing telemetry such as metrics, traces, and logs. Agents act via the Orchestrator,
which executes them and updates the service’s state. The Orchestrator evaluates the final solution using predefined metrics for the task.
that agents can make meaningful progress towards their ob- 2.2.2 Session Interface
jectives. Some APIs that AIO PS L AB provides by default
Another key responsibility of the Orchestrator is to manage
include get_logs (fetch logs), get_metrics (fetch metrics),
the lifecycle of the agent and the service. We implement the
get_traces (fetch traces), and exec_shell (execute shell
Orchestrator as a session-based system, where a Session
commands after applying security policy filters).
is created for each instance of an agent solving a problem.
Example 2.2. This example illustrates how the ACI is de- Agents are registered with the Orchestrator, and a session
fined in AIO PS L AB as APIs that agents can use. starts with simple API calls passing a unique problem iden-
tifier 1 . AIO PS L AB’s design is highly flexible and inte-
1 class TaskActions : grates with the growing LLM and agent framework space.
2 def get_traces (ns: str , duration : int = 5) ->
str: Our only requirement is that the agent must implement a
3 """ get_action method with the following signature: async
4 Collects trace data of the services from
Jaeger . def get_action(state: str)-> str. It takes the service’s
5 Args: state as input from the Orchestrator and returns the next ac-
6 ns (str): The K8S namespace .
7 duration (int): Duration to collect tion the agent wants to take. Note that this could be a simple
traces . wrapper function around any existing agent framework.
8 Returns :
9 str: Path to the directory where traces Example 2.3. In this simplified example, we illustrate how
saved .
10 """ an Agent can be onboarded to AIO PS L AB.
11 trace_api = TraceAPI (ns)
12 end_t = datetime .now () 1 from aiopslab import Orchestrator
13 start_t = end_t - timedelta ( duration ) 2 class Agent :
14 traces = trace_api . extract_traces (start_t , 3 def __init__ (self , prob , instructs , apis):
end_t ) 4 self. prompt = self. set_prompt (prob ,
15 return trace_api . save_traces ( traces ) instructs , apis)
5 self.llm = GPT4 ()
6
7 async def get_action (self , state : str) -> str:
As shown, the ACI encapsulates complex operations behind 8 return self.llm. generate (self. prompt +
state )
simple APIs like get_traces. On initializing a problem, 9
the Orchestrator automatically extracts documentation from 10 # initialize the orchestrator
11 orch = Orchestrator ()
these APIs to provide as context C to the agent. At runtime, 12 pid = " misconfig_app_hotel_res - mitigation -1"
agents can specify a wide range of actions on the service 13 prob_desc , instructs , apis = orch. init_problem (pid)
14 # register and evaluate the agent
(e.g., scaling, redeploying, patching) by way of the Orches- 15 agent = Agent (prob_desc , instructs , apis)
trator’s privileged access. Finally, the Orchestrator conveys 16 orch. register_agent (agent , name=" myAgent ")
17 asyncio .run(orch. start_problem ( max_steps =10))
the service’s state after each action with high-quality feed-
back to the agent, including outputs, error messages, and As shown on initializing a problem, the Orchestrator shares
tracebacks. context necessary for the agent to solve the problem. It then
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
polls (via get_action) for the agent’s next action. Faults Provided by
AIOpsLab (§2.4)
2.2.3 Other Interfaces
Symptomatic Functional
Problem Initializers. As described in Section 2.1, each Faults (§2.4.2) Faults (§2.4.3)
Table 2. Selected faults used to instantiate the problems for evaluation in AIO PS L AB. Ext. stands for extensibility. denotes the
fault can be easily used to construct other problems; G
# denotes there is some manual effort needed to create new problems; while #
means the fault is specific to some problems and cannot be applied to create other problems.
No. Name Application Task Level Category Ext. # Problem Description
Functional Missing authentication credentials cause
1 AuthenticationMissing HotelReservation 1, 2, 3, 4 #
G 4
Virtualization access denial to MongoDB.
Functional The service cannot connect to the specified
2 TargetPortMisconfig SocialNetwork 1, 2, 3, 4 12
Virtualization port due to misconfiguration.
Functional Revoked authentication causes database
3 RevokeAuth HotelReservation 1, 2, 3, 4 #
G 8
Application connection failure.
Functional The database service has access failures
4 UserUnregistered HotelReservation 1, 2, 3, 4 #
G 8
Application after the user was unregistered.
Functional Connection code bug in the application
5 BuggyAppImage HotelReservation 1, 2, 3, 4 # 4
Application image causes access issues.
Functional Incorrect scaling operation makes the
6 ScalePod SocialNetwork 1, 2, 3, 4 4
Virtualization number of pod zero for a service.
Functional Pod in a pending a failure status due to
7 AssignNonExistentNode SocialNetwork 1, 2, 3, 4 4
Virtualization wrong assignment to a non-existent node.
Network loss causes communication
8 NetworkLoss HotelReservation 1, 2 Symptomatic 2
failures for a specific service.
9 PodFailure HotelReservation 1, 2 Symptomatic 2 Service interruption due to a pod failure.
HotelReservation
10 Noop 1 - 2 No faults injected into the system.
SocialNetwork
use GPT-3.5- TURBO and GPT-4- TURBO (Achiam et al., 2023) agent interacts with the AIO PS L AB instead of the number
that have access to only a secure shell as baselines (GPT- W- of requests sent to the backend LLM.
S HELL). In addition, we also evaluate the performance of
Cost. We use the number of tokens, including both the input
R E ACT (Yao et al., 2023), which extends chain-of-thought
token and output tokens, generated by the agents/environ-
reasoning (Wei et al., 2022b) by integrating reasoning and
ment as an indicator of the cost.
acting in an interleaved manner,
As for cloud operation-specific agents, we choose 3.3 Problem Pool of AIO PS L AB Benchmark
F LASH (Zhang et al., 2024b). F LASH employs a workflow au- Currently, AIO PS L AB benchmark consists of 48 problems
tomation system that monitors execution status and decom- in its problem pool. With six agents, we evaluate a total
poses complex instructions into manageable, conditional of 288 cases. Table 2 lists the faults used to instantiate the
segments. It incorporates hindsight generation to learn from problems. As shown in Table 2, all functional faults (in-
past interactions. As F LASH was not publicly available at cluding Fault 1-7) are used to create problems at all of the
the time of writing, we develop a simplified version that four task levels; while the symptomatic faults (including
retrospectively generates insights after each step. Fault 8-9) can only be used to create problems at the detec-
To compare with other AIOps approaches specific to a cer- tion and localization levels (Level 1 and Level 2). In the
tain type of task, we evaluate three state-of-the-art, non- detection-level task, the agents must identify the presence
LLM-based AIOps algorithms on AIO PS L AB, using (multi- of faults in real-time. This task is a binary classification,
modal) telemetry data as input. They are: MKSMC (Çetin where the agents have to respond either “yes” if a fault is
and Tasgin, 2020) for detection, RMLAD (Wang et al., present or “no” on the contrary. The detection task (Level
2020) and PDiagnose (Hou et al., 2021) for localization. 1) can be made more complex, e.g., by asking the agents
to label the abnormal telemetry data; however, we keep it
3.2 Metrics simple here and leave the complex tasks to other levels. The
localization (Level 2) task asks the agents to specify the
Correctness. This metric measures the accuracy of the exact location of the fault, usually a service or pod name
agent’s response to problems. It evaluates whether the agent in Kubernetes. The RCA task (Level 3) requires the agents
successfully detects, localizes, analyzes and resolves the to identify (1) the system layer the fault affects and (2) the
problems as expected. type of the fault, e.g., misconfiguration or operation error.
The mitigation task (Level 4) requires the agents to interact
Time/Steps. These metrics evaluate the efficiency of the
with the environment to fix the fault with a series of actions,
AIOps agent for each type of task. For example, Time-to-
such as updating the configuration, or rollback to a previous
Detect (TTD) is the time elapsed from the occurrence of
version, etc.
a fault to its detection, and Time-to-Mitigate (TTM) is the
time taken from detection to complete mitigation of the fault. Most faults enable users to extend and create new prob-
The number of steps or actions taken to solve the problem lems easily by injecting the fault into other targets, such
is also recorded. Note that this is the number of times the as services. For example, Fault 2 in AIO PS L AB can be
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Accuracy
Agent LoC Time (s) # Steps Tokens Acc.
GPT-4- W-S HELL 41 28.61 6.44 6,394.5 49.15%
0.4
GPT-3.5- W-S HELL 41 12.44 14.70 2,557.95 15.25% 0.3
R E ACT 49 43.79 11.50 16,941.46 55.93%
F LASH 60 99.64 8.48 6,484.25 59.32% 0.2
0.1
injected into 10 services by simply configuring the injec- 0.0 3 5 10 15 20
tion target. We select the “user-service”, “text-service”, K (# Steps Taken)
and “post-storage-service” from SocialNetwork as injec- Figure 5. Agent performance vs. number of steps taken.
tion targets. Injecting faults into different targets is crucial
because each service may have distinct dependencies, re- Even the top-performing agents, such as F LASH, exhibit lim-
sulting in varied fault “blast radius” or failure propagation itations, particularly when tackling more complex tasks like
topologies. Consequently, faults can manifest at different mitigation. In Section 3.6, we will explore in detail the fail-
locations within the microservice architecture to help evalu- ure modes and challenges contributing to these performance
ate the ability of the AIOps agents since different locations limitations of agents, and opportunities for improvement.
may indicate distinct difficulties. Applying some faults to
3.5 Influence of the Step Limit
construct problems may require additional effort. For ex-
ample, Fault 3 and Fault 4 require the users to not only We examine the impact of the maximum number of allowed
prepare the scripts to trigger the admin privilege revoke or steps on the agent’s performance, with the results shown
user unregisteration during the testing, but also update the in Figure 5. The step limit significantly affects the perfor-
config map of the application in Kubernetes; and Fault 1 mance of certain agents. For instance, R E ACT and F LASH
needs to enforce its TLS requirements through a Helm con- show improved accuracy with more steps, with F LASH reach-
figuration update. Furthermore, some faults are designed ing the highest accuracy of 59.32% when the step limit is
for specific problems and are not readily adaptable, such as set to 20. However, for GPT-3.5- TURBO, increasing the step
Fault 5, which involves an application-level code bug in the limit beyond 5 does not yield better performance but merely
microservice’s image. increases the token consumption. Notably, the plateauing
of accuracy after a certain number of steps indicates that
3.4 Performance Results
self-repair with environment feedback can saturate quickly
The overall performance of the agents is summarized in for AIOps problems. On the contrary, in development tasks
Table 3, with task-specific results in Table 4. As illustrated (Dev), such as code generation, feedback via various com-
in Table 3, F LASH achieves the highest accuracy among all positional tools such as linters, type checkers, and test cases
agents. Although GPT-3.5- TURBO completes the tasks the help agents continuously improve. This suggests the need
fastest, it has the lowest accuracy at 15.25%. for (1) better task decomposition for AIOps problems using
planning, (2) improved feedback mechanisms for interme-
The detection task, being a binary choice question, should be diate steps, and (3) solutions that go beyond environment
the simplest task and the first step an AIOps agent performs. feedback and self-repair.
However, as shown in Table 4(a), only F LASH answers
all the detection problems correctly. For localization task, 3.6 Agent Behavior: The Good, the Bad and the Gaps
agents are allowed to come up with a list of potential faulty
We now delve into the behaviors of the agents and analyze
services as their answers (since there could be multiple faults
the good, the challenges, and opportunities for improve-
happenning in the system at the same time). To evaluate
ment. In Table 4, we see that all agents perform better
their accuracy, we consider both the top 1 and top 3 answers.
than the traditional non-LLM AIOps methods in terms of
In Table 4(b), R E ACT performs best when evaluated using
the problems for detection and localization tasks. Figure 6,
the top 3 answers, but its accuracy drops when considering
shows the telemetry API usage patterns among agents. The
the top 1. The RCA and mitigation tasks prove to be the
get_logs API is the most frequently used API across all
most challenging for the agents. GPT-3.5- W-S HELL fails to
agents, then the get_metrics, and the get_traces APIs.
recover any failure in its mitigation attempts.
However, agents also diverge in their patterns of API usage.
Problem difficulty differs across task levels. Despite show- For example, F LASH does not use the get_traces API at
ing promise in addressing realistic operational tasks, none of all. We present the occurrences of other system commands
the agents consistently achieve high problem-solving accu- for each agent in Table 5. We next discuss the underly-
racy across four task categories in AIO PS L AB benchmark. ing reasons and patterns contributing to the agents’ poor
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Table 4. Agent performance by task. This table summarizes the performance of different agents across various tasks including detection,
localization, RCA, and mitigation. Acc. stands for accuracy. Input/Output represents the number of tokens given to and produced by the
agent, respectively.
(a) Detection Task (b) Localization Task
Agent Accuracy Time (s) # Steps Input Output Agent Acc.@3 Acc.@1 Time (s) # Steps Input Output
GPT-4- W-S HELL 69.23% 7.08 3.85 5,492 132 GPT-4- W-S HELL 61.54% 61.54% 7.04 4.23 4,588.07 133.23
GPT-3.5- W-S HELL 23.07% 11.05 13.60 1,940.44 385.56 GPT-3.5- W-S HELL 30.77% 30.77% 6.26 11.92 1,784.23 217.08
R E ACT 76.92% 39.00 11.46 15,608.08 933.15 R E ACT 69.23% 53.85%↓ 38.65 11.08 4,760.77 880.92
F LASH 100% 78.27 6.77 12,869.08 125.69 F LASH 61.54% 46.15%↓ 56.60 5.77 1,875.08 123.31
PD IAGNOSE 15.38% 15.38% 1.02 N/A N/A N/A
MKSMC 15.38% 1.00 N/A N/A N/A
RMLAD 7.69% 7.69% 1.98 N/A N/A N/A
40
Figure 7. Action distribution by success and failure cases.
35.1%
30
25.5%
3.6.2 Overloaded information when consuming data
20
16.4% To dig deeper into the agent failure modes, we analyze the
10 correlation between the agents’ actions and the success or
5.8% 4.1% 5.4%
1.3% 0.0%
failure of problem-solving, as well as the distribution of
0
ReAct Flash actions across steps. In Figure 7, we present the distribution
Agents
of actions for both successful and failed cases. Agents tend
Figure 6. Total percentage of actions taken by different agents. to use get_metrics and get_traces APIs sparingly in suc-
cessfully resolved problems, typically only when necessary.
This is understandable, as the metrics data, e.g., CPU and
performance. memory usage have numerous values, which are hard to
3.6.1 Wasting steps on unnecessary actions directly interpret, and trace data are descriptive records of
the system’s dependencies, which are more comprehensi-
We observe that agents often waste steps on unnecessary ble when visualized. However, agents may subsequently
actions, such as repeatedly calling the same API, generating consume these data with a cat command directly, which
non-existent APIs, or spending excessive steps in multi- can overwhelm the model’s input context window and cause
agent communication. Specifically, the GPT-3.5- W-S HELL distraction and more tokens to be consumed. Consequently,
agent often generates incorrect API commands in loops, using these telemetry APIs without careful consideration or
leading to repeated errors in execution. For instance, set- analysis can add more noise into the agents’ reasoning, pos-
ting speaker_selection_method as round_robin allows ev- sibly leading to token exhaustion. We expect more refined
ery agent to speak in every step, but this often prevents telemetry data processing and filtering mechanisms to be
decisive, efficient decisions, as agents repeatedly resort implemented in the agents to avoid this issue in the future.
to telemetry APIs for more information. Even with the
3.6.3 Invalid API usage
speaker_selection_method set to auto, where the next
speaker is automatically chosen, a selected agent always We notice that agents can struggle with improper formatting
speaks ten times in a step without communication (with a of API calls. For instance, GPT-3.5- W-S HELL consistently
maximum of ten communication rounds per step). generates incorrect command formats (though the API name
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
is correct), such as malformed parameters, and repeat the the fault, workload, and environment setup).
error in subsequent steps. In many cases, GPT-3.5- W-S HELL
repeatedly responds with: “I apologize for the error. Here is 5 R ELATED W ORK
the API call again:” followed by the same/similar mistake.
This issue is observed as many as 14 times within a 20-step AgentOps. Recent advancements in cloud management
problem-solving case. have increasingly incorporated LLMs to enhance opera-
tional tasks. Approaches such as fine-tuned GPT models
Similar issues with invalid API usage are also observed in (Ahmed et al., 2023), RCACopilot (Chen et al., 2024), RCA-
other agents. For instance, the R E ACT agent occasionally gent (Wang et al., 2023), MonitorAssistant (Yu et al., 2024a),
generates incorrect API commands, but typically recovers and Xpert (Jiang et al., 2024) illustrate the effectiveness of
by reasoning through the errors and self-correcting its com- LLMs in monitoring and analyzing complex system be-
mands. In the following example, R E ACT uses an incorrect haviors. However, beyond the lack of publicly available
parameter for the get_logs API, but subsequently corrects implementations and associated private datasets, there is a
it by checking the existing services in its next step. notable gap: the absence of a unified benchmark capable
R E ACT: get_logs("test-social-network", "Social Network") of providing realistic evaluation scenarios to assess agents’
AIO PS L AB: Error: Your service/namespace does not exist. performance across operational tasks.
R E ACT:
Thought: I should check the services deployed within the test-social-network
namespace to ensure I use the correct service name. AIOps benchmarks. Existing AIOps benchmarks pri-
Action: I will list the services in the test-social-network namespace to confirm marily rely on static or text-based datasets, such as sys-
the correct name of the service.
exec_shell("kubectl get services -n test-social-network") tem metrics (Han et al., 2022; Jacob et al., 2020), typically
time series data, or fixed question-answer format (Liu et al.,
3.6.4 False positive detection issues 2023). These benchmarks, together with the general Lan-
To further evaluate the agents’ performance, we set up two guage Model benchmarks (Hendrycks et al., 2021b;a; Liang
detection problems for the two microservice applications et al., 2023; Lee et al., 2024; BIG-bench authors, 2023;
where no faults exist, referred to as no operation (Fault 10 – Huang et al., 2023), do not simulate the dynamic and com-
Noop in Table 2) problems. Only GPT-4- W-S HELL correctly plex cloud environments, not to mention allowing agents to
identify these cases as normal system execution, while the interact with them to solve operational tasks.
others report false positives, misinterpreting normal activi-
6 C ONCLUSION
ties (e.g., standard workload generation) as faults.
In this paper, we unravel the requirements and challenges
4 D ISCUSSION for a comprehensive framework that supports the design,
AIO PS L AB helps engineers to easily create customized in- development, and evaluation of autonomous AIOps agents.
cident scenarios for evaluating agents. By providing Agent We develop a framework, AIO PS L AB, which combines a
Cloud Interfaces (ACIs) as guard-rails, AIO PS L AB ensures fault injector, workload generator, cloud-agent orchestra-
that agents are tested within a controlled environment, al- tor, and telemetry observer to simulate cloud incidents and
lowing users to focus on designing scenarios that accurately provide an agent-cloud interface for orchestrating and evalu-
represent incidents in their systems and defining the specific ating AIOps agents. We leverage AIO PS L AB to construct a
problems their agents should solve. benchmark suite with 48 problems and evaluate four agents
to demonstrate the application of AIO PS L AB in evaluating
AIO PS L AB is also adaptable to other fault types. For exam- LLM-based agents across different types of AIOps tasks.
ple, an anomaly detection workload scenario can be intro-
duced for detection tasks. Further, users can create problems R EFERENCES
where agents are required to label the workload or telemetry
data to identify anomalies. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah-
mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo
When implementing problem evaluators, fine-grained evalu- Almeida, Janko Altenschmidt, Sam Altman, Shyamal
ation oracles, or AIO PS L AB’s optional LLM-as-Judge, may Anadkat, et al. 2023. Gpt-4 technical report. arXiv
be necessary. For instance, in the binary-choice detection preprint arXiv:2303.08774 (2023).
task, agents may answer correctly but provide incorrect in-
terpretations or reasoning. In one case, an agent claimed to Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas
detect an abnormal system behavior, but its explanation ref- Zimmermann, Xuchao Zhang, and Saravan Rajmohan.
erenced a workload that was, in fact, normal and unrelated 2023. Recommending Root-Cause and Mitigation Steps
to the injected fault. Leveraging AIO PS L AB’s LLM-as- for Cloud Incidents using Large Language Models. In
Judges can help address this issue by comparing the LLM Proceedings of the 45th International Conference on Soft-
reasoning chains with the problem description (including ware Engineering (ICSE’23).
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Pa- Analysis via Large Language Models for Cloud Incidents.
tel, Thanumalayan Sankaranarayana Pillai, Andrea C. In Proceedings of the Nineteenth European Conference
Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. on Computer Systems (EuroSys’24).
Correlated Crash Vulnerabilities. In Proceedings of the
12th USENIX Conference on Operating Systems Design Maria Christakis, Patrick Emmisberger, Patrice Godefroid,
and Implementation (OSDI’16). and Peter Müller. 2017. A General Framework for Dy-
namic Stub Injection. In Proceedings of the 39th Interna-
Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, tional Conference on Software Engineering (ICSE’17).
and Samer Al-Kiswany. 2018. An Analysis of Network-
Partitioning Failures in Cloud Systems. In Proceedings Yuanshun Dai, Yanping Xiang, and Gewei Zhang. 2009.
of the 13th USENIX Conference on Operating Systems Self-healing and Hybrid Diagnosis in Cloud Computing.
Design and Implementation (OSDI’18). In Cloud Computing, Martin Gilje Jaatun, Gansen Zhao,
and Chunming Rong (Eds.). Springer Berlin Heidelberg,
Radu Banabic and George Candea. 2012. Fast Black-Box Berlin, Heidelberg.
Testing of System Recovery Code. In Proceedings of
the 7th European Conference on Computer Systems (Eu- Elasticsearch. 2024a. Centralize, transform & stash your
roSys’12). data. https://www.elastic.co/logstash.
BIG-bench authors. 2023. Beyond the Imitation Game: Elasticsearch. 2024b. Lightweight shipper for logs. https:
Quantifying and extrapolating the capabilities of language //www.elastic.co/beats/filebeat.
models. Transactions on Machine Learning Research
(2023). Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal
Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian
Marco Canini, Daniele Venzano, Peter Perešíni, Dejan
Ritchken, Brendon Jackson, et al. 2019. An open-source
Kostić, and Jennifer Rexford. 2012. A NICE Way to
benchmark suite for microservices and their hardware-
Test OpenFlow Applications. In Proceedings of the 9th
software implications for cloud & edge systems. In Pro-
USENIX Conference on Networked Systems Design and
ceedings of the 24th International Conference on Archi-
Implementation (NSDI’12).
tectural Support for Programming Languages and Oper-
Uzay Çetin and Mursel Tasgin. 2020. Anomaly detection ating Systems (ASPLOS’19).
with multivariate K-sigma score using Monte Carlo. In
2020 5th International Conference on Computer Science Vaibhav Ganatra, Anjaly Parayil, Supriyo Ghosh, Yu Kang,
and Engineering. Minghua Ma, Chetan Bansal, Suman Nath, and Jonathan
Mace. 2023. Detection Is Better Than Cure: A Cloud
ChaosBlade Team. 2019. ChaosBlade. https:// Incidents Perspective. In Proceedings of the 31st Joint
github.com/chaosblade-io/chaosblade. Ac- European Software Engineering Conference and Sympo-
cessed: 2024-07-08. sium on the Foundations of Software Engineering (ES-
EC/FSE’23).
ChaosMesh Authors. 2022. ChaosMesh. https://chaos-
mesh.org/. Accessed: 2024-07-08. Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Al-
Haicheng Chen, Wensheng Dou, Dong Wang, and Feng varo, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau,
Qin. 2020. CoFI: Consistency-Guided Fault Injection for Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba
Cloud Systems. In Proceedings of the 35th ACM/IEEE Borthakur. 2011. Fate and Destini: A Framework
International Conference on Automated Software Engi- for Cloud Recovery Testing. In Proceedings of the 8th
neering (ASE’20). USENIX Symposium on Networked Systems Design and
Implementation (NSDI’11).
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen-
rique Ponde De Oliveira Pinto, Jared Kaplan, Harri Ed- Songqiao Han, Xiyang Hu, Hailiang Huang, Minqi Jiang,
wards, Yuri Burda, Nicholas Joseph, Greg Brockman, and Yue Zhao. 2022. ADBench: Anomaly Detection
et al. 2021. Evaluating large language models trained on Benchmark. In Thirty-sixth Conference on Neural Infor-
code. arXiv preprint arXiv:2107.03374 (2021). mation Processing Systems Datasets and Benchmarks
Track.
Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin
Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Shilin He, Botao Feng, Liqun Li, Xu Zhang, Yu Kang,
Ming Wen, Jun Zeng, Supriyo Ghosh, Xuchao Zhang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang.
Chaoyun Zhang, Qingwei Lin, Saravan Rajmohan, Dong- 2023. STEAM: Observability-Preserving Trace Sampling.
mei Zhang, and Tianyin Xu. 2024. Automatic Root Cause Association for Computing Machinery.
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Shilin He, Xu Zhang, Pinjia He, Yong Xu, Liqun Li, Yu Naman Jain, Manish Shetty, Tianjun Zhang, King Han,
Kang, Minghua Ma, Yining Wei, Yingnong Dang, Sara- Koushik Sen, and Ion Stoica. 2024b. R2E: Turning any
vanakumar Rajmohan, et al. 2022. An empirical study Github Repository into a Programming Agent Environ-
of log analysis at Microsoft. In Proceedings of the 30th ment. In Forty-first International Conference on Machine
ACM Joint European Software Engineering Conference Learning (ICML’24).
and Symposium on the Foundations of Software Engineer-
ing (ESEC/FSE). Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang,
Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Sara-
Helm. 2024. Helm: The package manager for Kubernetes. van Rajmohan, Qingwei Lin, and Dongmei Zhang. 2024.
Xpert: Empowering Incident Management with Query
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Recommendations via Large Language Models. In Pro-
Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021a. ceedings of the 46th IEEE/ACM International Conference
Aligning AI With Shared Human Values. Proceedings on Software Engineering (ICSE’24).
of the International Conference on Learning Representa-
tions (ICLR’21) (2021). Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu
Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, 2024. SWE-bench: Can Language Models Resolve Real-
Mantas Mazeika, Dawn Song, and Jacob Steinhardt. world Github Issues?. In The Twelfth International Con-
2021b. Measuring Massive Multitask Language Under- ference on Learning Representations (ICLR’24).
standing. Proceedings of the International Conference
on Learning Representations (ICLR’21) (2021). Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu,
and Dilma Da Silva. 2013. On Fault Resilience of Open-
Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Stack. In Proceedings of the 12th ACM Symposium on
Michael K Reiter, and Vyas Sekar. 2016. Gremlin: Sys- Cloud Computing (SOCC’13).
tematic Resilience Testing of Microservices. In Proceed-
ings of the IEEE 36th International Conference on Dis- Kyle Kingsbury. 2022. Jepsen. https://jepsen.io/.
tributed Computing Systems (ICDCS’16).
Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai,
Chuanjia Hou, Tong Jia, Yifan Wu, Ying Li, and Jing Han. Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak
2021. Diagnosing performance issues in microservices Narayanan, Hannah Teufel, Marco Bellagente, et al. 2024.
with heterogeneous data source. In 2021 IEEE Intl Conf Holistic evaluation of text-to-image models. Advances
on Parallel & Distributed Processing with Applications, in Neural Information Processing Systems (NeurIPS’24)
Big Data & Cloud Computing, Sustainable Computing (2024).
& Communications, Social Computing & Networking
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi,
(ISPA/BDCloud/SocialCom/SustainCom).
Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC:
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Semantic-Aware Model Checking for Fast Discovery of
Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Deep Bugs in Cloud Systems. In Proceedings of the 11th
Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, USENIX Conference on Operating Systems Design and
and Junxian He. 2023. C-Eval: A Multi-Level Multi- Implementation (OSDI’14).
Discipline Chinese Evaluation Suite for Foundation Mod-
Wenrui Li, Pengcheng Zhang, and Zhongxue Yang. 2012.
els. In Advances in Neural Information Processing Sys-
A Framework for Self-Healing Service Compositions
tems (NeurIPS’23).
in Cloud Computing Environments. In 2012 IEEE 19th
Vincent Jacob, Fei Song, Arnaud Stiegler, Yanlei Diao, and International Conference on Web Services (ICWS’12).
Nesime Tatbul. 2020. AnomalyBench: An Open Bench- Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras,
mark for Explainable Anomaly Detection. CoRR (2020). Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak
Jaeger Authors. 2024. Jaeger: Open source, end-to-end Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin New-
distributed tracing. https://www.jaegertracing.io. man, Binhang Yuan, Bobby Yan, Ce Zhang, Christian
Cosgrove, Christopher D. Manning, Christopher Ré, Di-
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia ana Acosta-Navas, Drew A. Hudson, Eric Zelikman,
Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu
Koushik Sen, and Ion Stoica. 2024a. LiveCodeBench: Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Lau-
Holistic and Contamination Free Evaluation of Large Lan- rel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suz-
guage Models for Code. arXiv preprint arXiv:2403.07974 gun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar
(2024). Khattab, Peter Henderson, Qian Huang, Ryan Chi,
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tat- Prometheus Authors. 2024. Prometheus. https://
sunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav prometheus.io.
Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui
Zhang, and Yuta Koreeda. 2023. Holistic Evaluation of Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta
Language Models. Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer,
Nicola Cancedda, and Thomas Scialom. 2024. Tool-
Yuhe Liu, Changhua Pei, Longlong Xu, Bohan Chen, former: Language models can teach themselves to use
Mingze Sun, Zhirui Zhang, Yongqian Sun, Shenglin tools. Advances in Neural Information Processing Sys-
Zhang, Kun Wang, Haiming Zhang, et al. 2023. Op- tems (NeurIPS’24) (2024).
sEval: A Comprehensive Task-Oriented AIOps Bench-
mark for Large Language Models. arXiv preprint Jesper Simonsson, Long Zhang, Brice Morin, Benoit
arXiv:2310.07637 (2023). Baudry, and Martin Monperrus. 2021. Observability and
chaos engineering on system calls for containerized appli-
Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun
cations in docker. Future Generation Computer Systems
Yang, and Liang You. 2019. CrashTuner: Detecting
(2021).
Crash-Recovery Bugs in Cloud Systems via Meta-Info
Analysis. In Proceedings of the 26th ACM Symposium on Gagan Somashekar, Anurag Dutt, Mainak Adak, Tania
Operating System Principles (SOSP’19). Lorido Botran, and Anshul Gandhi. 2024. GAMMA:
Minghua Ma, Shenglin Zhang, Dan Pei, Xin Huang, and Graph Neural Network-Based Multi-Bottleneck Localiza-
Hongwei Dai. 2018. Robust and rapid adaption for con- tion for Microservices Applications. In Proceedings of
cept drift in software system anomaly detection. In 2018 the ACM Web Conference 2024.
IEEE 29th International Symposium on Software Relia-
Akshitha Sriraman and Thomas F Wenisch. 2018. µ suite:
bility Engineering (ISSRE). IEEE, 13–24.
a benchmark suite for microservices. In 2018 IEEE Inter-
Rupak Majumdar and Filip Niksic. 2018. Why is Ran- national Symposium on Workload Characterization.
dom Testing Effective for Partition Tolerance Bugs?. In
Proceedings of the 45th ACM SIGPLAN Symposium on Xudong Sun, Wenqing Luo, Jiawei Tyler Gu, Aishwarya
Principles of Programming Languages (POPL’18). Ganesan, Ramnatthan Alagappan, Michael Gasch, Lalith
Suresh, and Tianyin Xu. 2022. Automatic Reliability Test-
Paul D Marinescu and George Candea. 2009. LFI: A Practi- ing for Cluster Management Controllers. In Proceedings
cal and General Library-Level Fault Injector. In Proceed- of the 16th USENIX Symposium on Operating Systems
ings of the 39th IEEE/IFIP International Conference on Design and Implementation (OSDI’22).
Dependable Systems and Networks (DSN’09).
Yongqian Sun, Binpeng Shi, Mingyu Mao, Minghua Ma,
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo- Sibo Xia, Shenglin Zhang, and Dan Pei. 2024. ART:
foros Nalmpantis, Ram Pasunuru, Roberta Raileanu, Bap- A Unified Unsupervised Framework for Incident Man-
tiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Ce- agement in Microservice Systems. In Proceedings of the
likyilmaz, et al. 2023. Augmented language models: a 39th IEEE/ACM International Conference on Automated
survey. arXiv preprint arXiv:2302.07842 (2023). Software Engineering.
Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli,
Lingzhi Wang, Nengwen Zhao, Junjie Chen, Pinnong Li,
Pandian Raju, and Vijay Chidambaram. 2018. Find-
Wenchi Zhang, and Kaixin Sui. 2020. Root-cause metric
ing Crash-Consistency Bugs with Bounded Black-Box
location for microservice systems via log anomaly de-
Crash Testing. In Proceedings of the 13th USENIX Con-
tection. In 2020 IEEE international conference on web
ference on Operating Systems Design and Implementation
services (ICWS’20).
(OSDI’18).
Netflix. 2011. ChaosMonkey. https://github.com/ Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong,
Netflix/chaosmonkey. Accessed: 2024-07-08. Lunting Fan, Lingfei Wu, and Qingsong Wen. 2023.
Rcagent: Cloud root cause analysis by autonomous
Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, agents with tool-augmented large language models. arXiv
Samer Al Kiswany, Andrea C. Arpaci-Dusseau, and preprint arXiv:2310.16340 (2023).
Remzi H. Arpaci-Dusseau. 2014. All File Systems
Are Not Created Equal: On the Complexity of Craft- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
ing Crash-Consistent Applications. In Proceedings of the Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022a. Chain
11th USENIX Conference on Operating Systems Design of thought prompting elicits reasoning in large language
and Implementation (OSDI’14). models. arXiv preprint arXiv:2201.11903 (2022).
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Automated Root Causing of Cloud Incidents using In-
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Context Learning with GPT-4. In Companion Proceed-
2022b. Chain-of-thought prompting elicits reasoning in ings of the 32nd ACM International Conference on the
large language models. Advances in neural information Foundations of Software Engineering.
processing systems (NeurIPS’22) (2022).
Xuchao Zhang, Tanish Mittal, Chetan Bansal, Rujia Wang,
Sean Wolfe. 2018. Amazon’s one hour of downtime on Minghua Ma, Zhixin Ren, Hao Huang, and Saravan Raj-
Prime Day may have cost it up to $100 million in lost mohan. 2024b. FLASH: A Workflow Automation Agent
sales. (2018). https://www.businessinsider.com/ for Diagnosing Recurring Incidents. (2024).
amazon-prime-day-website-issues-cost-it-
millions-in-lost-sales-2018-7 Chenyu Zhao, Minghua Ma, Zhenyu Zhong, Shenglin
Zhang, Zhiyuan Tan, Xiao Xiong, LuLu Yu, Jiayi Feng,
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Yongqian Sun, Yuzhi Zhang, et al. 2023. Robust Mul-
Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Syn- timodal Failure Detection for Microservice Systems. In
ergizing Reasoning and Acting in Language Models. In Proceedings of the 29th ACM SIGKDD Conference on
International Conference on Learning Representations Knowledge Discovery and Data Mining.
(ICLR’23).
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhaoyang Yu, Minghua Ma, Chaoyun Zhang, Si Qin, Yu Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo-
Kang, Chetan Bansal, Saravan Rajmohan, Yingnong han Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-
Dang, Changhua Pei, Dan Pei, Qingwei Lin, and Dong- as-a-judge with mt-bench and chatbot arena. Advances
mei Zhang. 2024a. MonitorAssistant: Simplifying Cloud in Neural Information Processing Systems (NeurIPS’24)
Service Monitoring via Large Language Models. In Com- (2024).
panion Proceedings of the 32nd ACM International Con-
ference on the Foundations of Software Engineering Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert
(FSE’24). Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel
Fried, Uri Alon, et al. 2023. WebArena: A Realistic Web
Zhaoyang Yu, Changhua Pei, Xin Wang, Minghua Ma, Environment for Building Autonomous Agents. arXiv
Chetan Bansal, Saravan Rajmohan, Qingwei Lin, Dong- preprint arXiv:2307.13854 (2023).
mei Zhang, Xidao Wen, Jianhui Li, et al. 2024b. Pre-
trained KPI Anomaly Detection Model Through Disen- Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai
tangled Transformer. In Proceedings of the 30th ACM Li, and Dan Ding. 2021. Fault Analysis and Debugging
SIGKDD Conference on Knowledge Discovery and Data of Microservice Systems: Industrial Survey, Benchmark
Mining. System, and Empirical Study. IEEE Transactions on
Software Engineering (2021).
Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy
Davis, Heather Miller, Chris Potts, James Zou, Michael
Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi.
2024. The Shift from Models to Compound AI Sys-
tems. https://bair.berkeley.edu/blog/2024/02/
18/compound-ai-systems/.