2501.06706v1

AIO PS L AB : A H OLISTIC F RAMEWORK TO E VALUATE AI AGENTS FOR
E NABLING AUTONOMOUS C LOUDS
Yinfang Chen 1 Manish Shetty 2 Gagan Somashekar 3 Minghua Ma 3 Yogesh Simmhan 4 Jonathan Mace 3
Chetan Bansal 3 Rujia Wang 3 Saravan Rajmohan 3
A BSTRACT
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root
arXiv:2501.06706v1 [cs.AI] 12 Jan 2025
cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools
and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language
Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation.
This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire
incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision
requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this
end, we present AIO PS L AB, a framework that not only deploys microservice cloud environments, injects faults,
generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces
for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and
demonstrate how AIO PS L AB can facilitate the evaluation of next-generation AIOps agents. Through evaluations
of state-of-the-art LLM agents within the benchmark created by AIO PS L AB, we provide insights into their
capabilities and limitations in handling complex operational tasks in cloud environments.
1 I NTRODUCTION 2023; He et al., 2022; Ma et al., 2018; Zhang et al., 2018;

Ganatra et al., 2023; Somashekar et al., 2024; Zhang et al.,
The rapid evolution of IT applications and services has led 2024a; Chen et al., 2024). Large Language Model (LLM)
enterprises to increasingly depend on hyper-scale, cloud- agents (Mialon et al., 2023; Schick et al., 2024) integrate
based systems. These systems are often distributed, em- external tools to dynamically interact with their environ-
ploying architectures such as microservices and serverless ment (Wei et al., 2022a), enabling them to autonomously
computing, enabling scalability but also adding complexity manage the entire incident lifecycle, as shown in Figure 1.
and introducing new operational challenges. In such cloud
To realize this autonomous self-healing cloud vision, we
environments, issues can cascade into large-scale outages.
propose a new paradigm called AgentOps (Agent for Opera-
For instance, an Amazon outage can result in losses of $100
tions). In this paradigm, agentic approaches are not limited
million in just one hour (Wolfe, 2018).
to isolated operational tasks but are capable of seamlessly
To address the challenges of managing incidents in such managing multiple, cross-layer tasks across the entire op-
complex infrastructures, there is a movement towards the erational stack. AgentOps represents an evolution where
adoption of AIOps (Artificial Intelligence for IT Opera- autonomous agents can make real-time decisions and end-
tions), within the context of DevOps (Development and to-end actions to ensure system reliability. This aligns with
Operations). The ultimate goal of AIOps is to create au- recent advancements in AI, as highlighted by a post:
tonomous self-healing clouds, where AI-driven approaches
“State-of-the-art AI results are increasingly ob-
can detect, localize, and mitigate faults with minimal hu-
tained by compound systems with multiple com-
man intervention. Although such a concept has existed for
ponents, not just monolithic models ... compound
over a decade (Li et al., 2012; Dai et al., 2009), the recent
AI systems will likely be the best way to maximize
advancements of AIOps and Large Language Model (LLM)
AI results in the future” – The Shift from Models
agents have brought this vision closer to reality (Zhao et al.,
to Compound AI Systems (Zaharia et al., 2024)
1
UIUC, Champaign, USA 2 UC Berkeley, Berkeley, USA AI-driven tools and benchmarks like WebArena (Zhou
3
Microsoft, Redmond, USA 4 IISc, Bengaluru, India. Correspon-
dence to: Minghua Ma <minghuama@microsoft.com>. et al., 2023), R2E (Jain et al., 2024b), HumanEval (Chen
et al., 2021), LiveCodeBench (Jain et al., 2024a), and SWE-
bench (Jimenez et al., 2024) have significantly advanced the
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
‘Dev’ side of DevOps by accelerating software development. scenarios, referred to as problems, which replicates realistic
However, progress in AI for ‘Ops’, particularly AgentOps, incidents within the microservice system. AIO PS L AB’s
remains limited, due to the lack of high-quality benchmarks problem pool is structured around a task-level taxonomy that
for diverse, realistic scenarios. Addressing this gap requires categorizes tasks of different problems across the incident
a framework that aids the design, development, and evalua- management lifecycle. Our approach ensures that evaluation
tion of AIOps agents within an interactive environment, a scenarios go beyond simple performance or crash failures
key contribution of this paper. (that cannot be further analyzed or mitigated by the agents),
incorporating fine-grained root causes to fully assess the
Challenges and contributions. Building a holistic bench- diagnostic and mitigation abilities of AIOps agents.
mark framework that can allow agents to interact dynam-
ically with the cloud poses several challenges. The first Implementation. We developed AIO PS L AB, an inno-
challenge is to manage an evaluation flow that is generally vative framework for building AgentOps benchmarks to
applicable to diverse agents and clouds, powerful enough to evaluate LLM-based AIOps agents. AIO PS L AB utilizes
evaluate agents by complex and realistic operational tasks, two microservice applications from DeathStarBench (Gan
and valuable enough to provide different feedback or ob- et al., 2019) as testbeds, along with their workload gen-
servability, together with extensibility that make it possible erators. An extensible fault library, integrated with
to accommodate new tasks and agents by the users. While ChaosMesh (ChaosMesh Authors, 2022), enables diverse
existing tools address individual components of the AIOps fault injections into the system. A telemetry observer, in-
evaluation, such as observability (He et al., 2023; Simonsson corporating Prometheus (Prometheus Authors, 2024) for
et al., 2021), application suites (Gan et al., 2019; Zhou et al., metrics, Jaeger (Jaeger Authors, 2024) for tracing, and File-
2021; Sriraman and Wenisch, 2018) and chaos engineer- beat (Elasticsearch, 2024b) and Logstash (Elasticsearch,
ing (Netflix, 2011; ChaosBlade Team, 2019; ChaosMesh 2024a) for logging, supports on-disk storage of telemetry
Authors, 2022), they lack the integration necessary to sup- data, facilitating evaluations of both traditional AIOps al-
port a unified AIOps evaluation. gorithms and agentic solutions. We also integrate Helm
and Kubernetes APIs into the AIO PS L AB’s orchestrator
We present AIO PS L AB, a holistic framework that can au- implementation.
tomatically manage the entire end-to-end evaluation pro-
cess for AIOps solutions. This involves deploying services, To demonstrate the application of our framework in evalu-
fault injection, workload generation, orchestrating the agent- ating LLM-based agents as the benchmark, we use AIO P -
cloud interaction, and analyzing results. Specifically, AIO P - S L AB to create 48 problems as evaluation scenarios covering
S L AB features the Agent-Cloud Interface (ACI), a unified different types of AIOps tasks, and register four agents from
interface that enables agents to interact with the cloud. ACI different types on those problems. The agent registration is
allows agents to communicate, take action, and receive feed- lightweight, with less than a hundred lines of code to im-
back, orchestrating these interactions to detect and resolve plement. Our evaluation process reveals distinct challenges
issues in dynamic and interactive environments. agents face across tasks.
Moreover, a common challenge in operation benchmarks Summary. This paper makes the following contributions:
is the lack of realistic evaluation scenarios, as existing ap-
• We unravel the requirements and challenges of achieving
proaches often rely on static datasets, such as system met-
a holistic framework that supports the design, develop-
rics (Han et al., 2022; Jacob et al., 2020) that are typically
ment, and evaluation of autonomous AIOps agents;
time series data, or on fixed question-answer format (Liu
et al., 2023). Such setups do not capture the dynamic, unpre- • We develop a framework, AIO PS L AB, which can not only
dictable, and evolving nature of real-world cloud environ- deploy microservice cloud environments, inject faults,
ments, where workloads and incidents fluctuate over time. generate workloads, and export telemetry data but also
To make matters worse, recent efforts on AgentOps (Wang orchestrate these components and provide agent-cloud
et al., 2023; Zhang et al., 2024a) use proprietary services interfaces for interacting with and evaluating agents.
and datasets. Furthermore, existing AIOps approaches and • We leverage the AIO PS L AB framework to construct a
their benchmarks often focus only on isolated aspects of benchmark suite with 48 problems across different AIOps
the incident lifecycle, such as anomaly detection (Yu et al., tasks in an interactive environment and evaluate four
2024b) or fault localization (Sun et al., 2024). This lacks a LLM-based agents.
cohesive framework to evaluate AIOps agents comprehen- • We provide a detailed analysis of the agents’ performance
sively. Moreover, it limits support for decision-making that and limitations by evaluating them on AIO PS L AB.
could assist in chaining algorithms or selecting the most • We will make AIO PS L AB publicly available.1
suitable agent for a given operation scenario.
1
To address these limitations, we designed a set of evaluation The link will be provided.
Incident Triage after interface.

Microservice Detection Localization
Agents
or 1 from aiopslab import LocalizationTask ,
Slow Incident Management
connection Human SocialNetwork
Operators Mitigation/ Root Cause 2 from aiopslab import Wrk , VirtFaultInjector
Frontend Storage
Service Service Resolution Diagnosis 3 class K8STargetPortMisconf ( LocalizationTask ):
4 def __init__ (self):
5 self.app = SocialNetwork ()
6 self.ans = "user - service "
7
Figure 1. Microservice incident and its management lifecycle. 8 def start_workload (self):
9 wrk = Wrk(rate =100 , duration =10)
10 wrk. start_workload (url=self.app.
2 AIO PS L AB frontend_url )
11
In this section, we discuss the design and implementation of 12 def inject_fault (self):
13 inj = VirtFaultInjector (self.app.ns)
AIO PS L AB and its components, as illustrated in Figure 2. 14 inj. inject ([ self.ans], " misconfig_k8s ")
15
2.1 Problem Definition 16 def eval(self , soln , trace , duration ):
17 res["TTL"] = duration
18 res[" success "] = is_exact_match (soln , self.
To support a wide range of evaluation scenarios (referred to ans)
as problems), which replicate realistic incidents within the 19 return res
microservice system, we first formalize an AIOps problem
P as a tuple: P = ⟨T, C, S⟩, where T represents a task, C Here, the task T is fault localization, and the solution S is
represents a context, and S represents the expected solution the microservice named “user-service”, which is also the
(oracle). The task T defines the specific AIOps operation fault injection target. The context C includes the social
to be performed, categorized into four types: detection, network application, a misconfiguration fault from AIO P -
localization, (root cause) analysis, and mitigation. We define S L AB’s fault library, and a standard workload using the wrk
these tasks in Table 1. Each task type is associated with tool. AIO PS L AB provides several such interfaces for all
success criteria and evaluation metrics. For instance, the AIOps tasks (Section 2.4.1) and allows users to add new
detection task employs Time-to-Detect (TTD) to measure problems by extending them. Once problems are defined,
the time taken to detect a fault. AIO PS L AB can instantiate them and allow agents to interact
The context C can be further formalized as a tuple: C = with them using an Orchestrator that we describe next.
⟨E, I⟩, where E is the operational environment in which
the problem occurs, and I is the problem information used 2.2 Orchestrator
to describe the problem to the agent. The operational en- AIO PS L AB’s Orchestrator strictly enforces the separation
vironment includes the cloud service, the fault model, and of concerns between the agent and the service, using a
the workload model used to generate the problem, which well-defined central piece, the Orchestrator. It provides a
is not shared with the agent. The problem information robust set of interfaces that allow seamless integration and
comprises of information such as service descriptions, task extension of various system components.
descriptions, and documentation about available APIs that
is directly shared with the agent. It also subsumes indirect 2.2.1 Agent Cloud Interface
information (including logs, metrics, and traces observed in A key responsibility of the Orchestrator is to provide a well-
the operational environment) that is queryable by the agent defined interface for the agent to interact with the cloud
at runtime. Finally, S is the expected outcome of the task, environment. Typically, developers operate clouds and ser-
which is used to evaluate the agent’s performance. The so- vices with various programming (e.g., APIs, CLIs) and user
lution is typically problem and task-specific and is carefully interfaces (incident portals, dashboards, etc.). However,
designed for evaluation. Note that some problems, e.g., mit- existing interfaces to the cloud are not well-designed for
igation tasks, can be solved in multiple ways. In such cases, LLMs and agents. For instance, humans can reliably ig-
AIO PS L AB evaluates the general state of the entire system, nore irrelevant information, which can prove distracting for
e.g., check whether all of the services are up and running, agents and hamper performance.
after the problem is resolved, rather than solely on the tar-
geted resource where the fault was injected, because other The ACI specifies (1) the set of valid actions available to
services or resources may have been inadvertently affected the agent 7 9 , and (2) how the service’s state is conveyed
during the mitigation process. back to the agent as the observation of its actions 8 .
Example 2.1. Consider the problem of localizing a Ku- In doing so, the ACI abstracts the cloud environment’s
bernetes target port misconfiguration in a social network complexity, simplifying the agent’s decision-making pro-
application. AIO PS L AB makes it easy to define this prob- cess. The ACI is designed to be intuitive and easy to use,
lem in just a few lines by extending the LocalizationTask with a concise list of APIs, each documented to ensure
provided by AIOpsLab
Fault library Services Under Test (§2.3) Workload Policy
extensible by developer
(§2.4) (§2.2.3)
evaluation result SocialNetwork
external API 4 HotelReservation 6
Fault Generator Workload
internal call (§2.4) Others Generator (§2.2.3)
system state/telemetry
3 2 8 5
1 Register
Telemetry Collector (§2.5)
Orchestrator (§2.2)
Agents 7 Action A Problem Pool (§3.3)
(§3.1) Evaluator (§2.2.3)
... ... Problem Definition (§2.1): Task (§2.4.1):
- Task T - Detect
9 Submit S Customized Evalution
- Workload W - Localize
Evaluation - Specified Fault F - RCA
Common Evaluation - Expected Solution S’ - Mitigate
Result (§3) 10
Figure 2. Overview of AIO PS L AB. The Orchestrator coordinates interactions between various system components and serves as the
Agent-Cloud-Interface (ACI). Agents engage with the Orchestrator to solve tasks, receiving a problem description, instructions, and
relevant APIs. The Orchestrator generates diverse problems using the Workload and Fault Generators, injecting these into applications it
can deploy. The deployed service has observability, providing telemetry such as metrics, traces, and logs. Agents act via the Orchestrator,
which executes them and updates the service’s state. The Orchestrator evaluates the final solution using predefined metrics for the task.
that agents can make meaningful progress towards their ob- 2.2.2 Session Interface
jectives. Some APIs that AIO PS L AB provides by default
Another key responsibility of the Orchestrator is to manage
include get_logs (fetch logs), get_metrics (fetch metrics),
the lifecycle of the agent and the service. We implement the
get_traces (fetch traces), and exec_shell (execute shell
Orchestrator as a session-based system, where a Session
commands after applying security policy filters).
is created for each instance of an agent solving a problem.
Example 2.2. This example illustrates how the ACI is de- Agents are registered with the Orchestrator, and a session
fined in AIO PS L AB as APIs that agents can use. starts with simple API calls passing a unique problem iden-
tifier 1 . AIO PS L AB’s design is highly flexible and inte-
1 class TaskActions : grates with the growing LLM and agent framework space.
2 def get_traces (ns: str , duration : int = 5) ->
str: Our only requirement is that the agent must implement a
3 """ get_action method with the following signature: async
4 Collects trace data of the services from
Jaeger . def get_action(state: str)-> str. It takes the service’s
5 Args: state as input from the Orchestrator and returns the next ac-
6 ns (str): The K8S namespace .
7 duration (int): Duration to collect tion the agent wants to take. Note that this could be a simple
traces . wrapper function around any existing agent framework.
8 Returns :
9 str: Path to the directory where traces Example 2.3. In this simplified example, we illustrate how
saved .
10 """ an Agent can be onboarded to AIO PS L AB.
11 trace_api = TraceAPI (ns)
12 end_t = datetime .now () 1 from aiopslab import Orchestrator
13 start_t = end_t - timedelta ( duration ) 2 class Agent :
14 traces = trace_api . extract_traces (start_t , 3 def __init__ (self , prob , instructs , apis):
end_t ) 4 self. prompt = self. set_prompt (prob ,
15 return trace_api . save_traces ( traces ) instructs , apis)
5 self.llm = GPT4 ()
6
7 async def get_action (self , state : str) -> str:
As shown, the ACI encapsulates complex operations behind 8 return self.llm. generate (self. prompt +
state )
simple APIs like get_traces. On initializing a problem, 9
the Orchestrator automatically extracts documentation from 10 # initialize the orchestrator
11 orch = Orchestrator ()
these APIs to provide as context C to the agent. At runtime, 12 pid = " misconfig_app_hotel_res - mitigation -1"
agents can specify a wide range of actions on the service 13 prob_desc , instructs , apis = orch. init_problem (pid)
14 # register and evaluate the agent
(e.g., scaling, redeploying, patching) by way of the Orches- 15 agent = Agent (prob_desc , instructs , apis)
trator’s privileged access. Finally, the Orchestrator conveys 16 orch. register_agent (agent , name=" myAgent ")
17 asyncio .run(orch. start_problem ( max_steps =10))
the service’s state after each action with high-quality feed-
back to the agent, including outputs, error messages, and As shown on initializing a problem, the Orchestrator shares
tracebacks. context necessary for the agent to solve the problem. It then
polls (via get_action) for the agent’s next action. Faults Provided by
AIOpsLab (§2.4)
2.2.3 Other Interfaces
Symptomatic Functional
Problem Initializers. As described in Section 2.1, each Faults (§2.4.2) Faults (§2.4.3)
problem is defined with a context C which includes its

operational environment. This environment is the service, Network Pod Application- Virtualization-
Loss Failure level Faults level Faults
fault, and workload conditions under which the problem
occurs. Here, the Orchestrator deploys services and uses
infrastructure-as-code tools like (Helm, 2024) to deploy Figure 3. Fault categories to instantiate problems in AIO PS L AB.
the required cloud service for each problem. We describe Table 1. Task taxonomy for AIOps agent evaluation. The lower
services already integrated into AIO PS L AB in Section 2.3. the level, the easier the task. AIO PS L AB aims to evaluate agents
across all task levels with its problems.
As shown in Figure 2, to create realistic benchmark sce- Level Task (# sub tasks) Evaluation Focus
narios, the Orchestrator then interfaces with two entities: 1 Detection (1) Can the approach accurately detect
anomalies or deviations?
(1) a workload generator 5 and (2) a fault generator 3 . 2 Localization (1) Can the approach pinpoint a fault’s exact
These generators introduce controlled service disruptions source (e.g., microservice)?
3 Root Cause Analysis Can the approach determine the underly-
that simulate live benchmark problems. As the workload (RCA) (2) ing cause of the fault?
generator, AIO PS L AB currently uses the wrk2 tool (Gan 4 Mitigation (1) Can the approach give effective solutions
to recover the environment?
et al., 2019), which supports several workload policies and
also replays industry workloads 6 . However, the AIO P -
S L AB is extensible to other workload generators. For fault world social networking applications. The HotelReservation
generation, AIO PS L AB uses a custom fault library that in- application, implemented with Go and gRPC, supports ser-
stantiates faults across different levels of the system stack vices like recommending and reserving hotels.
4 , such as application and virtualization. The library con-
2.4 Task-oriented Fault Library
tains and extends to several fine-grained and parametric
faults that go beyond surface-level symptoms and engage 2.4.1 Task Taxonomy
deeper into more complex resolution strategies. We describe
We present a task-level taxonomy (Table 1) that categorizes
the fault library in detail in Section 2.4.
the tasks that AIOps agents should accomplish according
Problem Evaluators. Finally, the Orchestrator plays a criti- to the different stages of the incident management lifecycle,
cal role in evaluating the agent’s performance on a problem. with progressively increasing complexity. In Table 1, a
It compares the agent’s solutions against predefined suc- higher level indicates a harder and more impactful task to
cess criteria and evaluation metrics specific to each task 10 . evaluate agents.
AIO PS L AB supports several default and common metrics
Level 1 focuses on the preliminary identification of unusual
for each task (e.g., Time-to-Detect for detection, number of
behavior within the system, for example, detecting a mal-
steps taken, and tokens produced by an LLM-powered agent
functioning Kubernetes pod of a microservice. Also, users
sent to AIO PS L AB). Additionally, AIO PS L AB provides an
can define more complex tasks or create sub-tasks. The root
optional qualitative evaluation of agent trajectories using
cause analysis task has both the system level and fault type
LLMs-as-Judges (Zheng et al., 2024). Beyond that, all user-
prediction sub-tasks to be solved.
defined evaluation metrics specific to the problem are run.
For instance, for the localization problem in Example 2.1, To instantiate problems across different task levels, we use
the metric success is defined by the agent’s submission fault injection to inject faults into the system, and construct
matching the fault microservice’s name. Lastly, the Orches- a problem pool for AIO PS L AB. We classify them into two
trator maintains comprehensive logs of all agent trajectories, main types, symptomatic faults and functional faults, as
including actions taken and resulting system states, facilitat- shown in Figure 3.
ing detailed analysis and debugging. All of the evaluation
2.4.2 Symptomatic Faults
results will be automatically collected.
Symptomatic faults, such as performance degradation and
2.3 Cloud Services
crash failures, manifest as observable symptoms, such as
AIO PS L AB deploys live microservice applications as cloud increased latency, resource exhaustion, or service outages.
environments 2 . AIO PS L AB is currently integrated with These faults typically help to construct Level 1 and Level
the HotelReservation and SocialNetwork from DeathStar- 2 tasks in the taxonomy, which can create problems that
Bench (Gan et al., 2019). The SocialNetwork application evaluate AIOps approaches’ detection and localization abil-
has 28 microservices, including Memcached, MongoDB, ity. These faults provide an overview of potential problems
and Redis, that together implement several features of real- but do not necessarily reveal the deeper, underlying root
Frontend 8 """ Recover the revoke admin privileges

fault ."""
9 ...
Search User Recommend 10 # Usage Example
11 class MongoDBRevokeAuth :
12 def __init__ (self):
Error: Not authorized on 13 self.app = HotelReservation ()
Rate Geo geo-db to execute command 14
Revoke the admin’s 15 def inject_fault (self):
Mongodb-geo privilege during execution 16 injector = ApplicationFaultInjector (ns)
17 injector . _inject (["mongodb -geo"], "
Figure 4. Revoke authentication fault example. Injection happens revoke_auth ")
at Mongodb-geo service, while Geo service will be abnormal and
generate error logs. Users can define problems using the existing fault library.
causes of issues (since they do not have one). AIO PS L AB For instance, users can specify different faulty services or
integrates the fault injection tool, Chaos-Mesh (ChaosMesh even construct a task that injects multiple faults into multiple
Authors, 2022), to inject symptomatic faults into microser- services concurrently. Users can also customize their faults
vice applications. to generate various problems. AIO PS L AB provides the
injection function for its associated failure scenarios and
2.4.3 Functional Faults offers the corresponding mitigation mechanism to recover
Though there are many fault injection tools for testing the the system from the erroneous state. In Section 3.3, we will
resilience of cloud systems (Marinescu and Candea, 2009; discuss the current problem pool we implement.
Banabic and Candea, 2012; Christakis et al., 2017; Zhang 2.5 Observability
and Elbaum, 2012; Kingsbury, 2022; Pillai et al., 2014;
Alquraan et al., 2018; Lu et al., 2019; Chen et al., 2020; AIO PS L AB is equipped with an extensible observability
Leesatapornwongsa et al., 2014; Gunawi et al., 2011; Ma- layer to provide comprehensive monitoring capabilities.
jumdar and Niksic, 2018; Ju et al., 2013; Heorhiadi et al., AIO PS L AB collects a wide array of telemetry data by its
2016; Alagappan et al., 2016; Mohan et al., 2018; Sun et al., telemetry collector, including (1) traces from Jaeger (Jaeger
2022; Canini et al., 2012), most of them focus solely on Authors, 2024) detailing the end-to-end paths of requests
injecting system symptoms. These coarse-grained faults can through distributed systems, (2) application logs retrieved
only disrupt without modeling the underlying, fine-grained by Kubectl, or formatted and recorded by Filebeat (Elastic-
root causes, e.g., misconfigurations or software bugs, and search, 2024b) and Logstash (Elasticsearch, 2024a), and (3)
hence are unable to evaluate the capabilities of AIOps agents system metrics monitored by Prometheus (Prometheus Au-
to diagnose and mitigate root causes. thors, 2024). AIO PS L AB not only supports data collection
during the interaction with the LLM agent but can also ex-
The failure scenarios to evaluate AIOps agents across tasks port the data offline to facilitate evaluating other traditional
must go beyond simple performance or crash failures, and AIOps approaches. Besides, AIO PS L AB is designed to
reflect realistic cases that challenge agents, where functional capture information from other dimensions, e.g., codebase,
faults come into play. Functional faults require approaches configuration, and cluster information. Developers can also
to not only detect (Level 1) and localize (Level 2) the failure design and expose low-level system information (such as
but also diagnose the root cause (Level 3), such as incorrect syscall logs) to agents using AIO PS L AB’s interface.
deployment or operations, and apply the correct mitigation
strategies (Level 4). For instance, the fault in Figure 4 3 E VALUATION
revokes the admin authentication for the MongoDB database This section begins by outlining the evaluation setup and
of the geographic microservice (Mongodb-geo). Since the metrics employed within AIO PS L AB. We then delve into
Geo service relies on its backend database, errors will appear the selected faults listed in Table 2, which serve as diverse
during its invocation. evaluation scenarios within AIO PS L AB. Following this, we
Example 2.4. In the following example, we illustrate the evaluate the performance of the AIOps agents solving these
structure of the application-level fault injector for a revoke problems, and then analyze the cost of the agents. We also
authentication fault and its usage example in AIO PS L AB. dig into the reasons behind the performance differences to
1 from aiopslab . generators . fault .base import understand the challenges and potential agent improvements.
FaultInjector Note that, all of the results are automatically collected and
2 from aiopslab . service .apps. hotelres import
HotelReservation recorded by the problem evaluators (Section 2.2.3).
3 class ApplicationFaultInjector ( FaultInjector ):
4 def inject_revoke_auth (self , microservices : 3.1 Evaluation Setup
list[str ]):
5 """ Revoke MongoDB admin privileges .""" We evaluate four LLM-based agents with AIO PS L AB. Note
6 ...
7 def recover_revoke_auth (self , microservices :
that, for a fair comparison, we register the naive agent in
list[str ]): AIO PS L AB without any fine-tuning or modifications. We
Table 2. Selected faults used to instantiate the problems for evaluation in AIO PS L AB. Ext. stands for extensibility. denotes the
fault can be easily used to construct other problems; G
# denotes there is some manual effort needed to create new problems; while #
means the fault is specific to some problems and cannot be applied to create other problems.
No. Name Application Task Level Category Ext. # Problem Description
Functional Missing authentication credentials cause
1 AuthenticationMissing HotelReservation 1, 2, 3, 4 #
G 4
Virtualization access denial to MongoDB.
Functional The service cannot connect to the specified
2 TargetPortMisconfig SocialNetwork 1, 2, 3, 4 12
Virtualization port due to misconfiguration.
Functional Revoked authentication causes database
3 RevokeAuth HotelReservation 1, 2, 3, 4 #
G 8
Application connection failure.
Functional The database service has access failures
4 UserUnregistered HotelReservation 1, 2, 3, 4 #
G 8
Application after the user was unregistered.
Functional Connection code bug in the application
5 BuggyAppImage HotelReservation 1, 2, 3, 4 # 4
Application image causes access issues.
Functional Incorrect scaling operation makes the
6 ScalePod SocialNetwork 1, 2, 3, 4 4
Virtualization number of pod zero for a service.
Functional Pod in a pending a failure status due to
7 AssignNonExistentNode SocialNetwork 1, 2, 3, 4 4
Virtualization wrong assignment to a non-existent node.
Network loss causes communication
8 NetworkLoss HotelReservation 1, 2 Symptomatic 2
failures for a specific service.
9 PodFailure HotelReservation 1, 2 Symptomatic 2 Service interruption due to a pod failure.
HotelReservation
10 Noop 1 - 2 No faults injected into the system.
SocialNetwork
use GPT-3.5- TURBO and GPT-4- TURBO (Achiam et al., 2023) agent interacts with the AIO PS L AB instead of the number
that have access to only a secure shell as baselines (GPT- W- of requests sent to the backend LLM.
S HELL). In addition, we also evaluate the performance of
Cost. We use the number of tokens, including both the input
R E ACT (Yao et al., 2023), which extends chain-of-thought
token and output tokens, generated by the agents/environ-
reasoning (Wei et al., 2022b) by integrating reasoning and
ment as an indicator of the cost.
acting in an interleaved manner,
As for cloud operation-specific agents, we choose 3.3 Problem Pool of AIO PS L AB Benchmark
F LASH (Zhang et al., 2024b). F LASH employs a workflow au- Currently, AIO PS L AB benchmark consists of 48 problems
tomation system that monitors execution status and decom- in its problem pool. With six agents, we evaluate a total
poses complex instructions into manageable, conditional of 288 cases. Table 2 lists the faults used to instantiate the
segments. It incorporates hindsight generation to learn from problems. As shown in Table 2, all functional faults (in-
past interactions. As F LASH was not publicly available at cluding Fault 1-7) are used to create problems at all of the
the time of writing, we develop a simplified version that four task levels; while the symptomatic faults (including
retrospectively generates insights after each step. Fault 8-9) can only be used to create problems at the detec-
To compare with other AIOps approaches specific to a cer- tion and localization levels (Level 1 and Level 2). In the
tain type of task, we evaluate three state-of-the-art, non- detection-level task, the agents must identify the presence
LLM-based AIOps algorithms on AIO PS L AB, using (multi- of faults in real-time. This task is a binary classification,
modal) telemetry data as input. They are: MKSMC (Çetin where the agents have to respond either “yes” if a fault is
and Tasgin, 2020) for detection, RMLAD (Wang et al., present or “no” on the contrary. The detection task (Level
2020) and PDiagnose (Hou et al., 2021) for localization. 1) can be made more complex, e.g., by asking the agents
to label the abnormal telemetry data; however, we keep it
3.2 Metrics simple here and leave the complex tasks to other levels. The
localization (Level 2) task asks the agents to specify the
Correctness. This metric measures the accuracy of the exact location of the fault, usually a service or pod name
agent’s response to problems. It evaluates whether the agent in Kubernetes. The RCA task (Level 3) requires the agents
successfully detects, localizes, analyzes and resolves the to identify (1) the system layer the fault affects and (2) the
problems as expected. type of the fault, e.g., misconfiguration or operation error.
The mitigation task (Level 4) requires the agents to interact
Time/Steps. These metrics evaluate the efficiency of the
with the environment to fix the fault with a series of actions,
AIOps agent for each type of task. For example, Time-to-
such as updating the configuration, or rollback to a previous
Detect (TTD) is the time elapsed from the occurrence of
version, etc.
a fault to its detection, and Time-to-Mitigate (TTM) is the
time taken from detection to complete mitigation of the fault. Most faults enable users to extend and create new prob-
The number of steps or actions taken to solve the problem lems easily by injecting the fault into other targets, such
is also recorded. Note that this is the number of times the as services. For example, Fault 2 in AIO PS L AB can be
Table 3. Overall performance of different agents. We show the 0.8

lines of code (LoC) to register the agent in AIO PS L AB, average GPT4-w-Shell
0.7 GPT3.5-w-Shell
running time in seconds, average number of steps taken, average ReAct
tokens used, and accuracy across all problems. 0.6 Flash
0.5
Accuracy
Agent LoC Time (s) # Steps Tokens Acc.
GPT-4- W-S HELL 41 28.61 6.44 6,394.5 49.15%
0.4
GPT-3.5- W-S HELL 41 12.44 14.70 2,557.95 15.25% 0.3
R E ACT 49 43.79 11.50 16,941.46 55.93%
F LASH 60 99.64 8.48 6,484.25 59.32% 0.2
0.1
injected into 10 services by simply configuring the injec- 0.0 3 5 10 15 20
tion target. We select the “user-service”, “text-service”, K (# Steps Taken)
and “post-storage-service” from SocialNetwork as injec- Figure 5. Agent performance vs. number of steps taken.
tion targets. Injecting faults into different targets is crucial
because each service may have distinct dependencies, re- Even the top-performing agents, such as F LASH, exhibit lim-
sulting in varied fault “blast radius” or failure propagation itations, particularly when tackling more complex tasks like
topologies. Consequently, faults can manifest at different mitigation. In Section 3.6, we will explore in detail the fail-
locations within the microservice architecture to help evalu- ure modes and challenges contributing to these performance
ate the ability of the AIOps agents since different locations limitations of agents, and opportunities for improvement.
may indicate distinct difficulties. Applying some faults to
3.5 Influence of the Step Limit
construct problems may require additional effort. For ex-
ample, Fault 3 and Fault 4 require the users to not only We examine the impact of the maximum number of allowed
prepare the scripts to trigger the admin privilege revoke or steps on the agent’s performance, with the results shown
user unregisteration during the testing, but also update the in Figure 5. The step limit significantly affects the perfor-
config map of the application in Kubernetes; and Fault 1 mance of certain agents. For instance, R E ACT and F LASH
needs to enforce its TLS requirements through a Helm con- show improved accuracy with more steps, with F LASH reach-
figuration update. Furthermore, some faults are designed ing the highest accuracy of 59.32% when the step limit is
for specific problems and are not readily adaptable, such as set to 20. However, for GPT-3.5- TURBO, increasing the step
Fault 5, which involves an application-level code bug in the limit beyond 5 does not yield better performance but merely
microservice’s image. increases the token consumption. Notably, the plateauing
of accuracy after a certain number of steps indicates that
3.4 Performance Results
self-repair with environment feedback can saturate quickly
The overall performance of the agents is summarized in for AIOps problems. On the contrary, in development tasks
Table 3, with task-specific results in Table 4. As illustrated (Dev), such as code generation, feedback via various com-
in Table 3, F LASH achieves the highest accuracy among all positional tools such as linters, type checkers, and test cases
agents. Although GPT-3.5- TURBO completes the tasks the help agents continuously improve. This suggests the need
fastest, it has the lowest accuracy at 15.25%. for (1) better task decomposition for AIOps problems using
planning, (2) improved feedback mechanisms for interme-
The detection task, being a binary choice question, should be diate steps, and (3) solutions that go beyond environment
the simplest task and the first step an AIOps agent performs. feedback and self-repair.
However, as shown in Table 4(a), only F LASH answers
all the detection problems correctly. For localization task, 3.6 Agent Behavior: The Good, the Bad and the Gaps
agents are allowed to come up with a list of potential faulty
We now delve into the behaviors of the agents and analyze
services as their answers (since there could be multiple faults
the good, the challenges, and opportunities for improve-
happenning in the system at the same time). To evaluate
ment. In Table 4, we see that all agents perform better
their accuracy, we consider both the top 1 and top 3 answers.
than the traditional non-LLM AIOps methods in terms of
In Table 4(b), R E ACT performs best when evaluated using
the problems for detection and localization tasks. Figure 6,
the top 3 answers, but its accuracy drops when considering
shows the telemetry API usage patterns among agents. The
the top 1. The RCA and mitigation tasks prove to be the
get_logs API is the most frequently used API across all
most challenging for the agents. GPT-3.5- W-S HELL fails to
agents, then the get_metrics, and the get_traces APIs.
recover any failure in its mitigation attempts.
However, agents also diverge in their patterns of API usage.
Problem difficulty differs across task levels. Despite show- For example, F LASH does not use the get_traces API at
ing promise in addressing realistic operational tasks, none of all. We present the occurrences of other system commands
the agents consistently achieve high problem-solving accu- for each agent in Table 5. We next discuss the underly-
racy across four task categories in AIO PS L AB benchmark. ing reasons and patterns contributing to the agents’ poor
Table 4. Agent performance by task. This table summarizes the performance of different agents across various tasks including detection,
localization, RCA, and mitigation. Acc. stands for accuracy. Input/Output represents the number of tokens given to and produced by the
agent, respectively.
(a) Detection Task (b) Localization Task
Agent Accuracy Time (s) # Steps Input Output Agent Acc.@3 Acc.@1 Time (s) # Steps Input Output
GPT-4- W-S HELL 69.23% 7.08 3.85 5,492 132 GPT-4- W-S HELL 61.54% 61.54% 7.04 4.23 4,588.07 133.23
GPT-3.5- W-S HELL 23.07% 11.05 13.60 1,940.44 385.56 GPT-3.5- W-S HELL 30.77% 30.77% 6.26 11.92 1,784.23 217.08
R E ACT 76.92% 39.00 11.46 15,608.08 933.15 R E ACT 69.23% 53.85%↓ 38.65 11.08 4,760.77 880.92
F LASH 100% 78.27 6.77 12,869.08 125.69 F LASH 61.54% 46.15%↓ 56.60 5.77 1,875.08 123.31
PD IAGNOSE 15.38% 15.38% 1.02 N/A N/A N/A
MKSMC 15.38% 1.00 N/A N/A N/A
RMLAD 7.69% 7.69% 1.98 N/A N/A N/A
(c) Root Cause Analysis (RCA) Task (d) Mitigation Task

Agent Accuracy Time (s) # Steps Input Output Agent Accuracy Time (s) # Steps Input Output
GPT-4- W-S HELL 40.90% 8.68 4.81 4,297.91 176.18 GPT-4- W-S HELL 27.27% 99.47 13.72 10,142.55 1,060.00
GPT-3.5- W-S HELL 9.09% 10.06 14.00 1,495.55 406.27 GPT-3.5- W-S HELL 0% 23.78 20.00 3,178.33 967.71
R E ACT 45.45% 32.16 8.00 16,276.09 757.27 R E ACT 36.36% 67.18 15.54 29,211.90 1,464.90
F LASH 36.36% 59.00 6.09 1,193.90 152.45 F LASH 54.55% 216.41 16.09 8,469.00 760.36
Table 5. Occurrences of system commands.

Agent find echo py awk mongo grep ls cat ip Submit Submit
kubectl get 23.5% kubectl get 20.5%
R E ACT 0 0 0 3 0 1 26 30 0 23.0% 22.3%
F LASH 0 3 0 0 0 0 8 10 0 get_metrics 2.6% get_metrics

8.2%
get_traces kubectl other
2.6% get_traces
kubectl other 16.9% 5.8%
get_logs get_logs
70 25.6%
Actions Others 16.7% Others 17.4%
get_logs 6.0% 8.9%
60 get_metrics 58.1%
get_traces
Others
50 K8S 48.2% (a) Successful cases. (b) Failure cases.
Percentage (%)
40
Figure 7. Action distribution by success and failure cases.
35.1%
30
25.5%
3.6.2 Overloaded information when consuming data
20
16.4% To dig deeper into the agent failure modes, we analyze the
10 correlation between the agents’ actions and the success or
5.8% 4.1% 5.4%
1.3% 0.0%
failure of problem-solving, as well as the distribution of
0
ReAct Flash actions across steps. In Figure 7, we present the distribution
Agents
of actions for both successful and failed cases. Agents tend
Figure 6. Total percentage of actions taken by different agents. to use get_metrics and get_traces APIs sparingly in suc-
cessfully resolved problems, typically only when necessary.
This is understandable, as the metrics data, e.g., CPU and
performance. memory usage have numerous values, which are hard to
3.6.1 Wasting steps on unnecessary actions directly interpret, and trace data are descriptive records of
the system’s dependencies, which are more comprehensi-
We observe that agents often waste steps on unnecessary ble when visualized. However, agents may subsequently
actions, such as repeatedly calling the same API, generating consume these data with a cat command directly, which
non-existent APIs, or spending excessive steps in multi- can overwhelm the model’s input context window and cause
agent communication. Specifically, the GPT-3.5- W-S HELL distraction and more tokens to be consumed. Consequently,
agent often generates incorrect API commands in loops, using these telemetry APIs without careful consideration or
leading to repeated errors in execution. For instance, set- analysis can add more noise into the agents’ reasoning, pos-
ting speaker_selection_method as round_robin allows ev- sibly leading to token exhaustion. We expect more refined
ery agent to speak in every step, but this often prevents telemetry data processing and filtering mechanisms to be
decisive, efficient decisions, as agents repeatedly resort implemented in the agents to avoid this issue in the future.
to telemetry APIs for more information. Even with the
3.6.3 Invalid API usage
speaker_selection_method set to auto, where the next
speaker is automatically chosen, a selected agent always We notice that agents can struggle with improper formatting
speaks ten times in a step without communication (with a of API calls. For instance, GPT-3.5- W-S HELL consistently
maximum of ten communication rounds per step). generates incorrect command formats (though the API name
is correct), such as malformed parameters, and repeat the the fault, workload, and environment setup).
error in subsequent steps. In many cases, GPT-3.5- W-S HELL
repeatedly responds with: “I apologize for the error. Here is 5 R ELATED W ORK
the API call again:” followed by the same/similar mistake.
This issue is observed as many as 14 times within a 20-step AgentOps. Recent advancements in cloud management
problem-solving case. have increasingly incorporated LLMs to enhance opera-
tional tasks. Approaches such as fine-tuned GPT models
Similar issues with invalid API usage are also observed in (Ahmed et al., 2023), RCACopilot (Chen et al., 2024), RCA-
other agents. For instance, the R E ACT agent occasionally gent (Wang et al., 2023), MonitorAssistant (Yu et al., 2024a),
generates incorrect API commands, but typically recovers and Xpert (Jiang et al., 2024) illustrate the effectiveness of
by reasoning through the errors and self-correcting its com- LLMs in monitoring and analyzing complex system be-
mands. In the following example, R E ACT uses an incorrect haviors. However, beyond the lack of publicly available
parameter for the get_logs API, but subsequently corrects implementations and associated private datasets, there is a
it by checking the existing services in its next step. notable gap: the absence of a unified benchmark capable
R E ACT: get_logs("test-social-network", "Social Network") of providing realistic evaluation scenarios to assess agents’
AIO PS L AB: Error: Your service/namespace does not exist. performance across operational tasks.
R E ACT:
Thought: I should check the services deployed within the test-social-network
namespace to ensure I use the correct service name. AIOps benchmarks. Existing AIOps benchmarks pri-
Action: I will list the services in the test-social-network namespace to confirm marily rely on static or text-based datasets, such as sys-
the correct name of the service.
exec_shell("kubectl get services -n test-social-network") tem metrics (Han et al., 2022; Jacob et al., 2020), typically
time series data, or fixed question-answer format (Liu et al.,
3.6.4 False positive detection issues 2023). These benchmarks, together with the general Lan-
To further evaluate the agents’ performance, we set up two guage Model benchmarks (Hendrycks et al., 2021b;a; Liang
detection problems for the two microservice applications et al., 2023; Lee et al., 2024; BIG-bench authors, 2023;
where no faults exist, referred to as no operation (Fault 10 – Huang et al., 2023), do not simulate the dynamic and com-
Noop in Table 2) problems. Only GPT-4- W-S HELL correctly plex cloud environments, not to mention allowing agents to
identify these cases as normal system execution, while the interact with them to solve operational tasks.
others report false positives, misinterpreting normal activi-
6 C ONCLUSION
ties (e.g., standard workload generation) as faults.
In this paper, we unravel the requirements and challenges
4 D ISCUSSION for a comprehensive framework that supports the design,
AIO PS L AB helps engineers to easily create customized in- development, and evaluation of autonomous AIOps agents.
cident scenarios for evaluating agents. By providing Agent We develop a framework, AIO PS L AB, which combines a
Cloud Interfaces (ACIs) as guard-rails, AIO PS L AB ensures fault injector, workload generator, cloud-agent orchestra-
that agents are tested within a controlled environment, al- tor, and telemetry observer to simulate cloud incidents and
lowing users to focus on designing scenarios that accurately provide an agent-cloud interface for orchestrating and evalu-
represent incidents in their systems and defining the specific ating AIOps agents. We leverage AIO PS L AB to construct a
problems their agents should solve. benchmark suite with 48 problems and evaluate four agents
to demonstrate the application of AIO PS L AB in evaluating
AIO PS L AB is also adaptable to other fault types. For exam- LLM-based agents across different types of AIOps tasks.
ple, an anomaly detection workload scenario can be intro-
duced for detection tasks. Further, users can create problems R EFERENCES
where agents are required to label the workload or telemetry
data to identify anomalies. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah-
mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo
When implementing problem evaluators, fine-grained evalu- Almeida, Janko Altenschmidt, Sam Altman, Shyamal
ation oracles, or AIO PS L AB’s optional LLM-as-Judge, may Anadkat, et al. 2023. Gpt-4 technical report. arXiv
be necessary. For instance, in the binary-choice detection preprint arXiv:2303.08774 (2023).
task, agents may answer correctly but provide incorrect in-
terpretations or reasoning. In one case, an agent claimed to Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas
detect an abnormal system behavior, but its explanation ref- Zimmermann, Xuchao Zhang, and Saravan Rajmohan.
erenced a workload that was, in fact, normal and unrelated 2023. Recommending Root-Cause and Mitigation Steps
to the injected fault. Leveraging AIO PS L AB’s LLM-as- for Cloud Incidents using Large Language Models. In
Judges can help address this issue by comparing the LLM Proceedings of the 45th International Conference on Soft-
reasoning chains with the problem description (including ware Engineering (ICSE’23).
Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Pa- Analysis via Large Language Models for Cloud Incidents.
tel, Thanumalayan Sankaranarayana Pillai, Andrea C. In Proceedings of the Nineteenth European Conference
Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. on Computer Systems (EuroSys’24).
Correlated Crash Vulnerabilities. In Proceedings of the
12th USENIX Conference on Operating Systems Design Maria Christakis, Patrick Emmisberger, Patrice Godefroid,
and Implementation (OSDI’16). and Peter Müller. 2017. A General Framework for Dy-
namic Stub Injection. In Proceedings of the 39th Interna-
Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, tional Conference on Software Engineering (ICSE’17).
and Samer Al-Kiswany. 2018. An Analysis of Network-
Partitioning Failures in Cloud Systems. In Proceedings Yuanshun Dai, Yanping Xiang, and Gewei Zhang. 2009.
of the 13th USENIX Conference on Operating Systems Self-healing and Hybrid Diagnosis in Cloud Computing.
Design and Implementation (OSDI’18). In Cloud Computing, Martin Gilje Jaatun, Gansen Zhao,
and Chunming Rong (Eds.). Springer Berlin Heidelberg,
Radu Banabic and George Candea. 2012. Fast Black-Box Berlin, Heidelberg.
Testing of System Recovery Code. In Proceedings of
the 7th European Conference on Computer Systems (Eu- Elasticsearch. 2024a. Centralize, transform & stash your
roSys’12). data. https://www.elastic.co/logstash.
BIG-bench authors. 2023. Beyond the Imitation Game: Elasticsearch. 2024b. Lightweight shipper for logs. https:
Quantifying and extrapolating the capabilities of language //www.elastic.co/beats/filebeat.
models. Transactions on Machine Learning Research
(2023). Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal
Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian
Marco Canini, Daniele Venzano, Peter Perešíni, Dejan
Ritchken, Brendon Jackson, et al. 2019. An open-source
Kostić, and Jennifer Rexford. 2012. A NICE Way to
benchmark suite for microservices and their hardware-
Test OpenFlow Applications. In Proceedings of the 9th
software implications for cloud & edge systems. In Pro-
USENIX Conference on Networked Systems Design and
ceedings of the 24th International Conference on Archi-
Implementation (NSDI’12).
tectural Support for Programming Languages and Oper-
Uzay Çetin and Mursel Tasgin. 2020. Anomaly detection ating Systems (ASPLOS’19).
with multivariate K-sigma score using Monte Carlo. In
2020 5th International Conference on Computer Science Vaibhav Ganatra, Anjaly Parayil, Supriyo Ghosh, Yu Kang,
and Engineering. Minghua Ma, Chetan Bansal, Suman Nath, and Jonathan
Mace. 2023. Detection Is Better Than Cure: A Cloud
ChaosBlade Team. 2019. ChaosBlade. https:// Incidents Perspective. In Proceedings of the 31st Joint
github.com/chaosblade-io/chaosblade. Ac- European Software Engineering Conference and Sympo-
cessed: 2024-07-08. sium on the Foundations of Software Engineering (ES-
EC/FSE’23).
ChaosMesh Authors. 2022. ChaosMesh. https://chaos-
mesh.org/. Accessed: 2024-07-08. Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Al-
Haicheng Chen, Wensheng Dou, Dong Wang, and Feng varo, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau,
Qin. 2020. CoFI: Consistency-Guided Fault Injection for Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba
Cloud Systems. In Proceedings of the 35th ACM/IEEE Borthakur. 2011. Fate and Destini: A Framework
International Conference on Automated Software Engi- for Cloud Recovery Testing. In Proceedings of the 8th
neering (ASE’20). USENIX Symposium on Networked Systems Design and
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen-
rique Ponde De Oliveira Pinto, Jared Kaplan, Harri Ed- Songqiao Han, Xiyang Hu, Hailiang Huang, Minqi Jiang,
wards, Yuri Burda, Nicholas Joseph, Greg Brockman, and Yue Zhao. 2022. ADBench: Anomaly Detection
et al. 2021. Evaluating large language models trained on Benchmark. In Thirty-sixth Conference on Neural Infor-
code. arXiv preprint arXiv:2107.03374 (2021). mation Processing Systems Datasets and Benchmarks
Track.
Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin
Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Shilin He, Botao Feng, Liqun Li, Xu Zhang, Yu Kang,
Ming Wen, Jun Zeng, Supriyo Ghosh, Xuchao Zhang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang.
Chaoyun Zhang, Qingwei Lin, Saravan Rajmohan, Dong- 2023. STEAM: Observability-Preserving Trace Sampling.
mei Zhang, and Tianyin Xu. 2024. Automatic Root Cause Association for Computing Machinery.
Shilin He, Xu Zhang, Pinjia He, Yong Xu, Liqun Li, Yu Naman Jain, Manish Shetty, Tianjun Zhang, King Han,
Kang, Minghua Ma, Yining Wei, Yingnong Dang, Sara- Koushik Sen, and Ion Stoica. 2024b. R2E: Turning any
vanakumar Rajmohan, et al. 2022. An empirical study Github Repository into a Programming Agent Environ-
of log analysis at Microsoft. In Proceedings of the 30th ment. In Forty-first International Conference on Machine
ACM Joint European Software Engineering Conference Learning (ICML’24).
and Symposium on the Foundations of Software Engineer-
ing (ESEC/FSE). Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang,
Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Sara-
Helm. 2024. Helm: The package manager for Kubernetes. van Rajmohan, Qingwei Lin, and Dongmei Zhang. 2024.
Xpert: Empowering Incident Management with Query
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Recommendations via Large Language Models. In Pro-
Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021a. ceedings of the 46th IEEE/ACM International Conference
Aligning AI With Shared Human Values. Proceedings on Software Engineering (ICSE’24).
of the International Conference on Learning Representa-
tions (ICLR’21) (2021). Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu
Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, 2024. SWE-bench: Can Language Models Resolve Real-
Mantas Mazeika, Dawn Song, and Jacob Steinhardt. world Github Issues?. In The Twelfth International Con-
2021b. Measuring Massive Multitask Language Under- ference on Learning Representations (ICLR’24).
standing. Proceedings of the International Conference
on Learning Representations (ICLR’21) (2021). Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu,
and Dilma Da Silva. 2013. On Fault Resilience of Open-
Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Stack. In Proceedings of the 12th ACM Symposium on
Michael K Reiter, and Vyas Sekar. 2016. Gremlin: Sys- Cloud Computing (SOCC’13).
tematic Resilience Testing of Microservices. In Proceed-
ings of the IEEE 36th International Conference on Dis- Kyle Kingsbury. 2022. Jepsen. https://jepsen.io/.
tributed Computing Systems (ICDCS’16).
Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai,
Chuanjia Hou, Tong Jia, Yifan Wu, Ying Li, and Jing Han. Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak
2021. Diagnosing performance issues in microservices Narayanan, Hannah Teufel, Marco Bellagente, et al. 2024.
with heterogeneous data source. In 2021 IEEE Intl Conf Holistic evaluation of text-to-image models. Advances
on Parallel & Distributed Processing with Applications, in Neural Information Processing Systems (NeurIPS’24)
Big Data & Cloud Computing, Sustainable Computing (2024).
& Communications, Social Computing & Networking
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi,
(ISPA/BDCloud/SocialCom/SustainCom).
Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC:
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Semantic-Aware Model Checking for Fast Discovery of
Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Deep Bugs in Cloud Systems. In Proceedings of the 11th
Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, USENIX Conference on Operating Systems Design and
and Junxian He. 2023. C-Eval: A Multi-Level Multi- Implementation (OSDI’14).
Discipline Chinese Evaluation Suite for Foundation Mod-
Wenrui Li, Pengcheng Zhang, and Zhongxue Yang. 2012.
els. In Advances in Neural Information Processing Sys-
A Framework for Self-Healing Service Compositions
tems (NeurIPS’23).
in Cloud Computing Environments. In 2012 IEEE 19th
Vincent Jacob, Fei Song, Arnaud Stiegler, Yanlei Diao, and International Conference on Web Services (ICWS’12).
Nesime Tatbul. 2020. AnomalyBench: An Open Bench- Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras,
mark for Explainable Anomaly Detection. CoRR (2020). Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak
Jaeger Authors. 2024. Jaeger: Open source, end-to-end Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin New-
distributed tracing. https://www.jaegertracing.io. man, Binhang Yuan, Bobby Yan, Ce Zhang, Christian
Cosgrove, Christopher D. Manning, Christopher Ré, Di-
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia ana Acosta-Navas, Drew A. Hudson, Eric Zelikman,
Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu
Koushik Sen, and Ion Stoica. 2024a. LiveCodeBench: Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Lau-
Holistic and Contamination Free Evaluation of Large Lan- rel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suz-
guage Models for Code. arXiv preprint arXiv:2403.07974 gun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar
(2024). Khattab, Peter Henderson, Qian Huang, Ryan Chi,
Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tat- Prometheus Authors. 2024. Prometheus. https://
sunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav prometheus.io.
Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui
Zhang, and Yuta Koreeda. 2023. Holistic Evaluation of Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta
Language Models. Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer,
Nicola Cancedda, and Thomas Scialom. 2024. Tool-
Yuhe Liu, Changhua Pei, Longlong Xu, Bohan Chen, former: Language models can teach themselves to use
Mingze Sun, Zhirui Zhang, Yongqian Sun, Shenglin tools. Advances in Neural Information Processing Sys-
Zhang, Kun Wang, Haiming Zhang, et al. 2023. Op- tems (NeurIPS’24) (2024).
sEval: A Comprehensive Task-Oriented AIOps Bench-
mark for Large Language Models. arXiv preprint Jesper Simonsson, Long Zhang, Brice Morin, Benoit
arXiv:2310.07637 (2023). Baudry, and Martin Monperrus. 2021. Observability and
chaos engineering on system calls for containerized appli-
Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun
cations in docker. Future Generation Computer Systems
Yang, and Liang You. 2019. CrashTuner: Detecting
(2021).
Crash-Recovery Bugs in Cloud Systems via Meta-Info
Analysis. In Proceedings of the 26th ACM Symposium on Gagan Somashekar, Anurag Dutt, Mainak Adak, Tania
Operating System Principles (SOSP’19). Lorido Botran, and Anshul Gandhi. 2024. GAMMA:
Minghua Ma, Shenglin Zhang, Dan Pei, Xin Huang, and Graph Neural Network-Based Multi-Bottleneck Localiza-
Hongwei Dai. 2018. Robust and rapid adaption for con- tion for Microservices Applications. In Proceedings of
cept drift in software system anomaly detection. In 2018 the ACM Web Conference 2024.
IEEE 29th International Symposium on Software Relia-
Akshitha Sriraman and Thomas F Wenisch. 2018. µ suite:
bility Engineering (ISSRE). IEEE, 13–24.
a benchmark suite for microservices. In 2018 IEEE Inter-
Rupak Majumdar and Filip Niksic. 2018. Why is Ran- national Symposium on Workload Characterization.
dom Testing Effective for Partition Tolerance Bugs?. In
Proceedings of the 45th ACM SIGPLAN Symposium on Xudong Sun, Wenqing Luo, Jiawei Tyler Gu, Aishwarya
Principles of Programming Languages (POPL’18). Ganesan, Ramnatthan Alagappan, Michael Gasch, Lalith
Suresh, and Tianyin Xu. 2022. Automatic Reliability Test-
Paul D Marinescu and George Candea. 2009. LFI: A Practi- ing for Cluster Management Controllers. In Proceedings
cal and General Library-Level Fault Injector. In Proceed- of the 16th USENIX Symposium on Operating Systems
ings of the 39th IEEE/IFIP International Conference on Design and Implementation (OSDI’22).
Dependable Systems and Networks (DSN’09).
Yongqian Sun, Binpeng Shi, Mingyu Mao, Minghua Ma,
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo- Sibo Xia, Shenglin Zhang, and Dan Pei. 2024. ART:
foros Nalmpantis, Ram Pasunuru, Roberta Raileanu, Bap- A Unified Unsupervised Framework for Incident Man-
tiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Ce- agement in Microservice Systems. In Proceedings of the
likyilmaz, et al. 2023. Augmented language models: a 39th IEEE/ACM International Conference on Automated
survey. arXiv preprint arXiv:2302.07842 (2023). Software Engineering.
Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli,
Lingzhi Wang, Nengwen Zhao, Junjie Chen, Pinnong Li,
Pandian Raju, and Vijay Chidambaram. 2018. Find-
Wenchi Zhang, and Kaixin Sui. 2020. Root-cause metric
ing Crash-Consistency Bugs with Bounded Black-Box
location for microservice systems via log anomaly de-
Crash Testing. In Proceedings of the 13th USENIX Con-
tection. In 2020 IEEE international conference on web
ference on Operating Systems Design and Implementation
services (ICWS’20).
(OSDI’18).
Netflix. 2011. ChaosMonkey. https://github.com/ Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong,
Netflix/chaosmonkey. Accessed: 2024-07-08. Lunting Fan, Lingfei Wu, and Qingsong Wen. 2023.
Rcagent: Cloud root cause analysis by autonomous
Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, agents with tool-augmented large language models. arXiv
Samer Al Kiswany, Andrea C. Arpaci-Dusseau, and preprint arXiv:2310.16340 (2023).
Remzi H. Arpaci-Dusseau. 2014. All File Systems
Are Not Created Equal: On the Complexity of Craft- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
ing Crash-Consistent Applications. In Proceedings of the Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022a. Chain
11th USENIX Conference on Operating Systems Design of thought prompting elicits reasoning in large language
and Implementation (OSDI’14). models. arXiv preprint arXiv:2201.11903 (2022).
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Automated Root Causing of Cloud Incidents using In-
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Context Learning with GPT-4. In Companion Proceed-
2022b. Chain-of-thought prompting elicits reasoning in ings of the 32nd ACM International Conference on the
large language models. Advances in neural information Foundations of Software Engineering.
processing systems (NeurIPS’22) (2022).
Xuchao Zhang, Tanish Mittal, Chetan Bansal, Rujia Wang,
Sean Wolfe. 2018. Amazon’s one hour of downtime on Minghua Ma, Zhixin Ren, Hao Huang, and Saravan Raj-
Prime Day may have cost it up to $100 million in lost mohan. 2024b. FLASH: A Workflow Automation Agent
sales. (2018). https://www.businessinsider.com/ for Diagnosing Recurring Incidents. (2024).
amazon-prime-day-website-issues-cost-it-
millions-in-lost-sales-2018-7 Chenyu Zhao, Minghua Ma, Zhenyu Zhong, Shenglin
Zhang, Zhiyuan Tan, Xiao Xiong, LuLu Yu, Jiayi Feng,
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Yongqian Sun, Yuzhi Zhang, et al. 2023. Robust Mul-
Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Syn- timodal Failure Detection for Microservice Systems. In
ergizing Reasoning and Acting in Language Models. In Proceedings of the 29th ACM SIGKDD Conference on
International Conference on Learning Representations Knowledge Discovery and Data Mining.
(ICLR’23).
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhaoyang Yu, Minghua Ma, Chaoyun Zhang, Si Qin, Yu Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo-
Kang, Chetan Bansal, Saravan Rajmohan, Yingnong han Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-
Dang, Changhua Pei, Dan Pei, Qingwei Lin, and Dong- as-a-judge with mt-bench and chatbot arena. Advances
mei Zhang. 2024a. MonitorAssistant: Simplifying Cloud in Neural Information Processing Systems (NeurIPS’24)
Service Monitoring via Large Language Models. In Com- (2024).
panion Proceedings of the 32nd ACM International Con-
ference on the Foundations of Software Engineering Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert
(FSE’24). Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel
Fried, Uri Alon, et al. 2023. WebArena: A Realistic Web
Zhaoyang Yu, Changhua Pei, Xin Wang, Minghua Ma, Environment for Building Autonomous Agents. arXiv
Chetan Bansal, Saravan Rajmohan, Qingwei Lin, Dong- preprint arXiv:2307.13854 (2023).
mei Zhang, Xidao Wen, Jianhui Li, et al. 2024b. Pre-
trained KPI Anomaly Detection Model Through Disen- Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai
tangled Transformer. In Proceedings of the 30th ACM Li, and Dan Ding. 2021. Fault Analysis and Debugging
SIGKDD Conference on Knowledge Discovery and Data of Microservice Systems: Industrial Survey, Benchmark
Mining. System, and Empirical Study. IEEE Transactions on
Software Engineering (2021).
Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy
Davis, Heather Miller, Chris Potts, James Zou, Michael
Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi.
2024. The Shift from Models to Compound AI Sys-
tems. https://bair.berkeley.edu/blog/2024/02/
18/compound-ai-systems/.
Pingyu Zhang and Sebastian Elbaum. 2012. Amplifying

Tests to Validate Exception Handling Code. In Proceed-
ings of the 34th International Conference on Software
Engineering (ICSE’12).
Qiao Zhang, Guo Yu, Chuanxiong Guo, Yingnong Dang,

Nick Swanson, Xinsheng Yang, Randolph Yao, , Murali
Chintalapati, Arvind Krishnamurthy, and Thomas Ander-
son. 2018. Deepview: Virtual Disk Failure Diagnosis and
Pattern Detection for Azure. In Proceedings of the 15th
USENIX Symposium on Networked Systems Design and
Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang,

Minghua Ma, Yu Kang, and Saravan Rajmohan. 2024a.

2501.06706v1

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

2501.06706v1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2501.06706v1

Uploaded by

Copyright:

Available Formats

AIO PS L AB : A H OLISTIC F RAMEWORK TO E VALUATE AI AGENTS FOR

E NABLING AUTONOMOUS C LOUDS

1 I NTRODUCTION 2023; He et al., 2022; Ma et al., 2018; Zhang et al., 2018;

Incident Triage after interface.

problem is defined with a context C which includes its

Frontend 8 """ Recover the revoke admin privileges

Table 3. Overall performance of different agents. We show the 0.8

(c) Root Cause Analysis (RCA) Task (d) Mitigation Task

Table 5. Occurrences of system commands.

F LASH 0 3 0 0 0 0 8 10 0 get_metrics 2.6% get_metrics

Pingyu Zhang and Sebastian Elbaum. 2012. Amplifying

Qiao Zhang, Guo Yu, Chuanxiong Guo, Yingnong Dang,

Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang,

You might also like