Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views12 pages

E Nhancing C Luster R Esilience

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

E NHANCING C LUSTER R ESILIENCE : LLM- AGENT BASED AUTONOMOUS

I NTELLIGENT C LUSTER D IAGNOSIS S YSTEM AND E VALUATION F RAMEWORK

Honghao Shi 1 Longkai Cheng 1 Wenli Wu 1 Yuhang Wang 1 Xuan Liu 1 Shaokai Nie 1 Weixv Wang 1
Xuebin Min 1 Chunlei Men 1 Yonghua Lin 1

A BSTRACT
arXiv:2411.05349v1 [cs.AI] 8 Nov 2024

Recent advancements in Large Language Models (LLMs) and related technologies such as Retrieval-Augmented
Generation (RAG) and Diagram of Thought (DoT) have enabled the creation of autonomous intelligent systems
capable of performing cluster diagnostics and troubleshooting. By integrating these technologies with self-play
methodologies, we have developed an LLM-agent system designed to autonomously diagnose and resolve issues
within AI clusters. Our innovations include a knowledge base tailored for cluster diagnostics, enhanced LLM
algorithms, practical deployment strategies for agents, and a benchmark specifically designed for evaluating LLM
capabilities in this domain. Through extensive experimentation across multiple dimensions, we have demonstrated
the superiority of our system in addressing the challenges faced in cluster diagnostics, particularly in detecting
and rectifying performance issues more efficiently and accurately than traditional methods.

1 INTRODUCTION as a comprehensive evaluation tool that highlights the


performance differences between our enhanced LLM-agent
Recent advancements in Large Language Models (LLMs) and baseline open-source models. In practical applications,
and complementary technologies such as Retrieval- the LLM-agent demonstrates its superior capability to
Augmented Generation (RAG) and Diagram of Thought identify and resolve performance issues more efficiently
(DoT) have paved the way for the development of than traditional methods, reducing the troubleshooting
autonomous intelligent systems capable of performing time significantly. For instance, in a simulated scenario
cluster diagnostics and troubleshooting. By integrating where one GPU was throttled to a much lower frequency,
these technologies with self-play methodologies, we have our system identified and resolved the issue within a
created an LLM-agent system designed to autonomously matter of minutes, whereas conventional approaches would
diagnose and resolve issues within AI clusters. Our have taken a senior operations engineer nearly an hour to
innovative approach includes the establishment of a diagnose and rectify using pre-written automated detection
specialized knowledge base for cluster diagnostics, the software.
enhancement of LLM algorithms to better suit the demands Moreover, the LLM-agent’s ability to detect and initiate
of the domain, practical deployment strategies for agents corrective actions even before the performance degrada-
within real-world environments, and the development tion is noticed by human operators marks a significant
of a benchmark specifically tailored to evaluate LLM advancement in proactive system maintenance. This
capabilities in the context of cluster diagnostics. These capability not only mitigates immediate issues but also
components collectively contribute to a robust framework enhances the overall availability and reliability of the
that addresses the complexities inherent in managing AI cluster by preemptively addressing potential faults. By
clusters, particularly in scenarios involving performance leveraging the strengths of RAG and DoT, the LLM-agent
degradation or other operational anomalies. can autonomously execute remediation measures, thereby
Through rigorous experimentation, we have validated freeing up engineering resources to focus on more complex
the effectiveness of our LLM-agent system across and value-driven tasks. Our research underscores the
multiple dimensions. Our benchmark, which consists transformative potential of combining AI-driven diagnostics
of 150 manually crafted advanced questions, serves with practical deployment strategies, setting the stage for a
*
Equal contribution 1 Beijing Academy of Artificial Intelli- new era of intelligent cluster management solutions.
gence, Beijing, China. Correspondence to: Yonghua Lin <yh-
lin@baai.ac.cn>.
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

2 RELATED WORKS achieve extraordinaire results(Snell et al., 2024). The CoT


series technics are also one of the foundations for build-
2.1 LLM’s Alignment and Enhancement ing LLM-agents. LLM-agents can leverage LLMs as the
In recent years, generative artificial intelligence centered processing core while integrating traditional AI-agent capa-
around large language models(LLMs) has seen rapid bilities such as memory, planning, and execution, creating
development, with powerful natural language generating semi-autonomous software entities that are highly adaptive
capabilities demonstrated by proprietary models such and capable(Xi et al., 2023).
as the GPT series(Achiam et al., 2023) and Gemini
series(Team et al., 2023), as well as open-source models 2.3 Diagnosis and Repair for AI Clusters
like Llama(Dubey et al., 2024) and Qwen(Yang et al.,
Constructing and utilizing LLM applications typically re-
2024).
quire hardware infrastructure on a scale costing millions
There are multiple approaches to enhancing the capabil-
of or more dollars. Meta constructed the LLM application
ities of LLMs across different stages such as training,
core LLaMA 3.1 within 54 days, leveraging a cluster that
inference, and deployment, as well as in areas like data,
included 16,000 GPUs(Dubey et al., 2024), with just the
algorithms, and computational resources. In light of
GPU costs amounting to over billion dollars. However,
the achievements of autoregressive models like GPT-
such complex and expensive systems face significant chal-
2(decoder-only transformers)(Radford et al., 2019) and
lenges in terms of reliability and availability. During the
LLaMA(transformer++)(Touvron et al., 2023), enhancing
54-day training, the Meta cluster experienced 419 unex-
the quality of the data has become a critical method for
pected interruptions, averaging one disruption every three
improving the efficacy of models during the pre-training
hours. At such a frequency of interruptions, the cluster, from
process(Adler et al., 2024; Liu et al., 2024).
the operating system to the AI framework and distributed
For modern LLMs, there exists several training or fine-
scheduling software, requires the ability to capture, identify,
tuning works between pre-training and the deployment.
attribute, and repair exceptions to ensure successful and ef-
ChatGPT(Ouyang et al., 2022) describes this process as Su-
ficient model training. Microsoft’s Superbench(Xiong et al.,
pervised Fine-Tuning (SFT), Reward Modeling (RM), and
2024) has systematically built a suite of standard test cases
Reinforcement Learning with Human Feedback (RLHF),
to comprehensively assess the availability of clusters.
while LLaMA3.1(Dubey et al., 2024) integrates these
In terms of capture and repair, the Torch(Paszke et al.,
into a continuous process known as ”Continue Training.”
2019) Elastic solution aims to enable automatic restarts
Besides training, LLMs can leverage Retrieval-Augmented
of model training, while works such as FlashCheckpoint-
Generation (RAG)(Lewis et al., 2020) to utilize knowledge
ing in DLRover(Wang et al., 2023) focus on reducing the
from data distributions that were not part of the training
cost of checkpoint saving and loading during the automatic
set. We can refer to the above content as the alignment and
restart process. Building upon automatic restart capabili-
enhancement of LLMs.
ties, many works at the AI framework level have conducted
research and practical implementations to enhance reliabil-
ity and availability, particularly those featuring highly cus-
2.2 AI-agent based Applications tomized solutions based on Megatron(Shoeybi et al., 2019).
After the model parameters have been frozen, it is possible ByteDance’s Megascale(Jiang et al., 2024) and Alibaba’s
to enhance the inherent capabilities of the model through Pai-Megatron(Qian et al., 2024) both provide toolkits for
mechanisms such as chain-of-thought(CoT) reasoning(Wei cluster diagnostics, which are used to check the health of
et al., 2022), scaling test time(Snell et al., 2024), and com- servers and networks, as well as to perform manual or auto-
bining CoT LLM and AI agents(Castelfranchi, 1998) as mated error identification and repair.
LLM-agent(Park et al., 2023). With the advancement of AI technologies, researchers are
CoT is a prompting technique used to guide LLMs to beginning to explore the use of AI techniques to address
generate intermediate reasoning steps before arriving at cluster diagnostic issues. Using big data techniques to ana-
a final conclusion. There are extensions to classic CoT, lyze log files was an typical approach to automating cluster
such as Tree of Thought (ToT)(Yao et al., 2024) for tree- diagnostics(Jung & Chung, 2021). However, such meth-
like backtracking, Graph of Thought (GoT)(Besta et al., ods primarily involve static or real-time analysis of files
2024) for graph-based reasoning, and Diagram of Thought produced by the training process, which limits their attribu-
(DoT)(Zhang et al., 2024) for a propose-critique-summarize tion capabilities and means they lack intelligent autonomy,
approach based on topos theory. relying instead on pre-written execution and planning pro-
The development of CoT and the scaling of test-time are cedure.
unified, with CoT applications always aiming to maintain
optimal results with limited test-time or scaling test-time to
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

3 SPECIAL TERMINOLOGIES
AI computing tasks: refers to programs or processes
designed to achieve intelligence, such as training large
language models, inference with large language models,
world model inference, and LLM-agent inference.
AI chips: processors suitable for or dedicated to performing
AI computing tasks, such as NVIDIA GPUs, Intel Gaudi AI
accelerators, and Google TPUs(Jouppi et al., 2017).
AI servers: computers equipped with AI chips that
are suitable for or specifically designed to perform AI
computing tasks, such as the NVIDIA DGX H100. AI
servers often have requirements beyond those of classic
servers in terms of stability, availability, cooling, and power
consumption.
AI cluster: a distributed server cluster composed of two
or more AI servers set up to accomplish a single target
task, such as Meta’s cluster containing 16 thousand GPUs.
Additionally, AI servers typically require RDMA or higher
bandwidth interconnect protocals, such as InfiniBand
RDMA(Shanley, 2003) and RDMA over Converged
Ethernet(RoCE)(Guo et al., 2016), and do not usually adopt
classic Ethernet protocols. Figure 1. Overview of the Intelligent Maintenance System Based
on LLM-Agents
Cluster diagnosis: ensuring that AI computing tasks
can run with normal performance on the AI cluster,
promptly detecting task failures, identifying the points
of failure, clarifying the reasons for failure, repairing the In order to solve the above problems, we have introduced
corresponding faults, and ensuring the overall availability three innovations. First, we use 250 cluster failure
of the AI cluster. records collected from GitHub as a starting point, and
treat the cluster operation failure logs actually managed
by the LLM-agent as a continuous source of data. We
utilize RAG(Lewis et al., 2020) to enable the LLM to
4 M ETHODS capture detailed knowledge corresponding to specific
4.1 Overview terms within the context. Figure 1 describes the ”alert”,
”compute cluster”, and ”storage sections”, along with their
We incorporate advanced techniques from the field of LLM communication with the LLM-agent, which outlines this
alignment and enhancement to creatively develop a solution process. Second, we use DoT(Zhang et al., 2024) enables
for building a cluster intelligent maintenance system based the model to effectively handle non-natural language
on LLM-agents. Figure 1 illustrates the overall process of information such as symbols, formulas, and code. Similar
this solution. to vision-text multimodal models, we effectively leverage
The upper part of the figure represents the core component textual elements that go beyond the inherent meaning of
of solution: the LLM-agent. The LLM-agent consists of an natural language based on DoT. The ”planning algorithm”
agent program and an LLM. The LLM interprets the input section at the top of Figure 1 illustrates this innovation.
information provided by the agent as external stimuli and Third, we use self-play technology(Snell et al., 2024) to
task instructions, and responds appropriately. The agent enable the LLM to autonomously, also intelligently, devides
then directly writes code or calls specific software interfaces long tasks or challenging reasoning objectives into multiple
based on the feedback from the LLM, thereby operating the steps, self-assess the output of each step, and ultimately
cluster. For LLM itself, there are two main challenges. First, achieve the goal.
how does the LLM acquire domain-specific knowledge The lower part of Figure 1 forms the basis of our work.
of cluster diagnostics, and furthermore, where does this It includes a mature operations alarm troubleshooting
knowledge come from. Second, how can the LLM reason and repair process, as well as several mature or advanced
and plan? For the entire LLM-agent, ensuring that the software tools. Based on related works, we have developed
LLM’s inputs and outputs match with the actual operations a unified, multi-level, multi-dimensional cluster diagnostic
performed by the agent controlling the cluster is another toolkit as Figure 2.
crucial aspect that needs to be addressed. This tool diagnoses the health status of the cluster from
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

reasoning. These reasoning structures use AI computation


tasks and hardware specification information as input,
and through a bottom-up modeling approach, predict the
theoretical performance of the given AI computation tasks,
thereby determining the correctness of the performance.

4.2.1 Diagnosis Dataset


We drew on effective practices from Alibaba’s experience
in managing cluster startup operations(Xu et al., 2024) to
build a database. We cleaned, organized, and structured the
Figure 2. Tools for LLM-agent to Diagnose AI Cluster unstructured data obtained from GitHub, ultimately forming
an effective dataset. We collected over a thousand questions
and feedback items from the GitHub issue section. Through
both the supply side and the demand side simultaneously. automated processes and manual review, we filtered out
The bottom part of Figure 2 lists the various components over 200 entries with substantive knowledge content and
required to build an AI cluster, including the computing well-structured Q&A formats. Each piece of organized data
component, storage component, network component, and contains four fields: problemkey, rawtext, function, and
others. AI clusters following different technical routes result.
provide similar capabilities, as shown in the middle part of The problemkey is a domain keyword identified either
Figure 2. We inspect all resource supply items affecting AI manually or based on openai o1. Rawtext refers to the
computing tasks to determine if their content is correct, if original content of a website after simple formatting, stored
their performance is appropriate, and if they are stable. For as a long string containing the questions asked on the
example, for the feature of RDMA read/write between two web page and the developers’ responses. The function is
GPUs across servers, our tool checks whether the read/write based on our cluster diagnosis toolkits and is manually
content is correct, whether the IOPS, bandwidth, latency, correlated by cluster troubleshooting personnel. This part
and other performance metrics are appropriate, and the is used as annotation in the portion of the dataset that the
stability under complex scenarios such as long-duration or model can perceive, it is not perceived by the model for
multi-process read/writes. Most of these tools are improved the answers used in the benchmark evaluation part, and
versions of packages provided by chip, server, or operating it serves as the starting point for knowledge acquisition
system vendors. The top part of Figure 2 takes the demand after the LLM-agent is deployed. The final results are the
side into consideration, evaluating the metric of concern for causes of the faults extracted from the rawtext based on
AI computing tasks with various characteristics. the developers’ answers. For an LLM capable of driving
In summary, we have built an LLM-agent capable of retriev- an agent to perform cluster diagnostics, we expect it to be
ing and utilizing vast amounts of external information, with able to determine the causes of faults based on real-time
autonomous planning, learning, reasoning, and execution operational information from the cluster and to call existing
capabilities. This LLM-agent works alongside either tools or write tool code on-the-fly for cluster repairs,
custom-written tools or existing mature tools to perform without relying on rawtext containing developer replies. We
early warning, troubleshooting, and repair tasks for the will demonstrate this capability in subsequent experiments.
cluster.

4.2.2 Performance Modeling


4.2 Cluster Diagnosis Domain-specific Knowledge We use a series of progressive methods to model the correct
Base performance of given AI computation tasks, and through
Our knowledge base consists of two sources. One part is the DoT, we convert this special modal data into tokens to
logs, monitoring information, or program output content, feed into the model. In addition to cluster health check, we
come from pre-collected, cleaned, and organized GitHub have included modules in the toolkits to determine whether
data, carefully selected to address pain points in the cluster different AI computing tasks exhibit correct performance.
diagnostics and troubleshooting domain, incorporating These modules can, on one hand, be invoked by the agent
knowledge from issues in the GitHub community, also come to provide results to the LLM for analysis, and on the other
from operational data acquired after the initial deployment hand, they can be called by the LLM to have the agent check
and operation of the LLM-agent. We call it Diagnosis the cluster status.
Dataset. The second part is composed of symbolic We start modeling with the simplest task types. Considering
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework
A B C
that existing AI clusters are composed of computing devices
with the von Neumann architecture, AI computing tasks
require the use of computing cores, memory, and I/O ports.
It is worth noting that what AI computing tasks occupy are
not narrowly defined CPU computing cores, main memory, 0% interconnect
0% compute 100% compute 0% interconnect 100% interconnect 100% interconnect

or input/output ports, but rather in a broader sense, such as 100% memory 0% memory 100% memory 0% memory
100% compute 0% compute

computing cores dedicated to matrix multiplication, HBM


Figure 3. Multi-variable Task Performance Modeling. A shows
memory composed of multi-level caches, and high-speed
compute-memory, B shows interconnect-memory, C shows
I/O ports formed by PCIe or RDMA protocols. To build a interconnect-compute
unified model, we use the concepts of equivalent computing
power, equivalent memory bandwidth, and equivalent I/O
bandwidth.
We refer to computational tasks that occupy or primarily oc- at figure 3 show that computing and memory are in domains
cupy one type of resource as single-resource computational that are completely non-parallelizable, whereas computing,
tasks. We construct a single-variable computational task per- memory, and I/O ports can approach full parallelization.
formance model and use experiments based on Khinchin’s This conclusion and related figures have been compiled and
law of large numbers to get the results. We assume that for placed in the RAG documentation.
a certain computational task T, the total amount of resource
Ri required is Mi . The hardware running this task can pro- 4.3 Create LLM-agent with RAG-DoT-Selfplay
vide Ni units of resource Ri per second. Assume that the techniques
single-variable task Tx depends only on resource R0 . We 4.3.1 Using RAG to Build an LLM That Can Utilize
determine M0 based on the mathematical formula used for External Knowledge
the task’s computation. For N0 , we consider it a random
variable. Through a large number of repeated experiments RAG integrates two core components: retrieval and genera-
after warm-up, we ensure that the difference between the tion. The retrieval module is responsible for finding context-
measured results and the expected value of the random vari- relevant information from an external knowledge base, a
able approaches zero. We define performance as the number process that typically involves indexing large volumes of
of times a specific task can be executed per unit time. For documents to quickly locate the most pertinent segments.
the aforementioned task Tx , we predict its performance to The retrieved information is then passed to the generation
N0 module as additional input. The generation module builds
be M 0
.
For non-single-variable tasks, we focus on modeling upon a pre-trained language model, leveraging the retrieved
whether the different resources they depend on can oper- context to enhance its generation capabilities, thereby pro-
ate in parallel. A widely used method in multivariate task ducing responses that are more accurate and better aligned
modeling is the roofline model(Ofenbeck et al., 2014). The with real-world situations.
roofline model introduces a new variable: task characteristic Considering other similar technologies, SFT requires sub-
CT . The Roofline model introduces a new variable: the stantial computing resources and may diminish the model’s
task characteristic CT . Consider a task Tx depends on two inherent generalization capabilities. In-context learning
resources R0 and R1 , the effective utilization of resource consumes context length and inference time, making it
R0 is plotted on the Y-axis, and the ratio of effective utiliza- unsuitable for importing datasets with millions of entries.
tion of resource R0 to resource R1 is plotted on the X-axis. RAG can acquire relevant knowledge during inference with
By changing CT , a scatter plot can be drawn, forming a minimal resources and inference time, without altering the
shape like a roofline. The Roofline model is equivalent to weights of the model itself.
modeling the performance of multivariable tasks under fully
parallel scenarios, which does not align with real-world 4.3.2 Using DoT to Build an Agent That Can Reason and
conditions. Additionally, in the context of existing LLM Plan
performance modeling, changes in CT are not about varia- DoT(Diagram of Thoughts)(Zhang et al., 2024) models iter-
tions in the input size of a single task but about the changing ative reasoning in LLMs as constructing a Directed Acyclic
proportions of two different primary resource-consuming Graph (DAG) within a single model. The DAG consists of
tasks within the total task. nodes representing propositions, critiques, refinements, and
Therefore, we use the proportion of different subtasks as verifications, with edges indicating the logical relationships
variables to model multivariable tasks for the three main or dependencies between them. We use XML to handle mul-
resources provided by AI clusters: equivalent floating- timodal special symbol data and perform reasoning based
point computing power for matrix multiplication, memory on DoT.
read/write bandwidth, and I/O port bandwidth. The results Based on the principles of DoT, we use XML tags to sep-
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

arate different types of text, including plain text, special in the field of cluster intelligent diagnostics. Throughout
symbols, code, formulas, and inference rules. Thanks to the this process, we emphasize fairness and impartiality, strictly
rope positional encoding adopted by LLama3.1, the model distinguishing between the parts of the model that can be
can accurately capture the content within XML pairs. Based perceived and the scoring portions of the evaluation. We
on the reasoning graph, our experiments confirmed that this further elaborate on the benchmark using the results of the
application allows the LLM to correctly reason according mainstream open-source model LLaMA3.1-70B.
to specific rules, achieving the capability to support the The second phase involves evaluating the innovative
agent in completing cluster fault attribution and repair tasks. aspects of the three models we proposed—RAG, DoT,
This significantly exceeds the capabilities of pre-trained or and selfplay—using the aforementioned benchmark for
aligned LLMs. comparative assessment. The experiments in the second
phase are aimed at demonstrating the advanced nature
4.3.3 Using Selfplay Techniques to Construct a of our proposed models in the field of cluster intelligent
Domain-specific MultiModal Agent diagnostics.
In the third phase, we expose the LLM-agent to both the
With the help of RAG and DoT, the LLM can utilize
training and testing sets in the benchmark, allowing it to
information from outside the training set as well as abstract
operate in its most complete form to address real-world
symbolic reasoning information. However, this still has
problems encountered in production environments. We
limitations for an agent designed for intelligent cluster
demonstrate the accuracy, efficiency, and autonomous
diagnostics. We permit the LLM to generate content over
intelligence of this solution through two typical cases.
a longer duration. The quality of solutions to challenging
Specifically, we found that this solution can provide early
problems can be enhanced through multiple rounds of
warnings for AI clusters, further enhancing the availability
planned selfplay or spontaneous self-questioning and
of the clusters.
answering by the agent.
Finally, we will conduct a qualitative analysis and discus-
Spontaneous self-questioning and answering is applied
sion on the topics of correctness, safety, and reliability,
in DoT reasoning. On the planned selfplay process, we
which are at the forefront of the LLM and LLM-agent fields
transform the complex problem of cluster fault attribution
and have yet to be conclusively resolved, to demonstrate
into a three-round process. In the first round, the agent,
the series of work we have undertaken in these areas.
based on error logs passed from the cluster, prompts
the LLM to identify potential keywords from the error
items and corresponding solutions from the knowledge
5.1 Statistics and Evaluation for Dataset and
base, performing information extraction and RAG. In the
Benchmark
second round, the LLM evaluates its own answers, making
corrections or accepting them directly, then proceeds to 5.1.1 Data’s Source
write or call appropriate tools for the Agent to execute. In
the final round, the LLM makes an accurate attribution The materials provided to the LLM come from three sources.
judgment based on the results of the agent’s interaction The first source is automatically collected Q&A data from
with the actual cluster. Compared to existing selfplay work relevant GitHub communities involved in AI cluster
focused on the text side, we integrate it with the agent, troubleshooting, such as the issue sections of repositories
granting it the permissions to operate machines and interact like Megatron, PAI, Deepspeed, and NCCL. This serves as
with the environment, fully simulating the capabilities of a our initial dataset. The data has undergone two rounds of
human engineer to solve problems. filtering, both automatic and manual, retaining parts with
clear solutions and logical dialogues. The second source
is the program output obtained by the LLM-agent using
RAG+DoT technology on several AI clusters running tasks.
5 EXPERIMENTS These tasks are executed on clusters ranging from 4 to
We conducted a three-phase experiment to demonstrate 100 A800 AI servers. The third part consists of special
the advanced nature of the proposed LLM-agent in the modal data such as symbolic representations and formulas
field of cluster intelligent diagnostics. The first phase processed using XML according to DoT logic, all of which
involves creating a dataset and benchmark for the field are unified into the text modality.
of cluster intelligent diagnostics. First, we define the The total amount of pure text material is 200+ items
statistical characteristics of the external data knowledge compared with 1.2GB origin files. This also confirms
base and introduce the process of generating an evaluation that if more than 200 items consist of pure text content
benchmark from this knowledge base. Next, we describe the is fully pre-tokenized to serve as the context for LLM
features of this benchmark and explain its advanced nature inference, it not only poses a significant challenge to the
LLM’s capability to handle long texts but also increases the
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

consumption of inference resources, thereby slowing down


Table 1. Benchmark’s Results on Open-source LLMs
the execution speed of the LLM-agent.

Inference
5.1.2 Benchmark’s Source and Statistics for Benchmark Inference Score Score Score
in 1
Model on 1 on on on
A800*8
We divided the original dataset into two parts, approximately A800 GPU Metric A Metric B Metric C
Server
in a 20%-80% ratio. From the 80%, we manually compiled
150 questions to assess the LLM’s capabilities in the field of Llama3.1- no yes 0.8658 0.0 0.0
cluster diagnostics. During comparative experiments, unless 70B
Nemotron- no yes 0.7315 0.0 0.0
otherwise specified, we provide only 20% of the original
70B
data to all models. During case studies and practical applica- Mistral- no no 0.7383 0.0 0.0
tions, we provide the entire original dataset to the deployed 120B
LLM-agent. Llama3.2- yes yes 0.047 0.0 0.0
We designed three evaluation metrics. Metric A evaluates 3B
the large model’s information extraction capabilities, in-
cluding extracting the cluster IP addresses and SSH port
numbers from conversations, as well as the ability to deter- Table 2. MMLU Benchmark’s Results on LLama3.1 and Nemotron
70B
mine whether further execution is needed, evaluated through
string matching. The challenge here is to assess the model’s
MMLU
ability to follow instructions and extract information, since Model SFT or not
score
logs are derived from user conversations and may contain
Llama3.1-70B no 0.8230
unnecessary commands that need to be ignored during the Llama3.1-70B yes 0.8007
determination process. Metric B evaluates the large model’s Nemotron-70B no 0.8234
code generation capabilities in the diagnostic domain, in- Nemotron-70B yes 0.7917
cluding the ability to generate prescribed code based on
descriptions given in conversations, control the input and
output of the code, and create unseen test cases, imple- 5.2 LMMs’ Evaluation
mented in a manner similar to human-eval(Chen et al., 2021)
but transferred to a real distributed cluster. Metric C eval- 5.2.1 Experimental Setup
uates the large model’s information attribution capabilities We conduct two parts of experiments to comprehen-
in the diagnostic domain, including the ability to provide sively evaluate and compare the innovative effects of our
attribution based on users’ error logs and information. This work. In the first part, we use the mature and universal
is currently implemented through multiple-choice questions. MMLU(Hendrycks et al., 2020) benchmark to evaluate the
comprehensive ability of the model in basic text understand-
5.1.3 Evaluation of Benchmark on Standard ing after it has been enhanced by RAG, DoT, and self-play.
LLaMA3.1-70B In the second part, through ablation and comparison exper-
We applied this benchmark to several of the most iments, combined with the focus areas of the sub-items in
widely used open-source LLMs, namely LLaMA3.1-70B, our proposed benchmark, we quantitatively demonstrate the
nemotron-70B(Adler et al., 2024), mistral-120B(Jiang et al., advantages of our three innovations.
2023), and llama3.2 3B.
The results is in table 1. Due to the lack of relevant 5.2.2 General Capability Evaluation Based on MMLU
data and information, as well as reasoning logic such Firstly, we aim to substantiate why SFT is not advisable
as DoT, all models were only able to complete the first in this domain. Although the LLM that supports the agent
task, scoring zero on the second and third tasks. Since needs to possess extensive knowledge in cluster diagnostics,
the results of llama3.2 3B did not meet the minimum performance modeling, and code writing, we discovered
requirements for building the LLM-agent, and the 120B that when the LLM reaches a level where this knowledge
model is difficult to infer on a single AI server, we opted for can be effectively applied, it often lacks the fundamental
the better-performing and more widely used LLama3.1-70B interaction capabilities required to engage with the agent.
out of the two 70B models as the basis for subsequent SFT We illustrate this point using the MMLU benchmark.
(Supervised Fine-Tuning) and the application of RAG, DoT, We converted the knowledge repository into tokens
and selfplay. compatible with the model and constructed an instruction
dataset. We iterated through multiple training rounds until
the model could respond correctly to instructions. We then
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

Table 3. Multi Comprehensive Benchmark’s Results on LLMs Table 4. Benchmark’s Results on Open-source LLMs(baselines)
and our LLM-agent

SFT
ARC Open
Model or ARC BoolQ MMLU Score Score Score
easy bookQA
not Model ”cheating” method on on on
Llama3.1- no 0.6246 0.8691 0.8786 0.3720 0.8230 Metric A Metric B Metric C
70B
Llama3.1- None None 0.8658 0.0 0.0
Llama3.1- yes 0.6032 0.8649 0.8862 0.3680 0.8007
70B
70B
Llama3.1- Pre- None 0.8658 0.4615 0.6470
Nemotron- no 0.6280 0.8620 0.8780 0.3680 0.8234
70B Written
70B
Com-
Nemotron- yes 0.6126 0.8653 0.8859 0.3580 0.7917
plete
70B
Agent
Mistral- no 0.6544 0.8788 0.9012 0.3980 0.8229
Plan-
120B
ning
Llama3.2- no 0.4352 0.7428 0.7835 0.2800 0.6040
Steps(pre-
3B
plan)
Llama3.1- None SFT 0.0 0.0 0.0
70B
Llama3.1- pre- SFT 0.0 0.9230 0.0
evaluated the SFT model that reached this state against 70B plan
the original open-source model using the Multi-Machine Llama3.1- None RAG 0.8658 0.0 0.0
Learning Understanding (MMLU) benchmark. The results 70B
are presented in Table 2. Llama3.1- pre- RAG 0.8658 0.4615 0.7059
From the above results, it can be seen that Supervised 70B plan
Llama3.1- None RAG 0.8466 0.6153 0.6470
Fine-Tuning (SFT) leads to a decline in performance 70B + DoT
when evaluated using general assessment methods such as + self-
MMLU. Subsequently, in our proposed cluster diagnostics play
benchmark, we further observed adverse consequences Llama3.1- None RAG 0.0 0.9230 0.0
of this performance decline in metric C. As a result, we 70B + DoT
+ self-
ultimately decided not to use the SFT approach to construct play +
the LLM-agent. SFT
To avoid the potential risks associated with relying solely Llama3.1- whole RAG 1.0 1.0 1.0
on MMLU, we further selected three additional LLM 70B dataset + DoT
benchmarks that are closely related to the problems we + self-
play +
aim to solve in our domain or are entirely generalizable: SFT
Abstraction and Reasoning Challenge(ARC)(Peter, 2022), Llama3.1- pre- RAG 1.0 1.0 1.0
BoolQ(Clark et al., 2019), and OpenbookQA(Mihaylov 70B plan + + DoT
et al., 2018). The results are presented in the table 3. The whole + self-
results of this set of experiments support the conclusions dataset play +
SFT
we drew from the MMLU benchmark. Nemotron- None None 0.7315 0.0 0.0
70B
Nemotron- pre- None 0.7315 0.4615 0.7059
5.2.3 Results of Our Benchmark 70B plan
Mistral- None None 0.7383 0.0 0.0
Table 4 presents all of our experimental results. The second 120B
column of the table indicates whether there was ”cheating.” Mistral- pre- None 0.7383 0.7692 0.8235
We define experiments that do not participate fairly in 120B plan
Llama3.2- None None 0.047 0.0 0.0
the benchmark as cheating. While this is unfair for the 3B
benchmark portion, it is clearly meaningful for our core Llama3.2- pre- None 0.047 0.2307 0.1176
research objective: to build an LLM-agent system that can 3B plan
autonomously and intelligently perform cluster diagnostics
and troubleshooting. When evaluating the benchmark
section, the cheating items can be considered as ground First, we found that a pre-defined plan can help a naive
truth. LLM control the agent. However, this plan was specifically
These experimental results can illustrate several conclusions. written based on the benchmark questions and cannot be
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

used in a production environment. Correspondingly, all so on. This method is time-consuming and labor-intensive
experiments utilizing DoT technology and not cheating and may require multiple attempts to pinpoint the root
scored well on metrics B and C for evaluating the agent, cause. In our system, the LLM-agent automatically gathers
although the scores were slightly lower than those achieved relevant log information, performance metrics, and other
with preplanning. This indicates that our proposed knowl- necessary data from the nodes of the cluster. Leveraging the
edge processing approach based on DoT and self-play can LLM-agent’s capabilities assessed through the benchmark,
be used to control cluster troubleshooting agents. Second, the system extracts useful information from the collected
we found that SFT significantly improved the scores on data, such as cluster IP addresses, SSH ports, and other crit-
metric B, which focuses on evaluating code writing or the ical diagnostic details. Using its diagnostic capabilities in
invocation of diagnostic tools. However, as a trade-off, code generation and information attribution, the LLM-agent
all models that underwent SFT, even with preplanning, identifies the root cause of the issue based on the collected
were unable to control the agent properly, resulting in poor data and information. This may include generating new test
performance on metric C. Third, we found that the results cases to validate hypotheses. Once the problem is identified,
based on LLama3.1-70B were not significantly different the LLM-agent generates corresponding remediation scripts
from those of Mistral-120B, which has nearly twice the and requests human review. After approval, the LLM-agent
number of parameters. Twice the number of parameters executes the remediation measures in the cluster. Following
implies double or more inference costs (considering the execution of remediation measures, the system collects
multi-GPU linearity), making it impractical. On the other data again to assess the outcome, forming a closed loop of
hand, the 3B smaller model, even with preplanning in data, algorithm, and hardware to optimize future diagnostic
a cheating scenario, is still unable to handle the task of processes.
controlling the agent. We manually constructed a scenario. This scenario would
We proceeded with subsequent experiments and actual lead to slow performance in AI model training tasks and
deployment using the LLM-agent enhanced with the whole has repeatedly occurred in the development environment.
dataset and all of our innovative methods. We simulated an extreme heat situation with HVAC failure,
throttling the frequency of one of the dozens of GPUs to
approximately 200 MHz, rather than the 1410 MHz that the
5.3 Intelligent Early Warning and Troubleshooting: A A800 GPUs should operate at. Observing the actual logs
Case Study shows that the speed of this AI computing task decreased to
approximately one-third of its normal performance. Our
To demonstrate the superiority of the LLM-agent system we LLM-system initially flagged the slow AI task through
have built in the context of intelligent cluster diagnostics, we power consumption monitoring and performance modeling
can present a concrete example to illustrate how the system results, triggering an automatic alert. Following this,
operates and how it is more efficient and accurate compared through three rounds of self-play, it recommended checking
to traditional methods. In the production environment of the GPU core frequencies, a suggestion that the agent
AI clusters, abnormal events or interruptions are not the then dispatched for execution across all GPUs. Based on
most challenging problems to resolve. Clear information the execution results, the LLM accurately pinpointed the
about anomalies or interruptions can effectively guide GPU with the low core frequency that we had specifically
senior engineers in diagnosing the causes of issues. Current altered. The entire troubleshooting process took less than
research is also progressively integrating technologies such 10 minutes. In contrast, a senior operations engineer would
as automatic restarts and automatic scheduling into the typically need about one hour to correctly identify the
procedures for handling anomalies or interruptions in AI problem and then use a pre-written automated detection
computing tasks. However, once an AI computing task software tool created by engineers to determine the specific
exhibits slow performance, it becomes difficult to quickly GPU with the low-frequency fault. More importantly,
identify the problem, and it is even harder to pinpoint the our LLM-agent can identify the fault before algorithm
cause of the slowdown. engineers or operations engineers detect the slow-down
Assume there is an AI training cluster composed of dozens phenomenon and automatically complete the repair. This
of servers, where one of the servers suddenly experiences a achieves resolving the issue before the fault occurs, thereby
performance drop. This could be due to various reasons, enhancing the overall availability of the cluster.
such as increased network latency, memory leaks, high
CPU load, or insufficient storage space. Traditionally,
administrators or engineers would check the log files of the
cluster to manually identify possible issues. This would
involve reviewing logs from different nodes, monitoring
system metrics, attempting to reproduce the problem, and
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

5.4 Qualitative Analysis of Correctness, Safety, and tion, we innovatively proposed a system solution utilizing
Reliability LLM-agents to autonomously and intelligently perform clus-
ter troubleshooting. In terms of LLM algorithms, we intro-
Based on the existing research that is not yet fully mature,
duced a benchmark consisting of 150 advanced problems
and in the context of this specific field of study, we
manually crafted, demonstrating the performance differ-
provide reasonable definitions for correctness, safety, and
ences between our constructed LLM-agent and the original
reliability. In this study, we define correctness as whether
open-source LLMs under fair data conditions. In the realm
the process and results of the LLM-agent executing tasks
of LLM-agent construction, we innovatively proposed inte-
are correct. Compared to evaluating the output of the
grating DoT reasoning mathematics and the ability to handle
LLM, assessing the correctness of the LLM-agent’s actions
special symbols and formulas into the agent, enabling the
is more challenging. An apparently incorrect operation
LLM to operate machines at the software level and receive
process may produce the correct result, whereas seemingly
feedback. Ultimately, we applied our innovative achieve-
perfect output at the textual level might lead to an erroneous
ments to cluster diagnostics, exploring the potential in this
result when executed. Since we focus on the field of cluster
field, and were pleasantly surprised to find that the LLM-
diagnostics with the actual output being the execution of
agent systems, despite being in their extremely early stages,
procedures by the agent, we do not investigate the potential
are already capable of handling repetitive and low-end tasks,
harmfulness or bias in the textual content generated by the
thus freeing industry practitioners to tackle more challeng-
LLM. Instead, we examine the ability of our LLM-agent to
ing and valuable problems.
avoid performing harmful operations on the cluster when
In the future, we will continue our work in four aspects. In
the information fed back to the agent changes, or even when
terms of LLM algorithms, we will expand and upgrade the
malicious content is inserted by an attacker, such as deleting
existing benchmark and build a more comprehensive and
files, shutting down, overclocking, or modifying critical
valuable metrics system. In the Agent field, we will further
system configurations. Regarding reliability, we define it
unlock the potential of DoT and make self-written code by
as the overall quality of fault handling by the LLM-agent
the LLM gradually become the main execution body, re-
compared to human engineers or expert human engineers.
ducing reliance on preset tools. At the system application
In addition to whether the attribution is correct, we also
level, we will form a closed loop of data, algorithm, and
consider factors such as the time taken to complete typical
hardware, enriching the database with results from actual
fault handling, the resources consumed, and the ability to
deployments. Finally, in terms of safety and reliability, we
communicate with non-experts.
will continue to work with researchers in related fields to
We incorporate the assessment of correctness into the
ensure and evaluate the safety and reliability of the agents.
benchmark evaluation. For the potential risks associated
with the LLM-agent, we implement a whitelist plus
6.2 Shortcomings and Limitations
human review approach. Initially, we ensure the safety
of the existing toolkit, followed by creating a whitelist Our research still has shortcomings and limitations. In terms
for the program interfaces included in the toolkit and of shortcomings, our agent currently relies on a mechanism
conducting human reviews for the LLM-agent’s requests of human review to ensure safety, depends on pre-written
to execute self-authored code. Finally, we observed that tools for code, and relies on data sourced from GitHub as a
the LLM-agent can attribute faults with an average of starting point. An ideal LLM-agent system should form a
fewer than three test cases across multiple rounds of self-sustained relationship with the AI cluster, maintaining
self-play, which is more efficient than the twelve cases and evolving itself.
typically required by human experts. However, regarding In terms of limitations, our work depends on the LLM within
communication abilities, the LLM-agent currently does not the LLM-agent, but smaller models like llama3.2-3B cur-
possess such capabilities. The qualitative analysis described rently cannot support the capabilities of the agent. There-
above is mainly aimed at reducing the probability of fore, our work can only be applied to data centers or large-
harmful incidents. Quantitative analysis or a comprehensive scale distributed clusters and cannot be deployed in edge
model still necessitates further advancements in the field of computing or personal computer scenarios. We need to
AI safety. continuously monitor the development of smaller models
and explore the possibility of teaching the capabilities of
the LLM-agent to smaller models in the form of DoT when
6 CONCLUSION AND DISCUSSION appropriate.

6.1 Work Summary and Further Plan


Based on our experience and research in the fields of cluster
diagnostics, LLM enhancement, and LLM-agent construc-
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

R EFERENCES Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,
G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
A., et al. In-datacenter performance analysis of a tensor
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S.,
processing unit. In Proceedings of the 44th annual inter-
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint
national symposium on computer architecture, pp. 1–12,
arXiv:2303.08774, 2023.
2017.
Adler, B., Agarwal, N., Aithal, A., Anh, D. H., Bhattacharya,
P., Brundyn, A., Casper, J., Catanzaro, B., Clay, S., Co- Jung, H. and Chung, K. Social mining-based clustering
hen, J., et al. Nemotron-4 340b technical report. arXiv process for big-data integration. Journal of Ambient In-
preprint arXiv:2406.11704, 2024. telligence and Humanized Computing, 12(1):589–600,
2021.
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Pod-
stawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V.,
Niewiadomski, H., Nyczyk, P., et al. Graph of thoughts: Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel,
Solving elaborate problems with large language models. T., et al. Retrieval-augmented generation for knowledge-
In Proceedings of the AAAI Conference on Artificial In- intensive nlp tasks. Advances in Neural Information Pro-
telligence, volume 38, pp. 17682–17690, 2024. cessing Systems, 33:9459–9474, 2020.

Castelfranchi, C. Modelling social action for ai agents. Liu, Y., Tao, S., Zhao, X., Zhu, M., Ma, W., Zhu, J., Su,
Artificial intelligence, 103(1-2):157–182, 1998. C., Hou, Y., Zhang, M., Zhang, M., et al. Coachlm:
Automatic instruction revisions improve the data quality
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O.,
in llm instruction tuning. In 2024 IEEE 40th International
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman,
Conference on Data Engineering (ICDE), pp. 5184–5197.
G., et al. Evaluating large language models trained on
IEEE, 2024.
code. arXiv preprint arXiv:2107.03374, 2021.

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can
M., and Toutanova, K. Boolq: Exploring the surprising a suit of armor conduct electricity? a new dataset
difficulty of natural yes/no questions. arXiv preprint for open book question answering. arXiv preprint
arXiv:1905.10044, 2019. arXiv:1809.02789, 2018.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, Ofenbeck, G., Steinmann, R., Caparros, V., Spampinato,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, D. G., and Püschel, M. Applying the roofline model.
A., et al. The llama 3 herd of models. arXiv preprint In 2014 IEEE International Symposium on Performance
arXiv:2407.21783, 2024. Analysis of Systems and Software (ISPASS), pp. 76–85.
IEEE, 2014.
Guo, C., Wu, H., Deng, Z., Soni, G., Ye, J., Padhye, J., and
Lipshteyn, M. Rdma over commodity ethernet at scale. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
In Proceedings of the 2016 ACM SIGCOMM Conference, Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
pp. 202–215, 2016. et al. Training language models to follow instructions
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, with human feedback. Advances in neural information
M., Song, D., and Steinhardt, J. Measuring mas- processing systems, 35:27730–27744, 2022.
sive multitask language understanding. arXiv preprint
Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang,
arXiv:2009.03300, 2020.
P., and Bernstein, M. S. Generative agents: Interactive
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., simulacra of human behavior. In Proceedings of the 36th
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., annual acm symposium on user interface software and
Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint technology, pp. 1–22, 2023.
arXiv:2310.06825, 2023.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Jiang, Z., Lin, H., Zhong, Y., Huang, Q., Chen, Y., Zhang, Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
Z., Peng, Y., Li, X., Xie, C., Nong, S., et al. {MegaScale}: L., et al. Pytorch: An imperative style, high-performance
Scaling large language model training to more than deep learning library. Advances in neural information
10,000 {GPUs}. In 21st USENIX Symposium on Net- processing systems, 32, 2019.
worked Systems Design and Implementation (NSDI 24),
pp. 745–760, 2024. Peter, E. Abstraction and reasoning challenge. 2022.
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

Qian, K., Xi, Y., Cao, J., Gao, J., Xu, Y., Guan, Y., Fu, B., practical benchmark for cloud configuration generation.
Shi, X., Zhu, F., Miao, R., et al. Alibaba hpn: a data Proceedings of Machine Learning and Systems, 6:173–
center network for large language model training. In 195, 2024.
Proceedings of the ACM SIGCOMM 2024 Conference,
pp. 691–706, 2024. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C.,
Li, C., Li, C., Liu, D., Huang, F., et al. Qwen2 technical
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., report. arXiv preprint arXiv:2407.10671, 2024.
Sutskever, I., et al. Language models are unsupervised
multitask learners. OpenAI blog, 1(8):9, 2019. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y.,
and Narasimhan, K. Tree of thoughts: Deliberate problem
Shanley, T. InfiniBand network architecture. Addison- solving with large language models. Advances in Neural
Wesley Professional, 2003. Information Processing Systems, 36, 2024.

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, Zhang, Y., Yuan, Y., and Yao, A. C.-C. On the diagram of
J., and Catanzaro, B. Megatron-lm: Training multi- thought. arXiv preprint arXiv:2409.10038, 2024.
billion parameter language models using model paral-
lelism. arXiv preprint arXiv:1909.08053, 2019.
A P LEASE ADD SUPPLEMENTAL MATERIAL
Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test-
time compute optimally can be more effective than scal- AS APPENDIX HERE
ing model parameters. arXiv preprint arXiv:2408.03314, Put anything that you might normally include after the refer-
2024. ences as an appendix here, not in a separate supplementary
file. Upload your final camera-ready as a single pdf, includ-
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu,
ing all appendices.
J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al.
Gemini: a family of highly capable multimodal models.
arXiv preprint arXiv:2312.11805, 2023.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,


M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models. arXiv preprint arXiv:2302.13971, 2023.

Wang, Q., Sang, B., Zhang, H., Tang, M., and Zhang,
K. Dlrover: An elastic deep training extension with
auto job resource recommendation. arXiv preprint
arXiv:2304.01468, 2023.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting
elicits reasoning in large language models. Advances in
neural information processing systems, 35:24824–24837,
2022.

Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B.,
Zhang, M., Wang, J., Jin, S., Zhou, E., et al. The rise and
potential of large language model based agents: A survey.
arXiv preprint arXiv:2309.07864, 2023.

Xiong, Y., Jiang, Y., Yang, Z., Qu, L., Zhao, G., Liu,
S., Zhong, D., Pinzur, B., Zhang, J., Wang, Y., et al.
{SuperBench}: Improving cloud {AI} infrastructure reli-
ability with proactive validation. In 2024 USENIX Annual
Technical Conference (USENIX ATC 24), pp. 835–850,
2024.

Xu, Y., Chen, Y., Zhang, X., Lin, X., Hu, P., Ma, Y., Lu,
S., Du, W., Mao, Z., Zhai, E., et al. Cloudeval-yaml: A

You might also like