E Nhancing C Luster R Esilience
E Nhancing C Luster R Esilience
E Nhancing C Luster R Esilience
Honghao Shi 1 Longkai Cheng 1 Wenli Wu 1 Yuhang Wang 1 Xuan Liu 1 Shaokai Nie 1 Weixv Wang 1
Xuebin Min 1 Chunlei Men 1 Yonghua Lin 1
A BSTRACT
arXiv:2411.05349v1 [cs.AI] 8 Nov 2024
Recent advancements in Large Language Models (LLMs) and related technologies such as Retrieval-Augmented
Generation (RAG) and Diagram of Thought (DoT) have enabled the creation of autonomous intelligent systems
capable of performing cluster diagnostics and troubleshooting. By integrating these technologies with self-play
methodologies, we have developed an LLM-agent system designed to autonomously diagnose and resolve issues
within AI clusters. Our innovations include a knowledge base tailored for cluster diagnostics, enhanced LLM
algorithms, practical deployment strategies for agents, and a benchmark specifically designed for evaluating LLM
capabilities in this domain. Through extensive experimentation across multiple dimensions, we have demonstrated
the superiority of our system in addressing the challenges faced in cluster diagnostics, particularly in detecting
and rectifying performance issues more efficiently and accurately than traditional methods.
3 SPECIAL TERMINOLOGIES
AI computing tasks: refers to programs or processes
designed to achieve intelligence, such as training large
language models, inference with large language models,
world model inference, and LLM-agent inference.
AI chips: processors suitable for or dedicated to performing
AI computing tasks, such as NVIDIA GPUs, Intel Gaudi AI
accelerators, and Google TPUs(Jouppi et al., 2017).
AI servers: computers equipped with AI chips that
are suitable for or specifically designed to perform AI
computing tasks, such as the NVIDIA DGX H100. AI
servers often have requirements beyond those of classic
servers in terms of stability, availability, cooling, and power
consumption.
AI cluster: a distributed server cluster composed of two
or more AI servers set up to accomplish a single target
task, such as Meta’s cluster containing 16 thousand GPUs.
Additionally, AI servers typically require RDMA or higher
bandwidth interconnect protocals, such as InfiniBand
RDMA(Shanley, 2003) and RDMA over Converged
Ethernet(RoCE)(Guo et al., 2016), and do not usually adopt
classic Ethernet protocols. Figure 1. Overview of the Intelligent Maintenance System Based
on LLM-Agents
Cluster diagnosis: ensuring that AI computing tasks
can run with normal performance on the AI cluster,
promptly detecting task failures, identifying the points
of failure, clarifying the reasons for failure, repairing the In order to solve the above problems, we have introduced
corresponding faults, and ensuring the overall availability three innovations. First, we use 250 cluster failure
of the AI cluster. records collected from GitHub as a starting point, and
treat the cluster operation failure logs actually managed
by the LLM-agent as a continuous source of data. We
utilize RAG(Lewis et al., 2020) to enable the LLM to
4 M ETHODS capture detailed knowledge corresponding to specific
4.1 Overview terms within the context. Figure 1 describes the ”alert”,
”compute cluster”, and ”storage sections”, along with their
We incorporate advanced techniques from the field of LLM communication with the LLM-agent, which outlines this
alignment and enhancement to creatively develop a solution process. Second, we use DoT(Zhang et al., 2024) enables
for building a cluster intelligent maintenance system based the model to effectively handle non-natural language
on LLM-agents. Figure 1 illustrates the overall process of information such as symbols, formulas, and code. Similar
this solution. to vision-text multimodal models, we effectively leverage
The upper part of the figure represents the core component textual elements that go beyond the inherent meaning of
of solution: the LLM-agent. The LLM-agent consists of an natural language based on DoT. The ”planning algorithm”
agent program and an LLM. The LLM interprets the input section at the top of Figure 1 illustrates this innovation.
information provided by the agent as external stimuli and Third, we use self-play technology(Snell et al., 2024) to
task instructions, and responds appropriately. The agent enable the LLM to autonomously, also intelligently, devides
then directly writes code or calls specific software interfaces long tasks or challenging reasoning objectives into multiple
based on the feedback from the LLM, thereby operating the steps, self-assess the output of each step, and ultimately
cluster. For LLM itself, there are two main challenges. First, achieve the goal.
how does the LLM acquire domain-specific knowledge The lower part of Figure 1 forms the basis of our work.
of cluster diagnostics, and furthermore, where does this It includes a mature operations alarm troubleshooting
knowledge come from. Second, how can the LLM reason and repair process, as well as several mature or advanced
and plan? For the entire LLM-agent, ensuring that the software tools. Based on related works, we have developed
LLM’s inputs and outputs match with the actual operations a unified, multi-level, multi-dimensional cluster diagnostic
performed by the agent controlling the cluster is another toolkit as Figure 2.
crucial aspect that needs to be addressed. This tool diagnoses the health status of the cluster from
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework
or input/output ports, but rather in a broader sense, such as 100% memory 0% memory 100% memory 0% memory
100% compute 0% compute
arate different types of text, including plain text, special in the field of cluster intelligent diagnostics. Throughout
symbols, code, formulas, and inference rules. Thanks to the this process, we emphasize fairness and impartiality, strictly
rope positional encoding adopted by LLama3.1, the model distinguishing between the parts of the model that can be
can accurately capture the content within XML pairs. Based perceived and the scoring portions of the evaluation. We
on the reasoning graph, our experiments confirmed that this further elaborate on the benchmark using the results of the
application allows the LLM to correctly reason according mainstream open-source model LLaMA3.1-70B.
to specific rules, achieving the capability to support the The second phase involves evaluating the innovative
agent in completing cluster fault attribution and repair tasks. aspects of the three models we proposed—RAG, DoT,
This significantly exceeds the capabilities of pre-trained or and selfplay—using the aforementioned benchmark for
aligned LLMs. comparative assessment. The experiments in the second
phase are aimed at demonstrating the advanced nature
4.3.3 Using Selfplay Techniques to Construct a of our proposed models in the field of cluster intelligent
Domain-specific MultiModal Agent diagnostics.
In the third phase, we expose the LLM-agent to both the
With the help of RAG and DoT, the LLM can utilize
training and testing sets in the benchmark, allowing it to
information from outside the training set as well as abstract
operate in its most complete form to address real-world
symbolic reasoning information. However, this still has
problems encountered in production environments. We
limitations for an agent designed for intelligent cluster
demonstrate the accuracy, efficiency, and autonomous
diagnostics. We permit the LLM to generate content over
intelligence of this solution through two typical cases.
a longer duration. The quality of solutions to challenging
Specifically, we found that this solution can provide early
problems can be enhanced through multiple rounds of
warnings for AI clusters, further enhancing the availability
planned selfplay or spontaneous self-questioning and
of the clusters.
answering by the agent.
Finally, we will conduct a qualitative analysis and discus-
Spontaneous self-questioning and answering is applied
sion on the topics of correctness, safety, and reliability,
in DoT reasoning. On the planned selfplay process, we
which are at the forefront of the LLM and LLM-agent fields
transform the complex problem of cluster fault attribution
and have yet to be conclusively resolved, to demonstrate
into a three-round process. In the first round, the agent,
the series of work we have undertaken in these areas.
based on error logs passed from the cluster, prompts
the LLM to identify potential keywords from the error
items and corresponding solutions from the knowledge
5.1 Statistics and Evaluation for Dataset and
base, performing information extraction and RAG. In the
Benchmark
second round, the LLM evaluates its own answers, making
corrections or accepting them directly, then proceeds to 5.1.1 Data’s Source
write or call appropriate tools for the Agent to execute. In
the final round, the LLM makes an accurate attribution The materials provided to the LLM come from three sources.
judgment based on the results of the agent’s interaction The first source is automatically collected Q&A data from
with the actual cluster. Compared to existing selfplay work relevant GitHub communities involved in AI cluster
focused on the text side, we integrate it with the agent, troubleshooting, such as the issue sections of repositories
granting it the permissions to operate machines and interact like Megatron, PAI, Deepspeed, and NCCL. This serves as
with the environment, fully simulating the capabilities of a our initial dataset. The data has undergone two rounds of
human engineer to solve problems. filtering, both automatic and manual, retaining parts with
clear solutions and logical dialogues. The second source
is the program output obtained by the LLM-agent using
RAG+DoT technology on several AI clusters running tasks.
5 EXPERIMENTS These tasks are executed on clusters ranging from 4 to
We conducted a three-phase experiment to demonstrate 100 A800 AI servers. The third part consists of special
the advanced nature of the proposed LLM-agent in the modal data such as symbolic representations and formulas
field of cluster intelligent diagnostics. The first phase processed using XML according to DoT logic, all of which
involves creating a dataset and benchmark for the field are unified into the text modality.
of cluster intelligent diagnostics. First, we define the The total amount of pure text material is 200+ items
statistical characteristics of the external data knowledge compared with 1.2GB origin files. This also confirms
base and introduce the process of generating an evaluation that if more than 200 items consist of pure text content
benchmark from this knowledge base. Next, we describe the is fully pre-tokenized to serve as the context for LLM
features of this benchmark and explain its advanced nature inference, it not only poses a significant challenge to the
LLM’s capability to handle long texts but also increases the
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework
Inference
5.1.2 Benchmark’s Source and Statistics for Benchmark Inference Score Score Score
in 1
Model on 1 on on on
A800*8
We divided the original dataset into two parts, approximately A800 GPU Metric A Metric B Metric C
Server
in a 20%-80% ratio. From the 80%, we manually compiled
150 questions to assess the LLM’s capabilities in the field of Llama3.1- no yes 0.8658 0.0 0.0
cluster diagnostics. During comparative experiments, unless 70B
Nemotron- no yes 0.7315 0.0 0.0
otherwise specified, we provide only 20% of the original
70B
data to all models. During case studies and practical applica- Mistral- no no 0.7383 0.0 0.0
tions, we provide the entire original dataset to the deployed 120B
LLM-agent. Llama3.2- yes yes 0.047 0.0 0.0
We designed three evaluation metrics. Metric A evaluates 3B
the large model’s information extraction capabilities, in-
cluding extracting the cluster IP addresses and SSH port
numbers from conversations, as well as the ability to deter- Table 2. MMLU Benchmark’s Results on LLama3.1 and Nemotron
70B
mine whether further execution is needed, evaluated through
string matching. The challenge here is to assess the model’s
MMLU
ability to follow instructions and extract information, since Model SFT or not
score
logs are derived from user conversations and may contain
Llama3.1-70B no 0.8230
unnecessary commands that need to be ignored during the Llama3.1-70B yes 0.8007
determination process. Metric B evaluates the large model’s Nemotron-70B no 0.8234
code generation capabilities in the diagnostic domain, in- Nemotron-70B yes 0.7917
cluding the ability to generate prescribed code based on
descriptions given in conversations, control the input and
output of the code, and create unseen test cases, imple- 5.2 LMMs’ Evaluation
mented in a manner similar to human-eval(Chen et al., 2021)
but transferred to a real distributed cluster. Metric C eval- 5.2.1 Experimental Setup
uates the large model’s information attribution capabilities We conduct two parts of experiments to comprehen-
in the diagnostic domain, including the ability to provide sively evaluate and compare the innovative effects of our
attribution based on users’ error logs and information. This work. In the first part, we use the mature and universal
is currently implemented through multiple-choice questions. MMLU(Hendrycks et al., 2020) benchmark to evaluate the
comprehensive ability of the model in basic text understand-
5.1.3 Evaluation of Benchmark on Standard ing after it has been enhanced by RAG, DoT, and self-play.
LLaMA3.1-70B In the second part, through ablation and comparison exper-
We applied this benchmark to several of the most iments, combined with the focus areas of the sub-items in
widely used open-source LLMs, namely LLaMA3.1-70B, our proposed benchmark, we quantitatively demonstrate the
nemotron-70B(Adler et al., 2024), mistral-120B(Jiang et al., advantages of our three innovations.
2023), and llama3.2 3B.
The results is in table 1. Due to the lack of relevant 5.2.2 General Capability Evaluation Based on MMLU
data and information, as well as reasoning logic such Firstly, we aim to substantiate why SFT is not advisable
as DoT, all models were only able to complete the first in this domain. Although the LLM that supports the agent
task, scoring zero on the second and third tasks. Since needs to possess extensive knowledge in cluster diagnostics,
the results of llama3.2 3B did not meet the minimum performance modeling, and code writing, we discovered
requirements for building the LLM-agent, and the 120B that when the LLM reaches a level where this knowledge
model is difficult to infer on a single AI server, we opted for can be effectively applied, it often lacks the fundamental
the better-performing and more widely used LLama3.1-70B interaction capabilities required to engage with the agent.
out of the two 70B models as the basis for subsequent SFT We illustrate this point using the MMLU benchmark.
(Supervised Fine-Tuning) and the application of RAG, DoT, We converted the knowledge repository into tokens
and selfplay. compatible with the model and constructed an instruction
dataset. We iterated through multiple training rounds until
the model could respond correctly to instructions. We then
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework
Table 3. Multi Comprehensive Benchmark’s Results on LLMs Table 4. Benchmark’s Results on Open-source LLMs(baselines)
and our LLM-agent
SFT
ARC Open
Model or ARC BoolQ MMLU Score Score Score
easy bookQA
not Model ”cheating” method on on on
Llama3.1- no 0.6246 0.8691 0.8786 0.3720 0.8230 Metric A Metric B Metric C
70B
Llama3.1- None None 0.8658 0.0 0.0
Llama3.1- yes 0.6032 0.8649 0.8862 0.3680 0.8007
70B
70B
Llama3.1- Pre- None 0.8658 0.4615 0.6470
Nemotron- no 0.6280 0.8620 0.8780 0.3680 0.8234
70B Written
70B
Com-
Nemotron- yes 0.6126 0.8653 0.8859 0.3580 0.7917
plete
70B
Agent
Mistral- no 0.6544 0.8788 0.9012 0.3980 0.8229
Plan-
120B
ning
Llama3.2- no 0.4352 0.7428 0.7835 0.2800 0.6040
Steps(pre-
3B
plan)
Llama3.1- None SFT 0.0 0.0 0.0
70B
Llama3.1- pre- SFT 0.0 0.9230 0.0
evaluated the SFT model that reached this state against 70B plan
the original open-source model using the Multi-Machine Llama3.1- None RAG 0.8658 0.0 0.0
Learning Understanding (MMLU) benchmark. The results 70B
are presented in Table 2. Llama3.1- pre- RAG 0.8658 0.4615 0.7059
From the above results, it can be seen that Supervised 70B plan
Llama3.1- None RAG 0.8466 0.6153 0.6470
Fine-Tuning (SFT) leads to a decline in performance 70B + DoT
when evaluated using general assessment methods such as + self-
MMLU. Subsequently, in our proposed cluster diagnostics play
benchmark, we further observed adverse consequences Llama3.1- None RAG 0.0 0.9230 0.0
of this performance decline in metric C. As a result, we 70B + DoT
+ self-
ultimately decided not to use the SFT approach to construct play +
the LLM-agent. SFT
To avoid the potential risks associated with relying solely Llama3.1- whole RAG 1.0 1.0 1.0
on MMLU, we further selected three additional LLM 70B dataset + DoT
benchmarks that are closely related to the problems we + self-
play +
aim to solve in our domain or are entirely generalizable: SFT
Abstraction and Reasoning Challenge(ARC)(Peter, 2022), Llama3.1- pre- RAG 1.0 1.0 1.0
BoolQ(Clark et al., 2019), and OpenbookQA(Mihaylov 70B plan + + DoT
et al., 2018). The results are presented in the table 3. The whole + self-
results of this set of experiments support the conclusions dataset play +
SFT
we drew from the MMLU benchmark. Nemotron- None None 0.7315 0.0 0.0
70B
Nemotron- pre- None 0.7315 0.4615 0.7059
5.2.3 Results of Our Benchmark 70B plan
Mistral- None None 0.7383 0.0 0.0
Table 4 presents all of our experimental results. The second 120B
column of the table indicates whether there was ”cheating.” Mistral- pre- None 0.7383 0.7692 0.8235
We define experiments that do not participate fairly in 120B plan
Llama3.2- None None 0.047 0.0 0.0
the benchmark as cheating. While this is unfair for the 3B
benchmark portion, it is clearly meaningful for our core Llama3.2- pre- None 0.047 0.2307 0.1176
research objective: to build an LLM-agent system that can 3B plan
autonomously and intelligently perform cluster diagnostics
and troubleshooting. When evaluating the benchmark
section, the cheating items can be considered as ground First, we found that a pre-defined plan can help a naive
truth. LLM control the agent. However, this plan was specifically
These experimental results can illustrate several conclusions. written based on the benchmark questions and cannot be
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework
used in a production environment. Correspondingly, all so on. This method is time-consuming and labor-intensive
experiments utilizing DoT technology and not cheating and may require multiple attempts to pinpoint the root
scored well on metrics B and C for evaluating the agent, cause. In our system, the LLM-agent automatically gathers
although the scores were slightly lower than those achieved relevant log information, performance metrics, and other
with preplanning. This indicates that our proposed knowl- necessary data from the nodes of the cluster. Leveraging the
edge processing approach based on DoT and self-play can LLM-agent’s capabilities assessed through the benchmark,
be used to control cluster troubleshooting agents. Second, the system extracts useful information from the collected
we found that SFT significantly improved the scores on data, such as cluster IP addresses, SSH ports, and other crit-
metric B, which focuses on evaluating code writing or the ical diagnostic details. Using its diagnostic capabilities in
invocation of diagnostic tools. However, as a trade-off, code generation and information attribution, the LLM-agent
all models that underwent SFT, even with preplanning, identifies the root cause of the issue based on the collected
were unable to control the agent properly, resulting in poor data and information. This may include generating new test
performance on metric C. Third, we found that the results cases to validate hypotheses. Once the problem is identified,
based on LLama3.1-70B were not significantly different the LLM-agent generates corresponding remediation scripts
from those of Mistral-120B, which has nearly twice the and requests human review. After approval, the LLM-agent
number of parameters. Twice the number of parameters executes the remediation measures in the cluster. Following
implies double or more inference costs (considering the execution of remediation measures, the system collects
multi-GPU linearity), making it impractical. On the other data again to assess the outcome, forming a closed loop of
hand, the 3B smaller model, even with preplanning in data, algorithm, and hardware to optimize future diagnostic
a cheating scenario, is still unable to handle the task of processes.
controlling the agent. We manually constructed a scenario. This scenario would
We proceeded with subsequent experiments and actual lead to slow performance in AI model training tasks and
deployment using the LLM-agent enhanced with the whole has repeatedly occurred in the development environment.
dataset and all of our innovative methods. We simulated an extreme heat situation with HVAC failure,
throttling the frequency of one of the dozens of GPUs to
approximately 200 MHz, rather than the 1410 MHz that the
5.3 Intelligent Early Warning and Troubleshooting: A A800 GPUs should operate at. Observing the actual logs
Case Study shows that the speed of this AI computing task decreased to
approximately one-third of its normal performance. Our
To demonstrate the superiority of the LLM-agent system we LLM-system initially flagged the slow AI task through
have built in the context of intelligent cluster diagnostics, we power consumption monitoring and performance modeling
can present a concrete example to illustrate how the system results, triggering an automatic alert. Following this,
operates and how it is more efficient and accurate compared through three rounds of self-play, it recommended checking
to traditional methods. In the production environment of the GPU core frequencies, a suggestion that the agent
AI clusters, abnormal events or interruptions are not the then dispatched for execution across all GPUs. Based on
most challenging problems to resolve. Clear information the execution results, the LLM accurately pinpointed the
about anomalies or interruptions can effectively guide GPU with the low core frequency that we had specifically
senior engineers in diagnosing the causes of issues. Current altered. The entire troubleshooting process took less than
research is also progressively integrating technologies such 10 minutes. In contrast, a senior operations engineer would
as automatic restarts and automatic scheduling into the typically need about one hour to correctly identify the
procedures for handling anomalies or interruptions in AI problem and then use a pre-written automated detection
computing tasks. However, once an AI computing task software tool created by engineers to determine the specific
exhibits slow performance, it becomes difficult to quickly GPU with the low-frequency fault. More importantly,
identify the problem, and it is even harder to pinpoint the our LLM-agent can identify the fault before algorithm
cause of the slowdown. engineers or operations engineers detect the slow-down
Assume there is an AI training cluster composed of dozens phenomenon and automatically complete the repair. This
of servers, where one of the servers suddenly experiences a achieves resolving the issue before the fault occurs, thereby
performance drop. This could be due to various reasons, enhancing the overall availability of the cluster.
such as increased network latency, memory leaks, high
CPU load, or insufficient storage space. Traditionally,
administrators or engineers would check the log files of the
cluster to manually identify possible issues. This would
involve reviewing logs from different nodes, monitoring
system metrics, attempting to reproduce the problem, and
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework
5.4 Qualitative Analysis of Correctness, Safety, and tion, we innovatively proposed a system solution utilizing
Reliability LLM-agents to autonomously and intelligently perform clus-
ter troubleshooting. In terms of LLM algorithms, we intro-
Based on the existing research that is not yet fully mature,
duced a benchmark consisting of 150 advanced problems
and in the context of this specific field of study, we
manually crafted, demonstrating the performance differ-
provide reasonable definitions for correctness, safety, and
ences between our constructed LLM-agent and the original
reliability. In this study, we define correctness as whether
open-source LLMs under fair data conditions. In the realm
the process and results of the LLM-agent executing tasks
of LLM-agent construction, we innovatively proposed inte-
are correct. Compared to evaluating the output of the
grating DoT reasoning mathematics and the ability to handle
LLM, assessing the correctness of the LLM-agent’s actions
special symbols and formulas into the agent, enabling the
is more challenging. An apparently incorrect operation
LLM to operate machines at the software level and receive
process may produce the correct result, whereas seemingly
feedback. Ultimately, we applied our innovative achieve-
perfect output at the textual level might lead to an erroneous
ments to cluster diagnostics, exploring the potential in this
result when executed. Since we focus on the field of cluster
field, and were pleasantly surprised to find that the LLM-
diagnostics with the actual output being the execution of
agent systems, despite being in their extremely early stages,
procedures by the agent, we do not investigate the potential
are already capable of handling repetitive and low-end tasks,
harmfulness or bias in the textual content generated by the
thus freeing industry practitioners to tackle more challeng-
LLM. Instead, we examine the ability of our LLM-agent to
ing and valuable problems.
avoid performing harmful operations on the cluster when
In the future, we will continue our work in four aspects. In
the information fed back to the agent changes, or even when
terms of LLM algorithms, we will expand and upgrade the
malicious content is inserted by an attacker, such as deleting
existing benchmark and build a more comprehensive and
files, shutting down, overclocking, or modifying critical
valuable metrics system. In the Agent field, we will further
system configurations. Regarding reliability, we define it
unlock the potential of DoT and make self-written code by
as the overall quality of fault handling by the LLM-agent
the LLM gradually become the main execution body, re-
compared to human engineers or expert human engineers.
ducing reliance on preset tools. At the system application
In addition to whether the attribution is correct, we also
level, we will form a closed loop of data, algorithm, and
consider factors such as the time taken to complete typical
hardware, enriching the database with results from actual
fault handling, the resources consumed, and the ability to
deployments. Finally, in terms of safety and reliability, we
communicate with non-experts.
will continue to work with researchers in related fields to
We incorporate the assessment of correctness into the
ensure and evaluate the safety and reliability of the agents.
benchmark evaluation. For the potential risks associated
with the LLM-agent, we implement a whitelist plus
6.2 Shortcomings and Limitations
human review approach. Initially, we ensure the safety
of the existing toolkit, followed by creating a whitelist Our research still has shortcomings and limitations. In terms
for the program interfaces included in the toolkit and of shortcomings, our agent currently relies on a mechanism
conducting human reviews for the LLM-agent’s requests of human review to ensure safety, depends on pre-written
to execute self-authored code. Finally, we observed that tools for code, and relies on data sourced from GitHub as a
the LLM-agent can attribute faults with an average of starting point. An ideal LLM-agent system should form a
fewer than three test cases across multiple rounds of self-sustained relationship with the AI cluster, maintaining
self-play, which is more efficient than the twelve cases and evolving itself.
typically required by human experts. However, regarding In terms of limitations, our work depends on the LLM within
communication abilities, the LLM-agent currently does not the LLM-agent, but smaller models like llama3.2-3B cur-
possess such capabilities. The qualitative analysis described rently cannot support the capabilities of the agent. There-
above is mainly aimed at reducing the probability of fore, our work can only be applied to data centers or large-
harmful incidents. Quantitative analysis or a comprehensive scale distributed clusters and cannot be deployed in edge
model still necessitates further advancements in the field of computing or personal computer scenarios. We need to
AI safety. continuously monitor the development of smaller models
and explore the possibility of teaching the capabilities of
the LLM-agent to smaller models in the form of DoT when
6 CONCLUSION AND DISCUSSION appropriate.
R EFERENCES Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,
G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
A., et al. In-datacenter performance analysis of a tensor
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S.,
processing unit. In Proceedings of the 44th annual inter-
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint
national symposium on computer architecture, pp. 1–12,
arXiv:2303.08774, 2023.
2017.
Adler, B., Agarwal, N., Aithal, A., Anh, D. H., Bhattacharya,
P., Brundyn, A., Casper, J., Catanzaro, B., Clay, S., Co- Jung, H. and Chung, K. Social mining-based clustering
hen, J., et al. Nemotron-4 340b technical report. arXiv process for big-data integration. Journal of Ambient In-
preprint arXiv:2406.11704, 2024. telligence and Humanized Computing, 12(1):589–600,
2021.
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Pod-
stawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V.,
Niewiadomski, H., Nyczyk, P., et al. Graph of thoughts: Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel,
Solving elaborate problems with large language models. T., et al. Retrieval-augmented generation for knowledge-
In Proceedings of the AAAI Conference on Artificial In- intensive nlp tasks. Advances in Neural Information Pro-
telligence, volume 38, pp. 17682–17690, 2024. cessing Systems, 33:9459–9474, 2020.
Castelfranchi, C. Modelling social action for ai agents. Liu, Y., Tao, S., Zhao, X., Zhu, M., Ma, W., Zhu, J., Su,
Artificial intelligence, 103(1-2):157–182, 1998. C., Hou, Y., Zhang, M., Zhang, M., et al. Coachlm:
Automatic instruction revisions improve the data quality
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O.,
in llm instruction tuning. In 2024 IEEE 40th International
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman,
Conference on Data Engineering (ICDE), pp. 5184–5197.
G., et al. Evaluating large language models trained on
IEEE, 2024.
code. arXiv preprint arXiv:2107.03374, 2021.
Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can
M., and Toutanova, K. Boolq: Exploring the surprising a suit of armor conduct electricity? a new dataset
difficulty of natural yes/no questions. arXiv preprint for open book question answering. arXiv preprint
arXiv:1905.10044, 2019. arXiv:1809.02789, 2018.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, Ofenbeck, G., Steinmann, R., Caparros, V., Spampinato,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, D. G., and Püschel, M. Applying the roofline model.
A., et al. The llama 3 herd of models. arXiv preprint In 2014 IEEE International Symposium on Performance
arXiv:2407.21783, 2024. Analysis of Systems and Software (ISPASS), pp. 76–85.
IEEE, 2014.
Guo, C., Wu, H., Deng, Z., Soni, G., Ye, J., Padhye, J., and
Lipshteyn, M. Rdma over commodity ethernet at scale. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
In Proceedings of the 2016 ACM SIGCOMM Conference, Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
pp. 202–215, 2016. et al. Training language models to follow instructions
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, with human feedback. Advances in neural information
M., Song, D., and Steinhardt, J. Measuring mas- processing systems, 35:27730–27744, 2022.
sive multitask language understanding. arXiv preprint
Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang,
arXiv:2009.03300, 2020.
P., and Bernstein, M. S. Generative agents: Interactive
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., simulacra of human behavior. In Proceedings of the 36th
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., annual acm symposium on user interface software and
Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint technology, pp. 1–22, 2023.
arXiv:2310.06825, 2023.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Jiang, Z., Lin, H., Zhong, Y., Huang, Q., Chen, Y., Zhang, Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
Z., Peng, Y., Li, X., Xie, C., Nong, S., et al. {MegaScale}: L., et al. Pytorch: An imperative style, high-performance
Scaling large language model training to more than deep learning library. Advances in neural information
10,000 {GPUs}. In 21st USENIX Symposium on Net- processing systems, 32, 2019.
worked Systems Design and Implementation (NSDI 24),
pp. 745–760, 2024. Peter, E. Abstraction and reasoning challenge. 2022.
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework
Qian, K., Xi, Y., Cao, J., Gao, J., Xu, Y., Guan, Y., Fu, B., practical benchmark for cloud configuration generation.
Shi, X., Zhu, F., Miao, R., et al. Alibaba hpn: a data Proceedings of Machine Learning and Systems, 6:173–
center network for large language model training. In 195, 2024.
Proceedings of the ACM SIGCOMM 2024 Conference,
pp. 691–706, 2024. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C.,
Li, C., Li, C., Liu, D., Huang, F., et al. Qwen2 technical
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., report. arXiv preprint arXiv:2407.10671, 2024.
Sutskever, I., et al. Language models are unsupervised
multitask learners. OpenAI blog, 1(8):9, 2019. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y.,
and Narasimhan, K. Tree of thoughts: Deliberate problem
Shanley, T. InfiniBand network architecture. Addison- solving with large language models. Advances in Neural
Wesley Professional, 2003. Information Processing Systems, 36, 2024.
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, Zhang, Y., Yuan, Y., and Yao, A. C.-C. On the diagram of
J., and Catanzaro, B. Megatron-lm: Training multi- thought. arXiv preprint arXiv:2409.10038, 2024.
billion parameter language models using model paral-
lelism. arXiv preprint arXiv:1909.08053, 2019.
A P LEASE ADD SUPPLEMENTAL MATERIAL
Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test-
time compute optimally can be more effective than scal- AS APPENDIX HERE
ing model parameters. arXiv preprint arXiv:2408.03314, Put anything that you might normally include after the refer-
2024. ences as an appendix here, not in a separate supplementary
file. Upload your final camera-ready as a single pdf, includ-
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu,
ing all appendices.
J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al.
Gemini: a family of highly capable multimodal models.
arXiv preprint arXiv:2312.11805, 2023.
Wang, Q., Sang, B., Zhang, H., Tang, M., and Zhang,
K. Dlrover: An elastic deep training extension with
auto job resource recommendation. arXiv preprint
arXiv:2304.01468, 2023.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting
elicits reasoning in large language models. Advances in
neural information processing systems, 35:24824–24837,
2022.
Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B.,
Zhang, M., Wang, J., Jin, S., Zhou, E., et al. The rise and
potential of large language model based agents: A survey.
arXiv preprint arXiv:2309.07864, 2023.
Xiong, Y., Jiang, Y., Yang, Z., Qu, L., Zhao, G., Liu,
S., Zhong, D., Pinzur, B., Zhang, J., Wang, Y., et al.
{SuperBench}: Improving cloud {AI} infrastructure reli-
ability with proactive validation. In 2024 USENIX Annual
Technical Conference (USENIX ATC 24), pp. 835–850,
2024.
Xu, Y., Chen, Y., Zhang, X., Lin, X., Hu, P., Ma, Y., Lu,
S., Du, W., Mao, Z., Zhai, E., et al. Cloudeval-yaml: A