2411.02265v2
2411.02265v2
2411.02265v2
Abstract
1 Introduction
In recent years, Large language models (LLMs) have significantly advanced the field of artificial
intelligence, proving their effectiveness across numerous fields such as NLP, CV, Speech, and
AI4Science. Starting from the emergence of ChatGPT (OpenAI, 2022), lots of powerful LLMs have
bloomed (Achiam et al., 2023; Gemini et al., 2023; Touvron et al., 2023; OpenAI, 2024; Dubey
et al., 2024; Qwen, 2024a), which inexorably bring in new ways for people to collect and process
information, broadly impacting our daily lives. As the demand for more sophisticated AI systems
continues to grow, researchers are exploring new techniques and paradigms to push the boundaries
of model size and performance. One approach that stands out is the Mixture of Experts (MoE)
model, which synergizes multiple specialized submodels to deliver superior performance in diverse
tasks with dynamic activated experts (Lepikhin et al., 2020; Fedus et al., 2022; Wang et al., 2024a),
achieving more efficient training and inference. There is a current trend observed that more and more
MoE-structured LLMs have been constructed and open-sourced to facilitate the LLM community
(Mistral, 2024; DeepSeek-AI, 2024; Yang et al., 2024; Jamba et al., 2024).
Tencent’s AI chatbot, Yuanbao (yuanbao.tencent.com), has also adopted MoE as the neural architec-
ture of the trillion-parameter flagship LLM since February 2024. Due to its exceptional capabilities
in reading, writing, and searching, the MoE-based Hunyuan model and Yuanbao chatbot are assisting
users in working effortlessly and enjoying a more vibrant life. The MoE-powered Hunyuan models
have also enhanced thousands of scenarios within Tencent’s applications, enabling Tencent to better
serve its billions of users.
In addition to serving users with the premium models, another way that contributes to the community is
open-sourcing. Open-source models can greatly promote the spreading of technology and flourishing
development of applications, as exemplified by LLama, Mistral, Qwen, and Deepseek, among others.
However, most open-source models are based on dense architectures, with only a very few models
based on the MoE architecture with relatively small scale of parameters. In this work, we introduce
Hunyuan-Large, a large Transformer-based MoE model, featuring an unprecedented 389 billion
total parameters and 52 billion activated parameters, capable of handling up to 256K tokens. This
model adopts the classical Transformer architecture (Vaswani et al., 2017) with MoE, containing a
pre-training stage for acquiring fundamental capabilities and a post-training stage for task-specific
instruction following, capability enhancement, and human preference alignment. Hunyuan-Large
supports conventional NLP abilities such as question answering, reasoning, reading comprehension,
and specific LLM capabilities such as mathematics, coding, multi-turn interaction, and multilinguality.
We delve into the key technical innovations that have contributed to Hunyuan-Large’s exceptional
performance as follows.
• High-Quality Synthetic Data. The broad usage of synthetic data improves the quality and diversity
of training data, which enables the model to learn richer representations effectively and generalize
better to unseen data. In total, Hunyuan-Large is pre-trained on 7T tokens, which contains nearly
1.5T tokens of high-quality and diverse synthetic data.
• Enhanced Model Structure. We propose key-value (KV) cache compression, recycle routing,
and expert-specific learning rate scaling strategies to enhance Hunyuan-Large. The reduction
of KV cache overhead allows for more seamless deployment and scaling. Moreover, we adopt
different learning rates for different shared/specialized experts with our recycle routing strategy,
ensuring that each token can be utilized effectively during training and contributing to the overall
performance.
• Explorations on MoE Scaling Laws. Additionally, we explore the scaling laws of MoE models
as our guidelines, highlighting the relationship between model size, training data, and performance.
This analysis offers insights into the foundational elements that contribute to the strong performance
of Hunyuan-Large, but also provides valuable insights for future development and optimization of
more powerful and larger MoE-structured LLMs.
To demonstrate the power of Hunyuan-Large, we conduct extensive experiments on diverse types of
benchmarks in both English and Chinese, compared with the best-performing dense and MoE models
having similar parameter sizes. We find that Hunyuan-Large is capable of handling various tasks
including commonsense understanding, question answering, mathematics reasoning, coding, and
aggregated tasks, achieving the overall best performance among existing open-source similar-scale
LLMs. The pre-trained and post-trained Hunyuan-Large models are publicly released to facilitate the
LLM community.
In the rest of this technical report, we will first give a detailed introduction to the pre-training stage of
Hunyuan-Large, including its data and tokenizer, model structure, and pre-training recipes in Section
2. Next, we will describe our post-training in Section 3, with details of our SFT and RLHF techniques.
The comprehensive experimental results and in-depth analyses of Hunyuan-Large’s pre-trained and
post-trained models will be given in Section 4. Finally, the conclusion and future direction will be
stated in Section 5.
2 Pre-Training
In this section, we will describe the details of pre-training Hunyuan-Large, including (a) data and
tokenizer, where high-quality data largely contributes to the model performance, (b) model structure,
consisting of our proposed KV cache compression, expert routing, and expert-specific learning rate
scaling strategies, and (c) pre-training recipes, introducing the detailed pre-training schedule as well
as our guidebook of explorations on MoE scaling laws. These techniques build the foundation of
Hunyuan-Large’s remarkable capability in pre-training.
We first give the overview of our data, which is viewed as the fuel of our powerful model, with its
preprocessing steps and data synthesis strategies essential for the quantity and quality of data. We
also introduce the tokenizer employed for converting text data into an appropriate format suitable for
Hunyuan-Large.
2
Figure 1: The four-step process of data synthesis in Hunyuan-Large’s pre-training: (1) Instruction
generation, (2) Instruction evolution, (3) Response generation, and (4) Response filtering.
To start with, we provide a brief overview of the used pre-training data, and then delve deeper into
the specifics of our synthetic data generation process, which is essential for acquiring capabilities
also verified in various LLMs (Dubey et al., 2024; Abdin et al., 2024; Liu et al., 2024).
Data Overview and Processing. We aim to create a high-quality, safe, and diverse training dataset
for pre-training, primarily consisting of Chinese and English languages for practical demands. We
filter the data based on criteria such as writing quality, educational value, and toxicity to ensure
its high quality. Additionally, we anonymize all privacy-sensitive data and other harmful data. We
have also implemented an elaborate system of category labels, which allows us to flexibly adjust the
proportions of various types of data in the training dataset.
Data Synthesis. Besides the existing natural text corpus, we construct large amounts of synthetic
data to specifically boost the knowledge acquisition against the relative capability deficiency merely
learned from natural data. To make full use of synthetic data to enhance model performance, we
mainly focus on the mathematics, coding, low-resource, and high-educational-value fields that are
good supplements to naturally distributed corpus, meeting three key requirements of quality, diversity,
and quantity.
As shown in Figure 1, we synthesize high-quality instruction data through a four-step process,
including instruction generation, instruction evolution, response generation, and response filtering.
3
• Step 4: Response Filtering. To filter the synthetic instruction-response pairs, we employ a critique
model and conduct self-consistency checks, in which we generate multiple answers to perform self-
consistency filtering for tasks such as objective question-answering tasks, ensuring the reliability
and accuracy. This process allows us to effectively remove any low-quality or inconsistent data,
ensuring the utilization of high-quality text in pre-training.
2.1.2 Tokenizer
The tokenizer is a vital component for effectiveness and efficiency in pre-training, which should
balance two critical factors: (a) achieving a high compression rate for efficient training and inference,
and (b) maintaining an appropriately large vocabulary to ensure adequate learning of each word em-
bedding. In Hunyuan-Large, we carefully consider both aspects and employ a vocabulary consisting
of 128K tokens. This token vocabulary is a combination of 100K tokens from the tiktoken tokenizer
(OpenAI, 2023) and an additional 28K tokens specifically designed to enhance Chinese language
support. Notably, when compared to the LLama3.1 tokenizer, our new tokenizer exhibits improved
compression rates, increasing from 2.78 to 3.13 characters per token.
Hunyuan-Large is equipped with superior model structure and training strategies to achieve impressive
LLM capabilities. We first show the overview of model architecture and hyper-parameters, and then
delve into the KV cache compression, expert routing strategy, and expert-specific learning rate scaling
used in our model with details.
Table 1: Overview of the architecture and key hyper-parameters of Hunyuan-Large. This model has
389B total parameters and 52B activated parameters. There are 1 shared expert and 1 specialized
expert activated for each token.
Configuration Hunyuan-Large
# Layers 64
# Attention Heads 80
# Key/Value Heads 8
# Shared Experts 1
# Specialized Experts 16
# Activated Specialized Experts 1
# Trained Tokens 7T
Activation Function SwiGLU
Vocabulary Size 128K
Hidden Size 6,400
4
usage across different mechanisms. The adopted GQA+CLA technique in Hunyuan-Large saves
nearly 95% KV cache in total compared to the original MHA mechanism, significantly improving the
inference efficiency without much side effect on model performance.
Table 2: Comparisons of KV cache memory (in bytes on bf16) for different attention mechanisms.
The attention mechanisms include Multi-Head Attention (MHA), Grouped-Query Attention (GQA),
Multi-Query Attention (MQA), Cross-Layer Attention (CLA), and GQA+CLA (the final setting in
Hunyuan-Large). nh , dh , l, and ng represent the number of attention heads, the dimension per head,
the number of layers, and the number of groups in GQA (ng <nh ), respectively. Our CLA shares the
KV cache every 2 layers.
5
(a) Traditional Top-k Routing. (b) Recycle Routing.
Figure 2: An illustration of the recycle routing strategy in Hunyuan-Large, where each expert’s
maximum capacity is set to 2. Token D, which was initially allocated to the overloaded Expert 1,
is reassigned to a randomly selected Expert 4. This approach helps alleviate the potential loss of
valuable information. In traditional routing strategies, tokens from overloaded experts would be
dropped as shown in (a). However, our strategy involves randomly reassigning these tokens to other
experts, as demonstrated in (b), where Token D is routed to Expert 4.
each expert during a single iteration will vary, which indicates that each expert will experience a
different effective batch size within one training iteration. Hence, it is essential to necessitate expert-
specific learning rates to optimize training efficiency. Considering the load balance losses, we could
safely assume that different specialized experts have approximately similar numbers of effectively
trained tokens. Specifically for specialized experts, the effective batch size should be roughly divided
by the number of specialized experts, resulting in their optimal learning rate being expressed as
ϵopt (B/n) (we activate 1 in 16 specialized experts and thus n = 16). The learning rate scaling ratio
between the shared and specialized experts is ϵopt (B)/ϵopt (B/n), which is approximately 0.31 in
our setting. Consequently, when configuring the learning rate for Hunyuan-Large, we assign the
optimal ϵopt (B) for the shared expert, and deliberately scale down the learning rate of specialized
experts in accordance with this ratio ϵopt (B)/ϵopt (B/n).
The effectiveness of LLM pre-training is not solely determined by the dataset and model structure,
but also significantly ascribed to the pre-training recipes obtained from empirical experiments. We
first explore the scaling laws of MoE functioned as a guidebook for our model design. Secondly,
we introduce the detailed process of annealing and long-context pre-training, which further enhance
LLM’s capability.
Initially, we investigate the scaling laws of MoE models to identify optimal settings and gain insights
before pre-training. Typically, the training compute budget for dense models is estimated using
C = 6N D, where N represents the number of parameters and D denotes the training tokens.
However, for MoE models with longer sequences (e.g., 8K, 32K, and 256K), the compute budget
formula varies due to attention complexity and sparse activation. Upon meticulous computation, we
ascertain the precise compute budget C for MoE models, where N in our formula represents the
number of activated parameters, is as follows:
Drawing on the insights of Kaplan et al. (2020) and Li et al. (2024a), we acknowledge that batch size
B has a significant impact on compute budget C during training. To isolate this effect and derive
precise estimates, we employ the critical batch size Bcrit (L), which optimizes the trade-off between
time and computational efficiency, ultimately resulting in minimal compute budget Cmin :
6
2024/9/3 19:38 IsoFLOP_curve-AP.svg
Figure 3: Using quadratic polynomial fitting, we obtain the scaling law of the optimal number of
activation parameters under different minimum compute budgets.
C
Cmin = B
. (3)
1+ Bcrit (L)
7
2024/9/3 19:49 IsoFLOP_curve-TK.svg
Figure 4: Employing the same fitting strategy as Figure 3, we derive the scaling law of the optimal
amount of training data under different minimum compute budgets.
consequently, enhancing its overall performance. Furthermore, during this phase, we prioritize the
use of the highest-quality dataset available, which plays a pivotal role in augmenting the model’s
performance in the annealing phase.
After the annealing phase, Hunyuan-Large is trained on longer sequences (up to 256K tokens) to
enable its long-context capability. Specifically, the long-context pre-training phase contains two
stages (i.e., gradually increases the token length as 32K→256K). We adopt RoPE (Su et al., 2024)
for building position embeddings, and scale the RoPE base frequency to 1 billion during the 256K
pre-training stage inspired by Xiong et al. (2023).
For the data, we solely rely on natural long-context data obtained from books and codes (comprising
nearly 25% of the corpus) and mix it with normal-length pre-training data (nearly 75%) to form our
long-context pre-training corpus, sharing similar conclusions observed in Gao et al. (2024). We also
discover that it does not require too much training for LLM to acquire long-context capabilities. In
each of the 32K and 256K stages, we employ a long-context pre-training corpus of approximately
10 billion tokens. The long-context pre-training at each stage can achieve satisfactory long-context
abilities, while maintaining good LLM capabilities on tasks with normal lengths.
3 Post-Training
The performance of SFT strongly depends on the quality of instruction data related to various types of
LLM capabilities. In SFT, we concentrate on the detailed data collection and processing manners that
ensure the effectiveness of Hunyuan-Large’s post-training, along with the training settings of SFT.
8
3.1.1 Overview of SFT Data
The central goal of SFT is further enhancing its performance across multiple key capabilities based
on the corresponding well-selected data. These capabilities primarily encompass mathematics,
coding, logical reasoning, knowledge-based question answering, agent behavior, text generation,
NLP comprehension, industrial applications, role-playing, long-text capabilities, etc. We recognize
that improving these abilities not only enables the model to be more adept in practical applications,
but also better satisfies users’ diverse needs across multiple scenarios. Simultaneously, we place great
emphasis on data security, striving to ensure that the model aligns with human values under most
circumstances. The overall SFT data volume exceeds 1 million.
The key techniques of SFT data collection and processing mainly include instruction extraction,
instruction generalization, instruction balancing, and data quality controlling.
Instruction Extraction. To enhance the breadth and diversity of the instruction set, we develop an
instruction extraction model specifically for domains such as mathematics, logical reasoning, and
knowledge-based question answering, whose primary goal is to effectively extract data suitable for
instruction tuning from publicly available data sources (e.g., web pages, encyclopedias, etc.). The
extracted data includes both instructions and corresponding reference answers. We develop many
specialized models as instruction extractors. With the help of these model, we successfully extract a
large set of natural instructions from public data. These instructions play a crucial role as the seed to
enhance the final model’s generalization performance and diversity.
Instruction Generalization. We propose an instruction generalization method to obtain more
diverse and complex instructions in large quantities. Specifically, we design and train an instruction
generalization system capable of generalizing targeted instructions while gradually increasing their
difficulty and complexity levels. The central recipe of this system lies in training the model by
synthesizing numerous mappings between simple and complex instructions. In addition, we construct
a well-structured instruction taxonomy with its corresponding classification models, which aims
to analyze and balance the distribution of various instruction types in SFT data. Armed with this
instruction taxonomy, our instruction generalization system can supplement the original data on
specific weak instructions of targeted types.
Instruction Balancing. Through the instruction extraction and generalization processes, we accu-
mulate more than 10 million instructions. Instruction balance is essential for enhancing the model’s
performance across various scenarios. However, many generated instructions have very similar
semantic meanings and the instruction type distribution is naturally unbalanced. To enhance the
instruction complexity while maintaining balanced instruction distributions, we attach labels for
each instruction. These labels encompass multiple dimensions. By meticulously tagging these
labels, we can more accurately understand and analyze the characteristics of our instruction sets. By
ensuring adequate amounts and balanced distribution of different types of instructions during the SFT
process, we can effectively alleviate overfitting or underfitting problems on specific instruction types,
thereby improving the model’s generalization capabilities and adaptability across diverse application
scenarios.
Data Quality Controlling. The quality of SFT data is the foundation of superior performance. We
mainly conduct the following three methods to ensure the high quality of our SFT data.
• Rule-based Filtering. We discover some common issues such as data truncation errors, duplication,
garbled characters, and format errors in SFT data. Consequently, we develop a set of rule-based
data filtering strategies to prevent the above instruction extraction and generation models from
producing undesirable outputs.
• Model-based Filtering. To automatically extract high-quality SFT data from a substantial volume
of synthesized instruction data, we train a critique model (McAleese et al., 2024) based on a 70B
dense model of our Hunyuan series. This model assigns a four-tier quality score to each instruction
sample, assessing aspects such as the accuracy, relevance, completeness, usefulness, and clarity of
the generated responses, and other possible data quality issues.
9
• Human-based Filtering. Prior to model training, the SFT data filtered via rule-based and model-
based methods further undergo human annotation, ensuring that answers adhere to the desired
task-specific response patterns and avoid introducing additional low-quality issues.
To align Hunyuan-Large with human preferences, we further train our SFT model using DPO (Rafailov
et al., 2024). We adopt a single-stage training strategy that integrates both offline and online training,
which demonstrates superior controllability and overall performance. In this integrated approach, we
utilize a pre-compiled preference dataset to enhance controllability, while simultaneously employing
the current policy model to generate multiple responses for each prompt and our reward model to
select the most and least preferred responses.
To enhance training stability, we incorporate an SFT loss term on the chosen response, similar to the
approaches in (Dubey et al., 2024; Adler et al., 2024). This addition helps stabilize DPO training
by preventing a decrease in the log probability of chosen responses. Furthermore, we implement an
exponential moving average strategy to mitigate reward hacking and reduce alignment tax (Ouyang
et al., 2022), ensuring a more stable training process across a larger dataset.
4 Model Evaluations
We conduct extensive evaluations of Hunyuan-Large to demonstrate its effectiveness. The following
experiments concentrate on our pre-trained language model (in Sec. 4.1) and post-trained language
model (in Sec. 4.2) on various tasks in Chinese and English, including math and reasoning, code,
reading comprehension, commonsense, long context, and aggregated task, etc., where Hunyuan-Large
achieves excellent performance among tasks in both pre-training and post-training.
In this section, we report the performance of Hunyuan-Large’s pre-trained model on various types of
widely-used benchmarks, verifying the power of the fundamental capability of our model.
10
Table 3: Performance of Hunyuan-Large’s pre-trained model and its competitors.
Model LLama3.1-405B LLama3.1-70B Mixtral-8x22B DeepSeek-V2 Hunyuan-Large
Architecture Dense Dense MoE MoE MoE
# Activated Params 405B 70B 39B 21B 52B
# Total Params 405B 70B 141B 236B 389B
Context Length 128k 128k 64k 128k 256k
English
MMLU 85.2 79.3 77.8 78.5 88.4
MMLU-Pro 61.6 53.8 49.5 - 60.2
BBH 85.9 81.6 78.9 78.9 86.3
HellaSwag - - 88.7 87.8 86.8
CommonsenseQA 85.8 84.1 82.4 - 92.9
WinoGrande 86.7 85.3 85.0 84.9 88.7
PIQA - - 83.6 83.7 88.3
NaturalQuestions - - 39.6 38.7 52.8
DROP 84.8 79.6 80.4 80.1 88.9
ARC-C 96.1 92.9 91.2 92.4 95.0
TriviaQA - - 82.1 79.9 89.2
Chinese
CMMLU - - 60.0 84.0 90.2
C-Eval - - 59.6 81.7 91.9
C3 - - 71.4 77.4 82.3
Math
GSM8K 89.0 83.7 83.7 79.2 92.8
MATH 53.8 41.4 42.5 43.6 69.8
CMATH - - 72.3 78.7 91.3
Code
HumanEval 61.0 58.5 53.1 48.8 71.4
MBPP 73.4 68.6 64.2 66.6 72.6
cisely, we adopt zero-shot for TriviaQA, PIQA, C3, HumanEval, 3-shot for BBH, MBPP, DROP,
CMATH, 4-shot for GSM8K, MATH, 5-shot for MMLU, MMLU-Pro, C-Eval, CMMLU, Wino-
Grande, NaturalQuestions, 7-shot for CommonsenseQA, 10-shot for HellaSwag and 25-shot for
ARC-C. We compare Hunyuan-Large with the state-of-the-art Dense and MoE pre-trained models of
comparable or larger (activated) parameter sizes. Specifically, these competitors include LLama3.1-
70B (Dubey et al., 2024), Mixtral-8x22B (Mistral, 2024), DeepSeek-V2 (DeepSeek-AI, 2024) and
LLama3.1-405B. For fair comparisons, we report the best performance among the results that are
publicly reported or those reproduced by ourselves for baselines.
Table 3 illustrates the performance of Hunyuan-Large and other competitive pre-trained models. In
general, Hunyuan-Large achieves the best overall performance compared to both Dense and MoE
based competitors having similar activated parameter sizes. For aggregated benchmarks such as
MMLU, Hunyuan-Large not only surpasses the performance of the LLama3.1-405B model but does so
with a significantly lower count of activation parameters, achieving an impressive 3.2% improvement.
Hunyuan-Large also shows superior performance in commonsense understanding and reasoning,
and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA,
PIQA, and TriviaQA). For the mathematics capability, Hunyuan-Large outperforms all baselines
in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese. It
also achieves the first-tier results in code datasets like HumanEval and MBPP. We also observe that
Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).
In-depth analyses throughout Hunyuan-Large’s development process indicate that the impressive
all-aspect improvement mainly derives from: (a) the high-quality pre-training data armed with
synthesis techniques, functioning as the fundamental fuel of acquiring capabilities, (b) better model
structure with recycle routing and expert-specific learning rate scaling on shared/specialized experts,
and (c) the pre-training recipe inspired by various pioneer explorations on more effective and efficient
MoE-based pre-training schedule, which enables a more intelligent and stable training. Furthermore,
Hunyuan-Large is capable of handling longer sequences of up to 256K tokens attributed to our
long-context pre-training.
11
Table 4: Performance of our Hunyuan-Large-Instruct and its competitors.
12
instructions within a given context. (4). Arena-Hard robustly differentiates model capabilities, aligns
closely with human preferences in real-world scenarios, and is frequently updated with new prompts
to prevent over-fitting and ensure ongoing relevance. (5). AlpacaEval-2.0 is also a commonly-used
benchmark to automatically evaluate the LLMs’ instruction following abilities. Hunyuan-Large-
Instruct shows the best overall performance on these five benchmarks compared to all the strong
baseline models. It is implied that the impressive performance of Hunyuan-Large-Instruct could
mainly attribute to its powerful pre-trained model, the high-quality SFT and DPO data with the
well-designed four-step data collection and processing that generates this data, and the superior SFT
and DPO training strategies.
13
Table 6: The performance of Hunyuan-Large-Instruct on PenguinScrolls.
Information Information Qualitative Numerical
Model Overall
Extraction Localization Analysis Reasoning
LLama3.1-70B-Instruct 82.51 69.70 75.77 49.52 69.37
Hunyuan-Large-Instruct 91.14 89.56 92.78 67.46 85.23
Internal user studies corroborate that the improvements on PenguinScrolls strongly correlate with
enhancements in actual user experiences. We will release PenguinScrolls to advance long-context
research and development in the future.
14
References
Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A.,
Bakhtiari, A., Behl, H., et al. Phi-3 technical report: A highly capable language model locally on
your phone. arXiv preprint arXiv:2404.14219, 2024.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt,
J., Altman, S., Anadkat, S., et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Adler, B., Agarwal, N., Aithal, A., Anh, D. H., Bhattacharya, P., Brundyn, A., Casper, J., Catanzaro,
B., Clay, S., Cohen, J., et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704,
2024.
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., and Sanghai, S. GQA: Training
generalized multi-query transformer models from multi-head checkpoints. In Proceedings of
EMNLP, 2023.
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M.,
Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,
2021.
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D.,
Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from
human feedback. arXiv preprint arXiv:2204.05862, 2022.
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. PIQA: Reasoning about physical commonsense in natural
language. In Proceedings of AAAI, 2020.
Brandon, W., Mishra, M., Nrusimha, A., Panda, R., and Kelly, J. R. Reducing transformer key-value
cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y.,
Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374, 2021.
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think
you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint
arXiv:1803.05457, 2018.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J.,
Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168, 2021.
DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.
arXiv preprint arXiv:2405.04434, 2024.
Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP: A reading compre-
hension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161,
2019.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A.,
Yang, A., Fan, A., et al. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783, 2024.
Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple
way to debias automatic evaluators, 2024. URL https://arxiv.org/abs/2404.04475.
Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with
simple and efficient sparsity. Journal of Machine Learning Research, 2022.
Gao, T., Wettig, A., Yen, H., and Chen, D. How to train long-context language models (effectively).
arXiv preprint arXiv:2410.02660, 2024.
Gemini, T., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai,
A. M., Hauth, A., et al. Gemini: a family of highly capable multimodal models. arXiv preprint
arXiv:2312.11805, 2023.
15
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia,
Y., and He, K. Accurate, large minibatch SGD: Training Imagenet in 1 hour. arXiv preprint
arXiv:1706.02677, 2017.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J.
Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,
2021.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L.,
Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models.
arXiv preprint arXiv:2203.15556, 2022.
Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. RULER: What’s
the real context size of your long-context language models? arXiv preprint arXiv:2404.06654,
2024.
Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu, J., Lv, C., Zhang, Y., Fu, Y., et al. C-Eval:
A multi-level multi-discipline chinese evaluation suite for foundation models. In Proceedings of
NeurIPS, 2024.
Jamba, T., Lenz, B., Arazi, A., Bergman, A., Manevich, A., Peleg, B., Aviram, B., Almagor, C.,
Fridman, C., Padnos, D., et al. Jamba-1.5: Hybrid transformer-mamba models at scale. arXiv
preprint arXiv:2408.12570, 2024.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas,
D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts. arXiv preprint arXiv:2401.04088,
2024.
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. TriviaQA: A large scale distantly supervised
challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A.,
Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
2020.
Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. arXiv preprint
arXiv:1404.5997, 2014.
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D.,
Polosukhin, I., Devlin, J., Lee, K., et al. Natural questions: a benchmark for question answering
research. TACL, 2019.
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen,
Z. GShard: Scaling giant models with conditional computation and automatic sharding. arXiv
preprint arXiv:2006.16668, 2020.
Li, H., Zhang, Y., Koto, F., Yang, Y., Zhao, H., Gong, Y., Duan, N., and Baldwin, T. CMMLU:
Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212,
2023.
Li, S., Zhao, P., Zhang, H., Sun, X., Wu, H., Jiao, D., Wang, W., Liu, C., Fang, Z., Xue, J., et al. Surge
phenomenon in optimal learning rate and batch size scaling. arXiv preprint arXiv:2405.14578,
2024a.
Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J. E., and Stoica, I. From
crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024b.
URL https://arxiv.org/abs/2406.11939.
Liu, R., Wei, J., Liu, F., Si, C., Zhang, Y., Rao, J., Zheng, S., Peng, D., Yang, D., Zhou, D., et al. Best
practices and lessons learned on synthetic data. In Proceedings of COLM, 2024.
Liu, X., Lei, X., Wang, S., Huang, Y., Feng, Z., Wen, B., Cheng, J., Ke, P., Xu, Y., Tam, W. L.,
et al. AlignBench: Benchmarking chinese alignment of large language models. arXiv preprint
arXiv:2311.18743, 2023.
16
Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In Proceedings of ICLR, 2019.
McAleese, N., Pokorny, R. M., Uribe, J. F. C., Nitishinskaya, E., Trebacz, M., and Leike, J. LLM
Critics Help Catch LLM Bugs. arXiv preprint arXiv:2407.00215, 2024.
Mistral. Cheaper, better, faster, stronger. continuing to push the frontier of AI and making it accessible
to all. 2024. URL https://mistral.ai/news/mixtral-8x22b.
OpenAI. Introducing ChatGPT. 2022. URL https://openai.com/index/chatgpt/.
OpenAI. Tiktoken. 2023. URL https://github.com/openai/tiktoken.
OpenAI. Hello GPT-4o. 2024. URL https://openai.com/index/hello-gpt-4o/.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S.,
Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.
In Proceedings of NeurIPS, 2022.
Qwen. Qwen2.5. 2024a. URL https://github.com/QwenLM/Qwen2.5.
Qwen. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters, 2024b.
URL https://qwenlm.github.io/blog/qwen-moe/.
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference
optimization: Your language model is secretly a reward model. In Proceedings of NeurIPS, 2024.
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. WinoGrande: An adversarial winograd
schema challenge at scale. Communications of the ACM, 2021.
Shazeer, N. GLU variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. Roformer: Enhanced transformer with rotary
position embedding. Neurocomputing, 2024.
Sun, K., Yu, D., Yu, D., and Cardie, C. Investigating prior knowledge for challenging chinese machine
reading comprehension. TACL, 2020.
Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V.,
Chi, E. H., Zhou, D., et al. Challenging big-bench tasks and whether chain-of-thought can solve
them. arXiv preprint arXiv:2210.09261, 2022.
Talmor, A., Herzig, J., Lourie, N., and Berant, J. CommonsenseQA: A question answering challenge
targeting commonsense knowledge. In Proceedings of NAACL, 2019.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal,
N., Hambro, E., Azhar, F., et al. LLaMA: Open and Efficient Foundation Language Models. arXiv
preprint arXiv:2302.13971, 2023.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and
Polosukhin, I. Attention is all you need. In Proceedings of NIPS, 2017.
Wang, A., Sun, X., Xie, R., Li, S., Zhu, J., Yang, Z., Zhao, P., Han, J., Kang, Z., Wang, D., et al.
HMoE: Heterogeneous mixture of experts for language modeling. arXiv preprint arXiv:2408.10681,
2024a.
Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z.,
et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.
In Proceedings of NeurIPS, 2024b.
Wei, T., Luan, J., Liu, W., Dong, S., and Wang, B. CMATH: Can your language model pass chinese
elementary school math test? arXiv preprint arXiv:2306.16636, 2023.
Xiong, W., Liu, J., Molybog, I., Zhang, H., Bhargava, P., Hou, R., Martin, L., Rungta, R., Sankarara-
man, K. A., Oguz, B., et al. Effective long-context scaling of foundation models. arXiv preprint
arXiv:2309.16039, 2023.
17
Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al.
QWen2 Technical Report. arXiv preprint arXiv:2407.10671, 2024.
Yuan, T., Ning, X., Zhou, D., Yang, Z., Li, S., Zhuang, M., Tan, Z., Yao, Z., Lin, D., Li, B., et al.
LV-Eval: A balanced long-context benchmark with 5 length levels up to 256k. arXiv preprint
arXiv:2402.05136, 2024.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. HellaSwag: Can a machine really finish
your sentence? In Proceedings of ACL, 2019.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.,
et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of NeurIPS,
2023.
Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. Instruction-
following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
18