LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

L Zhao, T Wei, L Zeng, C Cheng, L Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
L Zhao, T Wei, L Zeng, C Cheng, L Yang, P Cheng, L Wang, C Li, X Wu, B Zhu, Y Gan, R Hu…
arXiv preprint arXiv:2406.00605, 2024arxiv.org
We introduce LongSkywork, a long-context Large Language Model (LLM) capable of
processing up to 200,000 tokens. We provide a training recipe for efficiently extending
context length of LLMs. We identify that the critical element in enhancing long-context
processing capability is to incorporate a long-context SFT stage following the standard SFT
stage. A mere 200 iterations can convert the standard SFT model into a long-context model.
To reduce the effort in collecting and annotating data for long-context language modeling …
We introduce LongSkywork, a long-context Large Language Model (LLM) capable of processing up to 200,000 tokens. We provide a training recipe for efficiently extending context length of LLMs. We identify that the critical element in enhancing long-context processing capability is to incorporate a long-context SFT stage following the standard SFT stage. A mere 200 iterations can convert the standard SFT model into a long-context model. To reduce the effort in collecting and annotating data for long-context language modeling, we develop two novel methods for creating synthetic data. These methods are applied during the continual pretraining phase as well as the Supervised Fine-Tuning (SFT) phase, greatly enhancing the training efficiency of our long-context LLMs. Our findings suggest that synthetic long-context SFT data can surpass the performance of data curated by humans to some extent. LongSkywork achieves outstanding performance on a variety of long-context benchmarks. In the Needle test, a benchmark for long-context information retrieval, our models achieved perfect accuracy across multiple context spans. Moreover, in realistic application scenarios, LongSkywork-13B demonstrates performance on par with Claude2.1, the leading long-context model, underscoring the effectiveness of our proposed methods.
arxiv.org