Contrastive post-training large language models on data curriculum

C Xu, C Rosset, L Del Corro, S Mahajan… - arXiv preprint arXiv …, 2023 - arxiv.org
arXiv preprint arXiv:2310.02263, 2023arxiv.org
Alignment serves as an important step to steer large language models (LLMs) towards
human preferences. In this paper, we explore contrastive post-training techniques for
alignment by automatically constructing preference pairs from multiple models of varying
strengths (eg, InstructGPT, ChatGPT and GPT-4). We carefully compare the contrastive
techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function
improvement even after continueing SFT saturates. We also explore a data curriculum …
Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we explore contrastive post-training techniques for alignment by automatically constructing preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We carefully compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continueing SFT saturates. We also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to exceed that of ChatGPT.
arxiv.org