Google Scholar

Contrastive post-training large language models on data curriculum

C Xu, C Rosset, L Del Corro, S Mahajan… - arXiv preprint arXiv …, 2023 - arxiv.org

C Xu, C Rosset, L Del Corro, S Mahajan, J McAuley, J Neville, AH Awadallah, N Rao

arXiv preprint arXiv:2310.02263, 2023•arxiv.org

Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we explore contrastive post-training techniques for alignment by automatically constructing preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We carefully compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continueing SFT saturates. We also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to exceed that of ChatGPT.

arxiv.org

Show moreShow less

Save Cite Cited by 7 Related articles All 3 versions View as HTML

Cite

Advanced search

Saved to My library

Contrastive post-training large language models on data curriculum