2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Li, Shilong; He, Yancheng; Huang, Hui; Bu, Xingyuan; Liu, Jiaheng; Guo, Hangyu; Wang, Weixun; Gu, Jihao; Su, Wenbo; Zheng, Bo

Computer Science > Computation and Language

arXiv:2410.19720 (cs)

[Submitted on 25 Oct 2024]

Title:2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Authors:Shilong Li, Yancheng He, Hui Huang, Xingyuan Bu, Jiaheng Liu, Hangyu Guo, Weixun Wang, Jihao Gu, Wenbo Su, Bo Zheng

View PDF HTML (experimental)

Abstract:Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.

Comments:	The first four authors contributed equally, 25 pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2410.19720 [cs.CL]
	(or arXiv:2410.19720v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.19720

Submission history

From: Yancheng He [view email]
[v1] Fri, 25 Oct 2024 17:47:35 UTC (9,225 KB)

Computer Science > Computation and Language

Title:2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators