The paper “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” introduces Direct Preference Optimization (DPO), an algorithm for fine-tuning language models to align with human preferences without the need for complex reinforcement learning procedures. This simplifies Reinforcement Learning with Human Feedback (RLHF) by not requiring a time consuming human feedback loop in training of the model.
Directly Modified Reward Function : DPO uses human preferences to directly modify the reward function, employing a classification loss to align the model outputs with these preferences. Rather than relying solely on reward signals from the environment, it leverages comparisons or preferences between different trajectories to guide the learning process. The agent is provided with pairs of trajectories along with a preference indicating which trajectory is preferred. This preference data is used to train the policy. The task of predicting preferences can be framed as a binary classification problem. For a given pair of trajectories the model needs to predict which path is preferred. The classification loss then measures the discrepancy between the predicted and actual preferences. A common choice for this kind of binary classification is the binary cross-entropy loss. The overall training objective in DPO involves minimizing the classification loss across all pairs of trajectories in the dataset, which encourages the policy to produce trajectories that align with the observed preferences.

RLHF and Proximal Policy Optimization: RLHF trains a reward model using PPO and data gathered on human preferences that is labeled by humans. These RLHF steps are shown in the diagram below, from the RLHF paper. PPO indirectly learns the reward function through interactions with the environment and optimizes the policy to maximize this reward, using a reinforcement learning framework. The policy here is a mapping from states to a probability distribution over actions.

So Direct Preference Optimization (DPO) modifies the reward function using human preference data. Here is a high-level overview of the equations used:
- Preference Model:
- Let θ be the parameters of the model.
- Let τ1 and τ2 be two trajectories (or outputs) being compared.
- The preference model P(τ1≻τ2∣θ) indicates the probability that humans prefer τ1 over τ2.
- Logistic Function for Preferences:
- The preference probability is modeled using a logistic function:P(τ1≻τ2∣θ)=exp(R(τ1∣θ)) / ( exp(R(τ1∣θ)) + exp(R(τ2∣θ)) )
- R(τ∣θ) is the reward function for trajectory τ.
- Loss Function:
- The loss function L(θ) is defined as the negative log-likelihood of the human preferences:L(θ)=−∑(τ1,τ2)∈D log P(τ1≻τ2∣θ)
- D is the dataset of human preference comparisons.
- Optimization:
- The model parameters θ are optimized by minimizing the loss function L(θ)