RLHF와 DPO — 도움되도록 모델 정렬

사전학습된 causal LM은 텍스트 완성. base 모델은 재능 있지만 특별히 협조적이진 않아 — 지시 안 따르고, 해로운 요청 거부 안 하고, 도움되는 어시스턴트처럼 행동 안 해. Alignment는 base 모델을 사람이 말 걸고 싶은 무언가로 바꾸는 post-training 단계.

RLHF (Reinforcement Learning from Human Feedback)

고전 레시피(Ouyang et al., 2022, ChatGPT 뒤)는 3단계:

Supervised fine-tuning (SFT). 큐레이션된 데모로 base 모델 fine-tune — 사람이 좋은 답이 뭔지 작성.
Reward 모델. 같은 prompt에 대한 완성 쌍을 사람한테 보여주고 더 나은 거 고르게. 작은 모델이 완성 쌍에서 사람 선호 예측하도록 학습.
PPO (Proximal Policy Optimization). reward 모델을 강화 학습의 reward 신호로 사용. policy(LLM)가 완성 생성, reward 모델이 채점, PPO가 policy 업데이트해서 reward 증가, SFT 모델 대비 KL-divergence penalty로 reward hacking 방지.

DPO (Direct Preference Optimization)

DPO(Rafailov et al., 2023)는 명시적 reward 모델과 PPO loop을 건너뛰어. 통찰: preference 쌍에 대해 policy를 직접 최적화하는 closed-form objective 유도 가능. Loss:

L = -log σ(β × (log P(y_w|x) - log P_ref(y_w|x) - log P(y_l|x) + log P_ref(y_l|x)))

y_w는 선호 완성, y_l은 비선호, P_ref는 SFT만 한 모델의 확률. 더 단순, 더 안정, 더 적은 컴퓨트. 대부분 벤치마크에서 PPO 품질 매치 또는 초과. Llama 3 fine-tune, Mistral Instruct 변종, 그리고 커뮤니티 chat 모델의 큰 부분이 사용.

Code

DPO loss in PyTorch·python

def dpo_loss(policy_logits_w, policy_logits_l,
             ref_logits_w, ref_logits_l,
             beta=0.1):
    # logits_*: (B, L, vocab); we need log-probs of the actual tokens.
    # Pseudocode — assume you have already gathered:
    #   policy_logp_w  = log P(y_w | x, policy)   # (B,)
    #   policy_logp_l  = log P(y_l | x, policy)
    #   ref_logp_w     = log P(y_w | x, reference)
    #   ref_logp_l     = log P(y_l | x, reference)
    diff = beta * ((policy_logp_w - ref_logp_w)
                   - (policy_logp_l - ref_logp_l))
    return -torch.nn.functional.logsigmoid(diff).mean()
# That's it. No reward model, no value head, no PPO bookkeeping.

RLHF와 DPO — 도움되도록 모델 정렬

RLHF (Reinforcement Learning from Human Feedback)

DPO (Direct Preference Optimization)

Code

External links

Exercise

Progress

댓글 0