TRL: RLHF, DPO, Preference Optimization

~24 min · ops, trl, rlhf

Level 0스카우트

0 XP0/50 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

TRL 이 alignment 툴킷

trl (Transformer Reinforcement Learning) 이 canonical Python 라이브러리: SFT (이미 다룸), DPO (Direct Preference Optimization), PPO (RLHF 위한 Proximal Policy Optimization), KTO, IPO, ORPO, GRPO, 다른 preference / RL 메서드. 2026 alignment 의 dominant 디폴트가 DPO — reward-model + PPO 루프 skip, preference pair 에 직접 학습.

30 초의 DPO

DPO 가 (prompt, chosen, rejected) 페어 받음. 목적이 closed-form bound 사용해 chosen 확률을 rejected 대비 증가, RL 없이 RLHF mirror. 학습이 다른 loss 의 SFT 처럼 보임; 빠른 수렴; 별도 reward 모델 X.

Code

preference 데이터셋의 trl DPO·python

# pip install trl
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch

base = "Qwen/Qwen2.5-1.5B-Instruct"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16)

# Reference 모델 (frozen) — DPO 가 학습 모델 logprob 을 이 baseline 과 비교
ref = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16)

# Preference 데이터셋: 각 행이 'prompt', 'chosen', 'rejected' 가짐
ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

cfg = DPOConfig(
    output_dir="./dpo-out",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    learning_rate=5e-7,           # DPO lr 낮음 (종종 1e-6 to 5e-7)
    bf16=True,
    beta=0.1,                     # KL strength
)

trainer = DPOTrainer(model=model, ref_model=ref, args=cfg, tokenizer=tok, train_dataset=ds)
trainer.train()

External links

Exercise

training 트랙의 SFT-tuned 모델. 작은 preference 데이터셋 (예: trl-lib/ultrafeedback_binarized, 200 example) 에 DPO-tune. held-out 10 프롬프트 비교: SFT-only 출력 vs SFT+DPO 출력. 어떤 axis 변하는지 (helpfulness, conciseness, refusal style) 메모.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousOptimum: 하드웨어별 최적화 Next →End-to-End: Hub → Train → Serve → Space

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.