RL 이 reasoning 행동 만드는 법 — SFT, RLHF, GRPO

Reasoning model 만드는 training pipeline

현대 reasoning model 은 multi-stage pipeline 으로 생산. 정확한 stage 는 다양하지만, canonical 2025–2026 sequence 는 대략:

Stage 1 — Pretraining (base model)

다른 LLM 처럼: 거대한 corpus 위 autoregressive next-token prediction. 이게 raw "많이 알지만 아직 useful assistant 아님" base 생산. Reasoning model 이 이 stage 에서 special pretraining 레시피 안 필요; 일반적으로 non-reasoning sibling 에 쓰인 같은 base 에서 시작.

Stage 2 — Supervised fine-tuning (SFT)

Base 가 high-quality step-by-step reasoning 예시의 curated 데이터셋으로 fine-tune. 이거 human-written, 더 강한 모델에서 distill, programmatically 생성 가능. SFT 가 모델에 좋은 reasoning 의 모양 가르침 — 언제 문제 step 으로 break, 어떻게 verify, 언제 backtrack.

Stage 3 — 강화학습

여기서 마법 일어남. 모델이 verifiable 답 있는 문제 (수학, coding, logic) 에 candidate solution 생성, 옳은 답 도달에 reward. 다양한 RL 알고리즘 사용:

PPO (Proximal Policy Optimization) — 표준, 초기 RLHF 사용.
GRPO (Group Relative Policy Optimization) — DeepSeek 의 RL 레시피; 샘플된 응답 그룹 상대적 advantage 계산으로 별도 critic 제거.
DPO (Direct Preference Optimization) — PPO 보다 단순; explicit reward 모델 없이 preference pair 에서 직접 학습.
RLVR (Reinforcement Learning from Verifiable Rewards) — 학습 reward 모델 대신 programmatic verifier (수학 solver, code interpreter) 사용.

Stage 4 — Optional rejection sampling 과 refinement

RL-trained 모델로 많은 candidate response 생성, 가장 quality 높은 것만 유지, 그것들에 다시 fine-tune. 이 polishing step 이 reasoning quality 더 개선하는 경우 많음, 계속 RL 보다 훨씬 적은 compute 로.

"Aha moment" — pure RL 에서 emergent reasoning

DeepSeek-R1-Zero 가 SFT stage 완전히 건너뛸 수 있음 시연. Base model 에 outcome reward 만으로 pure RL 이 reasoning 행동 — self-verification, "잠깐, 다시 생각해 볼게", step-by-step 분해 포함 — spontaneously emerge 가능. 모델이 correctness 가 유일한 reward signal 일 때 자체적으로 reasoning 전략 발견. 2024–2025 의 가장 놀라운 결과 중 하나.

Code

GRPO advantage 계산 (sketch)·python

def grpo_advantage(rewards):
    # rewards: tensor of shape (group_size,) — N samples for the same prompt
    mean = rewards.mean()
    std  = rewards.std() + 1e-8
    return (rewards - mean) / std   # group-relative advantage; no critic needed

RLVR loop with 수학 verifier·python

def rlvr_step(model, prompts, verifier):
    rollouts = [model.generate(p) for p in prompts]
    rewards  = [1.0 if verifier(p, r) else 0.0 for p, r in zip(prompts, rollouts)]
    update_policy(model, rollouts, rewards)