평가 방법

~20 min · evaluation, perplexity, llm-judge, metrics

Level 0관찰자

0 XP0/43 lessons0/11 achievements

0/120 XP to next level120 XP to go0% complete

네 가지 평가 접근법

1. Perplexity

모델이 테스트 데이터에 얼마나 놀라는지. 낮을수록 좋음. 학습 중 sanity check에 좋지만 작업 품질엔 못 알려줘.

2. 작업 특화 메트릭

Accuracy, F1, BLEU, ROUGE — 작업에 따라. 분류 → accuracy. 요약 → ROUGE. 생성 품질 → 사람 선호율.

3. LLM-as-judge

강한 모델(GPT-4o, Claude)로 출력 평가. 많은 작업에서 놀랍게 안정적이고 사람 평가보다 훨씬 싸.

4. 사람 평가

골드 스탠다드. 도메인 전문가가 관련 차원(정확도, 유용성, 안전, 스타일)에서 출력 평가. 비싸지만 high-stakes use case엔 대체 불가.

결합 평가 전략

네 가지 다 섞어 —

Perplexity는 학습 시점 모니터링.
자동 메트릭은 빠른 반복.
LLM-as-judge는 모델 버전 비교.
사람 평가는 배포 전 최종 검증.

Code

LLM-as-judge for pairwise comparison·python

from openai import OpenAI

client = OpenAI()

def llm_judge(prompt: str, response_a: str, response_b: str) -> str:
    """Use GPT-4o to judge which response is better."""
    judgment = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Compare these two responses to the prompt below.

Prompt: {prompt}

Response A: {response_a}

Response B: {response_b}

Which response is better and why? Answer with "A" or "B" first, then a one-paragraph rationale focused on accuracy, helpfulness, and clarity.""",
        }],
        temperature=0,
    )
    return judgment.choices[0].message.content

# Tip: randomize A/B order between trials to remove position bias
import random
def fair_judge(prompt: str, base_resp: str, ft_resp: str) -> bool:
    if random.random() < 0.5:
        verdict = llm_judge(prompt, base_resp, ft_resp)
        return verdict.strip().upper().startswith("B")  # B = ft
    else:
        verdict = llm_judge(prompt, ft_resp, base_resp)
        return verdict.strip().upper().startswith("A")  # A = ft

External links

Exercise

파인튜닝 모델 + 베이스 모델 가져와. 타겟 use case에서 30 프롬프트 eval 셋 구축. Randomize된 A/B 위치로 pairwise LLM-as-judge 돌려. 파인튜닝 대 베이스 win rate 리포트. 그 다음 5개 random 쌍 손으로 sanity check — judge랑 너랑 동의해?

Progress

Progress is local-only — sign in to sync across devices.

← PreviousAlignment: DPO, ORPO, SimPO Next →퀴즈 · 5 questions

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.