LLM Evaluation 의 Open Challenge

Frontier 문제

LLM evaluation 은 어린, 빠르게 움직이는 field. 가장 어려운 문제가 여전히 open. 그것들 알면 약속 가능한 한계에 honest 유지.

8개 unsolved evaluation 문제

Creativity 평가 — 시, novel idea, joke 어떻게 score? 여러 valid output, taste 가 subjective, reference 없음.
Reasoning 평가 — Correct 답 ≠ correct reasoning process. Model 이 understand 없이 right 답에 pattern-match 가능.
Scale 의 safety 평가 — Adversarial 공격이 defense 보다 빨리 진화. 오늘 safety eval 이 내일 stale snapshot.
Long-context evaluation — 1M-token input 을 systematically 어떻게 test? 대부분 benchmark 짧음. Long context 동작이 qualitatively 다름.
Multi-modal evaluation — text, image, video, audio 섞은 output score. 각 modality 가 자체 framework 필요.
Agent evaluation — Single correct trajectory 없는 open-ended task. "이 47-step 세션이 옳은 거 했어?" 어떻게 score?
Meta-evaluation — Metric 이 실제로 quality 예측하는지 보장. 가장 어려운 meta-question.
Benchmark contamination — Public benchmark 가 결국 다 training data 로 leak. Benchmark usefulness 의 half-life 가 줄어들고 있어.

원칙: 이 모든 거 안 풀 거야. 하지만 evaluation 이 어렵다는 거 — 그리고 어쨌든 하는 거 — 이해는 LLM 기능 ship 하는 90% 팀들 앞에 둠. 측정 못 하면 향상 못 해. 측정 시작.

Field 가 가는 곳

더 많은 capability-eliciting eval — performance 측정이 아니라 latent ability (와 위험) 표면화 design.
더 많은 process evaluation — final output 만이 아니라 reasoning trace score.
더 많은 continuous eval — production traffic 이 실시간으로 suite 에 feed back.
더 많은 secret holdout — 어떤 training corpus 에도 안 나타나는 hold-back dataset.
더 많은 collaborative grading — 인간 + LLM judge + classifier ensembled.

Eval mindset

규율은 "완벽한 metric 있어?" 가 아니야. "honest 한 측정 있어, 그리고 system 이 향상하면서 향상시키고 있어?" 야. Evaluation 은 artifact 가 아니라 practice. 이걸 내재화한 팀이 더 빨리 ship, 덜 regress, 자체 system 신뢰.

Code

분기별 eval review meeting 이 cover 하는 것·markdown

# Quarterly eval review — agenda

## 1. Suite health
- Pass-rate trend across releases
- Per-tag breakdown — any tag bleeding?
- Cost per eval run (trending up?)

## 2. Suite drift
- Did we add new cases? How many?
- Are old cases still relevant or should we retire them?
- Production-distribution drift since last quarter?

## 3. Meta-eval
- Correlation of automated metrics with human ratings (sample 50)
- Correlation with user-effort signals
- Are any metrics now theatre?

## 4. Red-team status
- Last red-team run: when, who, findings
- Next red-team run: scheduled?

## 5. Tooling debt
- Anything broken / fragile in the eval pipeline
- Cost optimizations to land

## 6. People
- Is the team contributing cases? If not, why not?
- Onboarding for new joiners — who runs it?

LLM Evaluation 의 Open Challenge

Frontier 문제

8개 unsolved evaluation 문제

Field 가 가는 곳

Eval mindset

Code

External links

Exercise

Progress

댓글 0