Production에서 A/B prompt

~14 min · evaluation, ab-testing

Level 0수련생

0 XP0/100 lessons0/14 achievements

0/120 XP to next level120 XP to go0% complete

Eval set은 user 아니야

Golden-set eval은 proxy. 실제 prompt quality judge는 진짜 traffic의 진짜 user. Production A/B testing이 eval suggest한 거 confirm하는 법.

A/B wiring

Cohort assign에 user_id (또는 request_id) hash로 deterministically.
Cohort별 outcome track: success rate, downstream conversion, complaint rate, cost.
Statistical significance까지 충분히 길게 — 보통 low-volume system에 1–2주.
Metric과 decision rule pre-register ("v_new가 success rate에서 v_old ≥ 이고 p < 0.05면 ship").

Pitfall

Non-uniform feature로 hashing해서 cohort imbalance.
Novelty effect (새 prompt 처음 3일 더 좋아 보이고, mean-revert).
같은 user가 session 사이로 양 cohort에 land하는 spillover.
지연되는 outcome metric — full window 기다려.

Code

Hash 기반 cohort assignment·python

import hashlib

def cohort(user_id: str, experiment: str) -> str:
    h = hashlib.sha256(f"{experiment}:{user_id}".encode()).hexdigest()
    return "v_new" if int(h, 16) % 100 < 50 else "v_old"

prompt = PROMPTS[experiment][cohort(user_id, "support_v8")]

External links

Exercise

한 prompt 변경에 A/B test setup. metric, threshold, duration pre-register. 돌려. 실망스러워도 결과 honor.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousMetric design — easy한 거 측정 X Next →Eval drift — test set이 거짓말할 때

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.