Online Evaluation: Production A/B Testing

가장 honest 한 eval 은 실제 user

Offline eval 은 test set 이 말하는 거 알려줘. Online A/B test 는 user 가 하는 거 알려줘. 둘 다 중요, 하지만 online 이 ground truth 에 가장 가까운 — offline 으로 label 못 하는 거 (engagement, retention, 수익화) 측정하는 유일한 방법.

Online A/B test 가 측정하는, offline 이 못 하는 것

Engagement — user 가 새 variant 와 더 오래 대화 유지?
Task completion — 어느 fraction 이 실제 goal 완료?
Satisfaction — 명시적 thumbs-up/down 또는 implicit (follow-up 안 필요).
Latency tolerance — user 가 더 느리지만 더 나은 응답 기다렸어, churn 했어?
장기 retention — user 가 다시 와?

방법론 fundamental

User (session 이 아니라) level 의 random assignment.
Pre-registered metric — test 돌기 전에 primary metric 선언. 안 그러면 chance 로 "이긴" metric 찾게 돼.
Sample size 계산 — 미리 power analysis; 결과 peek X.
Novelty 효과 캡처할 만큼 길게 — 첫 주 효과 자주 generalize X.
Guardrail metric monitor — primary metric 이겨도 latency, error rate, cost 가 폭발 안 했는지 체크.

원칙: Offline eval 이 propose, online A/B test 가 dispose. Offline 숫자만으로 major change 절대 ship X.

Direction 만이 아니라 통계적 significance

"Variant B 가 2% 더 나았어" 는 confidence interval 없으면 무의미. Arm 당 1,000 user 면 대부분 metric 의 noise floor 가 ±5%. Winner 선언 전 p-value (또는 confidence interval) 계산. statsmodels, ABBA, 또는 platform 의 built-in stats 가 처리.

Code

User level 의 random assignment·python

import hashlib

def ab_assign(user_id, variant_count=2, salt="experiment-2026-q2"):
    """Stable, deterministic, uniform assignment."""
    h = hashlib.sha256(f"{salt}:{user_id}".encode()).hexdigest()
    bucket = int(h[:8], 16) % variant_count
    return bucket

# Same user → same bucket every call. Different experiment → different salt.
# Avoids 'session-level' assignment (different responses on different visits).

Completion rate 위 two-proportion z-test·python

from statistics import NormalDist

def two_prop_z_test(c_a, n_a, c_b, n_b):
    p_a, p_b = c_a / n_a, c_b / n_b
    p_pool = (c_a + c_b) / (n_a + n_b)
    se = (p_pool * (1 - p_pool) * (1/n_a + 1/n_b)) ** 0.5
    z = (p_a - p_b) / se
    p_value = 2 * (1 - NormalDist().cdf(abs(z)))
    return {
        "p_a": p_a, "p_b": p_b,
        "diff": p_a - p_b,
        "z": z,
        "p_value": p_value,
        "significant": p_value < 0.05,
    }

print(two_prop_z_test(c_a=420, n_a=1000, c_b=465, n_b=1000))
# diff=0.045 (4.5pp), p_value~0.04 → just barely significant

Online Evaluation: Production A/B Testing

가장 honest 한 eval 은 실제 user

Online A/B test 가 측정하는, offline 이 못 하는 것

방법론 fundamental

Direction 만이 아니라 통계적 significance

Code

External links

Exercise

Progress

댓글 0