Custom 과 Composite Score

단일 metric 이 진실 못 말할 때

때로 옳은 grade 는 weighted combination — accuracy + safety + brevity, 모두 format-compliance 로 multiply. 그게 composite score 고, 작성은 straightforward 한데 수학에 규칙 있어.

Composite design 규칙

각 component 가 0-1 score 여야 해. 퍼센트와 raw count 섞으면 garbage.
Critical 차원에 대해 hard gate 가 weighted average 이겨. Safety 는 fail 하면 0 으로 multiply, 몇 점 빼는 게 아니야.
항상 component 들 log, composite 만 X. composite 는 headline, component 는 진단.
Weight 신중히 골라 그리고 이유 적어. "Correctness 0.5, format 0.3, brevity 0.2 — user 가 잘못된 답에 대해 긴 답에 대해보다 두 배 자주 불평하니까."

원칙: Composite 는 많은 측정의 한 요약이어야지, 그것들을 대체하는 single score 가 아니야. 항상 component 보여줘.

Weighted-average vs multiplicative

Weighted average: Component 가 독립적이고 보상 가능할 때 좋음 (더 많은 brevity 가 약간 낮은 coherence 를 offset 가능).

Multiplicative: Failure 가 catastrophic 할 때 좋음 (safety = 0 → composite = 0, 답이 brilliant 해도).

대부분 실제 composite 는 safety / format gate 에 multiplicative, quality 차원에 weighted-average 야.

Composite 의 세금 — calibration drift

Composite 가 "correctness * 0.5 + brevity * 0.5" 이고 팀이 이걸 optimize 하면, 간결하지만 wrong 으로 높은 composite 받을 수 있어. Composite 는 원하던 걸 측정 안 했어 — 팀에 maximize 하라고 한 걸 측정. 분기별로 composite 를 인간 rating 대비 다시 체크.

Code

Hard safety gate 있는 weighted composite·python

def composite_score(scores: dict[str, float], weights: dict[str, float], gates: list[str]):
    # Hard gates: any gate failing zeroes the composite.
    for gate in gates:
        if scores.get(gate, 0.0) < 1.0:
            return 0.0, f"hard-gate fail: {gate}"
    weighted = sum(scores[k] * weights[k] for k in weights)
    total = sum(weights.values())
    return weighted / total, "ok"

result = composite_score(
    scores={"correctness": 0.9, "brevity": 0.7, "safety": 1.0, "format": 1.0},
    weights={"correctness": 0.6, "brevity": 0.4},
    gates=["safety", "format"],
)
print(result)  # (0.82, 'ok')

Multi-criteria report — component 먼저, composite 나중·python

def report_per_case(case_results):
    print(f"{'id':<12} {'corr':>5} {'brev':>5} {'safe':>5} {'fmt':>5} {'COMP':>6}")
    for r in case_results:
        comp, _ = composite_score(r['scores'], WEIGHTS, GATES)
        print(f"{r['id']:<12} {r['scores']['correctness']:>5.2f} {r['scores']['brevity']:>5.2f} "
              f"{r['scores']['safety']:>5.2f} {r['scores']['format']:>5.2f} {comp:>6.2f}")

# Sample output:
# id           corr  brev  safe   fmt   COMP
# qa.001       0.95  0.80  1.00  1.00   0.89
# qa.002       0.70  0.95  1.00  1.00   0.80
# qa.003       0.95  0.50  0.00  1.00   0.00  ← safety failed → composite 0

Custom 과 Composite Score

단일 metric 이 진실 못 말할 때

Composite design 규칙

Weighted-average vs multiplicative

Composite 의 세금 — calibration drift

Code

External links

Exercise

Progress

댓글 0