Code Generation Evaluation

Code 가 평가하기 가장 쉽고 — 그리고 가장 어려운 output

쉬운 이유: 돌릴 수 있어. 어려운 이유: "code 가 돌아" 가 필요하지만 충분 X. 통과하는 test 가 code 가 좋다는 의미 X. 세 quality layer 가 중요.

Layer 1: 실행 correctness

생성된 code 돌림; unit test 대비 체크. pass@k 는 k sample 중 적어도 하나에서 모든 test pass 하는 시도 fraction. HumanEval 과 SWE-bench 가 측정.

Layer 2: code quality

Static analysis — linter, type checker, security scanner (bandit, semgrep).
Style — ruff, prettier, black. Bug 안 잡지만 "인간이 이 PR 받을까?" signal.
Complexity — cyclomatic, nesting depth.
Documentation — docstring, comment, type annotation.

Layer 3: test 너머의 behavioral correctness

생성된 code 가 test 통과하는데 intent 와 match? LLM judge 가 implementation 대비 자연어 spec 비교 가능. 대부분 "correct 처럼 보이는데 X" bug 가 숨는 곳.

원칙: 생성된 code 항상 sandbox 에서 돌려. Linter 항상 돌려. Implementation 대비 intent 비교하는 judge 항상. Layer 하나 skip 하면 잡았을 bug ship.

Code-specific safety check

생성된 code 가 command injection, SQL injection, hardcoded secret, unsafe deserialization, insecure dependency 가질 수 있어. 생성된 code 위 security scanner (semgrep, bandit, npm audit) 가 eval 의 일부, afterthought 가 아니야.

Code

pass@k 계산·python

import math
import itertools

def pass_at_k(n, c, k):
    """Standard HumanEval-style pass@k. n=samples, c=correct, k=evaluation k."""
    if n - c < k:
        return 1.0
    return 1.0 - math.comb(n - c, k) / math.comb(n, k)

# Generate n samples for each problem; run tests; count correct.
# Then compute pass@1, pass@10, pass@100 from the same data.
results = []
for problem in benchmark:
    samples = [model.generate(problem) for _ in range(20)]
    n_correct = sum(run_tests(s, problem.tests) for s in samples)
    results.append({
        "problem": problem.id,
        "pass@1": pass_at_k(20, n_correct, 1),
        "pass@10": pass_at_k(20, n_correct, 10),
    })

Code eval 의 일부로 security scan·python

import subprocess, tempfile, os

def security_scan(code: str):
    with tempfile.NamedTemporaryFile(suffix=".py", delete=False) as f:
        f.write(code.encode())
        path = f.name
    try:
        # bandit for Python security antipatterns
        result = subprocess.run(
            ["bandit", "-q", "-f", "json", path],
            capture_output=True, text=True,
        )
        # also: semgrep, ruff, mypy --strict
    finally:
        os.unlink(path)
    return result.stdout

Sandboxed test 실행·python

import subprocess, tempfile, os, signal

def run_in_sandbox(code: str, test_code: str, timeout=10):
    """Run with subprocess timeout. For real safety use Docker/firejail."""
    with tempfile.TemporaryDirectory() as d:
        with open(f"{d}/sol.py", "w") as f:
            f.write(code)
        with open(f"{d}/test.py", "w") as f:
            f.write(test_code)
        try:
            r = subprocess.run(
                ["python", f"{d}/test.py"],
                capture_output=True, text=True, timeout=timeout,
                cwd=d,
            )
            return r.returncode == 0, r.stdout + r.stderr
        except subprocess.TimeoutExpired:
            return False, "timeout"

Code Generation Evaluation

Code 가 평가하기 가장 쉽고 — 그리고 가장 어려운 output

Layer 1: 실행 correctness

Layer 2: code quality

Layer 3: test 너머의 behavioral correctness

Code-specific safety check

Code

External links

Exercise

Progress

댓글 0