Code Benchmark: HumanEval, MBPP, SWE-bench

다른 code skill 측정하는 benchmark 셋

Code generation 은 자체 benchmark family. 가장 많이 인용되는 셋이 다른 걸 알려줘.

HumanEval (Chen et al. 2021, OpenAI)

Unit test 있는 164개 손-작성 Python programming 문제. Model 이 코드 generate; harness 가 test 돌림; score 는 통과 fraction (pass@1, pass@10). Top model 들에 2024 saturated — frontier score 90%+ hit. Basic capability check 로 유용, 더 이상 discriminator X.

MBPP (Mostly Basic Python Problems, Google 2021)

1,000 더 simpler Python 문제. HumanEval 보다 ceiling 낮음; 주로 cross-validation 위해 HumanEval 과 함께 사용.

SWE-bench (Princeton 2023, with extension)

흥미로운 것. SWE-bench 는 model 한테 실제 OSS repo 의 실제 GitHub issue 줘서 프로젝트의 기존 test suite 를 통과하는 patch 만들라고 부탁. 이건 multi-file, repo-context, agent-style task. SWE-bench-Verified (maintainer 가 축복한 subset) 가 de-facto modern code-agent benchmark. 2026 top score 가 60-70% 근처 — 여전히 saturate 와 멀고, "이 model 이 실제 software 작업에 도움 될까" 의 훨씬 더 predictive.

원칙: HumanEval 은 model 이 Python 작성 가능함을 알려주고, SWE-bench 는 software engineering 가능함을 알려줘. Gap 이 어마어마해.

둘 다 안 test 하는 것

Code review — 코드 읽기 in scope X.
Tooling — shell, debugger, IDE feature 사용.
Long-running iteration — 대부분 one-shot 또는 small-loop.
Python 과 TypeScript 외 언어 (대부분).

Code

HumanEval 문제 (sample)·python

# HumanEval/0

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    # Model generates the body. Harness runs hidden test cases.
    # pass@k = fraction of attempts where ALL tests pass.

SWE-bench task 구조·json

{
  "repo": "django/django",
  "instance_id": "django__django-12345",
  "problem_statement": "... (real GitHub issue text)",
  "base_commit": "abc123...",
  "patch": "... (the maintainer's actual fix)",
  "test_patch": "... (tests that verify the fix)",
  "FAIL_TO_PASS": ["tests.test_foo.test_bar"],
  "PASS_TO_PASS": ["tests.test_foo.test_baz"]
}

# Model gets the repo at base_commit + the issue.
# Must produce a patch that flips FAIL_TO_PASS to passing
# without breaking any PASS_TO_PASS test.

Code Benchmark: HumanEval, MBPP, SWE-bench

다른 code skill 측정하는 benchmark 셋

HumanEval (Chen et al. 2021, OpenAI)

MBPP (Mostly Basic Python Problems, Google 2021)

SWE-bench (Princeton 2023, with extension)

둘 다 안 test 하는 것

Code

External links

Exercise

Progress

댓글 0