Agent Evaluation

Trajectory, final 답만이 아니라

LLM agent 는 loop 에서 결정함: tool 고름, 실행, 결과 관찰, 다음 step 결정. Final 답 중요, 하지만 어떻게 도달했는지도 그래. Agent 가 옳은 tool 썼어? 불필요하게 loop 했어? Acceptable 비용으로 success 했어?

Agent-specific metric

Task success — agent 가 goal 달성? (binary 또는 graded)
Tool-call accuracy — 옳은 tool 이 옳은 argument 와 골랐어?
Trajectory efficiency — completion 까지 step 수 vs optimal.
Cost — 총 token, 총 tool call, 총 wall-clock 시간.
Recovery — 초기 step fail 하면 agent recover?
Goal alignment — agent 가 user 의 실제 goal 추구, drift?

Trajectory eval 패턴

Agent 가 취하는 모든 step 캡처: thought, tool call, tool result, next thought. (a) final outcome 과 (b) trajectory quality score. "47 불필요한 tool call 후 perfect 답" 은 eval fail 해야지 pass 가 아니야.

원칙: Agent evaluation 은 필요에 의해 trace-aware. Final output 만 score 하면 agent 가 보이지 않게 fail 하는 — slow, expensive, mid-trajectory 에서 confused — 작업의 절반 놓쳐.

안전한 실행 위한 sandbox

Code 돌리고, browse 하고, state 수정하는 agent 는 sandbox 필요. Inspect AI 의 Docker sandbox 가 한 옵션; SWE-bench-style ephemeral repo clone 이 다른. 실제 infrastructure 닿는 system 에 절대 unsandboxed eval 돌리지 마.

Code

Trajectory 캡처와 step-level scoring·python

from dataclasses import dataclass, field

@dataclass
class AgentStep:
    thought: str
    tool: str
    args: dict
    result: str
    cost: float       # tokens or wall-time

@dataclass
class AgentTrajectory:
    task: str
    steps: list[AgentStep] = field(default_factory=list)
    final_output: str = ""
    success: bool = False

def trajectory_metrics(traj: AgentTrajectory, optimal_steps: int):
    return {
        "success": traj.success,
        "n_steps": len(traj.steps),
        "efficiency": optimal_steps / max(len(traj.steps), 1),
        "total_cost": sum(s.cost for s in traj.steps),
        "unique_tools": len({s.tool for s in traj.steps}),
    }

DeepEval 과 tool-call accuracy·python

from deepeval.metrics import ToolCallAccuracyMetric, AgentGoalAccuracyMetric
from deepeval.test_case import LLMTestCase, ToolCall

case = LLMTestCase(
    input="What is the weather in Tokyo right now?",
    actual_output="It is 18°C and partly cloudy in Tokyo.",
    tools_called=[
        ToolCall(name="get_weather", arguments={"city": "Tokyo"}),
    ],
    expected_tools=[
        ToolCall(name="get_weather", arguments={"city": "Tokyo"}),
    ],
)
assert_test(case, [ToolCallAccuracyMetric(), AgentGoalAccuracyMetric()])

Sandbox 와 Inspect AI agent eval·python

from inspect_ai import Task, task
from inspect_ai.solver import use_tools, generate
from inspect_ai.tool import bash, python
from inspect_ai.scorer import includes

@task
def coding_agent_task():
    return Task(
        dataset=[Sample(
            input="Find the line count of all .py files under src/ and report the total.",
            target="42",
        )],
        solver=[
            use_tools([bash(), python()]),
            generate(),
        ],
        scorer=includes(),
        sandbox="docker",  # tools execute in isolated container
        message_limit=20,  # caps trajectory length
    )

Trajectory, final 답만이 아니라

Agent-specific metric

Trajectory eval 패턴

안전한 실행 위한 sandbox

Code

External links

Exercise

Progress

댓글 0