DeepEval: Pytest-Native Python Framework

Python test 처럼 느껴지는 eval

DeepEval (2026 중반 기준 v3.x) 는 Python-native eval framework. pytest 와 통합, 30+ named metric (RAG / agentic / safety / custom 카테고리) 제공, Confident AI 통한 선택적 cloud dashboard. Eval 로직이 Python 에 살고 이미 pytest 쓸 때 best fit.

기본 제공

Pytest 통합 — eval 을 assert_test 로 pytest test 처럼 작성.
Built-in metric — Faithfulness, AnswerRelevancy, Hallucination, Toxicity, Bias, ContextualPrecision/Recall/Relevancy, ToolCallAccuracy, AgentGoalAccuracy, GEval (custom LLM-as-Judge).
End-to-end 와 tracing mode — final output 위 black-box, 또는 pipeline trace 위 white-box.
deepeval login 으로 공유 experiment history 위한 cloud dashboard.
Test data 생성 위한 Synthesizer.

빛나는 곳

DeepEval 은 Python 코드베이스에 자연스럽게 fit. Eval suite 가 또 하나의 tests/eval/ 디렉토리; pytest 가 collect; CI 가 돌려. GEval 통한 custom metric 이 prompt 수동 작성 없이 declarative criteria 로 LLM-as-Judge 로직 정의 가능하게.

안 맞는 곳

팀이 Python 안 쓰면 friction 진짜야. No-code GUI 원하면 Braintrust 또는 Confident AI 의 hosted UI 봐.

원칙: DeepEval 은 eval 이 Python 코드 옆에 살고 다른 모든 test 처럼 느껴지길 원할 때 옳은 pick.

Code

Install 과 login·bash

pip install -U deepeval

# Optional: login to Confident AI for cloud dashboard
deepeval login

# Run a single test file
deepeval test run test_eval.py

# Or via pytest
pytest test_eval.py -v

test_eval.py — GEval custom metric·python

from deepeval import assert_test
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

def test_correctness():
    correctness = GEval(
        name="Correctness",
        criteria="Determine whether the actual output is factually correct given the expected output.",
        evaluation_params=[
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.EXPECTED_OUTPUT,
        ],
        threshold=0.5,
    )
    case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France.",
        expected_output="Paris",
    )
    assert_test(case, [correctness])

RAG-specific built-in metric·python

from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    HallucinationMetric,
)

faithfulness = FaithfulnessMetric(threshold=0.7)
relevancy = AnswerRelevancyMetric(threshold=0.7)

case = LLMTestCase(
    input="What is the latest iPhone?",
    actual_output="The iPhone 17 Pro has a 48MP main camera.",
    retrieval_context=["iPhone 17 Pro features a 48MP main camera, A19 chip, ..."],
)
assert_test(case, [faithfulness, relevancy])

DeepEval: Pytest-Native Python Framework

Python test 처럼 느껴지는 eval

기본 제공

빛나는 곳

안 맞는 곳

Code

External links

Exercise

Progress

댓글 0