RAGAS: RAG-Specific Evaluation

RAG pipeline 위해 specifically 만들어진 framework

RAGAS (Retrieval-Augmented Generation Assessment) 는 RAG system 의 4단계 실패 공간 위해 design 된 metric 제공: retrieval quality, context relevance, answer faithfulness, answer relevance. RAG ship 하면 원해.

4 core RAGAS metric

Faithfulness — 답이 retrieved context 에 의해 supported 된 claim 만들어? (hallucination 잡음)
Answer Relevancy — 답이 질문 다뤄? (off-topic 응답 잡음)
Context Precision — Retrieve 된 chunk 중 실제로 관련된 게 몇? (retriever 가 junk 가져오는 거 잡음)
Context Recall — Retriever 가 답하는데 필요한 chunk 다 찾았어? (retriever 가 결정적 정보 놓치는 거 잡음)

합쳐서 "이 RAG output 좋아?" 를 어느 부분 이 fail 했는지 알려주는 4 sub-question 으로 분해.

이 분해가 중요한 이유

모호한 "답이 나빠" 는 팀에 아무것도 안 알려줘. RAGAS score 는 알려줘: 높은 faithfulness + 낮은 recall = retriever 가 doc 놓침. 낮은 faithfulness + 높은 recall = generator 가 옳은 context 가지고도 hallucinate. 각 진단이 다른 fix 가리켜.

원칙: RAG ship 하면 faithfulness 와 recall 을 독립 metric 으로 측정해야 해. Composite score 는 pipeline 의 어느 절반이 깨졌는지 숨겨.

구현 현실

RAGAS 는 대부분 metric 에 LLM-as-Judge 를 underneath 사용, judge 비용과 편향 상속. Specific RAG corpus 위 인간 대비 calibrate 후 absolute 숫자 trust. Metric 을 absolute quality 측정보다 relative signal (이 버전 vs 그 버전) 로 더 사용.

Code

Install 과 basic eval·python

# pip install ragas
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

data = Dataset.from_dict({
    "question": ["What is the capital of France?"],
    "answer": ["Paris is the capital of France."],
    "contexts": [["France's capital is Paris. It has a population of 2.1M."]],
    "ground_truth": ["Paris"],
})

result = evaluate(
    data,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 1.0, 'answer_relevancy': 0.95, 'context_precision': 1.0, 'context_recall': 1.0}

Failing case 분해·python

# Three diagnostic profiles, three different fixes:
#
# 1) low faithfulness, high recall
#    → generator is hallucinating despite having the right context
#    → fix: tighter prompt, citation enforcement, smaller model temperature
#
# 2) high faithfulness, low recall
#    → retriever is not finding the right chunks
#    → fix: better embeddings, hybrid search, query rewriting
#
# 3) high faithfulness, high recall, low answer relevancy
#    → retrieved correct docs, generated faithful summary, but didn't
#      answer the question the user asked
#    → fix: prompt engineering — explicitly route the answer to the question

RAGAS: RAG-Specific Evaluation

RAG pipeline 위해 specifically 만들어진 framework

4 core RAGAS metric

이 분해가 중요한 이유

구현 현실

Code

External links

Exercise

Progress

댓글 0