BLEU 와 ROUGE

두 옛 metric, 조심히 쓰면 여전히 관련

BLEU 와 ROUGE 는 2000년대 초 machine translation 과 summarization 용으로 발명됐어. 불완전하고, 자주 비판받지만, 여전히 만들어진 좁은 task 에 유용해: reference output 있고, 빠르고 free, 언어 무관 similarity score 원할 때.

BLEU — 번역용

BLEU (Bilingual Evaluation Understudy) 는 candidate output 과 하나 이상의 reference 사이 n-gram overlap 을 측정, 의심스럽게 짧은 output discourage 위한 brevity penalty 와 함께. Score 는 0-1 (또는 일부 라이브러리에서 0-100). 0.3 은 괜찮음, 0.5 는 좋음, 0.7+ 는 실제 번역 task 에서 드물어.

ROUGE — 요약용

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) 는 summarization 자매. ROUGE-N 은 n-gram overlap, ROUGE-L 은 longest common subsequence, ROUGE-W 는 연속 match 에 weight. 더 높은 recall = candidate 가 reference 더 많이 cover.

원칙: BLEU 와 ROUGE 는 surface overlap 측정, 의미 X. Reference 와 단어 공유하는 output 보상. 같은 걸 다르게 말하는 창의적 paraphrase 처벌.

쓰면 안 되는 곳

Open-ended generation (창작, brainstorming) — 많은 valid output, 어떤 single reference 와도 낮은 overlap.
Code generation — 다른 코드가 functionally 동일하면서 lexically 거리 멀 수 있음.
Dialogue — 좋은 답이 reference 와 단어 0개 공유 가능.
Surface form 보다 의미가 더 중요한 모든 것. BERTScore 또는 LLM judge 써.

Code

sacrebleu — 표준 reference impl·python

# pip install sacrebleu
from sacrebleu import corpus_bleu, sentence_bleu

refs   = [["The cat sat on the mat."]]
hyps   = ["The cat is on the mat."]
bleu = sentence_bleu(hyps[0], refs[0])
print(bleu.score)   # ~46 (out of 100)

# Corpus-level — better statistics, what papers report.
refs_corpus = [["The cat sat on the mat.", "It is raining."]]
hyps_corpus = ["The cat is on the mat.", "It rains today."]
corpus = corpus_bleu(hyps_corpus, refs_corpus)
print(corpus.score, corpus.brevity_penalty)

ROUGE — ROUGE-1, ROUGE-2, ROUGE-L·python

# pip install rouge-score
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)

reference = "The cat sat on the mat and watched the rain."
candidate = "A cat watched the rain from the mat."

result = scorer.score(reference, candidate)
for metric, score in result.items():
    print(f"{metric}: P={score.precision:.2f} R={score.recall:.2f} F1={score.fmeasure:.2f}")

# Output:
# rouge1: P=0.62 R=0.50 F1=0.56
# rouge2: P=0.40 R=0.30 F1=0.34
# rougeL: P=0.62 R=0.50 F1=0.56

BLEU 가 실제 system 에서 misleading 한 이유·python

# Two outputs that mean the same thing — BLEU disagrees.
ref  = "The doctor prescribed antibiotics for the infection."
a    = "The doctor prescribed antibiotics for the infection."   # identical
b    = "For the infection, the physician prescribed antibiotics."  # paraphrase

from sacrebleu import sentence_bleu
print(sentence_bleu(a, [ref]).score)  # ~100
print(sentence_bleu(b, [ref]).score)  # ~30 — penalized for word reordering

두 옛 metric, 조심히 쓰면 여전히 관련

BLEU — 번역용

ROUGE — 요약용

쓰면 안 되는 곳

Code

External links

Exercise

Progress

댓글 0