BERTScore: Semantic Similarity

의미가 중요하고 단어가 안 중요할 때

BERTScore 는 n-gram overlap 을 embedding similarity 로 대체. candidate 의 각 token 이 BERT (또는 어떤 contextual embedding model) 써서 reference 의 가장 비슷한 token 에 match, 그리고 token-별 similarity 가 평균. 결과: 같은 걸 다른 단어로 말하는 paraphrase 가 높게 score, 다른 의미의 lexical overlap 이 낮게 score.

작동 방식

Reference 와 candidate 를 contextual model (BERT, RoBERTa, 또는 modern equivalent) 로 encode.
Candidate 의 각 token 에 대해 reference 에서 highest-similarity token 찾기.
Precision (candidate-side), recall (reference-side), F1 으로 aggregate.

대부분 task 에서 BLEU/ROUGE 보다 나은 이유

BLEU 는 'physician' 과 'doctor' 를 0 overlap 으로 다뤄. BERTScore 는 거의 동일하게 다뤄. paraphrase, summarization, translation task 에서 인간 판단과 훨씬 강하게 상관해.

원칙: LLM 안 부르고 자동 의미 비교 필요할 때 BERTScore 가 workhorse. surface metric 이 놓치는 의미 캡처, 거의 deterministic 비용으로.

여전히 잡지 못하는 것

논리 에러 — "X causes Y" 와 "Y causes X" 가 함께 높게 score 가능.
Negation — "is safe" 와 "is not safe" 가 대부분 embedding mass 공유.
Numerical 실수 — "$1M" vs "$10M" 가 embedding 에 의미적으로 가까워 보임.

실용적 고려사항

Model 신중히 골라. microsoft/deberta-large-mnli 가 vanilla BERT 보다 평가에 잘 함. 같은 dataset 에서 다시 돌리면 embedding cache. 첫 run 비싸, 그다음은 거의 free.

Code

공식 라이브러리 BERTScore·python

# pip install bert-score
from bert_score import score

refs = [
    "The doctor prescribed antibiotics for the infection.",
    "The car would not start because the battery was dead.",
]
hyps = [
    "For the infection, the physician prescribed antibiotics.",
    "The vehicle failed to start due to a dead battery.",
]

# F1 is what most papers report
P, R, F1 = score(hyps, refs, lang="en", model_type="microsoft/deberta-large-mnli")
for h, f in zip(hyps, F1.tolist()):
    print(f"{f:.3f}  {h}")
# 0.94  For the infection, the physician prescribed antibiotics.
# 0.91  The vehicle failed to start due to a dead battery.
# Both >>>> what BLEU would have given (~0.30).

embedding 의 cosine similarity — DIY 버전·python

# When you want fast embeddings without the BERTScore matching machinery.
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")

def cosine_sim(a, b):
    va, vb = model.encode([a, b])
    return float(np.dot(va, vb) / (np.linalg.norm(va) * np.linalg.norm(vb)))

print(cosine_sim("safe to drink", "potable"))     # ~0.78
print(cosine_sim("safe to drink", "not safe"))    # ~0.55  ← negation trap

BERTScore: Semantic Similarity

의미가 중요하고 단어가 안 중요할 때

작동 방식

대부분 task 에서 BLEU/ROUGE 보다 나은 이유

여전히 잡지 못하는 것

실용적 고려사항

Code

External links

Exercise

Progress

댓글 0