Chatbot 과 Conversation Evaluation

Multi-turn 이 모든 걸 바꿔

Single-turn eval 은 "이 답이 이 질문에 좋아?" 묻고. Conversation eval 은 "이 conversation 이 잘 가고 있어?" 묻고. 두번째가 훨씬 어려워, quality 가 context, memory, consistency, trajectory 에 의존하니까.

Conversation-specific metric

Turn-level quality — 각 응답 독립 평가 (여전히 useful).
Context retention — bot 이 일찍 말한 거 기억해?
Topic coherence — 나중 turn 이 일찍 turn 에서 logically 따라?
Repetition — bot 이 자기 반복?
Conversation completion rate — multi-turn flow 의 어느 fraction 이 success end state 도달?
User-effort signal — resolution 까지 turn, escalation rate, abandonment rate.

두 evaluation 패턴

Frozen-conversation eval — 새 system 에 pre-recorded conversation replay. 빠르고, cheap, deterministic. Bot 의 실제 응답에 의존하는 동작 놓침.
Live-simulation eval — "user simulator" (다른 LLM) 가 bot 과 conversation 가짐. Emergent 동작 캡처. 더 느리고, 더 비싸고, 더 realistic.

원칙: Frozen 과 simulated conversation eval 둘 다 돌려. Frozen 이 cheap 하게 regression 잡고; simulated 가 frozen 이 못 보는 거 잡아.

Production conversation telemetry

Resolution 까지 turn, abandonment rate, escalation rate 를 live metric 으로 추적. User-effort proxy — degrade 하면 turn-level quality 가 변하지 않아 보여도 conversation 경험이 더 나빠진 거.

Code

Frozen-conversation replay eval·python

def replay_conversation(conversation, bot):
    """Replay a recorded conversation, scoring each new bot response."""
    history = []
    scores = []
    for turn in conversation:
        if turn["role"] == "user":
            history.append(turn)
        else:  # role = assistant
            new_response = bot.complete(history)
            score = judge(turn, new_response, history=history)  # compare to recorded
            scores.append(score)
            history.append({"role": "assistant", "content": new_response})
    return scores

Live multi-turn eval 위한 user simulator·python

USER_SIMULATOR_PROMPT = """
You are role-playing a user trying to accomplish this goal: {goal}
Be realistic — sometimes ask follow-up questions, sometimes get confused.
End the conversation when you are satisfied or give up.

Keep messages short (1-2 sentences). Reply ONLY with what the user would say.
"""

def simulate_conversation(goal, bot, simulator, max_turns=10):
    history = []
    for turn in range(max_turns):
        sim_msg = simulator.complete(USER_SIMULATOR_PROMPT.format(goal=goal), history=history)
        if sim_msg.strip().lower() in ("thanks", "goodbye", "that's enough"):
            break
        history.append({"role": "user", "content": sim_msg})
        bot_msg = bot.complete(history)
        history.append({"role": "assistant", "content": bot_msg})
    return history

Chatbot 과 Conversation Evaluation

Multi-turn 이 모든 걸 바꿔

Conversation-specific metric

두 evaluation 패턴

Production conversation telemetry

Code

External links

Exercise

Progress

댓글 0