Chat template — 모델이 단어 포장 방식에 까다로울 때

가장 under-appreciate 된 단일 디테일

Instruct-tuned 모델은 특정 형식으로 wrap 된 대화에 fine-tune 됐어 — turn 경계 표시 special 토큰, system / user / assistant 역할 표시 special 토큰, end-of-turn 표시 special 토큰. 그 wrapping 없이 raw 프롬프트 모델에 보내면, chat 응답 원한다는 거 전혀 몰라. 네 텍스트 이어 갈 수도. Wrapping 자체를 출력할 수도. 절대 생성 멈추지 않을 수도. 내가 보는 "모델이 깨졌어" 불평 대부분이 사실 "chat template 누락".

Fix 는 helper 하나 — tokenizer.apply_chat_template. load() 에서 돌아온 tokenizer 가 모델용 맞는 template 이미 알아. 역할-keyed 메시지 리스트 줘. 적절히 wrap 된 프롬프트 문자열 돌려줘.

모양

입력은 role 과 content 가진 dict 의 리스트:

messages = [
    {"role": "system", "content": "You are a terse assistant."},
    {"role": "user",   "content": "Capital of France?"},
]

출력 (tokenize=False, add_generation_prompt=True 로 호출하면) 은 generate 에 먹일 준비된 단일 문자열:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a terse assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

그걸 generate 에 보내면 모델이 assistant 턴인 거 알고, 대화 context 알고, 어디서 멈출지 알아.

Multi-turn — 그냥 계속 append

후속 턴엔, assistant 의 이전 응답을 메시지 리스트에 append 하고 template 다시 render. mlx-lm 이 너 대신 대화 상태 추적 안 해 — 너가 메시지 리스트 유지, 매 턴마다 새로 render. 일처럼 들리네. 사실 feature 야 — context window 관리에 완전한 통제 주니까.

이거 건너뛰면 뭐 잘못되나

모델이 네 프롬프트에 답하고 계속 가, 가짜 user / assistant 턴 생성.
모델이 멈추는 대신 literal <|eot_id|> 를 텍스트로 출력.
모델이 혼란스러운, instruction-following 안 하는 응답 줘 (chat 모드 아니라 completion 모드에 있어서).
모델이 max_tokens 찰 때까지 절대 안 멈춰.

이 넷 다 같은 버그 — chat template 누락 또는 잘못. Instruct 모델에선 항상 generate 전에 template render.

Code

apply_chat_template — canonical 패턴·python

from mlx_lm import load, generate

model, tok = load("mlx-community/Llama-3.2-1B-Instruct-4bit")

messages = [
    {"role": "system", "content": "You are a terse assistant."},
    {"role": "user",   "content": "Capital of France?"},
]

prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

print("--- rendered prompt ---")
print(prompt)
print("--- generation ---")
print(generate(model, tok, prompt=prompt, max_tokens=20, verbose=False))

# Verified output (2026-05-03):
#   --- rendered prompt ---
#   <|begin_of_text|><|start_header_id|>system<|end_header_id|>
#
#   Cutting Knowledge Date: December 2023
#   Today Date: 03 May 2026
#
#   You are a terse assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
#
#   Capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
#
#   --- generation ---
#   Paris.

Multi-turn — assistant 답장 append, 다시 render·python

from mlx_lm import load, generate

model, tok = load("mlx-community/Llama-3.2-1B-Instruct-4bit")

messages = [
    {"role": "system", "content": "You are a terse assistant."},
    {"role": "user",   "content": "Capital of France?"},
]

# Turn 1
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
reply  = generate(model, tok, prompt=prompt, max_tokens=20, verbose=False)
print("Turn 1:", reply)
messages.append({"role": "assistant", "content": reply})

# Turn 2 — follow-up question, with prior turns in context
messages.append({"role": "user", "content": "And of Germany?"})
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
reply  = generate(model, tok, prompt=prompt, max_tokens=20, verbose=False)
print("Turn 2:", reply)

Raw template 살피기 (디버깅 도구)·python

from mlx_lm import load
model, tok = load("mlx-community/Llama-3.2-1B-Instruct-4bit")

# The Jinja template the tokenizer will use to render messages.
# Useful when you're debugging "why does my prompt look weird".
print("Chat template (first 600 chars):")
print(tok.chat_template[:600])

Exercise

Multi-turn 블록 돌려. 두 턴 다 정확히 답하는 거 확인. 그 다음 같은 multi-turn 대화를 apply_chat_template 통째로 건너뛰고 — raw user 메시지 문자열을 generate 에 직접 전달해 — 돌려. 뭐 일어나나 봐 (네 텍스트 이어가거나, system 프롬프트 출력, 또는 폭주). 알아챈 거 두 문장; 이게 평생 알아볼 버그 template.