Special token — 모델용 비계

special token은 vocab에 예약된 ID들 — 어휘적 의미가 아니라 구조적 의미를 가져. 사용자가 쓴 입력 텍스트의 일부가 아니라, 시퀀스의 나머지를 어떻게 해석할지 모델한테 알려주는 신호.

토큰	역할	사용
[CLS]	분류 — 이 위치의 last-layer hidden state가 전체 입력 대표	BERT
[SEP]	두 segment(질문/문맥) 경계	BERT
[PAD]	배치 내 가변 길이 입력 정렬용 padding	거의 모든 encoder
[MASK]	masked-LM 학습용 placeholder	BERT
<\|endoftext\|>	문서 경계	GPT-2/3/4
<s> / </s>	시퀀스 시작/끝	Llama, T5
<\|im_start\|> / <\|im_end\|>	chat message 역할 경계(system, user, assistant)	OpenAI Harmony, Llama chat

왜 이게 중요해

모던 챗 모델은 special token으로 만든 chat template을 써. tokenizer.apply_chat_template(messages) 호출하면 라이브러리가 각 role 주위에 올바른 special token 삽입. template 우회해서 "User: ..." / "Assistant: ..." 직접 concat하는 거 — "모델이 이상하게 행동해요" 버그 top 3 원인. 모델은 <|im_start|>user 같은 토큰으로 학습됐지, 문자열 "User:"로 학습된 게 아니야.

Code

Always use the official chat template·python

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

messages = [
    {"role": "system",    "content": "You are a helpful assistant."},
    {"role": "user",      "content": "What is a Transformer?"},
]
prompt_ids = tok.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_tensors="pt",
)
# Right way. Library inserts <|begin_of_text|>, <|start_header_id|>system,
# <|end_header_id|>, role contents, <|eot_id|>, etc.

Exercise

3-message 대화(system + user + assistant)를 세 방식으로 토큰화 — (1) Llama 3 apply_chat_template, (2) Mistral 7B Instruct apply_chat_template, (3) 직접 'system: ... user: ... assistant: ...' concat. 각각 디코딩해서 토큰 수랑 special token 위치 비교. 손으로 concat한 건 누가 봐도 이상해 보일 거야.

Special token — 모델용 비계

왜 이게 중요해

Code

External links

Exercise

Progress

댓글 0