토크나이저 깊이: BPE, WordPiece, SentencePiece

Family 셋, 일은 하나

토크나이저는 텍스트를 모델이 embed 할 정수 id 로 split. 현재 사용되는 family 셋:

Byte-Pair Encoding (BPE) — 빈번한 byte/char pair 를 merge. GPT-2, GPT-3, GPT-4, Llama, Qwen, Mistral, 모던 LLM 대부분이 씀. byte 위에서 동작 (UNK 없음), 모든 Unicode 처리.
WordPiece — greedy longest-match. BERT, DistilBERT 가 씀. ## 로 subword 마킹.
SentencePiece (Unigram or BPE) — 입력을 whitespace 포함 raw byte stream 으로 다룸 (▁ 가 word boundary 마킹). T5, mBART, ALBERT, XLM-R 가 씀.

진짜 알아야 할 것

모든 토크나이저가 fast Rust 백업 버전 (디폴트) 을 가져. 모델이 slow Python 만 ship 하지 않는 한 fast variant 써. 성능 차이 종종 10x.

99% 케이스에서 만지는 메서드 네 개:

tokenizer(text) — {input_ids, attention_mask} 돌려줌.
tokenizer.encode(text) / .decode(ids) — round-trip.
tokenizer.apply_chat_template(messages) — 모델의 chat template 따라 chat history 렌더 (토크나이저 config 의 Jinja).
tokenizer.batch_encode_plus(...) / tokenizer(...) 에 리스트 넘기기 — batch.

Code

토크나이저 동작 inspect·python

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

text = "Hugging Face is the GitHub of AI."
ids = tok.encode(text, add_special_tokens=False)
toks = tok.convert_ids_to_tokens(ids)

print("ids:  ", ids)
print("tokens:", toks)
print("vocab size:", tok.vocab_size)
print("model_max_length:", tok.model_max_length)
print("special tokens:", tok.special_tokens_map)

chat template 적용 (프롬프트 빌드 정공법)·python

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a concise assistant."},
    {"role": "user", "content": "Explain Hugging Face in one sentence."},
]

prompt = tok.apply_chat_template(
    messages,
    tokenize=False,             # inspect 위해 string 반환
    add_generation_prompt=True, # assistant header 추가
)
print(prompt)

# 같은 콜, 모델용 토큰화:
input_ids = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
print("shape:", input_ids.shape)

Exercise

다른 모델 family 셋 골라 (Llama, Mistral, Qwen, T5, BERT). 각각 토크나이저 로드 후 같은 입력 string 에 비교: vocab size, 100단어 영문 단락의 토큰 수, special token, apply_chat_template 결과. 어느 family 가 토크나이저 family 공유, 어느 게 unique 한지 메모.