왜 토큰화인가? 텍스트에서 텐서로 가는 다리

신경망은 숫자로 작동해 — 정확히는 부동소수점 값의 dense 행렬. "A"라는 글자나 "고양이"라는 단어를 그대로 못 먹어. 토큰화가 그 다리야: 결정적 절차로 raw text를 고정 vocab의 정수 시퀀스(token ID)로 바꿔, embedding lookup에 바로 쓸 수 있게.

파이프라인 3단계:

Split — 텍스트를 토큰으로 자른다 (문자/단어/subword).
Map — 각 토큰을 vocab 테이블의 unique 정수 ID로.
Embed — ID를 embedding 행렬에 indexing해서 dense vector로(Track 3).

재미있는 일이 다 일어나는 게 split 단계야. 잘못 고르면 모델이 "the"랑 " the"가 같은 뜻이라는 걸 학습하는 데 capacity를 낭비해, "preprocessing"이 "process"랑 관련 있다는 것도 직접 배워야 하고. 잘 고르면 vocab 자체가 유용한 inductive bias가 돼.

왜 별도 stage로 다루는가

토큰화는 한 번 결정해서 tokenizer 파일에 얼려놓고, 학습/추론에 동일하게 써. 학습 끝난 모델의 tokenizer를 바꿀 수 없어 — embedding 행렬 index가 더 이상 같은 의미가 아니니까. 그래서 모든 모델 카드가 weight랑 같이 tokenizer를 발표하고, "이 모델 tokenizer 뭐 쓰냐?"가 trivia가 아니라 본질 질문인 거야.

Code

The full pipeline at a glance·python

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

text = "Tokenization decides what the model sees."
ids = tok.encode(text, add_special_tokens=False)
print(ids[:8])         # e.g. [3404, 2065, 6276, 374, ...]
print(tok.decode(ids)) # round-trips to original text

# Embedding lookup happens inside the model:
# embedding[ids] -> (seq_len, d_model) tensor

The frozen-tokenizer invariant·python

# This is why you cannot mix-and-match tokenizers:
ids_llama = tokenizer_llama.encode("hello world")
ids_gpt   = tokenizer_gpt.encode("hello world")
# Same string, different integers, different vocab sizes.
# Feeding ids_llama into a GPT model is undefined behavior.

Exercise

tokenizer 계열이 다른 open-weight 모델 셋 골라(예: Llama 3, Mistral 7B, Qwen 2.5). 같은 영어 단락을 셋 다 토큰화하고 토큰 수 기록. 한국어 단락도 동일하게. 영어에 가장 효율적인 tokenizer는? 한국어는? 이게 토큰당 API 가격에 무슨 함의가 있을까?

왜 토큰화인가? 텍스트에서 텐서로 가는 다리

왜 별도 stage로 다루는가

Code

External links

Exercise

Progress

댓글 0