문자 vs 단어 vs subword — 세 전략, 승자 하나

텍스트 자르는 근본 방법은 셋. 각각 대규모로 시도됐고, 승자가 분명해.

문자 단위(character-level)

문자(또는 UTF-8 바이트) 하나 = 토큰 하나. vocab은 작아(byte면 약 256), OOV 문제는 영원히 없어. 그런데 시퀀스가 너무 길어져. "tokenization"이 12토큰. context window가 증발하고, 모델이 't', 'h', 'e'가 "the" 의미한다는 걸 처음부터 배워야 해. ByT5나 일부 연구 모델이 쓰지만 production에선 드물어.

단어 단위(word-level)

공백으로 자른 단어 하나 = 토큰 하나. 시퀀스는 짧지만 vocab이 폭발 — 영어만 해도 굴절 형태 포함하면 10만+, 그리고 학습 때 못 본 거 다 [UNK]로 처리해야 해. 오타, named entity, 코드 식별자 다 [UNK]로 뭉개져. 2018년 이전 표준, 지금은 사실상 사망.

Subword

subword 조각 하나 = 토큰 하나. 흔한 단어는 통째로, 희귀 단어는 조각내. vocab 적당(30K-200K), 시퀀스 적당, OOV 절대 없음 — 뭐든 조각 시퀀스로 인코딩 가능. 모든 모던 LLM이 subword 써. 논쟁은 어떤 알고리즘(BPE, WordPiece, Unigram, BBPE)이랑 vocab을 얼마나 크게만 남았어.

전략	vocab	시퀀스 길이	OOV?	현재 사용
문자 / 바이트	~256	아주 길음	없음	Niche (ByT5, Charformer)
단어	10만+	짧음	있음	사실상 은퇴
Subword	30K-200K	중간	없음	모든 모던 LLM

Code

Same string, three tokenizations·python

text = "transformer-based tokenization"

# 1. Character-level (UTF-8 bytes)
chars = list(text.encode('utf-8'))
print(len(chars))   # ~30 tokens

# 2. Word-level (whitespace + punctuation)
import re
words = re.findall(r"\w+|\S", text)
print(words)        # ['transformer', '-', 'based', 'tokenization']

# 3. Subword (BPE via tiktoken)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
print(enc.encode(text))  # ~5-7 tokens; common subwords kept whole

문자 vs 단어 vs subword — 세 전략, 승자 하나

문자 단위(character-level)

단어 단위(word-level)

Subword

Code

External links

Exercise

Progress

댓글 0