SentencePiece — 언어 독립적 토큰화

SentencePiece는 구글의 토큰화 라이브러리. BPE/WordPiece에 박혀 있던 구조적 가정 — 공백이 특별한 경계 문자라는 가정 — 을 풀어. 입력을 raw Unicode 스트림으로 다루고, 언어별 전처리 없이, 공백은 보이는 문자 ▁(U+2581)로 인코딩해서 일반 토큰처럼 학습해.

사소한 디테일 같아? 아냐. 다국어 데이터로 학습한 단일 tokenizer가 영어(공백 분리), 중국어/일본어(공백 없음), 한국어(혼합), 코드(공백이 문법적 의미를 가짐)에서 동일하게 동작한다는 뜻이야.

SentencePiece는 두 알고리즘 지원: BPE(이미 본 bottom-up)랑 Unigram. Unigram은 top-down — 거대한 vocab에서 시작해서 corpus likelihood에 기여 적은 토큰을 반복 가지치기, 목표 크기에 도달할 때까지. Llama 1/2랑 T5가 SentencePiece + Unigram, Llama 3는 byte-level BPE로 갈아탔고, Gemma 3는 SentencePiece + Unigram에 262K vocab으로 140+ 언어 커버.

Code

SentencePiece preserves whitespace as ▁·python

import sentencepiece as spm

# Train (one-liner) — needs a small text corpus on disk first
spm.SentencePieceTrainer.train(
    input='corpus.txt', model_prefix='m', vocab_size=1000,
    model_type='unigram'
)

sp = spm.SentencePieceProcessor(model_file='m.model')
print(sp.encode_as_pieces("Hello world"))
# ['▁Hello', '▁world']

print(sp.encode_as_pieces("안녕 세상"))
# ['▁', '안녕', '▁세상']  (no language-specific assumptions)

Exercise

같은 다국어 corpus(영어 + 한국어 + 중국어 섞은 거)로 SentencePiece tokenizer 두 개 학습 — 하나는 model_type='bpe', 하나는 'unigram', 둘 다 vocab_size=8000. 한국어 문장에서 두 결과가 어떻게 다른지 비교해. 어느 쪽이 형태소 경계에 더 가까운 느낌?

SentencePiece — 언어 독립적 토큰화

Code

External links

Exercise

Progress

댓글 0