다국어 토큰화와 효율 격차

tokenizer는 학습 corpus에서 merge를 배워. 그 corpus의 80%가 영어면 결과 vocab이 영어 subword에 강하게 편향돼. 비용은 추론 시점에 드러나 — 영어 아닌 토큰이 보통 2-4배 더 많은 조각을 먹어.

구체적으로 어떻게 보이나

cl100k_base(GPT-4)에서 영어 단어 평균은 약 1.3토큰. 같은 tokenizer에서 한국어 음절은 평균 3-4토큰. 그래서 영어 100단어 단락이 약 130토큰일 때 한국어 100음절 단락은 약 350토큰. 토큰당 API 가격에선 이게 실제 비용 비대칭이야.

완화책

더 큰 vocab. GPT-4o의 o200k_base(200K)랑 Gemma 3의 262K vocab은 다국어 subword에 일등 시민 슬롯을 줘서 이 격차 좁히는 시도.
다국어 학습 데이터. Llama 3는 사전학습 5%가 비영어, Llama 4는 early-fusion 다국어 데이터로 더 밀고 200 언어 명시 지원.
언어별 fine-tune. Solar(영/한), Qwen(영/중), Aya(다국어) 같은 모델은 타깃 언어용으로 tokenizer를 명시적으로 재학습하거나 확장해.

격차는 실재하고 좁아지는 중이지만, 당장: 비영어 사용자 서빙한다면 영어 등가 비용 가정 말고 production에서 메시지당 실제 토큰 수를 측정해.

Code

Measure your own token-per-language ratio·python

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

samples = {
    "English": "The Transformer architecture uses self-attention to model long-range dependencies in sequences.",
    "Korean":  "트랜스포머 아키텍처는 시퀀스에서 장기 의존성을 모델링하기 위해 self-attention을 사용한다.",
    "Chinese": "Transformer 架构使用自注意力机制来建模序列中的长距离依赖关系。",
    "Japanese": "Transformerアーキテクチャは、シーケンス内の長距離依存関係をモデル化するために自己注意機構を使用する。",
}
for lang, text in samples.items():
    n = len(enc.encode(text))
    chars = len(text)
    print(f"{lang:>9}  {n:>4} tokens  {chars:>3} chars  {n/chars:.2f} tok/char")

다국어 토큰화와 효율 격차

구체적으로 어떻게 보이나

완화책

Code

External links

Exercise

Progress

댓글 0