tiktoken과 Hugging Face Tokenizers — 실전 도구

~10 min · tooling, tiktoken, huggingface

Level 0Token

0 XP0/94 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

production에선 BPE를 직접 구현할 일이 거의 없어. 두 라이브러리가 대부분 커버해.

tiktoken (OpenAI)

OpenAI 인코딩용 Rust 기반 BPE tokenizer — p50k_base(GPT-3), cl100k_base(GPT-4), o200k_base(GPT-4o), o200k_harmony(GPT-5). 빠름, 단일 코어에서 약 100만 토큰/초. Python 인터페이스는 import 하나 + 메서드 하나:

Hugging Face `tokenizers`

BPE, WordPiece, SentencePiece 통한 Unigram 다 지원하는 Rust 기반 라이브러리. 전체 파이프라인 구현 — pre-tokenization(공백 split 같은 거), normalization(lowercase, NFKC), 모델 자체(BPE/WordPiece/Unigram), post-processing(special token 추가, chat template 구성). 거의 모든 Hugging Face 모델의 AutoTokenizer가 내부적으로 사용.

속도: 순수 Python tokenizer 대비 10-100배 빠름. 학습 시 corpus에서 vocab 만드는 게 Python으로 몇 시간 걸리던 게 분 단위로 끝나.

Code

tiktoken — counting tokens before an API call·python

import tiktoken

# Pick the encoding for the model you'll send to
enc = tiktoken.encoding_for_model("gpt-4o")

prompt = build_prompt()  # your prompt construction
tokens = enc.encode(prompt)
print(f"Prompt length: {len(tokens)} tokens")

# Use this BEFORE sending to estimate cost and to truncate
# safely when the prompt exceeds the model's context window.

Hugging Face tokenizers — train and load·python

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tok = Tokenizer(BPE(unk_token="[UNK]"))
tok.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=8000,
                     special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]"])
tok.train(files=["corpus.txt"], trainer=trainer)
tok.save("my-bpe.json")

# Reload anywhere:
tok = Tokenizer.from_file("my-bpe.json")
print(tok.encode("hello world").tokens)

External links

Exercise

tiktoken을 본인의 OpenAI/Anthropic 클라이언트 wrapper에 끼워서 매 요청마다 (a) prompt 토큰 수, (b) 허용한 max_tokens, (c) 응답 토큰 수 로깅. 하루치 트래픽 분포 확인 — context 과잉/과소 예산 잡고 있나? long-tail outlier는 어디서 오나?

Progress

Progress is local-only — sign in to sync across devices.

← Previous다국어 토큰화와 효율 격차 Next →토큰 경제학 — 왜 가격이 토큰당이냐

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.