WordPiece — BERT의 tokenizer

~12 min · wordpiece, bert

Level 0Token

0 XP0/94 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

WordPiece는 구글의 subword 알고리즘. 2012년 일본어/한국어 음성 검색용으로 개발됐고, BERT가 유명하게 만들었지. BPE처럼 bottom-up 병합인데, 병합 기준이 달라.

가장 빈번한 쌍을 병합하는 게 아니라, unigram 언어 모델 하에서 학습 데이터의 likelihood를 최대화하는 쌍을 병합해. 구체적으론 count(ab) / (count(a) × count(b))가 최대인 (a, b) 선택. 그래서 단순한 문자 공기 빈도보다 형태소 경계를 잡는 경향이 있어.

출력은 continuation piece에 ## 접두어를 붙여. 그래서 "unbelievable"이 ['un', '##believ', '##able']이 돼. 역토큰화는 정확 — continuation piece의 ## 떼고 이어붙이면 원래 단어.

Code

WordPiece in action (BERT)·python

from transformers import BertTokenizer
tok = BertTokenizer.from_pretrained("bert-base-uncased")

words = ["unbelievable", "tokenization", "transformers", "embeddings"]
for w in words:
    print(w, "->", tok.tokenize(w))

# unbelievable  -> ['un', '##believable']
# tokenization  -> ['token', '##ization']
# transformers  -> ['transformers']     (kept whole; common)
# embeddings    -> ['em', '##bed', '##ding', '##s']

External links

Exercise

BPE 연습 때 쓴 단락을 BertTokenizer에도 돌려. 토큰 경계 비교 — BPE랑 어디서 다르게 자르는지? 형태소적으로 의미 있는 차이 사례 최소 3개 찾아(접두사, 접미사, 굴절 등).

Progress

Progress is local-only — sign in to sync across devices.

← PreviousByte-Pair Encoding (BPE)을 한 스텝씩 Next →SentencePiece — 언어 독립적 토큰화

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.