Word embedding — 토큰 ID가 dense vector가 되는 곳

토큰 ID는 임의의 정수야. 47458이라는 숫자는 BPE merge가 우연히 그렇게 매긴 의미만 가질 뿐, 정수 자체엔 의미 구조가 없어. attention이 일어나기 전 모델이 첫 번째로 할 일은 각 ID를 의미 구조가 있는 dense vector로 바꾸는 거야.

이건 학습되는 lookup이야. embedding 행렬의 shape은 (vocab_size × d_model). 토큰 ID로 인덱싱하면 d_model 크기의 행이 반환 — 그 행이 토큰의 초기 표현. 학습 중 gradient가 그 행들로 거슬러 올라가서, 비슷한 문맥에 쓰이는 토큰들이 비슷한 vector를 갖도록 살살 옮겨.

결과는 의미가 기하로 인코딩된 고차원 공간이야. "cat"이랑 "dog"이 가까워지고, "king"이랑 "queen"이 가까워지고, "run"이랑 "running"이 가까워져. 이게 모든 후속 layer가 정제하는 substrate야 — layer들은 단어 위에서 작동하는 게 아니라 이 공간의 점 위에서 작동해.

Code

Embedding matrix in PyTorch·python

import torch
import torch.nn as nn

vocab_size, d_model = 50_257, 768   # GPT-2 base
emb = nn.Embedding(vocab_size, d_model)

# emb.weight has shape (50257, 768) and is learned
ids = torch.tensor([464, 2368, 3290])     # "The cat sat"
vectors = emb(ids)                        # (3, 768)

# Vectors start random. Training reshapes the matrix so that
# semantically similar tokens have similar rows.

What 'similar' means geometrically·python

import torch.nn.functional as F

# After training, you can probe what's near what
def neighbors(token, k=5):
    target = emb.weight[token_id(token)]     # (d_model,)
    sims = F.cosine_similarity(emb.weight, target.unsqueeze(0))
    return sims.topk(k).indices.tolist()

# In a trained Transformer, neighbors('cat') typically include
# 'dog', 'kitten', 'cats', 'feline', 'pet' — in some order.
# In a randomly-initialized model, you'd get random neighbors.

Exercise

huggingface gpt2 로드해서 input embedding 행렬(model.transformer.wte.weight) 추출. 'computer' 같은 타깃 단어 골라서 embedding 공간에서 cosine similarity 기준 10-NN 찾아. 이웃이 주로 동의어인지, 관련 개념인지, 형태소 변형인지, 잡음인지? 이게 embedding이 뭘 잡았는지에 대해 뭘 말해주나?

Word embedding — 토큰 ID가 dense vector가 되는 곳

Code

External links

Exercise

Progress

댓글 0