training 중 각 element 를 확률 p 로 random zero, 나머지 1/(1-p) 로 scale 해서 expected output 보존. eval 중엔 no-op. 고전 regularizer — model 이 single neuron 에 너무 강하게 의존 못 하게.
흔한 값: Transformer FFN/attention 엔 0.1, 옛 fully-connected layer 엔 0.5. modern model 이 2015 시대 paper 보다 더 sparingly 사용.
nn.Embedding
integer index 에서 dense vector 로 lookup table. categorical input 이 있는 곳:
NLP 의 token embedding (vocab → vector).
Positional embedding.
recommendation system 의 user / item embedding.
개념적으로 그냥 (num_embeddings, embedding_dim) weight matrix, embed(idx) 가 weight[idx] 반환. weight[idx] 직접 대신 nn.Embedding 사용 이유: PyTorch 가 sparse gradient 효율적으로 처리, update 안 되는 padding index 지원, optimizer 와 깔끔한 통합.
nn.ModuleList — dynamic-depth pattern
Sequential lesson 에서 이미 cover, 한 번 더 강조: forward 가 iterate 필요하면 (N-layer Transformer, ResNet stack), nn.ModuleList 사용. plain Python list 는 PyTorch 에 invisible.
Code
Dropout — train vs eval 행동·python
import torch
import torch.nn as nn
drop = nn.Dropout(p=0.5)
x = torch.ones(1, 8)
drop.train()
print(drop(x)) # ~half the entries zeroed, rest scaled by 2
# tensor([[2., 0., 0., 2., 0., 2., 2., 0.]]) (varies)
drop.eval()
print(drop(x)) # tensor([[1., 1., 1., 1., 1., 1., 1., 1.]]) — no-op
Embedding — lookup table·python
import torch
import torch.nn as nn
# Vocab of 10,000 tokens, each represented by a 256-dim vector
embed = nn.Embedding(num_embeddings=10000, embedding_dim=256)
token_ids = torch.tensor([42, 100, 7, 2023])
vectors = embed(token_ids)
print(vectors.shape) # torch.Size([4, 256])
# Batched
batch = torch.randint(0, 10000, (32, 50)) # batch=32, seq=50
print(embed(batch).shape) # torch.Size([32, 50, 256])
# padding_idx — vector for index 0 stays zero and isn't trained
embed_pad = nn.Embedding(10000, 256, padding_idx=0)
print(embed_pad.weight[0].sum()) # tensor(0.) — guaranteed
합치기 — 최소 Transformer block·python
import torch
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, d_model=512, n_head=8, d_ff=2048, drop=0.1):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, n_head, dropout=drop, batch_first=True)
self.ln1 = nn.LayerNorm(d_model)
self.ln2 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(drop),
nn.Linear(d_ff, d_model),
nn.Dropout(drop),
)
def forward(self, x, attn_mask=None):
# Pre-LN style — modern Transformer convention
h = self.ln1(x)
h, _ = self.attn(h, h, h, attn_mask=attn_mask, need_weights=False)
x = x + h # residual connection
h = self.ln2(x)
h = self.ff(h)
x = x + h # residual connection
return x
block = TransformerBlock()
x = torch.randn(2, 16, 512) # batch=2, seq=16, dim=512
print(block(x).shape) # torch.Size([2, 16, 512])
세 번째 code block 의 TransformerBlock 6 개를 nn.ModuleList 로 stack 해서 TransformerEncoder 짓기. config dict 로 depth driving. parameter count 검증: d_model=512 의 각 block ≈ 3.15M, stack 은 약 18.9M.
Progress
Progress is local-only — sign in to sync across devices.