Batch, Logit, Output Head

Batch 가 계산의 unit

Network 는 single example 이 아니라 batch 의 example 에 작동. 32 개 image batch 는 shape [32, 3, 224, 224] 의 tensor, sentence batch 는 [32, seq_len, hidden_dim]. 첫 번째 차원이 거의 항상 batch dimension — 생각 없이 reach 해야 해, x[0] 이 batch 의 첫 번째 example.

Batching 이 GPU 를 유용하게 만드는 거야. 한 example 에 single forward pass 하면 chip 의 parallelism 대부분이 낭비. Batch size 두 배는 보통 example 당 시간을 절반 — memory bandwidth 나 VRAM 이 cap 할 때까지.

팁: Tutorial 이 single-example forward pass 쓰면, 정신적으로 batched 로 다시 써. 진짜 training loop 는 항상 batch 로 작동, single-example 코드는 teaching shortcut.

Logit vs probability

Logit 은 final linear layer 의 raw, un-normalized output. 양수 음수 어떤 실수든 가능. Probability 는 logit 을 softmax (multi-class) 또는 sigmoid (binary) 통과해서 나와. PyTorch classification loss 는 logit 을 받지 probability 안 받아 — 내부에서 numerical stability 위해 softmax 적용. Model 에서 logit 반환이 올바른 default.

Output head

'Head' 는 network 의 internal representation 을 task output 으로 mapping 하는 final layer (또는 작은 stack). Classification 은 nn.Linear(hidden_dim, num_classes). Regression 은 nn.Linear(hidden_dim, 1). Multi-label 은 nn.Linear(hidden_dim, num_labels) + sigmoid + BCE. Embedding learning 은 L2-normalized projection.

Pretrained model 을 새 task 에 fine-tune 할 때 보통 교체하는 layer 야. Backbone 그대로, head 만 새 label set 으로 re-initialize.

원칙: Probability 가 아니라 logit 반환. 올바른 loss function 과 pair. Softmax 는 visualization 또는 진짜 probability 필요한 downstream code 에 저장.

Code

Batched classifier with task-shaped heads·python

import torch, torch.nn as nn

class Backbone(nn.Module):
    def __init__(self, in_dim, hidden_dim):
        super().__init__()
        self.body = nn.Sequential(
            nn.Linear(in_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
        )
    def forward(self, x):
        return self.body(x)

backbone = Backbone(in_dim=512, hidden_dim=256)
head_clf = nn.Linear(256, 10)
head_reg = nn.Linear(256, 1)
head_emb = nn.Linear(256, 64)

x = torch.randn(32, 512)
z = backbone(x)
print(head_clf(z).shape)                         # [32, 10] logits
print(head_reg(z).shape)                         # [32, 1]  predictions
print(nn.functional.normalize(head_emb(z), dim=-1).shape)  # [32, 64]

Batch 가 계산의 unit

Logit vs probability

Output head

Code

External links

Exercise

Progress

댓글 0