DataLoader 와 Batching

DataLoader 가 하는 일

DataLoader 는 Dataset wrap 하고 추가: batching (N example 을 shape [N, ...] tensor 로), shuffling (sampling order), parallel loading (multiple worker process), pin memory (CPU→GPU 전송 빠르게). Training loop 맨 위 한 줄, 나머지 코드는 batch 반복만 하면 됨.

가장 중요한 두 argument: batch_size (VRAM/throughput 위해 tune) 와 num_workers (CPU/IO bandwidth 위해 tune). Default num_workers=0 은 single-process — 작은 dataset 엔 OK, 다른 모든 거엔 디버깅 악몽.

팁: num_workers 를 사용 가능한 physical CPU core 수 정도로. 16 core box 에 num_workers=8 이 sensible 시작점. htop 열어두고 test — worker 바쁘고 GPU 바쁘고, 어느 쪽도 다른 쪽 기다리며 idle 안 해야 해.

흔한 함정

Training 에 shuffle=True 잊음 — data 가 class-sorted 면 model 이 첫 class 에 overfit 하고 회복 안 함. Training loader 는 항상 shuffle.

Validation loader shuffle — 시간 낭비, per-batch metric reproducibility 깨짐. Validation 은 shuffle=False.

CUDA 에서 num_workers>0 인데 pin_memory=True 없음 — CPU→GPU bandwidth 낭비. 항상 같이 pair.

__getitem__ 에서 shared state mutation — worker 가 subprocess 라 global state 가 worker 당. Logging counter, RNG state 등이 예상과 미묘하게 다르게 행동.

Custom collate

Default collate_fn 은 같은 shape tensor 쌓아. Variable-length sequence (text, audio, set) 는 pad/pack 하는 custom collate 필요. Hugging Face 의 DataCollatorWithPadding 이 흔한 case 처리해 줘.

원칙: 모든 training DataLoader 에 set 할 3 가지: shuffle=True, num_workers>0, CUDA 면 pin_memory=True. Validation 에 set 할 3 가지: shuffle=False, num_workers>0, pin_memory=True.

Code

Train and val DataLoaders, the right way·python

from torch.utils.data import DataLoader

train_loader = DataLoader(
    train_ds,
    batch_size=128,
    shuffle=True,
    num_workers=8,
    pin_memory=True,
    persistent_workers=True,   # avoid worker spawn cost each epoch
    drop_last=True,            # drop the last partial batch (cleaner stats)
)

val_loader = DataLoader(
    val_ds,
    batch_size=256,            # bigger batch for inference
    shuffle=False,
    num_workers=4,
    pin_memory=True,
    persistent_workers=True,
)

Custom collate for variable-length sequences·python

import torch
from torch.nn.utils.rnn import pad_sequence

def pad_collate(batch):
    seqs, labels = zip(*batch)        # tuples of variable-length tensors and scalars
    seqs = pad_sequence(seqs, batch_first=True, padding_value=0)
    labels = torch.tensor(labels)
    return seqs, labels

loader = DataLoader(my_dataset, batch_size=32, collate_fn=pad_collate)

DataLoader 가 하는 일

흔한 함정

Custom collate

Code

External links

Exercise

Progress

댓글 0