DistributedDataParallel — Multi-GPU Training

multi-GPU 의 표준

DistributedDataParallel (DDP) 가 PyTorch 의 multi-GPU training 방법. 정신 모델: GPU 당 한 process, 각 process 가 model 자체 copy 들고 data slice 위 자체 forward/backward 돌림. backward 후 gradient 가 process 들에 synchronize (all-reduce); 각 optimizer step 이 자기 weight copy 에 같은 averaged gradient 적용해서 lockstep 유지.

single-GPU script 에 추가할 거

process group 초기화 (dist.init_process_group("nccl")).
per-process device 설정 (torch.cuda.set_device(rank)).
model 을 DDP 로 wrap (model = DDP(model, device_ids=[rank])).
각 process 가 다른 data slice 보게 DistributedSampler 사용.
torchrun 으로 launch (modern launcher, 옛 torch.distributed.launch 아님).

DistributedSampler 함정

DistributedSampler 없으면 모든 process 가 full dataset iterate — 중복 작업에 compute 낭비. 매 epoch top 의 sampler 의 set_epoch(epoch) 가 mandatory; 없으면 shuffle 이 매 epoch 같고 training 이 신비롭게 plateau.

torchrun — modern launcher

torchrun --nproc_per_node=4 train.py 가 4 process spawn, 각 환경 변수 (LOCAL_RANK, WORLD_SIZE, RANK) 설정해서 너 script 가 self-identify. 옛 torch.distributed.launch 도 작동하지만 torchrun 선호.

Code

DDP training script — canonical shape·python

import os
import torch
import torch.distributed as dist
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

def setup():
    dist.init_process_group("nccl")
    rank = int(os.environ['LOCAL_RANK'])
    torch.cuda.set_device(rank)
    return rank

def train(model, dataset, num_epochs):
    rank = setup()
    model = model.to(rank)
    model = DDP(model, device_ids=[rank])

    sampler = DistributedSampler(dataset, shuffle=True)
    loader = DataLoader(dataset, batch_size=32, sampler=sampler,
                        num_workers=4, pin_memory=True)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)               # MANDATORY for shuffle
        for x, y in loader:
            x, y = x.to(rank), y.to(rank)
            optimizer.zero_grad()
            loss = criterion(model(x), y)
            loss.backward()                    # gradients sync here (all-reduce)
            optimizer.step()

    dist.destroy_process_group()

if __name__ == "__main__":
    model = MyModel()
    dataset = MyDataset()
    train(model, dataset, num_epochs=10)

torchrun 으로 launch·python

# Single-node, 4 GPUs
# torchrun --nproc_per_node=4 train.py

# Single-node, 8 GPUs, with extra args to your script
# torchrun --nproc_per_node=8 train.py --lr 1e-4 --batch_size 32

# Multi-node — each node runs its own torchrun
# Node 0:
#   torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
#            --master_addr=NODE0_IP --master_port=29500 train.py
# Node 1:
#   torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
#            --master_addr=NODE0_IP --master_port=29500 train.py

import os
print(os.environ['LOCAL_RANK'], os.environ['RANK'], os.environ['WORLD_SIZE'])

rank-0-only logging — universal pattern·python

import os
import torch.distributed as dist

def is_main():
    return not dist.is_initialized() or dist.get_rank() == 0

def log(*args, **kwargs):
    if is_main():
        print(*args, **kwargs)

# Use throughout your training
log(f"epoch {epoch}: train_loss={train_loss:.4f}")

# Same trick for checkpoint saving — only rank 0 writes
if is_main():
    torch.save(model.module.state_dict(), 'best.pt')   # .module to unwrap DDP

# And for any setup that should happen ONCE, not N-times
if is_main():
    os.makedirs('outputs', exist_ok=True)

Exercise

single-GPU training script 를 DDP 로 변환. torchrun --nproc_per_node=N (N=GPU count) 로 돌리기. 검증: (a) loss 가 N 번 아니라 epoch 당 한 번 print — is_main() rank-0 trick. (b) training 이 N>1 GPU 로 진짜 더 빠름 (단지 N 배 비용 아님). (c) shuffle 이 epoch 별 다름 — sampler.set_epoch 적절히 위치.