Batch 크기와 gradient accumulation

~10 min · batch, gradient-accumulation

Level 0Token

0 XP0/94 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

LLM 사전학습은 거대한 effective batch 크기 사용 — gradient 업데이트당 수백만 토큰. 큰 batch가 gradient 추정 안정화(샘플 많으면 평균 gradient의 noise 낮음) + 하드웨어 활용 개선. 도전: GPU 메모리.

Gradient accumulation

원하는 batch가 메모리에 안 들어가면, 작은 "micro-batch" 여럿 돌리고, 가로질러 gradient 누적, 한 번에 업데이트. effective batch 크기는 micro-batch × accumulation step. 메모리 비용은 micro-batch가 결정, 연산 비용은 단일 큰 batch와 동일.

이 패턴이 너무 보편적이라서 모든 모던 학습 프레임워크 — PyTorch, DeepSpeed, Megatron-LM — 가 이걸 기본으로 둬. production run은 종종 GPU당 micro-batch ~1-4 시퀀스, accumulation ~16-128, 수백~수천 GPU에 분산.

Code

Gradient accumulation in PyTorch·python

optimizer.zero_grad()
for step, micro_batch in enumerate(loader):
    logits = model(micro_batch)
    loss = causal_lm_loss(logits, micro_batch)
    # Scale to keep the effective gradient magnitude correct
    (loss / accumulation_steps).backward()
    if (step + 1) % accumulation_steps == 0:
        # Gradient clipping is standard for LLM training
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

External links

Exercise

작은 모델을 effective batch 크기 256으로 학습, 세 구성으로 달성 — (a) micro-batch 256, accum 1(메모리 있다고 가정), (b) micro-batch 32, accum 8, (c) micro-batch 8, accum 32. wall-clock 시간과 최종 loss 비교. 곡선 동일해야 — GPU 최적화 차이 modulo. 동일한가?

Progress

Progress is local-only — sign in to sync across devices.

← PreviousLearning rate — warmup + cosine decay Next →Mixed precision — BF16, FP16, FP8

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.