Mixed Precision, Gradient Accumulation, 메모리

Mixed precision: 2026 의 bf16 디폴트

가중치 fp32, forward/backward 중 activation + gradient 는 bf16, optimizer step 위한 master 가중치는 fp32. TrainingArguments 의 bf16=True 가 다 핸들. 지원 하드웨어 (Ampere+) 에선 거의 품질 비용 X + 약 2x 속도.

Gradient accumulation: 무료 가상 batch size

GPU 가 4 example fit 하는데 레시피가 batch=64 원하면 per_device_train_batch_size=4 + gradient_accumulation_steps=16. Optimizer 가 16 microbatch 누적해 64-example gradient 봄. 비용: optimizer step 당 16x compute. 이득: 어떤 GPU 든 어떤 레시피든 학습 가능, 그저 느릴 뿐.

메모리 트릭 셋 더

gradient_checkpointing=True — backward 중 activation 저장 X, 재계산. ~50% 메모리 감소, ~30% 속도 비용.
optim="adamw_8bit" (bitsandbytes) — 8-bit Adam state. 큰 모델 학습할 때 많이 절약.
torch.compile(model) — PyTorch 2.x 의 graph compiler. stable shape 면 1.2-1.5x 속도; dynamic shape 엔 가끔 해로움.

Code

메모리 효율 TrainingArguments·python

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./out",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,   # effective batch = 16 (single GPU)
    bf16=True,
    gradient_checkpointing=True,
    optim="adamw_8bit",              # bitsandbytes 필요
    num_train_epochs=3,
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    save_steps=500, eval_steps=500, logging_steps=50,
    torch_compile=False,             # baseline 동작 후 True 시도
)

학습 중 peak 메모리 inspect·python

import torch

# Peak counter reset
torch.cuda.reset_peak_memory_stats()

# ... 일부 step trainer.train() 실행 ...

print(f"peak: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

Mixed Precision, Gradient Accumulation, 메모리

Mixed precision: 2026 의 bf16 디폴트

Gradient accumulation: 무료 가상 batch size

메모리 트릭 셋 더

Code

External links

Exercise

Progress

댓글 0