Accelerator 와 Device

2026 년 본인이 돌릴 만한 것

NVIDIA H100 / H200 / B100 — cloud training 의 일꾼. BF16, FP8, big VRAM (80GB+).
NVIDIA A100 — 여전히 흔함, 약간 older 인데 cloud 에 풍부.
NVIDIA RTX 4090 / 5090 — single-GPU desktop training, 24-32GB VRAM.
Apple Silicon (M3 Ultra, M4 Max) — unified memory architecture, CPU 와 GPU 사이 64-512GB shared. MPS 또는 MLX 통한 inference 와 small-to-medium training 에 빠름.
Google TPU (v5e, v5p) — Google Cloud only, 매우 큰 model training 에 great, JAX 또는 PyTorch/XLA 필요.
AMD MI300X — H100 의 emerging alternative, PyTorch ROCm support 개선 중.

팁: 대부분 학습과 production inference 가 single accelerator 에서. Multi-GPU 와 multi-node training 은 진짜 필요할 때까지 미룰 step — engineering tax 진짜.

Memory 가 보통 binding constraint

Modern accelerator 가 많은 FLOP 가짐. Constraint 는 model + activation + gradient + optimizer state 를 VRAM 에 fit 하는 거. Mixed precision 이 model + activation memory 절반. Gradient checkpointing 이 compute 를 memory 와 trade. ZeRO 와 FSDP 가 GPU 들 가로질러 optimizer state shard.

Apple Silicon 진짜야

Local inference 와 modest training 에 Apple Silicon 이 competitive. Unified memory 가 192GB Mac Studio 가 x86 의 multiple GPU 가 필요할 model weight 들 수 있다는 뜻. MLX 가 Apple native framework, PyTorch MPS backend 가 cross-platform path. 둘 다 2026 년 first-class.

원칙: FLOPS 아니라 memory 기반으로 accelerator 골라. 24GB consumer GPU 가 model 이 fit 해서 16GB datacenter GPU 를 LLM fine-tuning 에 자주 이김.

Code

Universal device picker·python

import torch

def best_device():
    if torch.cuda.is_available():
        return torch.device("cuda")
    if torch.backends.mps.is_available():
        return torch.device("mps")
    return torch.device("cpu")

device = best_device()
print("device:", device)

if device.type == "cuda":
    print("model:", torch.cuda.get_device_name(0))
    print("VRAM (GB):", torch.cuda.get_device_properties(0).total_memory / 1e9)

2026 년 본인이 돌릴 만한 것

Memory 가 보통 binding constraint

Apple Silicon 진짜야

Code

External links

Exercise

Progress

댓글 0