Device 와 Tensor 배치

CPU/GPU 경계는 진짜야

PyTorch tensor 는 특정 device 에 살아. x.device 가 어디 있는지 알려줘. Tensor 사이 operation 은 둘 다 같은 device 에 있어야 하고, .to(device) 로 명시적으로 옮겨야 해. 이거 잊는 게 새 PyTorch user 가 보는 가장 흔한 error: Expected all tensors to be on the same device.

Training 빠르려면 model 과 data 둘 다 GPU 에 있어야 함. Model 은 startup 에 한 번 옮기고, training loop 에서 batch 마다 옮겨. Output tensor 는 본인이 명시적으로 가져올 때까지 (printing 또는 non-PyTorch downstream code 위해) GPU 에 머물러.

팁: .to(device) 둘 곳 3 군데: startup 에 model 한 번, loop 에서 input batch 마다 한 번, print 하고 싶을 때 loss tensor 한 번 (scalar 는 .item() 으로 Python 에 가져와). 거의 다른 모든 건 GPU 에 머물러.

올바른 device 고르기

3 흔한 case (NVIDIA, Apple Silicon, CPU fallback) 를 처리하는 one-liner — 아래 코드. Apple MPS backend 는 많은 workload 에 빠른데 가끔 missing op 있음 (NotImplementedError 보고 upstream 신고).

Multi-GPU 와 distributed

한 머신, 다중 GPU: torch.nn.parallel.DistributedDataParallel (DDP), torchrun 으로 launch. 다중 머신: 같은 DDP API, 다른 launch script. Higher-level: Hugging Face accelerate launch 가 boilerplate 숨김.

원칙: 2026 년에 model 이 진짜로 안 들어갈 때까지 single-GPU training 을 default. Multi-GPU 의 complexity tax 가 진짜고, modern accelerator (H100, MI300X, M3 Ultra) 가 놀랄 만큼 큰 model 들고 있어.

Code

Pick the device once, use it everywhere·python

import torch

device = (
    torch.device("cuda") if torch.cuda.is_available()
    else torch.device("mps") if torch.backends.mps.is_available()
    else torch.device("cpu")
)
print("device:", device)

model = MyModel().to(device)

for xb, yb in train_loader:
    xb, yb = xb.to(device, non_blocking=True), yb.to(device, non_blocking=True)
    logits = model(xb)
    loss = loss_fn(logits, yb)
    loss.backward()
    optimizer.step()
    print(loss.item())   # .item() brings the scalar back to CPU/Python

CPU/GPU 경계는 진짜야

올바른 device 고르기

Multi-GPU 와 distributed

Code

External links

Exercise

Progress

댓글 0