CPU, CUDA, MPS — Device 와 이동

한 framework, 세 (네) backend

PyTorch 는 CPU, NVIDIA CUDA, Apple Silicon MPS 를 first-class device 로 지원. ROCm (AMD) 와 XPU (Intel) 도 있지만 indie 작업에선 덜 흔해. 정신 모델: tensor 는 정확히 한 device 에 살고; op 는 input 이 같은 device 요구; .to(device) 로 data 이동.

device-string idiom

거의 모든 PyTorch project 가 한 줄로 시작 — best available device 고르고, 그 string 을 어디서나 reuse:

device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)

그 string 이 model 과 모든 batch 의 .to(device) 에 전달. 'expected cuda:0 got cpu' 에러는 항상 옮겨지지 않은 tensor 하나로 추적.

MPS — Apple Silicon backend

MPS 는 표준 op 의 대부분에서 작동. 아직 일부 operator 가 CPU 로 fallback 함 (default 로 warning 발생, PYTORCH_ENABLE_MPS_FALLBACK=1 설정 안 하면). MPS 의 torch.compile 지원은 개선 중이지만 CUDA 만큼 성숙하진 않음. 최고 Apple Silicon 성능 위해선 MLX 가 종종 더 적합 — unified memory architecture 를 첫 날부터 설계 starting point 로.

Code

Device 감지와 canonical idiom·python

import torch

print(torch.cuda.is_available())          # True if NVIDIA GPU
print(torch.backends.mps.is_available())  # True on Apple Silicon
print(torch.cuda.device_count())          # number of CUDA GPUs

device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device}")

tensor 와 model 이동·python

import torch
import torch.nn as nn

device = "mps"  # or whatever you picked

# Tensors
x = torch.randn(32, 100)            # CPU
x = x.to(device)                     # moves
print(x.device)                      # mps:0

# Create directly on device — avoids the CPU→device copy
y = torch.randn(32, 100, device=device)

# Models — to() moves all parameters AND buffers
model = nn.Linear(100, 10).to(device)

# Inputs and weights MUST be on the same device
out = model(y)   # works
# out = model(torch.randn(32, 100))  # RuntimeError: device mismatch

Pinned memory + non_blocking — GPU training speedup·python

import torch
from torch.utils.data import DataLoader

# In your DataLoader, set pin_memory=True
# (only meaningful when you'll move data to a CUDA GPU)
loader = DataLoader(dataset, batch_size=64, num_workers=4, pin_memory=True)

device = "cuda"
for x, y in loader:
    # non_blocking=True lets the CPU keep going while the copy queues
    x = x.to(device, non_blocking=True)
    y = y.to(device, non_blocking=True)
    # ...training step...

CPU, CUDA, MPS — Device 와 이동

한 framework, 세 (네) backend

device-string idiom

MPS — Apple Silicon backend

Code

External links

Exercise

Progress

댓글 0