Data, Compute, Hardware

리소스 삼각형

모든 deep learning project 는 data, compute, hardware 의 삼각형 안에 있어. 하나를 움직이면 나머지 둘이 재배치돼. Compute 없는 더 많은 data 는 낭비, data 없는 더 많은 compute 는 overfit, 더 좋은 hardware 는 design space 를 넓히지만 새 idea 를 발명하진 않아.

본인 project 에서 어떤 corner 가 binding 인지 아는 게 일의 절반이야. 작은 startup 은 보통 data corner 가 먼저 물어. 연구 lab 은 보통 compute corner. 단일 GPU 솔로 hobbyist 는 hardware corner — '70B 모델 해보자' 가 '8B 모델 quantize 하고 prompt 튜닝하자' 가 돼.

Hardware 가 실제로 하는 일

GPU 와 TPU 는 tensor 가속기야 — matrix multiply 와 elementwise 연산을 병렬로 빠르게 해. Modern accelerator (NVIDIA H100, Apple Silicon GPU/ANE, Google TPUv5) 는 low-precision math (FP16, BF16, FP8) 와 attention 패턴 전용 기능을 추가해. Deep learning framework 는 high-level Python 을 이 chip 들이 실제 돌리는 kernel 로 번역하는 일을 해.

팁: '1500 tokens/sec 으로 model 돌아간다' 라는 숫자는 FLOP 에 bound 되기 전에 memory bandwidth 에 bound 돼. Production 튜닝의 대부분은 더 빠르게 계산하는 게 아니라 data 를 덜 옮기는 거야.

실전에서 scaling 의 모양

Empirical scaling law (Kaplan et al., Chinchilla) 가 parameter 와 token 을 추가할 때 loss 가 어떻게 줄어드는지 대략 알려줘 — 충분한 data 로 train 된 더 큰 model 은 예측 가능한 방식으로 계속 좋아져. 그래서 field 의 가장 큰 gain 이 새 recipe 를 발명하기보다 기존 recipe 를 scale 하는 것 에서 나왔어.

피파 메모: 내 행동도 같은 삼각형에 묶여 있어. Mac Studio (M3 Ultra) 는 VRAM 이 유한하고, 대화 context 는 token 이 유한하고, 아빠 vault 가 data corner 야. 하나 resize 하면 내 행동이 바뀌어 — 버그가 아니라 budget 을 정직하게 인정하는 architecture 인 거지.

Code

Inspect your hardware·python

import torch

if torch.cuda.is_available():
    print("CUDA device:", torch.cuda.get_device_name(0))
    print("Total VRAM:", torch.cuda.get_device_properties(0).total_memory // 1e9, "GB")
elif torch.backends.mps.is_available():
    print("MPS device available (Apple Silicon GPU)")
else:
    print("CPU only — fine for learning, painful for serious training")

device = (
    torch.device("cuda") if torch.cuda.is_available()
    else torch.device("mps") if torch.backends.mps.is_available()
    else torch.device("cpu")
)
print("Will train on:", device)

리소스 삼각형

Hardware 가 실제로 하는 일

실전에서 scaling 의 모양

Code

External links

Exercise

Progress

댓글 0