C.W.K.
Stream
Lesson 06 of 06 · published

Inference 최적화 — torch.compile, Quantization, Batching

~12 min · inference, compile, quantization, batch

Level 0Tensor 호기심
0 XP0/62 lessons0/13 achievements
0/120 XP to next level120 XP to go0% complete

Training 이 iterative; inference 가 성능-critical

model train 후 serving 위 최적화. routine 결합할 세 기법:

  • torch.compile(model) — TorchInductor 로 model graph JIT-compile. 한 줄에 1.5-3x speedup. (다음 트랙에 full coverage.)
  • Quantization — 더 작은 model 과 빠른 matmul 위 numerical 정밀도 감소 (fp32 → int8 또는 int4). modern path 가 torchao.
  • Batching — 32 single-sample inference 대신 한 32-sample batch. 활용도 부족이 실재 비용.

적용 순서

  1. eager-mode + fp32 로 정확성 맞추기. output 검증.
  2. hardware 지원하면 bf16 으로 전환. 정확도 유지 검증.
  3. torch.compile(model) 추가. 속도와 정확도 검증.
  4. 더 필요하면 quantize (transformer 엔 int8 dynamic, LLM 엔 int8/int4 weight-only).
  5. request serving 면 작은 queueing window 로 user 들 batch.

Python overhead 잊지 마

Tokenization, post-processing, serialization 이 작은 model 의 inference time 지배 가능. 단지 forward pass 아니라 end-to-end (HTTP request → response) profile. fix 가 종종 'tokenizer cache' 또는 'faster JSON 라이브러리 사용' 이지 model-side 아님.

Code

torch.compile — 한 줄 speedup·python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
)
model.eval()

# JIT-compile. First call is slow (compilation), subsequent calls are fast.
model = torch.compile(model)

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
inputs = tok(["A great movie!", "Terrible."], padding=True, return_tensors="pt")

with torch.inference_mode():
    out = model(**inputs)

print(out.logits.argmax(-1))   # tensor([1, 0])
Dynamic int8 quantization — transformer 의 한 줄·python
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()

# Quantize Linear layers to int8 — weights stored as int8, computed in int8
quantized = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8,
)

# Compare disk size
torch.save(model.state_dict(), '/tmp/fp32.pt')
torch.save(quantized.state_dict(), '/tmp/int8.pt')

import os
print(f"fp32 : {os.path.getsize('/tmp/fp32.pt')/1e6:.1f} MB")
print(f"int8 : {os.path.getsize('/tmp/int8.pt')/1e6:.1f} MB")
# fp32: 268.4 MB
# int8:  72.1 MB     (~4x smaller)
torchao — modern int4 / weight-only·python
# pip install torchao
import torchao
from torchao.quantization import int4_weight_only, int8_weight_only
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")

# int8 weight-only — weights stored as int8, computed in fp16/bf16
torchao.quantize_(model, int8_weight_only())

# int4 weight-only — even smaller, designed for LLMs
# torchao.quantize_(model, int4_weight_only(group_size=128))

# Use the model normally — the int4/int8 layers handle on-the-fly dequant
# in their forward.
End-to-end inference benchmark·python
import time
import torch

@torch.inference_mode()
def benchmark(model, inputs, n=100, warmup=10):
    if next(model.parameters()).is_cuda:
        torch.cuda.synchronize()
    for _ in range(warmup):
        model(**inputs)
    if next(model.parameters()).is_cuda:
        torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(n):
        model(**inputs)
    if next(model.parameters()).is_cuda:
        torch.cuda.synchronize()
    elapsed_ms = (time.perf_counter() - t0) / n * 1000
    print(f"{elapsed_ms:.2f} ms / call")
    return elapsed_ms

External links

Exercise

어떤 작은 model 이든. inference latency 세 configuration: eager fp32, eager bf16, compiled bf16 으로 benchmark. 그 다음 int8 dynamic quantization 추가. latency AND held-out batch 에 quick accuracy sanity check 추적. 숫자 저장 — 모든 production 배포가 이 종류 표 필요.

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.