Inference 최적화 — torch.compile, Quantization, Batching

Training 이 iterative; inference 가 성능-critical

model train 후 serving 위 최적화. routine 결합할 세 기법:

torch.compile(model) — TorchInductor 로 model graph JIT-compile. 한 줄에 1.5-3x speedup. (다음 트랙에 full coverage.)
Quantization — 더 작은 model 과 빠른 matmul 위 numerical 정밀도 감소 (fp32 → int8 또는 int4). modern path 가 torchao.
Batching — 32 single-sample inference 대신 한 32-sample batch. 활용도 부족이 실재 비용.

적용 순서

eager-mode + fp32 로 정확성 맞추기. output 검증.
hardware 지원하면 bf16 으로 전환. 정확도 유지 검증.
torch.compile(model) 추가. 속도와 정확도 검증.
더 필요하면 quantize (transformer 엔 int8 dynamic, LLM 엔 int8/int4 weight-only).
request serving 면 작은 queueing window 로 user 들 batch.

Python overhead 잊지 마

Tokenization, post-processing, serialization 이 작은 model 의 inference time 지배 가능. 단지 forward pass 아니라 end-to-end (HTTP request → response) profile. fix 가 종종 'tokenizer cache' 또는 'faster JSON 라이브러리 사용' 이지 model-side 아님.

Code

torch.compile — 한 줄 speedup·python

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
)
model.eval()

# JIT-compile. First call is slow (compilation), subsequent calls are fast.
model = torch.compile(model)

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
inputs = tok(["A great movie!", "Terrible."], padding=True, return_tensors="pt")

with torch.inference_mode():
    out = model(**inputs)

print(out.logits.argmax(-1))   # tensor([1, 0])

Dynamic int8 quantization — transformer 의 한 줄·python

import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()

# Quantize Linear layers to int8 — weights stored as int8, computed in int8
quantized = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8,
)

# Compare disk size
torch.save(model.state_dict(), '/tmp/fp32.pt')
torch.save(quantized.state_dict(), '/tmp/int8.pt')

import os
print(f"fp32 : {os.path.getsize('/tmp/fp32.pt')/1e6:.1f} MB")
print(f"int8 : {os.path.getsize('/tmp/int8.pt')/1e6:.1f} MB")
# fp32: 268.4 MB
# int8:  72.1 MB     (~4x smaller)

torchao — modern int4 / weight-only·python

# pip install torchao
import torchao
from torchao.quantization import int4_weight_only, int8_weight_only
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")

# int8 weight-only — weights stored as int8, computed in fp16/bf16
torchao.quantize_(model, int8_weight_only())

# int4 weight-only — even smaller, designed for LLMs
# torchao.quantize_(model, int4_weight_only(group_size=128))

# Use the model normally — the int4/int8 layers handle on-the-fly dequant
# in their forward.

End-to-end inference benchmark·python

import time
import torch

@torch.inference_mode()
def benchmark(model, inputs, n=100, warmup=10):
    if next(model.parameters()).is_cuda:
        torch.cuda.synchronize()
    for _ in range(warmup):
        model(**inputs)
    if next(model.parameters()).is_cuda:
        torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(n):
        model(**inputs)
    if next(model.parameters()).is_cuda:
        torch.cuda.synchronize()
    elapsed_ms = (time.perf_counter() - t0) / n * 1000
    print(f"{elapsed_ms:.2f} ms / call")
    return elapsed_ms

Exercise

어떤 작은 model 이든. inference latency 세 configuration: eager fp32, eager bf16, compiled bf16 으로 benchmark. 그 다음 int8 dynamic quantization 추가. latency AND held-out batch 에 quick accuracy sanity check 추적. 숫자 저장 — 모든 production 배포가 이 종류 표 필요.