Quantization — fp32 에서 int8 (그리고 int4) 까지

작은 정확도 budget 을 거대 size 와 속도 win 으로 trade

modern hardware 가 int8 matmul 에 fp32 보다 훨씬 빠름 — 가끔 4x. 메모리 소비도 비례 drop. 비용: 작은 정확도 hit, 보통 well-quantized model 의 표준 benchmark 위 <1%. LLM 위 int4 weight-only quant (2025-2026 가장 활발한 영역 중 하나) 가 7B-param model 을 4GB 에 fit.

quantization 세 맛

Dynamic quantization — weight int8 저장, activation 은 inference 중 on-the-fly quantize. 한 줄 setup. Linear layer 지배 transformer-shaped model 에 best.
Static (post-training) quantization (PTQ) — weight 와 activation 둘 다 quantize, 작은 dataset 에 calibrate. dynamic 보다 빠르지만 setup 더.
Quantization-Aware Training (QAT) — forward pass 에 simulated quantization 으로 train. 가장 정확하지만 setup 가장 느림.

torchao — modern API

역사적 torch.quantization / torch.ao.quantization module 이 standalone torchao 패키지로 이동 중. torchao 가 modern int8 / int4 / weight-only / GPTQ / AWQ 기법 사는 곳. 새 project 엔 거기 시작.

quantization 이 도움 / 해 되는 곳

도움: 큰 transformer FFN, 큰 embedding table, LLM serving.
해: per-op overhead 가 지배하는 작은 model, non-Linear op 많은 model (unquantized 부분이 bottleneck 됨).

Code

Dynamic int8 quantization — transformer 의 한 줄·python

import os
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()

quantized = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8,
)

# Compare disk size
torch.save(model.state_dict(), '/tmp/fp32.pt')
torch.save(quantized.state_dict(), '/tmp/int8.pt')
print(f"fp32: {os.path.getsize('/tmp/fp32.pt')/1e6:.1f} MB")
print(f"int8: {os.path.getsize('/tmp/int8.pt')/1e6:.1f} MB")

torchao — int8 weight-only quantization·python

# pip install torchao
import torch
import torchao
from torchao.quantization import int8_weight_only
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")

# Apply int8 weight-only quant in place
torchao.quantize_(model, int8_weight_only())

# Use the model normally — the quantized layers handle dequant in their forward

torchao — LLM 위 int4 weight-only·python

import torchao
from torchao.quantization import int4_weight_only
from transformers import AutoModelForCausalLM

# int4 — even smaller, designed for LLMs
# group_size controls the granularity: smaller groups = better accuracy, more overhead
model = AutoModelForCausalLM.from_pretrained("gpt2")
torchao.quantize_(model, int4_weight_only(group_size=128))

# A 7B-param model at fp16 = ~14GB; at int4 = ~3.5GB
# That's the difference between "needs an A100" and "fits on a 4090"

quantization 후 정확도 검증 — 항상·python

import torch

# Run both fp32 and quantized model on a calibration / val set
# Compare outputs on a per-sample basis

def compare_models(fp32_model, quant_model, val_loader):
    fp32_model.eval(); quant_model.eval()
    abs_diff_total = 0
    n = 0
    with torch.inference_mode():
        for x, y in val_loader:
            out_fp = fp32_model(x).logits
            out_q  = quant_model(x).logits
            abs_diff_total += (out_fp - out_q).abs().mean().item()
            n += 1
    print(f"mean |fp32 - quantized| logit diff: {abs_diff_total / n:.4f}")
    # Also recompute task accuracy on both — that's the number that matters

Quantization — fp32 에서 int8 (그리고 int4) 까지

작은 정확도 budget 을 거대 size 와 속도 win 으로 trade

quantization 세 맛

torchao — modern API

quantization 이 도움 / 해 되는 곳

Code

External links

Exercise

Progress

댓글 0