Quantization

weight 작게, inference 빠르게

학습된 model 은 weight 를 32-bit float 로 저장해. quantization 은 그 weight 를 훨씬 적은 bit 로 다시 표현하는 거 — float32 대신 int8 이면 모델 크기 대략 4 배 ↓, inference 도 빨라져. 정수 연산이 더 싸고 메모리로 옮기는 바이트가 1/4 이거든. 마지막 게 edge / mobile / serverless 의 진짜 이득 — 거기선 raw FLOP 이 아니라 memory bandwidth 랑 cold-start 크기가 아프니까.

Keras 3 에선 한 번 호출

Keras 3 는 model.quantize('int8') 을 in-place 변환으로 바로 제공해 — 별도 toolkit 도, graph export 도 없어. 'int4' (Keras 3.11+) 는 견딜 수 있는 경우 압축을 ~8 배까지 밀고, type_filter=['Dense'] 는 짜도 안전한 layer type 만 quantize 하고 민감한 건 full precision 으로 남겨.

PTQ vs QAT

quantize() 가 하는 건 post-training quantization (PTQ) — 이미 학습된 weight 만 변환, 재학습 없음. 거의 공짜고 int8 은 보통 accuracy 손실 ~0.5% 미만. 그 손실이 너무 클 때 (자주 int4 에서) quantization-aware training (QAT) 으로 가 — 학습 *중* 에 low-precision rounding 을 시뮬해서 model 이 그걸 흡수하도록 배우게 해. PTQ 로 시작하고 숫자가 강제할 때만 QAT. 어느 쪽이든 자기 task 로 benchmark 해 — accuracy 손실은 상수가 아니라 data 따라 달라.

Code

in-place PTQ: int8, int4, 선택적 quantization·python

# Int8 quantization (4x smaller, faster inference)
model.quantize("int8")

# Int4 quantization (8x smaller, Keras 3.11+)
model.quantize("int4")

# Selective quantization (exclude specific layers)
model.quantize("int8", type_filter=["Dense"])  # Only Dense layers

Exercise

작은 MNIST classifier 학습하고 accuracy 랑 디스크 크기 기록. model.quantize("int8") 호출, 그 다음 fresh copy 에 model.quantize("int4"). 각각 accuracy + 크기 재측정하고 accuracy-vs-압축 곡선 메모. int4 가 더는 값어치 없어지는 지점은 어디야?

Hint

전후 model.save() 로 파일 크기 비교. 매번 같은 test set 으로 평가해야 accuracy 숫자가 비교돼.