Compilation 과 Export Path

PyTorch model 에서 deployed inference 까지

Model train 했어. 이제 어디서 돌게 해야 — server, phone, embedded device, edge GPU. 주요 path 3 개:

torch.compile — PyTorch-native target 의 가장 빠른 path. Forward pass 를 fused kernel 로 JIT-compile. PyTorch 지원하는 same device 에 작동.
ONNX — ONNX Runtime, TensorRT, OpenVINO, CoreML 등에 도는 vendor-neutral graph format 으로 export. Portable.
TorchScript — 더 오래된 PyTorch-native serialization. 새 코드엔 torch.compile 로 대체, legacy 에 여전히 흔함.

팁: Modern inference 에 realistic tooling: PyTorch server 에 torch.compile, cross-platform deployment 에 ONNX Runtime, LLM serving 에 vLLM 또는 TGI, iOS 에 Core ML, Android 에 TensorFlow Lite, local quantized LLM inference 에 llama.cpp.

Quantization

Smaller model 과 faster inference 위해 weight precision (FP32 → INT8 또는 INT4) 줄이기, 보통 minimal accuracy loss. 3 가지 맛: post-training quantization (싸고, retraining 없음), quantization-aware training (better accuracy, more complex), weight-only quantization (LLM deployment 에 사용, AWQ, GPTQ 등).

Distillation

Small student model 을 large teacher 의 prediction mimic 하게 train. Production 에서 serving 비용 줄이려 흔함. Student 가 자주 teacher accuracy 의 90-95% 를 parameter count 5-10% 로 달성.

원칙: Inference 가 자기 engineering discipline. Eval 이긴 model 이 production 에 unmodified 가는 일 드물어 — quantization, distillation, graph optimization 이 보통 deployment 전 trim.

Code

ONNX export and Runtime inference·python

import torch
import onnxruntime as ort

model.eval()
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model, dummy, "model.onnx",
    input_names=["x"], output_names=["logits"],
    dynamic_axes={"x": {0: "batch"}, "logits": {0: "batch"}},
    opset_version=17,
)

sess = ort.InferenceSession("model.onnx",
                            providers=["CPUExecutionProvider"])
out = sess.run(None, {"x": dummy.numpy()})[0]
print(out.shape)

Quantize an LLM for local inference·bash

# Convert HF model to GGUF (llama.cpp format)
python convert-hf-to-gguf.py meta-llama/Llama-3.1-8B-Instruct

# Quantize to 4-bit
./quantize Meta-Llama-3.1-8B-Instruct.gguf model-q4_K_M.gguf q4_K_M

# Run via Ollama on Apple Silicon — Pippa's local fallback brain path
ollama create my-llama -f Modelfile
ollama run my-llama "Explain mixed precision in one paragraph"

PyTorch model 에서 deployed inference 까지

Quantization

Distillation

Code

External links

Exercise

Progress

댓글 0