Ollama / llama.cpp 용 GGUF Export

~24 min · ops, gguf, ollama

Level 0스카우트

0 XP0/50 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

왜 GGUF

GGUF (이전 GGML) 가 llama.cpp 가 발명, Ollama 가 wrap 하는 포맷. 최적화 for: CPU-first 인퍼런스, Apple Silicon 의 Metal 가속, 크로스 플랫폼 binary 배포. 단일 .gguf 파일이 가중치 + tokenizer + metadata 담음, ollama pull 준비.

GGUF 생성법

Canonical path 가 llama.cpp 의 convert_hf_to_gguf.py 통해. llama.cpp clone, script 가 너 HF 레포 가리키게, 양자화 골라 (Q4_K_M, Q5_K_M, Q8_0, F16). Output: ship 가능한 .gguf 파일.

네이밍 컨벤션

커뮤니티 컨벤션으로 Hub 의 GGUF 파일이 {model}-{params}.{quant}.gguf 이름: Llama-3.1-8B-Instruct-Q4_K_M.gguf. quant tier (Q4_K_M 가 typical “balanced” 선택) 가 bit-width + rounding scheme 알려줘.

Code

HF 모델을 GGUF 로 변환·bash

# 1회 셋업
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt

# HF 모델 다운로드
huggingface-cli download "Qwen/Qwen2.5-1.5B-Instruct" --local-dir ./qwen-1.5b

# F16 GGUF 로 변환
python convert_hf_to_gguf.py ./qwen-1.5b --outfile qwen-1.5b-f16.gguf

# Q4_K_M 으로 양자화 (cmake 으로 quantize 바이너리 먼저 빌드)
./build/bin/llama-quantize qwen-1.5b-f16.gguf qwen-1.5b-Q4_K_M.gguf Q4_K_M

GGUF 를 Hub 레포에 push, Ollama 로 pull·bash

# Hub 에 모델 레포 생성
huggingface-cli upload "yourname/qwen-1.5b-gguf" qwen-1.5b-Q4_K_M.gguf --repo-type=model

# 그다음 Ollama 로 pull (modelfile 패턴)
cat > Modelfile <<EOF
FROM hf.co/yourname/qwen-1.5b-gguf:Q4_K_M
EOF

ollama create qwen-1.5b-cwk -f Modelfile
ollama run qwen-1.5b-cwk

External links

Exercise

1-3B Hub 모델을 GGUF (Q4_K_M) 로 변환. private Hub 레포에 GGUF push. Modelfile 패턴으로 Ollama 통해 pull. 인퍼런스 실행, 5 프롬프트에 unquantized HF 모델과 출력 품질 비교.

Progress

Progress is local-only — sign in to sync across devices.

← Previous양자화 포맷 cheat sheet Next →ONNX 와 MLX 변환

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.