로컬 배포: Ollama & llama.cpp

~24 min · ollama, llama-cpp, gguf, local, apple-silicon

Level 0관찰자

0 XP0/43 lessons0/11 achievements

0/120 XP to next level120 XP to go0% complete

로컬 서빙 경로

로컬 추론(노트북, 내부 서버, air-gapped 박스)엔 Ollama + llama.cpp 스택이 옳은 선택. 둘 다 네이티브로 말하는 포맷이 GGUF.

파이프라인

LoRA를 베이스에 merge(이전 레슨).
llama.cpp 변환기로 HF 포맷 → GGUF.
타겟 메모리 예산에 맞춰 GGUF 양자화.
Modelfile로 wrap하고 Ollama로 서빙.

Code

HF → GGUF + quantize·bash

# Convert merged HF model to GGUF
cd llama.cpp
python convert_hf_to_gguf.py ../merged-model \
    --outtype f16 --outfile model-f16.gguf

# Quantize (pick based on hardware)
./llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M
# For Apple Silicon with plenty of RAM:
./llama-quantize model-f16.gguf model-Q6_K.gguf Q6_K

Ollama Modelfile + run·bash

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-Q4_K_M.gguf

TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER stop "<|eot_id|>"
SYSTEM "You are my fine-tuned assistant."
EOF

# Build and run
ollama create my-model -f Modelfile
ollama run my-model "Hello, how can you help me?"

External links

Exercise

Merged 7B 모델 가져와 Ollama로 Q4_K_M 로컬 실행. 5개 프롬프트에서 출력을 비양자화 HF 버전이랑 비교. 품질 저하 다 문서화. 그 다음 Q6_K 시도하고 재평가 — 품질 lift가 추가 ~1.4GB RAM 가치 있어?

Progress

Progress is local-only — sign in to sync across devices.

← PreviousMerge & Export Next →클라우드 서빙: vLLM & TGI

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.