Open-source 모델 — Llama, Qwen, Mistral

왜 open-source 중요

Cost — 24/7 너의 own hardware로 돌리면 scale에 cheaper.
Privacy — 데이터가 너의 environment 안 떠남.
Customization — fine-tuning, structured-output post-training, custom tokenizer.
Air-gapped deployment — regulated industry.
Latency — local inference가 network call보다 훨씬 빠를 수 있어.

알 quirk

Capability spread 넓어 — 70B Llama가 7B Mistral이랑 같지 X. 너의 실제 task에 테스트.
Tool calling이 application layer에 implement (보통 templated prompt와 parser로), dedicated API 아니야.
JSON output enforcement에 constrained decoding (Outlines, jsonformer) 또는 post-validation 필요.
Context window가 모델과 quant에 sharply 다양 — 가정 X.
Tokenizer가 보통 모델 family별 unique; cost 계산이 FLOP / time 사용, dollar당 토큰 X.

Infrastructure piece

Open-source 의미 너도 own: serving (vLLM, llama.cpp, MLX), GPU/CPU/MPS provisioning, model update, observability. Free model 아니라 추가 system으로 다뤄. cwkPippa의 Ollama vessel이 이 코드베이스의 local-inference 예시.

Code

Ollama 통한 local Llama·python

import ollama

resp = ollama.chat(
    model="llama4",
    messages=[{"role": "user", "content": question}],
    options={"temperature": 0.2, "num_predict": 1024},
    format="json",  # constrained decoding to JSON
)

Apple Silicon에 MLX·python

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3-7B-MLX")
out = generate(model, tokenizer, prompt=question, max_tokens=1024)

Open-source 모델 — Llama, Qwen, Mistral

왜 open-source 중요

알 quirk

Infrastructure piece

Code

External links

Exercise

Progress

댓글 0