MLX vs Ollama — 같은 Mac, 다른 layer

다른 layer, 같은 Mac

Ollama v0.19 (2026-03) 부터 Ollama 가 MLX 위에 빌드. 그게 둘 사이 대화를 라이벌리에서 layering 질문으로 바꿈. MLX 가 array primitive 와 추론 kernel 제공; Ollama 가 모델 registry, HTTP server, 모델 pulling CLI, 그리고 비-Python 사용자가 LLM 로컬로 돌리게 하는 사용자-향하는 단순함 제공. 둘 다 같은 unified-memory 하드웨어에서 돌아; 둘 다 궁극적으로 무거운 작업 위해 MLX kernel 에 dispatch.

둘 사이 결정은 성능에 대한 게 아냐 — 같은 compute layer 를 literally 사용하니까 같은 하드웨어에서 비슷하게 수행. 결정은 어떤 표면과 상호작용하고 싶은지.

Ollama 가 mlx-lm 이 안 주는 거

모델 registry — ollama pull llama3.2-스타일 명령으로, Hugging Face 다운로드 path 통째로 추상화.
Daemon — Python 프로세스 시작 안 하고 백그라운드에 돌고 localhost:11434 의 HTTP API 서빙.
친근한 CLI — chat, 모델 관리, 빠른 실험용.
자라는 ecosystem — Open WebUI, plugin, Mac menu-bar 통합.

mlx-lm 이 Ollama 가 안 주는 거

Full Python 통제 — generation 스크립트, 토큰 당 metadata 로그, 파이프라인 통합, fine-tuning loop 박기 가능.
Hugging Face 의 MLX-format 모델 직접 접근, Ollama 위해 아직 패키지 안 된 거 포함.
Fine-tuning 워크플로 — Ollama 는 추론-전용; fine-tuning 은 mlx-lm 의 영역 (Track 4).
Custom sampling 로직, custom architecture, custom 무엇이든 — 확장이나 수정 필요한 순간 Python 이 이김.

각 역할에 맞는 선택

코드 없이 "모델 그냥 도는 거" 원함 → Ollama.
LLM 을 Python 파이프라인이나 앱에 통합 → mlx-lm 직접, 또는 Ollama 추상화 선호하면 Ollama 의 HTTP API.
Fine-tuning → mlx-lm (그 다음 옵션으로 fused 결과를 Ollama 의 import flow 통해 서빙).
비-프로그래머 Mac 사용자 → Mac 앱 통한 Ollama, 터미널 안 필요.

MLX 사람에게 알 가치 있는 것

Ollama 의 존재가 MLX 를 더 유용하게 만들지, 덜 안 만들어. MLX format 으로 변환된 모델이 훨씬 큰 사용자 base 에 사용 가능 — Mac 에서 Ollama 돌리는 누구든 implicitly MLX 돌려, 한 번도 들어본 적 없어도. 그게 mlx-community 의 모든 MLX-format 업로드의 영향 넓혀.

Code

Ollama-on-MLX, zero-Python path·bash

# Install (one-time)
brew install ollama

# Pull and chat
ollama pull llama3.2
ollama run llama3.2

# Or serve as an HTTP endpoint (Ollama daemon runs in the background)
ollama serve  # background, usually starts automatically
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "Hello"}'

mlx-lm 통한 같은 모델 직접·python

from mlx_lm import load, generate

model, tok = load("mlx-community/Llama-3.2-1B-Instruct-4bit")
print(generate(model, tok, prompt="Hello", max_tokens=20))

# Same compute layer (MLX kernels), different surface (Python script vs CLI/daemon).

Exercise

안 했으면 Ollama 설치 (hood 아래 네 MLX cache architecture 공유). llama3.2 pull, 인터랙티브 chat 위해 ollama run llama3.2 실행, 그 다음 curl 로 HTTP API hit. mlx-lm 통해 같은 모델 직접 돌리는 거와 wall-clock latency 비교. 일상 사용에 어느 표면이 더 자연스럽게 느껴지는지 두 문장.