내장 OpenAI-호환 서버

OpenAI-shape 서버를 공짜로 받아

mlx-lm 이 OpenAI API 와 같은 request 와 response shape 의 /v1/chat/completions 노출하는 내장 HTTP 서버 출하. OpenAI 클라이언트 프로토콜 말하는 무엇이든 — 공식 openai Python SDK, LangChain, LlamaIndex, 네 shell alias — 네 로컬 mlx-lm 서버 가리키고 그냥 동작.

이게 단일 MLX 모델을 위한 가장 게으른 가능 production 배포. 한 명령 돌려, 안정 HTTP endpoint 받아, 이미 OpenAI 알아듣는 어떤 도구에서든 말 걸어. 진지한 caveat — concurrency, queueing, 여러 모델 — 은 prod.lesson1 에 와. 이 레슨은 단순 케이스.

서버 시작

두 줄 setup. 모델은 startup 에 한 번 로드되고 request 들에 걸쳐 메모리에 유지; 후속 request 는 추론 비용만 내.

openai-python 에서 말 걸기

OpenAI 클라이언트를 http://localhost:8080/v1 로 가리켜 (기본 포트, 설정 가능). 비어 있지 않은 어떤 문자열이든 API key 로 전달 — 로컬 서버는 인증 안 해. 그 다음 OpenAI 호출하듯 호출.

뭐가 동작하고 뭐가 안 하나

동작: chat-completions endpoint, 스트리밍 SSE 응답, model name routing (다른 포트에 여러 인스턴스 시작하면 여러 모델 서빙 가능), 기본 sampling 파라미터 (temperature, top_p, max_tokens).
적용 안 됨: embeddings (mlx-lm 은 텍스트-생성 서버, embeddings 서버 아냐 — 그건 다른 패키지 써), function calling (모델과 template 에 달려 있어; OpenAI 의 정확한 JSON schema 항상 지원되진 않아), fine-tuning endpoint (mlx-lm 은 자체 LoRA 워크플로 가져 — Track 4 봐).
Single-process: 내장 서버는 single-process. Queueing 과 여러 worker 의 진짜 동시 서빙은 mlx-lm 주위 얇은 FastAPI wrapper 박을 거 — prod.lesson1.

내장 서버로 충분할 때

로컬 개발, 1-사용자 demo, 또는 너와 어쩌면 한 팀원이 부하인 internal 도구엔, 내장 서버로 충분. 사용자 향하는 무엇이든, concurrency 요구사항 가진 무엇이든, auth 나 rate-limiting 필요한 무엇이든 — FastAPI wrapper 로 졸업. 결정은 트래픽과 운영 요구사항이지, MLX 자체 아냐.

Code

서버 시작 (한 터미널에서)·bash

# In a terminal with the `mlx` env activated:
conda activate mlx

# Start mlx-lm's built-in OpenAI-compatible HTTP server.
# Default port is 8080. The model is loaded once at startup.
python -m mlx_lm server \
  --model mlx-community/Llama-3.2-1B-Instruct-4bit

# You'll see startup logs ending with something like:
#   Starting httpd at 127.0.0.1:8080
#
# Leave this terminal running; the next code block talks to it from a second terminal.

openai-python 에서 말 걸기 (두 번째 터미널)·python

# In a separate terminal with openai-python installed:
#   pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used-but-required",     # any non-empty string is fine
)

resp = client.chat.completions.create(
    model="mlx-community/Llama-3.2-1B-Instruct-4bit",
    messages=[
        {"role": "system", "content": "You are a terse assistant."},
        {"role": "user",   "content": "Capital of France?"},
    ],
    max_tokens=20,
    temperature=0.7,
)
print(resp.choices[0].message.content)
# → "Paris." (or close to it)

같은 클라이언트로 스트리밍 응답·python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")

stream = client.chat.completions.create(
    model="mlx-community/Llama-3.2-1B-Instruct-4bit",
    messages=[{"role": "user", "content": "Count 1 to 5:"}],
    max_tokens=30,
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

Exercise

한 터미널에서 서버 시작. 두 번째 터미널에서 openai-python 클라이언트로 hit — single-shot 과 스트리밍 둘 다. 그 다음 Cursor (또는 이미 쓰는 OpenAI-호환 클라이언트 어떤 거든) 를 같은 모델 이름으로 http://localhost:8080/v1 에 가리켜. 진짜 chat 이 end-to-end 동작하는지 확인. 운동은 mlx-lm 이 번역 없이, 클라이언트 당 한 config 조정으로 OpenAI ecosystem 에 슬롯 들어가는 거 느끼는 것.