이미지 보내기

~20 min · vision, images

Level 0Downloader

0 XP0/41 lessons0/11 achievements

0/120 XP to next level120 XP to go0% complete

Ollama 이미지 형식은 OpenAI 형식 아님

이게 Ollama 두 번째로 흔한 footgun (NDJSON-vs-SSE 다음). Ollama는 이미지를 raw base64 문자열로 images array에 받음 — message에 붙어서. data:image/png;base64, prefix NO. image_url 객체로 content array 모양 NO.

Ollama: {"role": "user", "content": "describe this", "images": ["iVBORw0KGgoA..."]}
OpenAI: {"role": "user", "content": [{"type": "text", "text": "describe"}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgoA..."}}]}

OpenAI 형식을 Ollama에 복사하면 에러 안 내 — 그냥 이미지 무시하고 텍스트만 가지고 답 만듦. Silent failure.

보내기 전에 resize

Local vision 모델은 보통 max input 차원 가짐 (~896 ~ ~1568 px, 모델마다 다름). Encoding 전에 long edge를 ~1024 px로 resize. 큰 이미지는 내부에서 downscale (encode 낭비)되거나 context limit 걸려.

Code

Vision 모델에 single 이미지 보내기·python

import httpx, base64
from pathlib import Path

def encode_image(path: str | Path) -> str:
    """파일 읽어서 base64 문자열 반환 (data URI prefix 없음)."""
    return base64.b64encode(Path(path).read_bytes()).decode("utf-8")

resp = httpx.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "gemma3:12b",
        "messages": [{
            "role": "user",
            "content": "Describe what you see in this image. Be specific about objects and any text.",
            "images": [encode_image("screenshot.png")],
        }],
        "stream": False,
    },
    timeout=300.0,  # vision 모델은 느려
)
print(resp.json()["message"]["content"])

먼저 resize해서 빠르게 유지·python

from PIL import Image
import io, base64

def encode_resized(path: str, max_edge: int = 1024) -> str:
    img = Image.open(path)
    if img.mode != "RGB":
        img = img.convert("RGB")
    img.thumbnail((max_edge, max_edge), Image.LANCZOS)
    buf = io.BytesIO()
    img.save(buf, format="JPEG", quality=88)
    return base64.b64encode(buf.getvalue()).decode("utf-8")

# 1024px JPEG q88이면 품질 높게 유지하면서 typical 폰 사진 encode + transfer + decode 시간 4배 정도 줄어.

External links

Exercise

컴퓨터의 진짜 screenshot을 max edge 1024px로 encode. Gemma 3 12B에 이미지 진짜 봐야 답할 수 있는 질문 (예: '브라우저의 탭 제목이 뭐?')과 함께 보내. 답이 generic이면 silent-failure 모드에 걸린 거 — 디버그.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousVision-capable local 모델 Next →여러 이미지와 use case

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.