여러 이미지와 use case

Turn당 여러 이미지

images array가 base64 문자열 여러 개 가질 수 있어; 모델이 같은 turn 안에서 다 봐. 한 round 안에 "이 두 chart 비교" 또는 "이 제품 사진을 이 reference 사진들 중 하나에 매치" 같은 거 이렇게 해.

Local에서 진짜 동작하는 use case

Screenshot 분석. "이 UI에 뭐가 잘못됐어?" — Gemma 3 12B 이상에서 놀라울 정도로 잘 됨.
Document OCR. Qwen 2.5-VL이 영수증, invoice, 양식에서 구조화된 데이터 추출에 best-in-class.
Chart 읽기. Bar / line chart, 단순 heat map — 잘 됨; 복잡한 multi-axis나 stacked chart는 hit-or-miss.
코드-screenshot에서 코드. Gemma 3 27B랑 Qwen 2.5-VL 32B가 screenshot에서 코드 transcribe 잘함.
UI element 식별. "Submit 버튼 어디?" — 접근성 도구랑 local UI agent에 유용.

Local에서 아직 어려운 거

고해상도 디테일 (원본에 30 px 미만 텍스트).
복잡한 공간 reasoning ("고양이가 의자 뒤야 앞이야?").
한 turn에 multi-page document — page별 호출로 쪼개.
화살표 / annotation 있는 손그림 다이어그램.

Code

한 turn에 chart 둘 비교·python

resp = httpx.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "qwen2.5vl:7b",
        "messages": [{
            "role": "user",
            "content": "Compare these two charts. Which shows higher peak revenue and in what year?",
            "images": [
                encode_resized("chart_2024.png"),
                encode_resized("chart_2025.png"),
            ],
        }],
        "stream": False,
    },
    timeout=600.0,
)
print(resp.json()["message"]["content"])

영수증 OCR해서 구조화된 JSON으로·python

schema = {
    "type": "object",
    "properties": {
        "vendor": {"type": "string"},
        "date": {"type": "string"},
        "total": {"type": "number"},
        "currency": {"type": "string"},
        "line_items": {"type": "array", "items": {"type": "object",
            "properties": {"name": {"type": "string"}, "amount": {"type": "number"}},
            "required": ["name", "amount"],
        }},
    },
    "required": ["vendor", "date", "total", "currency", "line_items"],
}

resp = httpx.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "qwen2.5vl:7b",
        "messages": [{
            "role": "user",
            "content": "Extract all fields from this receipt as JSON.",
            "images": [encode_resized("receipt.jpg")],
        }],
        "stream": False,
        "format": schema,
    },
    timeout=300.0,
)
import json
data = json.loads(resp.json()["message"]["content"])
print(json.dumps(data, indent=2, ensure_ascii=False))

여러 이미지와 use case

Turn당 여러 이미지

Local에서 진짜 동작하는 use case

Local에서 아직 어려운 거

Code

External links

Exercise

Progress

댓글 0