Image, Video, Audio

한 모델, 세 modality

Gemini 2.5 Flash 와 Pro 가 같은 conversation 에서 text, image, video, audio 받음. 모델이나 endpoint 전환 X — contents 에 옳은 Part type 포함하면 됨.

Media 첨부 두 가지

Inline bytes — 파일 base64-encode 해서 request 에 보냄. 작은 파일 (≤ 20MB 총 request size) 에 최적.
File API — client.files.upload 로 먼저 업로드, 결과 file URI 참조. 큰 거 필수; video 필수.

File API 로 업로드된 파일은 48 시간 살아. 실제 generation 호출 만들 시간 충분.

Token 비용은 모양당 고정, character 기반 X

Image: ≤ 384px = 258 토큰. 더 크면 768×768 tile 당 258.
Video: ~300 토큰/초. 최대 1 시간. YouTube URL 직접 지원.
Audio: 32 토큰/초. 최대 9.5 시간.

지원 포맷

Image: PNG, JPEG, WEBP, HEIC.
Video: MP4, MOV, AVI, FLV, MPG, WMV, 3GPP, WEBM.
Audio: WAV, MP3, AIFF, AAC, OGG, FLAC.

Code

Image — inline bytes·python

from google import genai
from google.genai import types
from pathlib import Path

client = genai.Client()
img_bytes = Path('photo.jpg').read_bytes()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg'),
        'Caption this image in one sentence.',
    ],
)
print(response.text)

Image — File API (큰 파일에 선호)·python

uploaded = client.files.upload(file='photo.jpg')

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[uploaded, 'Describe in 3 sentences.'],
)
print(response.text)

# When done
client.files.delete(name=uploaded.name)

Video — processing 대기·python

import time

uploaded = client.files.upload(
    file='lecture.mp4',
    config=types.UploadFileConfig(display_name='Lecture'),
)

# Big videos go through PROCESSING — poll until ACTIVE
while uploaded.state.name == 'PROCESSING':
    time.sleep(2.5)
    uploaded = client.files.get(name=uploaded.name)

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[uploaded, 'Summarize the key points.'],
)

Audio — transcription·python

uploaded = client.files.upload(file='podcast.mp3')

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Transcribe this audio. Use proper punctuation and paragraph breaks.',
        uploaded,
    ],
)
print(response.text)

Exercise

너 머신에서 작은 파일 셋 골라 — image 하나, 짧은 video (≤ 30 초) 하나, audio clip 하나. 각자 File API 로 업로드, 적절한 prompt (caption / describe / transcribe) 로 Flash 에 보냄, 응답 출력, 파일 삭제하는 single Python 스크립트 작성. PROCESSING 지연 catch + 보고.