Batch vs Streaming — 진짜 trade-off

담론 vs 현실

컨퍼런스 톡은 다 streaming 으로 옮긴다고 들리게 만들어. 2026 현장 현실은 대부분의 데이터 엔지니어링이 여전히 batch, sophisticated 회사에서도 비율이 80/20 정도로 batch 우세야. 어떤 게 진짜 필요한지 생각하는 법.

Batch 가 사주는 것

더 단순한 mental model — "매일 새벽 3시에 파이프라인 돌아."
더 싼 compute — 24/7 worst case 가 아니라 daily peak 에 cluster 사이즈.
더 쉬운 디버깅 — 분리된 run 들이 있고, 로그가 있고, 성공했거나 안 했거나.
더 쉬운 backfill — "지난주 다시 돌려" 가 의미 있음.

Streaming 이 사주는 것

더 낮은 latency — 이벤트가 연속 흐르고, downstream consumer 가 초/분 안에 봄.
Always-on aggregation — fraud 탐지, 실시간 대시보드, 온라인 ML feature 서빙에 유용.
Decoupled producer/consumer — Kafka/Pulsar 가 bus 역할.

옳은 질문

"Consumer 가 필요한 freshness SLA 가 뭐야?" 답이 "daily 면 OK" 면 batch. "5분이면 OK" 면 micro-batch (5분마다 batch 작업). "sub-second" 면 streaming. 대부분의 consumer 는 "daily" 라고 해. Streaming 은 진짜 필요할 때 진짜 이점 — 그리고 안 필요할 때 진짜 비용 (운영 복잡도, 비용, 디버깅 어려움).

Code

Micro-batch — 대부분의 "streaming" 요구의 실용적 중간·python

# cron / Airflow / Dagster 로 5분마다 스케줄.
# 마지막 watermark 이후 모든 거 읽고 partitioned target 에 쓰기
import datetime as dt
from pathlib import Path
import pandas as pd

def micro_batch_run(window_minutes: int = 5) -> None:
    now = dt.datetime.utcnow()
    since = now - dt.timedelta(minutes=window_minutes)
    bucket = now.strftime('%Y/%m/%d/%H%M')

    df = fetch_events(since=since.isoformat(), until=now.isoformat())
    if df.empty:
        return

    target = Path(f'warehouse/events/{bucket}')
    target.mkdir(parents=True, exist_ok=True)
    df.to_parquet(target / 'part-0.parquet', index=False)

# 5분 cron 으로 돌리면 Kafka 없이 실용적 near-real-time

Exercise

본인 환경의 진짜 데이터 consumer 셋 (대시보드, ML 모델, 재무 리포트) 에 대해 각각 실제로 필요한 freshness SLA 적어 — 솔직하게, 바라는 거 말고 필요한 거. Batch / micro-batch / streaming 으로 분류. 답이 거의 항상 "batch 면 OK" 인 거 알아채.