Production 용 포맷 고르기

실제로 쓰는 결정 트리

플레이어 봤어. 내가 실제로 쓰는 결정 트리 — 적용 순서대로:

Downstream 파이프라인이 읽을까? → Parquet (partitioned, zstd). 이게 답의 90%.
Consumer 가 이미 데이터 가진 DB? → 파일 쓰지 마. Warehouse 안에서 SQL/dbt 로 compute.
Consumer 가 Excel 사용자? → boundary 에서 sidecar schema 와 함께 CSV. 파이프라인 stage 로 만들지 마.
Nested/log 모양 데이터? → newline-delimited JSON (NDJSON / JSONL). gzip 또는 zstd 압축.
Notebook 단계 사이 단기 scratch? → Arrow IPC (Feather). 빠른 로드, 압축 overhead 없음.
데이터 작고 diff 가능 원해? → version control 안의 CSV. ~10MB 넘으면 Parquet 으로 전환.

HDF5, Avro, ORC, Iceberg, Delta Lake, Hudi 는?

빠른 요약:

HDF5 — 과학 데이터, 계층형, modern stack 이전. 물리/유전체학에서 여전. 새 분석 작업엔 고르지 마.
Avro — row-oriented, Kafka 파이프라인에서 흔함. 스트림용 OK; at-rest 분석은 Parquet.
ORC — Hadoop 시대 columnar Parquet 경쟁자. 이미 ORC 세상 아니라면 Parquet 골라.
Iceberg / Delta Lake / Hudi — Parquet 위의 테이블 포맷. ACID 트랜잭션, schema evolution, time travel 추가. Warehouse 규모 lakehouse 엔 critical; 단일 머신 파이프라인엔 overkill. 파일 트리 접근 outgrow 하면 배워.

Code

목적별 올바른 포맷 고르는 파이프라인 친화 write helper·python

from pathlib import Path
import pandas as pd

def write(df: pd.DataFrame, path: Path, *, purpose: str) -> Path:
    '''데이터가 뭐 위한 건지로 포맷 결정.'''
    if purpose == 'pipeline':
        # default 분석 포맷
        path = path.with_suffix('.parquet')
        df.to_parquet(path, index=False, compression='zstd')
    elif purpose == 'human':
        # Excel/CSV boundary — 사람이 본다
        path = path.with_suffix('.csv')
        df.to_csv(path, index=False, encoding='utf-8', lineterminator='\n')
    elif purpose == 'scratch':
        # Notebook scratch — 빠른 로드, 압축 없음
        path = path.with_suffix('.feather')
        df.to_feather(path)
    elif purpose == 'log':
        # Append-only newline-delimited JSON
        path = path.with_suffix('.jsonl')
        df.to_json(path, orient='records', lines=True)
    else:
        raise ValueError(f'unknown purpose {purpose!r}')
    return path

실제로 쓰는 결정 트리

HDF5, Avro, ORC, Iceberg, Delta Lake, Hudi 는?

Code

External links

Exercise

Progress

댓글 0