Latency: Batching, Async, 시간이 어디 가나

~22 min · ops, latency, performance

Level 0Scout

0 XP0/41 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

일반적 RAG turn 에서 시간이 가는 곳

Query 임베딩 — 30–200 ms (호스티드) 또는 5–20 ms (로컬 작은 모델)
벡터 검색 — 5–50 ms (Chroma, single-node) 또는 1–10 ms (HNSW 의 pgvector)
(옵션) BM25 — 5–30 ms
(옵션) 50개 cross-encoder rerank — 100–500 ms (CPU) 또는 30–150 ms (GPU)
LLM 생성 — 500ms (작은) ~ 10s (큰 + 긴 출력). 다른 거 다 dominate.

최적화 순서

먼저 profile, dominant 비용 최적화. LLM 호출이 4초인데 query 임베딩을 100ms 에서 10ms 로 줄이는 건 낭비된 노력. LLM 응답 streaming 이 retrieval latency 절반으로 줄이는 것보다 보통 큰 UX win.

threading 아니라 async

Retrieval 은 I/O bound. async (Python 의 asyncio, 다른 데서는 async/await) 사용해서 아키텍처 허용할 때 임베딩 호출, 벡터 검색, BM25 평행 issue. Threading 은 이 작업의 잘못된 abstraction.

Code

벡터 + BM25 평행 실행·python

import asyncio

async def parallel_retrieve(question: str, k: int = 20):
    vec_task  = asyncio.create_task(asyncio.to_thread(vector_search, question, k))
    bm25_task = asyncio.create_task(asyncio.to_thread(bm25_search,   question, k))
    vec, bm25 = await asyncio.gather(vec_task, bm25_task)
    return reciprocal_rank_fusion([
        [r['id'] for r in vec],
        [r['id'] for r in bm25],
    ])

results = asyncio.run(parallel_retrieve('how do I cancel?'))

각 stage 명시적 시간 재기·python

import time
from contextlib import contextmanager

@contextmanager
def stage(name: str, log: dict):
    start = time.perf_counter()
    try:
        yield
    finally:
        log[name] = round((time.perf_counter() - start) * 1000, 1)

log = {}
with stage('embed', log):
    qvec = embed(question)
with stage('search', log):
    cands = collection.query(query_embeddings=[qvec], n_results=20)
with stage('rerank', log):
    final = rerank(question, cands)
print(log)   # {'embed': 42.3, 'search': 11.8, 'rerank': 187.4}

External links

Exercise

RAG 파이프라인에 stage 별 명시적 timing 추가. 진짜 query 100개 로깅. 각 stage 의 median + p95 계산. 가장 큰 contributor 식별 — 그게 다음 최적화 타겟. 다른 건 premature.

Progress

Progress is local-only — sign in to sync across devices.

← Previous비용: 임베딩 budget 과 호출 패턴 Next →프로덕션 안 깨고 임베딩 versioning

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.