RAG vs Stuffing: lookup 빈도로 픽

데이터 가져오는 두 가지

관련 텍스트를 프롬프트에 stuff하거나 on demand로 retrieve. Stuffing이 더 단순 — 문서 paste하고 묻기. Retrieval이 더 lean — corpus embed, top-k chunks fetch, 그것만 paste. 둘 다 valid; 옳은 답은 지식 베이스가 얼마나 크고 variable한지 따라.

결정 룰

지식 베이스가 컨텍스트에 편하게 들어가고 같은 콘텐츠가 대부분 쿼리 먹이면 prompt caching이랑 stuff — cache가 반복 읽기 싸게. 지식 베이스가 컨텍스트보다 훨씬 크거나 쿼리가 widely vary하면 retrieve. cwkPippa는 페르소나 + vault 인덱스 stuff(stable, ~30KB cached), specific 파일은 Read 도구로 on demand retrieve.

좋은 RAG 모양

Quality embedding 모델(SDK가 ship 안 — Ollama bge-m3, OpenAI text-embedding-3 등이랑 페어), 합리적 chunk 크기(512-1024 토큰 + overlap), 후보 많으면 re-ranker, cite 가능하게 답까지 provenance preserved.

원칙: RAG는 작은 안정 corpus에 overkill. Stuffing은 큰 variable 거에 overkill. 데이터에 패턴 매치.

Code

Prompt caching이랑 stuffing·python

STABLE_KB = open("product_kb.md").read()  # 30KB의 안정 docs

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": f"Use the knowledge base to answer.\n\n<kb>\n{STABLE_KB}\n</kb>",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_question}],
)

RAG: top-k chunks retrieve 후 묻기·python

# Pseudo: vector store로 top-3 chunks retrieve
hits = vector_store.query(user_question, k=3)
context = "\n\n".join(f"[doc {h.id}] {h.text}" for h in hits)

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="Answer using ONLY the provided context. If the answer is not in context, say so.",
    messages=[{
        "role": "user",
        "content": f"<context>\n{context}\n</context>\n\nQuestion: {user_question}",
    }],
)

Exercise

지식 driven 기능 하나에 대해 stuffing-with-cache랑 RAG 둘 다 prototype. 30 representative 쿼리에 대해 지연, 비용, 답 quality 측정. Gut 아니라 데이터로 결정.

Hint

비용 비교 — stuffing은 cache window당 한 번 지불; RAG는 쿼리당 지불. Crossover는 보통 cache TTL당 10-50 쿼리 어디쯤.

RAG vs Stuffing: lookup 빈도로 픽

데이터 가져오는 두 가지

결정 룰

좋은 RAG 모양

Code

External links

Exercise

Progress

댓글 0