RAG 는 architecture 아냐

RAG 가 실제로 무엇

Retrieval-Augmented Generation 이 모델 architecture 아니라 시스템 디자인 패턴. Query 도착하면 application 코드가 vector store (또는 다른 source — database, web 검색, structured query) 에서 관련 문서 retrieve, retrieve 한 텍스트를 모델 context window 에 paste. 모델 weights unchanged. 모델이 retrieve 한 문서를 prompt 의 일부로 보고 그것에 conditioned 한 response 생성.

RAG 는 search + paste + prompt

이름 벗기면 RAG 는 세 step: (1) 관련 chunk 위해 corpus 검색, (2) 모델 input 에 concatenate, (3) 그 context 사용해 답하라고 모델 prompt. 모든 step 이 application 코드. 모델은 black-box generator, RAG 없는 거와 동일.

왜 사람들이 RAG 를 architecture 로 헷갈려

"모델이 RAG capability 있어", "built-in retrieval", "memory-enabled AI" 같은 phrase 다 모델 안에서 뭔가 변하는 것처럼 implies. 거의 아무것도 안 변해. 같은 base 모델이 0 또는 많은 RAG 시스템에 wrap 가능, weight 변화 없이. 같은 모델 위 두 다른 RAG 시스템이 wildly 다른 행동 produce 가능 — 그 변동이 application 레이어에 살지 모델에 안 살아.

RAG 추가하면 변하는 것

Input: 모델이 query 당 더 많은 context 봐.
Latency: retrieval 이 generation 전 step 추가.
Cost: 더 큰 context = 더 많은 input 토큰 = 더 많은 billing.
Behavior: 모델이 retrieve 한 사실에 conditioned, training data 에 없는 콘텐츠 질문 답 가능.

안 변하는 것

모델 weights. 모델 training. 모델 architecture. 어떤 모델도 RAG pipeline 에 plug 가능; pipeline 이 시스템에 retrieval 능력 주지, 모델이 아냐.

Code

4 줄로 RAG (모델 unchanged)·python

def rag_answer(query, vector_db, llm, top_k=5):
    docs    = vector_db.search(query, top_k=top_k)
    context = "\n\n".join(docs)
    prompt  = f"Context:\n{context}\n\nQuestion: {query}"
    return llm.generate(prompt)
# llm.generate is the SAME call you'd make without RAG.
# RAG lives in the four lines around it, not inside it.

Exercise

RAG-claimed product 가져와 (chatbot, code assistant, 검색 도구). 기저 LLM 을 비슷한 capability 다른 거로 swap 하면 response 의 어떤 부분이 변할지 식별. Retrieval 부분 그대로; generation style 변함. 그 mental separation 이 실제 삶에서 architecture-vs-product 구분.