Agent 는 architecture 아냐 ('long context' 도 아냐)

Agent — wiring 이 아니라 loop

"AI agent" 가 scaffolding loop 안에 놓인 LLM — observe, think, act, 다시 observe. 들어본 모든 framework (LangChain, LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude Agent SDK) 가 standard LLM 호출 둘러싼 state 관리하는 loop. 각 iteration 이 carefully 구성된 prompt 와 한 normal model invocation. 모델은 "agent 안" 인지 awareness 없어 — chat 대화로만 봐.

Pseudocode 의 agent loop

Agent 30 줄 orchestration 코드로 구현 가능하면 agent 는 architecture 아냐. Plumbing 이야. 모델 unchanged.

Long context 는 새 architecture 아냐

Context length 4K 에서 128K, 1M+ 토큰으로 확장 involves:

RoPE frequency scaling (NTK-aware scaling, dynamic NTK, YaRN) — 학습된 window 너머 extrapolate 위해 rotary position encoding 조정.
Continued training 새 RoPE configuration 으로 long sequence 위에.
더 좋은 attention 구현 더 긴 KV-cache 위해.

이것 중 어느 것도 새 architecture 아냐. Backbone Transformer unchanged; 변하는 건 position encoding 과 training data length.

Effective context ≠ advertised context

1M-token context window 가 모델이 1M 토큰 잘 쓴다는 뜻 아냐. "Needle in a Haystack" 테스트와 다양한 MECW (Maximum Effective Context Window) evaluation 같은 research 가 많은 모델이 advertised limit 도달 훨씬 전에 의미 있게 degrade 보여줌. Advertised 숫자가 capability 가 아니라 capacity.

일반 lesson

Fixed 모델을 extra software 로 wrap 하는 어떤 것 — agent, RAG, long context 트릭, guardrail, system prompt — 도 product 또는 training, architecture 아냐. Architectural skeleton 은 weights 파일에 인코딩된 것. 다른 모든 것은 application 코드.

Code

25 줄로 entire agent (모델 변화 없음)·python

def agent_loop(goal, llm, tools, max_steps=20):
    history = [{"role": "system", "content": f"You are an agent. Goal: {goal}"}]
    for step in range(max_steps):
        response = llm.generate(history)
        history.append({"role": "assistant", "content": response})
        if "FINAL_ANSWER" in response:
            return response
        tool_call = parse_tool_call(response)
        if tool_call is None:
            continue
        result = tools[tool_call["name"]](**tool_call["args"])
        history.append({"role": "tool", "content": str(result)})
    return "STOPPED: max_steps reached"

RoPE frequency scaling — long-context 트릭·python

# Standard RoPE (training-time)
def rope_freqs(d_model, base=10000.0):
    return base ** (-torch.arange(0, d_model, 2) / d_model)

# YaRN-style scaling for extension
def yarn_scaled_freqs(d_model, base=10000.0, scale=4.0):
    return base ** (-torch.arange(0, d_model, 2) / d_model) / scale
# Position encoding adjustment, no architectural change.

Exercise

비슷한 advertised context window 가진 두 모델에 같은 long-context task (예: '100K-토큰 문서의 90% 위치에 묻힌 이 사실 찾기') 실행. 결과 비교. 차이가 wiring 에 대한 거 거의 없어 — 각 모델이 long-context recall 위해 얼마나 철저히 학습됐는지에 대한 거.