Long-context limit — 모델이 중간 잊을 때

Lost in the middle은 진짜 효과

1M-토큰-window 모델에서도 attention은 uniform 아니야. 시작과 끝 정보가 high fidelity로 recall, 중간에 묻힌 정보는 measurable한 degradation으로 recall. Needle-in-haystack 벤치마크가 매년 좋아졌는데 안 사라져 — 그리고 production 프롬프트 대부분이 needle test만큼 well-structured 아니야.

work하는 tactic

중요도로 order — most relevant chunk 먼저, document order 아니라.
끝에 instruction 반복 — 긴 evidence block 후 한 줄 reminder가 task re-anchor.
aggressive하게 trim — marginally relevant context 200k 토큰이 well-chosen context 20k 토큰보다 나빠.
Hierarchical retrieval — 먼저 summary level retrieve, 그 다음 relevant section drill. 모든 걸 중간에 dump 안 함.

테스트해

알려진 fact를 긴 context의 세 위치 (시작, 중간, 끝)에 inject. 모델한테 recall하라 요청. middle-position recall이 materially 나쁘면 너의 프롬프트가 lost-in-the-middle 효과 느끼는 중.

Code

Position-sensitive recall test·python

import random
import anthropic

def build_haystack(needle: str, position: str, total_tokens: int = 100_000) -> str:
    filler = "Lorem ipsum ... " * (total_tokens // 4)  # crude
    parts = filler.split(". ")
    if position == "start":
        out = needle + ". " + ". ".join(parts)
    elif position == "end":
        out = ". ".join(parts) + ". " + needle
    else:  # middle
        mid = len(parts) // 2
        out = ". ".join(parts[:mid] + [needle] + parts[mid:])
    return out

# Run with the same needle at three positions; compare recall.

Long-context limit — 모델이 중간 잊을 때

Lost in the middle은 진짜 효과

work하는 tactic

테스트해

Code

External links

Exercise

Progress

댓글 0