O(n²) attention 복잡도 — 긴 컨텍스트의 절벽

~10 min · complexity, long-context

Level 0Token

0 XP0/94 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

표준 self-attention은 (n × n) score 행렬과 행별 softmax 계산. 시간과 메모리 둘 다 시퀀스 길이에 대해 O(n²). 토큰 수천 개까지는 싸. 32K 넘어가면 비싸지고, 128K 넘어가면 잔혹하고, 1M 넘어가면 아키텍처 변경 없이는 비현실적.

시퀀스 길이 n	(n × n) 항목	head 하나당 FP16 메모리
1K	1M	2 MB
8K	64M	128 MB
32K	1B	2 GB
128K	16B	32 GB
1M	1T	2 TB

32+ head, 80+ layer LLaMA 스타일 모델이면 약 2,560을 곱해. 1M 토큰에서 단일 head attention만 해도 이미 2 TB; 풀 multi-head, multi-layer attention은 페타바이트의 중간 저장소 필요 — FFN 시작도 안 한 거지. LLaMA 4 Scout(10M 컨텍스트), Gemini 2.5 Pro(1M 컨텍스트) 같은 모델이 전체 시퀀스 길이에서 표준 dense attention 안 돌리는 이유. 효율적 근사 사용(Flash Attention은 정확하지만 메모리에서 타일링; sliding-window, sparse, Mamba 하이브리드 대안은 부정확).

Code

Memory cost calculator·python

def attention_memory_bytes(seq_len, n_heads=1, dtype_bytes=2):
    # Just the (n, n) score / weight matrix, per head
    return n_heads * seq_len * seq_len * dtype_bytes

for n in [1_024, 8_192, 32_768, 128_000, 1_000_000]:
    print(f"n={n:>9,}  per-head: {attention_memory_bytes(n)/1e9:>8.2f} GB"
          f"  Llama-3 32-head: {attention_memory_bytes(n, 32)/1e9:>9.2f} GB")

External links

Exercise

128K 컨텍스트 지원 모델 골라. 합성 'unique 8자리 숫자 찾기' task를 깊이 1K, 8K, 32K, 64K, 128K에서 각 100회 실행. recall vs. 깊이 플롯. 본인 용도에서 실용적 컨텍스트 윈도우가 어디서 끝나나?

Progress

Progress is local-only — sign in to sync across devices.

← PreviousKV-cache — 생성이 매번 전부 재계산 안 하는 이유 Next →Efficient attention — Flash Attention, sliding window, sparse

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.