O(n²) Bottleneck

Quadratic 은 constant factor 가 아냐

attention score matrix 모양은 n × n. sequence length 가 1K 면 entry 가 백만 개. 100K 면 100 억 개. 이 matrix 가 최종 출력에 안 나타나도, 표준 구현에서는 softmax 전에 메모리에 materialize 돼 — 저장하든 안 하든 O(n²·d) FLOPs 가 들어. compute 와 memory 둘 다 sequence length 에 quadratic 으로 자라.

이건 엔지니어링으로 없앨 수 있는 constant-factor 불편함이 아냐. context 를 두 배 늘리면 attention cost 가 네 배가 돼. 8K 에서 128K 로 — 16× context 증가 — 가면 attention compute 가 256× 곱해져. 어느 시점에는 GPU 가 그냥 안 들어가고, 그걸 넘으면 메모리는 들어가도 wall-clock cost 가 prohibitive 해져.

FlashAttention 이 고친 거랑 못 고친 거

FlashAttention (Dao et al., 2022 → 2024 의 FlashAttention-3) 은 시대 가장 중요한 시스템 논문 중 하나야. attention matrix 의 tile 들을 GPU 의 빠른 on-chip SRAM 안에 유지하고 느린 HBM 으로 계속 swap 안 하게 computation 을 재구성해. FlashAttention-3 은 Hopper 에서 async TMA copy 와 FP8 quantization 으로 H100 의 이론적 peak 약 85% 까지 도달.

FlashAttention 이 바꾼 건: constant factor. 안 바꾼 건: asymptotic complexity. compute 는 여전히 O(n²·d) 야. FlashAttention 은 실전 wall 을 더 멀리 밀었지만 — 여전히 wall 이야. 백만 토큰 context 처리하려면 다른 algorithm 이 필요하지, 더 좋은 kernel 이 아냐.

Wall 이 실제 어디서 나타나나

학습 시: context window 를 늘리려고 할 때 OOM error 로. 추론 시: long-context serving 의 throughput collapse 로. 그리고 consumer hardware 에서 가장 잔인하게 — 2025 study 에서 Transformer 는 24GB consumer GPU 에서 ~25K 토큰 을 못 넘긴 반면, 동급 SSM 은 같은 카드에서 220K+ 토큰을 처리. 이 gap 이 이 quest 나머지의 motivation 이야.

Code

fp16 attention matrix 메모리 증가·python

def attention_bytes(seq_len, n_heads=32, dtype_bytes=2):
    # Score matrix: heads * n * n entries, 각 entry `dtype_bytes` 바이트
    return n_heads * seq_len * seq_len * dtype_bytes

for n in [1024, 4096, 32_768, 131_072, 1_000_000]:
    gb = attention_bytes(n) / (1024 ** 3)
    print(f'{n:>10} tokens -> {gb:>8.2f} GB attention matrix')
# 1024       -> 0.06 GB
# 4096       -> 1.0 GB
# 32768      -> 64.0 GB     <-- 이미 단일 H100 80GB 초과
# 131072     -> 1024.0 GB
# 1000000    -> ~59 TB

Exercise

로컬에서 돌릴 수 있는 Hugging Face Transformers 모델 골라봐 (Llama 3.2 1B 면 노트북에서도 OK). attn_implementation="eager", "sdpa", hardware 지원되면 "flash_attention_2" 를 차례로 설정하고, sequence length 512, 4K, 32K 에서 prefill latency 측정. log-log axis 로 latency vs sequence length 그려봐. 긴 sequence 에서는 세 구현 다 비슷한 slope (≈ 2 in log-log) 로 수렴해야 해 — 그 slope 가 kernel 무관한 algorithmic O(n²) 가 비치는 거야.