Gemma — Sliding Window Attention 이라는 dense 최적화

Gemma 의 정의적 트릭

Google 의 Gemma 패밀리는 dense 로 남으면서도 attention 계산에서 효율 짜낼 수 있다는 가장 깨끗한 시연. Gemma 2 (2024년 6월) 가 local/global sliding-window attention interleaving 도입 — cheap local attention (4K window) 과 full global attention 을 레이어마다 alternate. Gemma 3 가 5:1 비율로 sharpen (5 local 당 1 global), 262K vocabulary, 128K context.

왜 interleaving 이 중요해

Full attention 은 sequence length 에 O(n²) — 모든 토큰이 다른 모든 토큰 attend. 128K context window 에서 빠르게 painful. Local sliding window attention 은 레이어마다 O(n × w) 로 줄여 (w = window size, 예: 4K). Interleaving 으로 모델은 full attention 을 *어딘가* 에는 가져 — 모든 레이어 아니지만.

직관

대부분 토큰은 일하려면 사실 마지막 몇천 토큰만 보면 돼. Long-range dependency 필요한 소수 토큰은 global-attention 레이어가 전체 context 에서 정보 끌어와 줘. Pure local 의 cost 절약 대부분과 full attention 의 recall 대부분을 둘 다 가져. Hack 이 아니라 원칙적인 타협.

Gemma 패밀리 스냅샷

Gemma 2 (2B, 9B, 27B): local/global pattern 도입.
Gemma 3 (1B, 4B, 12B, 27B): 5:1 비율, 128K context, 262K vocab. 27B-IT 강력한 dense workhorse.
Gemma 4 (2025): 31B dense + 26B-A4B MoE variant — Google 의 첫 MoE Gemma. 31B dense 가 여전히 쉬운 default.

Dense 에 대해 가르치는 것

"Dense" 가 "uniform" 의미 안 함. Gemma 의 interleaving 은 완전히 dense paradigm 안 에 살아 — 모든 토큰이 여전히 모든 파라미터 활성화, FFN 여전히 dense, expert 없음. 바뀌는 건 attention head 마다 어떤 토큰들 볼 수 있냐 뿐. Dense 가 boring 하다고 생각하기 시작하면 유용한 반례.

Code

Local vs global attention — 한 줄 차이·python

# Global attention: token at position i can attend to positions [0..i]
def global_mask(n):
    return [[1 if j <= i else 0 for j in range(n)] for i in range(n)]

# Local sliding window: token at i attends only to [max(0, i-w+1)..i]
def local_mask(n, w=4096):
    return [[1 if (i - w + 1) <= j <= i else 0 for j in range(n)] for i in range(n)]

# Gemma interleaves: ~5 local layers, then 1 global layer, repeat.

Exercise

Gemma 3 technical report 의 sliding-window attention 섹션 읽어. 손으로 계산: 128K 토큰 sequence, 5:1 local:global 비율, 4K local window 에서 각 global layer 가 각 local layer 대비 attention compute 의 몇 분의 일을 차지? Efficient-attention 페이퍼 읽는 데 필요한 직관.