YaRN과 NTK-aware scaling — Llama 3가 128K까지 간 방법

YaRN("Yet another RoPE extensioN", Peng et al. 2023)은 RoPE 기반 모델을 긴 컨텍스트로 확장하는 데 production에서 가장 성공한 기법. Llama 3랑 많은 커뮤니티 fine-tune이 YaRN 또는 YaRN 모양 레시피 사용.

YaRN이 이용하는 통찰

RoPE는 여러 주파수의 sinusoid로 위치를 인코딩. 고주파 차원(작은 θ)은 빨리 한 바퀴 돌아 — 가까운 위치 구별하는 친구들. 저주파 차원은 천천히 돌아 — 멀리 떨어진 위치 구별하고, 따라서 컨텍스트 확장 시 분포 밖으로 나가는 친구들.

순진한 Position Interpolation은 모든 주파수를 균일하게 재스케일 — 동작은 하는데 모델이 이미 학습한 local 패턴을 압축. NTK-aware scaling은 저주파 차원만 재스케일하고 고주파는 그대로 둬. YaRN은 여기에 temperature-scaled attention 수정을 더해서 결과 개선.

영수증

YaRN은 LLaMA 모델을 4K → 128K 컨텍스트로 확장 시 풀 재학습 대비 약 10배 적은 fine-tuning 토큰, 이전 보간 방법 대비 2.5배 적은 학습 step을 보였어. Llama 3는 8K 학습 base를 release 모델 카드의 128K 컨텍스트로 확장하는 데 YaRN 스타일 스케줄 사용.

Code

NTK-aware scaling — sketch·python

# Standard RoPE inverse frequencies:
# inv_freq[i] = 1 / (base^(2i / d))

# NTK-aware: scale base proportionally to context extension
def ntk_aware_inv_freq(d_head, scaling_factor, base=10000.0):
    # If we want to extend context by `scaling_factor`,
    # adjust base so that the lowest-frequency dimension
    # reaches the new range while leaving high-frequency dims alone.
    new_base = base * (scaling_factor ** (d_head / (d_head - 2)))
    inv_freq = 1.0 / (new_base ** (torch.arange(0, d_head, 2).float() / d_head))
    return inv_freq

# YaRN adds:
#  - per-dimension interpolation (the "ramp")
#  - attention temperature: scale attention scores by t = 0.1 ln(s) + 1
#    where s is the scaling factor, to maintain entropy.

Exercise

컨텍스트 확장 레시피가 문서화된 open-weight 모델 찾아(예: NousResearch/Yarn-Llama-2-7b-128k). README/config 읽어. 어떤 YaRN 파라미터를 썼고 왜 그랬는지 파악. 그 다음 긴 문서로 학습 윈도우 길이랑 확장 길이에서 perplexity 측정. perplexity가 얼마나 변하나?

YaRN과 NTK-aware scaling — Llama 3가 128K까지 간 방법

YaRN이 이용하는 통찰

영수증

Code

External links

Exercise

Progress

댓글 0