Scaling 과 학습 안정성

Transformer 가 scale 에서 예측 가능

Transformer 가 이제 잘 이해된 dynamics 와 함께 수천억 parameter 까지 scale. 모든 failure mode 가 documented 되고 known mitigation 가져: loss spike, attention entropy collapse, attention sink, learning-rate cliff, gradient norm divergence. 안정된 70B Transformer 학습 recipe 가 mature. 안정된 7B Transformer 학습 recipe 는 극도로 mature.

이 예측 가능성이 진짜 economic 가치 가져. Frontier lab 이 single 학습 run 에 수천만 달러 써. 실패한 학습 run 의 cost 가 낭비된 FLOPs 아냐; 회복 못하는 calendar 시간이고 다른 작업에서 redirect 한 팀 시간이야. Transformer 가 학습 run 이 뭔가 산출할 거라는 high confidence 와 함께 budget 가능할 만큼 예측 가능.

SSM 학습이 더 좁아

SSM 이 더 좁은 optimal learning rate range 보여. 2025 study 가 발견한 거 — recall task 에서 Mamba performance 가 빡빡한 LR window 안에서 거의 0 에서 거의 optimal 로 점프, 잘못 고르면 모델이 unusable, graceful degradation 없음. NVIDIA (Nemotron) 와 AI21 (Jamba) 의 발표된 recipe 가 significantly 도와주지만, "Transformer 는 finicky" 가 더 이상 진실 아닌 방식으로 "Mamba 는 finicky" 는 여전히 진실.

Research lab 과 well-funded 팀한테 이 fragility 가 tax — 옳은 LR 찾으려고 더 많은 sweep 돌릴 수 있어. 작은 학습 budget 돌리는 startup 과 product 팀한테 진짜 risk. $1M Mamba run 의 single LR 실수가 $1M Transformer run 의 single LR 실수보다 더 painful, institutional fallback 적어서.

Initialization, schedule, regularization

같은 fragility 가 across the board 적용: initialization 선택이 더 중요, schedule 모양이 더 중요, regularization 이득과 손실이 더 variable. Hyperparameter 실수의 grace period 가 더 짧아. 이 중 어느 것도 fatal 아냐 — Mamba 와 다른 alternative 가 명백히 successfully train 됨 — 근데 Transformer 의 예측 가능성이 자체로 차이 felt 해보기 전엔 fully appreciate 어려운 competitive advantage.

Exercise

작은 dataset 에 작은 Mamba (≤30M param) 학습, learning rate 를 1e-5 에서 1e-2 까지 log-spaced value 로 sweep. 그 다음 비슷한 parameter count 의 작은 Transformer 로 같은 거. 둘 다 final validation loss vs learning rate 그래프. Mamba curve 가 더 sharp 한 minimum (divergence 또는 나쁜 local optima 로 convergence 산출하는 LR 의 더 넓은 range) 가지는 거 봐야; Transformer curve 가 더 gentler. 그 sharpness 가 학습-안정성 gap.

Transformer 가 scale 에서 예측 가능

SSM 학습이 더 좁아

Initialization, schedule, regularization

External links

Exercise

Progress

댓글 0