Feed-forward network — 위치별 연산

attention이 위치 간 정보를 섞은 다음, 모든 토큰이 위치별 feed-forward network(FFN)를 통과해 — 작은 MLP 하나가 위치마다 독립적으로 적용. FFN은 d_model을 d_ff("intermediate" 차원, 보통 4 × d_model)로 확장, 비선형 활성 적용, 다시 축소.

FFN(x) = Activation(x · W₁ + b₁) · W₂ + b₂

확장-후-축소 패턴 자체가 포인트. 중간 표현이 d_model의 4배 차원 — attention layer가 추출한 feature의 비선형 함수를 계산할 공간을 FFN한테 줘. 축소 projection이 학습된 reduction을 재사용해서 결과를 다시 d_model로 압축.

파라미터 예산이 얼마나 여기 사나

전형적 decoder block, d_model=4096, d_ff=14336(Llama 3 8B):

Attention: 4 × d_model² = layer당 67M 파라미터.
FFN(SwiGLU, 행렬 3개): 3 × d_model × d_ff ≈ layer당 176M 파라미터.

FFN이 layer당 파라미터의 약 70%. 거의 모든 모델 크기에서 같은 비율 성립. "MoE가 FFN 주위로 컴퓨트 옮긴다"고 할 때 그 의미는, FFN이 모델의 대부분이라는 거.

Code

Standard FFN with GELU·python

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.W1 = nn.Linear(d_model, d_ff)
        self.W2 = nn.Linear(d_ff, d_model)
        self.act = nn.GELU()
    def forward(self, x):
        return self.W2(self.act(self.W1(x)))

Parameter math for one block·python

def block_params(d_model, d_ff, n_kv_heads=None, n_q_heads=None):
    n_q_heads = n_q_heads or 32
    n_kv_heads = n_kv_heads or n_q_heads
    d_head = d_model // n_q_heads
    attn = (n_q_heads + 2 * n_kv_heads) * d_model * d_head + d_model * d_model
    ffn  = 3 * d_model * d_ff      # SwiGLU has 3 matrices
    norm = 2 * d_model              # 2 RMSNorms per block
    return attn, ffn, norm

a, f, n = block_params(4096, 14336, n_kv_heads=8, n_q_heads=32)
print(f"attn {a/1e6:.0f}M  ffn {f/1e6:.0f}M  norm {n}  per block")

Exercise

다음 모델들의 파라미터 분해(attention vs FFN vs norm) 계산 — GPT-2(d=768, d_ff=3072, 12 head, 12 layer); Llama 3 8B(d=4096, d_ff=14336, 32 Q head, 8 KV head, 32 layer); Mixtral 8x22B(d=6144, expert당 d_ff≈16384, 48 Q head, 8 KV head, 56 layer). 각 컴포넌트가 파라미터의 몇 %를 차지하나?

Feed-forward network — 위치별 연산

파라미터 예산이 얼마나 여기 사나

Code

External links

Exercise

Progress

댓글 0