실전 — Custom Attention layer

Keras 가 이미 주는 걸 왜 또 짜?

Keras 엔 MultiHeadAttention 이 들어있어, 그러니 직접 짜는 건 프로덕션 얘기가 절대 아냐 — 이해 얘기야. attention 은 신비롭다는 평판이 있지만 메커니즘은 행렬곱 몇 개야. keras.ops 만으로 한 번 손수 짜보면, 모든 Transformer 논문의 그 수식이 척 보면 읽히는 코드가 돼.

custom layer 의 세 조각

keras.layers.Layer 를 subclass 한다는 건 메서드 셋을 채운다는 뜻. __init__ 에 config 저장 (여기선 projection 폭 units). build(input_shape) 에서 입력 shape 가 정해진 뒤 weight 생성 — query/key/value 용 projection 행렬 셋을 add_weight 로. call(inputs) 가 forward pass. __init__ 과 build 를 쪼개는 덕에 Keras 가 layer 에 뭐가 들어오든 weight shape 를 lazy 하게 추론해.

scaled dot-product attention 한 줄씩

call 본문이 바로 그 유명한 공식이야. 입력을 Q, K, V 로 projection. matmul(q, transpose(k)) 로 모든 query 를 모든 key 에 점수 매김. sqrt(units) 로 나눔 — 차원이 커져도 softmax 가 포화 안 되게 잡아주는 scaling. 점수를 softmax 해서 attention weight 로, 그 weight 로 value 의 가중합. 모든 op 가 keras.ops 라서 같은 layer 가 TensorFlow·PyTorch·JAX 에서 그대로 돌아. 본인이 짠 attention 의 weight 가 model.fit() 으로 학습되는 거 보면 *내 손으로* attention 만든 느낌이 와.

Code

SimpleAttention — scaled dot-product attention 직접 구현·python

class SimpleAttention(keras.layers.Layer):
    def __init__(self, units, **kwargs):
        super().__init__(**kwargs)
        self.units = units

    def build(self, input_shape):
        self.W_q = self.add_weight(
            shape=(input_shape[-1], self.units), name="query_weight"
        )
        self.W_k = self.add_weight(
            shape=(input_shape[-1], self.units), name="key_weight"
        )
        self.W_v = self.add_weight(
            shape=(input_shape[-1], self.units), name="value_weight"
        )

    def call(self, inputs):
        q = keras.ops.matmul(inputs, self.W_q)
        k = keras.ops.matmul(inputs, self.W_k)
        v = keras.ops.matmul(inputs, self.W_v)

        # Scaled dot-product attention
        scale = keras.ops.sqrt(
            keras.ops.cast(self.units, dtype="float32")
        )
        scores = keras.ops.matmul(q, keras.ops.transpose(k)) / scale
        weights = keras.ops.nn.softmax(scores)
        return keras.ops.matmul(weights, v)

Exercise

SimpleAttention 을 keras.ops 만 써서 multi-head self-attention 으로 확장 (Subclassed Layer): Q/K/V projection → num_heads 개 head 로 reshape → head 별 scaled dot-product attention → concat 후 최종 output projection. 같은 weight 를 양쪽에 복사했을 때 keras.layers.MultiHeadAttention 과 출력이 float tolerance 안에서 일치하는지 확인.

Hint

(batch, seq, units) → (batch, num_heads, seq, head_dim) reshape 랑 attention 후 다시 transpose 하는 부분에 버그가 제일 많이 살아. 매 단계 shape 찍고 MultiHeadAttention 출력이랑 head 하나씩 대조해.