Attention과 MultiHeadAttention

머신러닝을 삼킨 메커니즘

Attention은 model이 각 출력 만들 때 입력의 다른 부분에 선택적으로 집중하게 해. 전체 sequence를 단일 고정 크기 벡터로 압축 (vanilla seq2seq의 병목) 대신, attention은 decoder가 모든 encoder state 다시 보고 매 step 가장 관련 있는 거 결정하게 해.

Self-attention은 한 발 더 나가: sequence의 모든 위치가 동시에 모든 다른 위치에 attend. 이게 Transformer architecture의 핵심 혁신. Multi-Head Attention은 attention을 병렬로 여러 번 돌려, 각각 다른 종류 관계 학습 (구문, 의미, 위치).

TensorFlow는 tf.keras.layers.MultiHeadAttention을 first-class layer로 제공. 시그니처는 직관적: query, key, value 넘기기. Self-attention은 셋 다 같은 tensor; cross-attention (decoder가 encoder에 attend)은 query가 decoder, key/value가 encoder.

Code

Using MultiHeadAttention·python

import tensorflow as tf
from tensorflow.keras import layers

mha = layers.MultiHeadAttention(num_heads=8, key_dim=64)

# Self-attention: query = key = value
batch_size, seq_len, d_model = 32, 100, 512
x = tf.random.normal((batch_size, seq_len, d_model))

output, attention_weights = mha(
    query=x, value=x, key=x,
    return_attention_scores=True,
)
# output:            (32, 100, 512)
# attention_weights: (32, num_heads, 100, 100)

# Cross-attention: decoder attends to encoder output
decoder_seq = tf.random.normal((32, 20, 512))
encoder_seq = tf.random.normal((32, 100, 512))
cross_out = mha(query=decoder_seq, value=encoder_seq, key=encoder_seq)

Transformer encoder block — the core unit·python

import tensorflow as tf
from tensorflow.keras import layers

class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, ff_dim, dropout_rate=0.1):
        super().__init__()
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=d_model // num_heads
        )
        self.ffn = tf.keras.Sequential([
            layers.Dense(ff_dim, activation='relu'),
            layers.Dense(d_model),
        ])
        self.norm1 = layers.LayerNormalization(epsilon=1e-6)
        self.norm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)

    def call(self, x, training=False):
        # Self-attention with residual + layer norm
        attn = self.attention(query=x, value=x, key=x, training=training)
        attn = self.dropout1(attn, training=training)
        out1 = self.norm1(x + attn)

        # Feedforward with residual + layer norm
        ffn = self.ffn(out1)
        ffn = self.dropout2(ffn, training=training)
        return self.norm2(out1 + ffn)

Attention과 MultiHeadAttention

머신러닝을 삼킨 메커니즘

Code

External links

Progress

댓글 0