Self-attention vs cross-attention

attention 두 가지 — 아키텍처의 다른 부분에서 쓰여:

Self-attention

Q, K, V가 다 같은 시퀀스에서 옴. 각 토큰이 같은 입력의 다른 토큰들에 attend. encoder-only 모델(BERT)이 전반적으로 쓰고, decoder-only 모델(Llama)이 유일한 attention 종류로 써.

Cross-attention

Q는 한 시퀀스(decoder)에서, K와 V는 다른 시퀀스(encoder의 출력)에서. decoder의 토큰들이 encoder의 입력 표현을 "본다." encoder-decoder 모델(T5, Whisper)에서 source에 대한 encoder의 이해와 target에 대한 decoder의 생성을 연결.

아키텍처	self-attn	cross-attn
BERT (encoder-only)	O (양방향)	X
GPT / Llama (decoder-only)	O (causal)	X
T5 / Whisper (encoder-decoder)	encoder에서 O, decoder에서 O (causal)	O (decoder Q가 encoder K, V에 attend)

source 문장 S와 생성 중인 target 문장 T가 있는 번역 task에서, cross-attention은 decoder가 n번째 target 토큰 생성 중 S의 전체 encoder 표현을 돌아보고 지금 어떤 source 토큰이 관련 있는지 알아내게 해주는 거야.

Code

Cross-attention in PyTorch·python

class CrossAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
    def forward(self, decoder_x, encoder_out):
        # Q from decoder, K/V from encoder
        # PyTorch's MultiheadAttention takes (query, key, value)
        out, _ = self.attn(decoder_x, encoder_out, encoder_out)
        return out
# Inside an encoder-decoder block:
# x = self_attn(x)              # decoder self-attention (causal)
# x = cross_attn(x, enc_out)    # decoder looks at encoder output
# x = ffn(x)

Self-attention vs cross-attention

Self-attention

Cross-attention

Code

External links

Exercise

Progress

댓글 0