Vision Transformer와 multimodal 모델

Vision Transformer (ViT, Dosovitskiy et al., 2020)는 이미지를 patch 시퀀스로 다루고 표준 Transformer encoder로 처리. 돌파구는 새 아키텍처가 아니라 — 같은 기계가 이미지에 통한다는 시연이었어.

Patching 파이프라인

224×224 이미지를 16×16 픽셀 패치 14×14개로 reshape.
각 패치를 flatten하고 d_model로 linear projection.
학습 가능한 [CLS] 토큰을 앞에 추가, positional embedding 더하기.
Transformer encoder 통과.

원조 논문의 아키텍처 전체가 이거. ViT-H/14 (632M params)가 ImageNet top-1 88.5% 도달. DINOv2가 self-supervised learning으로 ViT 학습 + 만들어낸 visual feature가 여러 multimodal LLM(Llama 3.2 vision, 많은 open chat-vision 모델)에서 재사용 중.

Multimodal LLM

모던 multimodal 모델은 이미지 패치를 비전 encoder에 통과 → 결과를 LLM의 embedding 공간으로 projection → 그 "visual token"을 텍스트 처리하는 같은 decoder-only 스택에 공급. 아키텍처는 안 변해 — 변하는 건 입력의 토큰 알파벳.

Llama 3.2-vision (11B / 90B): 같은 Llama 3 백본이 소비하는 visual token 생산하는 비전 encoder 추가.
GPT-4o, Gemini 2.5: native multimodal — 텍스트/이미지/오디오가 학습부터 residual stream 공유.
Llama 4 Scout/Maverick: MetaCLIP 비전 encoder + early-fusion multimodal, 200 언어 지원.

Code

ViT in 30 lines·python

import torch.nn as nn
from einops import rearrange

class ViT(nn.Module):
    def __init__(self, img_size=224, patch=16, d_model=768,
                 n_layers=12, n_heads=12, n_classes=1000):
        super().__init__()
        self.patch_proj = nn.Conv2d(3, d_model, patch, patch)  # patchify
        self.cls = nn.Parameter(torch.zeros(1, 1, d_model))
        self.pos = nn.Parameter(torch.zeros(1, (img_size // patch)**2 + 1, d_model))
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, MultiHeadAttention(d_model, n_heads),
                             FeedForward(d_model, 4 * d_model))
            for _ in range(n_layers)
        ])
        self.head = nn.Linear(d_model, n_classes)
    def forward(self, x):
        x = self.patch_proj(x)              # (B, d, H/p, W/p)
        x = rearrange(x, 'b d h w -> b (h w) d')
        cls = self.cls.expand(x.size(0), -1, -1)
        x = torch.cat([cls, x], dim=1) + self.pos
        for blk in self.blocks:
            x = blk(x)
        return self.head(x[:, 0])           # use [CLS] token's output

Vision Transformer와 multimodal 모델

Patching 파이프라인

Multimodal LLM

Code

External links

Exercise

Progress

댓글 0