Image 가 왜 다른가

Pixel 이 실제로 가진 structure

Image 는 flat vector 가 아냐. 가까운 pixel 끼리 correlate, 같은 object 가 다른 위치 나타남, 유용한 feature 가 global 전에 local 인 2-D grid. Naively flatten 한 image 를 MLP 에 먹이면 이거 다 잃어. CNN 이 그 structure 를 직접 exploit 하려고 발명됐어.

중요한 3 properties: locality (feature 가 pixel local window 에 살아), translation invariance (frame 어디 나타나도 고양이는 고양이), compositionality (low-level feature 가 mid-level feature 로, 그게 object 로 합쳐짐).

팁: Translation 이 label 안 바꾼다는 거 다시 학습하느라 model 이 parameter 낭비하면, 실제 task 에 남는 parameter 가 적어. Translation invariance baked in 한 architecture (CNN) 가 학습해야 하는 architecture (early MLP) 보다 image data 에 outperform.

Image 를 flat vector 로 다루는 비용

224×224 RGB image 의 MLP 는 3 * 224 * 224 = 150,528 input feature. 첫 hidden layer 가 hidden unit 당 대략 그만큼 parameter. 256-wide 첫 layer 만 38M parameter — small CNN 보다 많고, (input pixel, hidden unit) pair 마다 unique weight 학습. Translation 외우는 데 많은 parameter 가 한 3×3 convolution 9 weight 로 capture 되는 거.

이게 어디로 이끄나

다음 lesson 들이 toolkit 빌드: convolution (locality + translation), pooling (downsampling), classic CNN 계보 (AlexNet → VGG → ResNet → ConvNeXt), 다음 ViT (transformer 가 vision 도 먹어), 다음 sequence model, 다음 attention, 다음 transformer block. 패턴: 각 architecture 가 data 에 대한 가정 encode, 가정 맞는 architecture 골라.

원칙: Architecture 는 의견. 모든 layer 선택이 data 가 어떤 structure 가졌는지에 대한 가정. 가정 아는 게 stuck model 디버깅하게 해주는 거.

Code

MLP vs CNN parameter counts on a tiny image task·python

import torch.nn as nn

# MLP on 32x32 RGB
mlp = nn.Sequential(
    nn.Flatten(),
    nn.Linear(3 * 32 * 32, 128), nn.ReLU(),
    nn.Linear(128, 10),
)
print("MLP params:", sum(p.numel() for p in mlp.parameters()))

# Small CNN on 32x32 RGB
cnn = nn.Sequential(
    nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
    nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
    nn.AdaptiveAvgPool2d(1), nn.Flatten(),
    nn.Linear(64, 10),
)
print("CNN params:", sum(p.numel() for p in cnn.parameters()))
# CNN typically uses fewer parameters AND generalizes better.

Pixel 이 실제로 가진 structure

Image 를 flat vector 로 다루는 비용

이게 어디로 이끄나

Code

External links

Exercise

Progress

댓글 0