CNN 직관: Edge 에서 Object 로

CNN 이 실제 학습하는 것

Trained CNN 의 filter visualize 하면 아름다운 패턴 나타남. 초기 layer 가 oriented edge detector 와 color blob 학습 (포유류 visual cortex 의 simple cell 비슷). 중간 layer 가 texture 와 shape part (눈, 바퀴, 털 패턴) 학습. 후기 layer 가 object-shaped feature 와 category-specific concept 학습.

이게 designed in 안 됐어 — training 에서 emerge. Architecture (locality + parameter sharing + non-linearity + depth) 면 충분, specific feature 계층은 data 와 gradient 에서 와.

팁: Deep learning 이 진짜 뭔가 한다는 거 의심하면, AlexNet 첫 layer 의 visualized filter 봐. Hand-designed edge detector (Gabor filter) 가 network 가 from scratch 학습한 거랑 거의 똑같이 생겼어.

Receptive field 가 depth 와 자라

Layer 1 의 neuron 이 input 의 작은 patch 봐. Layer 5 의 neuron 이 더 큰 patch (conv stack 통해 그것에 feed 하는 모든 layer-1 patch 의 union) 봐. Layer 30 의 neuron 이 image 대부분 봐. CNN 이 local edge 에서 global structure 까지 빌드하는 방식.

Pooling 과 downsampling

대부분 CNN 이 convolution 과 downsampling (stride-2 conv 또는 MaxPool) 을 alternate. 각 downsample 이 spatial dimension 절반, (보통) channel count 두 배. 결과: pyramid — 시작에 high-resolution shallow feature, 끝에 low-resolution deep feature.

안에 뭐 있는지 visualize

Tool: gradient-based saliency map (어떤 input pixel 이 prediction 에 영향?), Grad-CAM (어떤 spatial region 이 예측된 class 에 활성화?), feature inversion (이 filter 를 maximize 하는 input 이 뭐?). 알아둘 interpretability brick.

원칙: CNN 은 30 분 들여서 안 보면 black box 아냐. Filter visualization 과 Grad-CAM heatmap 이 본인 model 이 뭐 하는지에 대해 어떤 accuracy plot 보다 더 가르쳐.

Code

Grad-CAM in 20 lines·python

import torch
import torch.nn.functional as F

def gradcam(model, x, target_class, target_layer):
    activations, gradients = [], []
    h1 = target_layer.register_forward_hook(lambda m, i, o: activations.append(o))
    h2 = target_layer.register_full_backward_hook(lambda m, gi, go: gradients.append(go[0]))

    model.eval()
    logits = model(x)
    model.zero_grad()
    logits[0, target_class].backward()

    act = activations[0][0]                  # [C, H, W]
    grad = gradients[0][0]                   # [C, H, W]
    weights = grad.mean(dim=(1, 2))           # [C]
    cam = (weights[:, None, None] * act).sum(dim=0)
    cam = F.relu(cam)
    cam = cam / cam.max()

    h1.remove(); h2.remove()
    return cam.cpu()

CNN 이 실제 학습하는 것

Receptive field 가 depth 와 자라

Pooling 과 downsampling

안에 뭐 있는지 visualize

Code

External links

Exercise

Progress

댓글 0