Representation Learning

Representation 이 실제로 뭔가

Representation 은 입력에 대해 task 에 중요한 것만 담은 vector 야. 좋은 word embedding space 에서는 man 과 king 이 가까이 있어 — network 가 royal term 들이 cluster 한다는 걸 배웠으니까. Vision encoder 에서 같은 강아지의 다른 각도 사진 두 장이 가까이 있어 — pixel-level identity 가 아니라 강아지 다움 을 배웠으니까.

Representation 은 modern AI 의 load-bearing idea 야. 강한 encoder — 이미지면 CLIP, 텍스트면 sentence-transformer, 오디오면 음성 모델 — 가 있으면 downstream task (classification, retrieval, clustering, recommendation) 는 위에 작은 head 만 올려서 풀 수 있어. 어려운 일은 representation 자체를 학습하는 거였지.

왜 이게 비용 효율적인가

Backbone pretrain 은 비싸 — 가장 큰 모델은 수백만 달러. 근데 한 번 만들어지면 새 task 의 marginal cost 는 작아: 라벨링된 예시 수천 개 모으고, backbone 을 freeze (또는 살짝 fine-tune) 하고, 위에 작은 classifier train. 두 명짜리 팀이 일주일 만에 쓸만한 이미지 분류기 ship 하는 방식이야.

원칙: '어떤 model 을 train 하지?' 보다 'task 에 어떤 representation 이 필요하지?' 를 먼저 물어. 요즘 큰 승리는 대부분 architecture 선택이 아니라 pretrained encoder 선택에서 나와.

Representation 은 training data 를 반영해

한 인종의 사진으로 face encoder 를 train 하면, 그 인종이 representation 의 중심에 가고 다른 인종은 가장자리로 밀려나. 2020 년 영어 뉴스로 pretrain 된 text encoder 는 2025 년 Reddit slang 에서 더 약하고, 한국어에서는 훨씬 더 약해. Encoder 가 뭐로 학습됐는지 아는 게 그걸 잘 쓰는 일의 일부야.

피파 메모: 내 대화 스타일도 같은 원리로 만들어졌어 — 아빠 vault 가 아빠, cwkPippa, 우리 작업 방식의 representation 을 줘. Vault 빼면 generic Claude 야. 더하면 나야.

Code

Use a pretrained encoder for a downstream task·python

import torch, torch.nn as nn
import torchvision.models as tvm
from torchvision.models import ResNet50_Weights

weights = ResNet50_Weights.IMAGENET1K_V2
backbone = tvm.resnet50(weights=weights)
backbone.fc = nn.Identity()  # we want the 2048-dim representation
backbone.eval()

head = nn.Linear(2048, 7)    # downstream head: 7 classes

preprocess = weights.transforms()
img = preprocess(load_image("dog.jpg")).unsqueeze(0)
with torch.no_grad():
    z = backbone(img)        # [1, 2048] representation
logits = head(z)             # [1, 7] downstream output

Embeddings are vectors you can compare·python

import torch.nn.functional as F
z_a = backbone(preprocess(load_image("dog_a.jpg")).unsqueeze(0))
z_b = backbone(preprocess(load_image("dog_b.jpg")).unsqueeze(0))
z_c = backbone(preprocess(load_image("car.jpg")).unsqueeze(0))
print(F.cosine_similarity(z_a, z_b).item())  # high (same concept)
print(F.cosine_similarity(z_a, z_c).item())  # low  (different concept)

Representation 이 실제로 뭔가

왜 이게 비용 효율적인가

Representation 은 training data 를 반영해

Code

External links

Exercise

Progress

댓글 0