torchvision Dataset 과 v2 Transform API

Built-in dataset 과 modern transform pipeline

torchvision 이 ready-to-use dataset (MNIST, CIFAR-10/100, ImageNet, COCO) 와 transform pipeline ship. 현재 API 는 torchvision.transforms.v2 — 옛 torchvision.transforms 를 추천 path 로 대체. 새 코드엔 v2.

왜 v2 중요

tv_tensors (Image, BoundingBoxes, Mask, Video) native 지원 — detection 과 segmentation 에 중요.
transform 이 multiple input 에 동시에 정확히 적용: image AND bounding box AND mask 를 한 번에 rotate.
가장 흔한 op 가 다시 쓴 구현 덕에 훨씬 빠름.
PIL-image transform 과 tensor transform 사이 깔끔한 분리.

표준 preprocessing chain

ImageNet-pretrained model 위해 이 정확한 recipe 끊임없이 봐:

ToImage() — tv_tensor.Image (v2-native type) 로 wrap.
ToDtype(torch.float32, scale=True) — [0, 1] 의 float32 로 변환.
Resize(256) — 짧은 변을 256 으로.
CenterCrop(224) — 중앙 224x224 crop.
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) — ImageNet normalization. 이 숫자 외워.

Code

Built-in dataset — CIFAR10 예·python

import torch
import torchvision
from torchvision import datasets
import torchvision.transforms.v2 as T

transform = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Resize(32),
    T.Normalize(mean=[0.4914, 0.4822, 0.4465],
                std=[0.2470, 0.2435, 0.2616]),  # CIFAR10-specific
])

train_ds = datasets.CIFAR10('./data', train=True, download=True, transform=transform)
test_ds  = datasets.CIFAR10('./data', train=False, download=True, transform=transform)
print(len(train_ds), len(test_ds))   # 50000 10000
print(train_ds[0][0].shape, train_ds[0][1])  # torch.Size([3, 32, 32]) 6

표준 ImageNet preprocessing·python

import torch
import torchvision.transforms.v2 as T

# This is the chain that matches every torchvision pretrained model
preprocess = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Resize(256),
    T.CenterCrop(224),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

# But the modern recommended way is to ASK the model for its transforms:
from torchvision.models import resnet50, ResNet50_Weights
weights = ResNet50_Weights.IMAGENET1K_V2
preprocess = weights.transforms()
# weights.transforms() returns the EXACT preprocessing the model was trained with

v2 의 killer feature — multi-input transform·python

import torch
import torchvision.transforms.v2 as T
from torchvision import tv_tensors

# A scene with image + bounding boxes + segmentation mask
img = torch.randint(0, 255, (3, 224, 224), dtype=torch.uint8)
boxes = tv_tensors.BoundingBoxes(
    [[10, 20, 100, 150]], format='XYXY', canvas_size=(224, 224)
)
mask = tv_tensors.Mask(torch.zeros(224, 224, dtype=torch.uint8))

transform = T.Compose([
    T.RandomHorizontalFlip(p=1.0),
    T.RandomRotation(15),
])

# Apply to all three at once — boxes and mask transform consistently with image
img_t, boxes_t, mask_t = transform(img, boxes, mask)
print(img_t.shape, boxes_t, mask_t.shape)
# This was nearly impossible with the old transforms API.

torchvision Dataset 과 v2 Transform API

Built-in dataset 과 modern transform pipeline

왜 v2 중요

표준 preprocessing chain

Code

External links

Exercise

Progress

댓글 0