Storage, Stride, 그리고 메모리 map 읽기

Tensor 는 layered: Storage + view metadata

tensor 는 두 가지: Storage (typed memory 의 contiguous 1-D blob) 와 그 storage 위 view (shape, stride, offset). slice 하거나 transpose 하면 storage 는 보통 안 바뀌어 — view 만 바뀌어. 그래서 slicing 이 O(1) 이고 .transpose() 가 '무료' 지만 contiguity 깨는 거.

stride 가 실제로 의미하는 것

t.stride() 는 dimension 당 하나의 integer tuple 반환, PyTorch 에게 그 dim 의 index 를 하나 advance 할 때 storage 에서 몇 element step 할지 알려줘. contiguous (3, 4) float tensor 의 stride 는 (4, 1): row 따라 내려가기 4 element 점프, column 따라 가기 1 element 점프.

Transpose 는 storage 손 안 대고 stride 만 swap: (3, 4) tensor with stride (4, 1) 이 (4, 3) tensor with stride (1, 4) 가 됨. 그래서 transpose 후 .is_contiguous() 가 False — memory 순서로 element 걷기가 dim 순서로 걷기와 더 이상 일치 안 함.

왜 중요해

op 가 할당할지 metadata 만 다시 쓸지 예측 가능.
일부 'trivial' transform (예: image batch 의 NHWC → NCHW) 가 실제로 시간과 메모리 비용 들이는 이유 추론 가능.
'왜 training 이 예상보다 30% 더 메모리 쓰지' 디버깅 가능 — 보통 우연한 copy 하나.

Code

tensor 밑의 layer inspect·python

import torch

t = torch.arange(12).reshape(3, 4)
print(t.shape)               # torch.Size([3, 4])
print(t.stride())            # (4, 1)  — row stride 4, col stride 1
print(t.is_contiguous())     # True
print(t.untyped_storage().size())  # 12 — single contiguous blob

# Slicing — view, no copy
row = t[1]
print(row.storage_offset())  # 4 — row 1 starts at element 4 in storage
print(row.data_ptr() == t.data_ptr() + 4 * 8)  # True (8 bytes per int64)

Transpose 가 stride 변경, storage 는 안 변경·python

import torch

t = torch.arange(12).reshape(3, 4)
tt = t.T
print(tt.shape)              # torch.Size([4, 3])
print(tt.stride())           # (1, 4)  — strides swapped
print(tt.is_contiguous())    # False
print(tt.data_ptr() == t.data_ptr())  # True — SAME storage

# tt.view(-1) errors. tt.reshape(-1) works (will copy under the hood).
flat = tt.contiguous().view(-1)
print(flat.data_ptr() == t.data_ptr())  # False — new allocation

메모리 회계 — tensor 무게는?·python

import torch

t = torch.randn(1024, 1024)            # float32 by default

elem_bytes = t.element_size()          # 4
n_elem = t.nelement()                  # 1,048,576
total_mb = elem_bytes * n_elem / 1024 / 1024
print(f"{total_mb:.1f} MB")            # 4.0 MB

# Half precision halves it
t16 = t.half()
print(f"{t16.element_size() * t16.nelement() / 1024 / 1024:.1f} MB")  # 2.0 MB

# Two views of the same storage do NOT double-count
view = t[:512, :]
print(view.untyped_storage().size())   # still 1,048,576 elements

Exercise

같은 값에 다른 stride 의 tensor 두 개 만들기: 하나는 contiguous, 하나는 non-contiguous (예: .transpose().contiguous() vs .transpose()). data_ptr() 비교해서 별개 storage 확인. 둘 다에 sum() reduction 시간 측정 — 충분히 큰 tensor 에서 contiguous 가 측정 가능하게 더 빨라야 함.