Dataset vs DatasetDict, Features, 스키마

~24 min · datasets, schema

Level 0스카우트

0 XP0/50 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

클래스 둘, 멘탈 모델 하나

Dataset 은 named 컬럼 + 행으로 된 단일 Arrow-backed 테이블. DatasetDict 는 그냥 {split_name: Dataset} + 모든 split 에 broadcast 하는 편의 메서드. 단일 Dataset 에서 가능한 모든 거 — map, filter, select, shuffle — DatasetDict 에서도 가능, 모든 split 에 적용.

Features 가 곧 스키마

ds.features 가 스키마 dict. 각 컬럼이 Value, ClassLabel, Sequence, Audio, Image, 또는 nested 타입. 분류 위해 ClassLabel 이 label-int-to-name 매핑 carry — 그래서 ds.features['label'].int2str(1) 이 'positive' 줌. features dict 를 DB 스키마처럼 다뤄: 타입 인지, 쿼리 가능, 데이터가 의미하는 바의 진실 출처.

Code

스키마 inspect·python

from datasets import load_dataset

ds = load_dataset("stanfordnlp/imdb")
print(ds["train"].features)
# {'text': Value(dtype='string'), 'label': ClassLabel(names=['neg', 'pos'])}

# label int → name 변환
print(ds["train"].features["label"].int2str(1))  # 'pos'

# Slice 접근
sample = ds["train"][:3]  # dict of column → list
print(sample["label"])

DatasetDict 에서 op 가 broadcast·python

from datasets import load_dataset

ds = load_dataset("stanfordnlp/imdb")  # DatasetDict
small = ds.shuffle(seed=42).select(range(1000))  # NOTE: select 은 split 별, shuffle 은 DatasetDict 에서 동작
# ds.map / .filter / .remove_columns 모두 비슷하게 broadcast
small_filtered = ds.filter(lambda ex: len(ex["text"]) > 200)
print({k: len(v) for k, v in small_filtered.items()})

External links

Exercise

Hub 의 분류 데이터셋 아무거나 로드. features, split 별 num_rows, label 분포 출력. train_test_split(stratify_by_column='label') 로 stratified 80/20 train/dev split 만들고 클래스 균형 보존 검증.

Progress

Progress is local-only — sign in to sync across devices.

← Previousload_dataset(): 만능 로더 Next →Pandas, Arrow, 뷰어로 탐색

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.