레포 타입: Models, Datasets, Spaces

레포 계약 세 개 — 탭 세 개 아님

Hub 는 models, datasets, Spaces 를 서로 다른 레포 계약으로 다뤄 — URL prefix 다르고, 메타 스키마 다르고, 스토리지 백엔드 다르고, rate limit 다르고. 웹 UI 에서 탭으로 묶여 있어서 같은 거처럼 보이지만, 다른 거야.

huggingface.co/{org}/{name} — model 레포. 가중치 (.safetensors / .bin), config, tokenizer, 그리고 YAML front-matter 가 붙은 README.md (모델 카드).
huggingface.co/datasets/{org}/{name} — dataset 레포. 데이터 파일 (Parquet, CSV, JSONL, audio, image), optional 로딩 스크립트, 데이터셋 카드.
huggingface.co/spaces/{org}/{name} — Space 레포. 앱이야: Gradio / Streamlit / Docker / static. 런타임 config, secrets, 하드웨어 tier 가짐.

왜 구분이 중요한가

huggingface_hub 의 모든 API 콜은 repo_type 을 받아. 데이터셋 의미인데 repo_type="model" 넘기면 레포가 존재해도 404 나와 — 같은 Git 인프라, 다른 namespace 라서. HfApi 의 list_models(), list_datasets(), list_spaces() 는 같은 함수의 alias 가 아니야. 다른 인덱스 치고, 다른 필터 받아 (task= 는 모델만, language= 는 모델 + 데이터셋 둘 다).

레포 타입이 페이지에 뭐가 뜨는지도 결정해. 모델 레포는 인퍼런스 위젯 + Use This Model 버튼; 데이터셋 레포는 Data Studio 뷰어; Spaces 는 돌아가는 앱 iframe. 이건 설정 불가능해 — 레포 타입에 묶여 있어.

Code

타입 셋, 엔드포인트 셋·python

from huggingface_hub import HfApi

api = HfApi()

# Models — task 필터는 이 표면 전용
llamas = api.list_models(search="llama", task="text-generation", sort="downloads", limit=5)
for m in llamas:
    print(f"MODEL  {m.id:<55} dl={m.downloads or 0:>8}")

# Datasets — language 필터
korean_ds = api.list_datasets(language="ko", sort="downloads", limit=5)
for d in korean_ds:
    print(f"DATA   {d.id:<55} dl={d.downloads or 0:>8}")

# Spaces — sdk 필터 (gradio | streamlit | docker | static)
gradio_spaces = api.list_spaces(filter="gradio", sort="likes", limit=5)
for s in gradio_spaces:
    print(f"SPACE  {s.id:<55} likes={s.likes or 0}")

repo_type 은 모든 CRUD 콜에 영향·python

from huggingface_hub import HfApi

api = HfApi()
repo_id = "stanfordnlp/imdb"  # 이건 DATASET, model 아님

# 잘못된 호출: 레포 존재해도 404
try:
    api.repo_info(repo_id, repo_type="model")
except Exception as e:
    print("model lookup:", type(e).__name__)

# 정상
info = api.repo_info(repo_id, repo_type="dataset")
print("dataset siblings:", [s.rfilename for s in info.siblings[:5]])

Exercise

공개 Hub 에서 진짜 프로젝트에 쓸 만한 레포 셋을 골라봐 — 모델 하나, 데이터셋 하나, Space 하나. 각각에 api.repo_info(repo_id, repo_type=...) 콜해서 .sha, .last_modified, .siblings, .tags 살펴봐. 어떤 필드가 레포 타입마다 unique 한지 체크. 결과를 Python dict 로 저장 — 다음 exercise 에서 써.