데이터 포맷: JSONL, OpenAI, Alpaca, ShareGPT

JSONL이 이긴 이유

거의 모든 파인튜닝 파이프라인은 JSONL(JSON Lines) 써: 한 줄에 JSON 객체 하나. 이긴 이유 — (1) 스트리밍 가능, (2) 나머지 안 파싱하고 한 줄만 검증 가능, (3) 나쁜 줄이 파일 전체 망치지 않음, (4) 현대 데이터셋 도구 다 네이티브 지원.

OpenAI chat 포맷 (사실상 표준)

OpenAI managed 파인튜닝이 요구하는 거고, TRL의 SFTTrainer가 자동 감지하는 거고, 현대 오픈소스 도구 대부분이 기본값으로 쓰는 거. 한 줄당 messages 배열을 갖는 예제 하나.

Alpaca 포맷

오래됐지만 많은 오픈 데이터셋에 여전히 보여. 한 줄: {instruction, input, output}. 인테이크 시 chat 포맷으로 변환 쉬워.

ShareGPT 포맷

멀티턴 대화용. 한 줄에 {from, value} 항목 가진 conversations 배열. Axolotl 시대 데이터셋에 흔해.

포맷 섞는 건 함정

인테이크 시점에 포맷 하나 정하고 나머지 다 변환해. 한 학습 런 안에서 포맷 섞는 게 1주차 파인튜닝 프로젝트에서 "모델이 헛소리 출력함"의 1번 원인.

Code

OpenAI chat format (one example per JSONL line)·json

{"messages": [
  {"role": "system", "content": "You are a medical assistant. Be concise."},
  {"role": "user", "content": "What are common symptoms of type 2 diabetes?"},
  {"role": "assistant", "content": "Increased thirst, frequent urination, increased hunger, fatigue, blurred vision, slow-healing sores, frequent infections."}
]}

Alpaca format·json

{"instruction": "Summarize the key findings.", "input": "Patient presents with elevated BP (160/95)...", "output": "Stage 2 hypertension; recommend ambulatory BP monitoring."}

ShareGPT format·json

{"conversations": [
  {"from": "system", "value": "You are a coding assistant."},
  {"from": "human", "value": "How do I read a CSV in Python?"},
  {"from": "gpt", "value": "Use pandas:\n\nimport pandas as pd\ndf = pd.read_csv('data.csv')"}
]}

데이터 포맷: JSONL, OpenAI, Alpaca, ShareGPT

JSONL이 이긴 이유

OpenAI chat 포맷 (사실상 표준)

Alpaca 포맷

ShareGPT 포맷

포맷 섞는 건 함정

Code

External links

Exercise

Progress

댓글 0