흔한 실패 & 디버깅

Top-10 파인튜닝 실패 (와 fix)

#	증상	가능한 원인	Fix
1	모델이 헛소리 출력	잘못된 chat template 또는 토크나이저 불일치	매칭하는 토크나이저 사용; 학습/추론 chat template 동일 검증.
2	측정 가능한 개선 없음	예제 너무 적거나 데이터 품질 낮음	500+ 예제; 처음 50개 다 리뷰.
3	Validation loss 일찍 상승	Epoch 너무 많거나 데이터셋 너무 작음	Epoch 줄여, dropout 늘려, 다양한 데이터 추가.
4	일반 능력 잃음 (forgetting)	Full FT의 catastrophic forgetting	대신 LoRA 써; learning rate 낮춰.
5	출력 포맷 일관성 없음	학습 데이터 포맷 일관성 없음	학습 전 모든 예제 한 포맷으로 표준화.
6	Loss 전혀 감소 X	LR 너무 낮거나 데이터 포맷 잘못	LR 올려; 데이터가 변경 없이 모델에 도달하는지 검증.
7	OOM (메모리 부족)	모델 또는 배치가 GPU에 비해 큼	QLoRA 써; batch 줄여; gradient checkpointing 활성화.
8	학습 느림	Flash Attention 없음, bf16 없음	Flash Attention 2 + bf16 활성화.
9	FT 후 모델이 더 환각	학습 데이터에 에러 또는 환각 포함	데이터 audit; 검증 못 하는 거 다 제거.
10	Eval에선 작동, 실서비스에선 실패	Eval 데이터가 학습이랑 너무 비슷	실서비스 derived 테스트 셋 구축; eval에 학습 분포 절대 재사용 X.

Code

Three-step debug protocol when something looks wrong·python

# 1. Manually inspect 5 random training examples
import json, random
with open("train.jsonl") as f:
    examples = [json.loads(line) for line in f]
for ex in random.sample(examples, 5):
    for m in ex["messages"]:
        print(f"[{m['role']}] {m['content'][:240]}")
    print("---")

# 2. Verify the tokenized output looks correct
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
for ex in random.sample(examples, 1):
    formatted = tok.apply_chat_template(ex["messages"], tokenize=False)
    print("=== formatted ===")
    print(formatted[:1000])
    tokens = tok(formatted)["input_ids"]
    print(f"=== token count: {len(tokens)} ===")

# 3. Run a tiny training run (10 examples, 1 epoch) to verify pipeline
# If THAT crashes, the pipeline is broken, not the data scale.

Top-10 파인튜닝 실패 (와 fix)

Code

External links

Exercise

Progress

댓글 0