데이터 수집 & 합성 생성

실무 소스 네 가지

수동 큐레이션 — 손으로 예제 작성. 최고 품질, 사람 시간 가장 비싸. 처음 20~50개 골드 예제에 필수.
운영 로그 — 실제 사용자 인터랙션 수집(동의랑 프라이버시 리뷰 거쳐서), 품질 필터. 진짜 분포에 가장 가까워.
합성 생성 — 강한 모델(GPT-4o, Claude 3.5 Sonnet)로 골드 셋을 seed 삼아 예제 생성. 가장 빠르게 스케일.
Distillation — 실제/합성 입력을 큰 teacher 모델에 통과시켜 그 출력을 작은 student의 학습 데이터로 사용.

드리프트 없는 합성 생성 파이프라인

실수 — 강한 모델한테 "X에 대한 예제 1,000개 생성해줘" 하고 결과를 덤프. 모델이 몇 개 템플릿으로 무너져서 거의 동일한 1,000행으로 학습하게 돼.

해결책 — 변동 축 명시적 나열(토픽, 난이도, 톤, 길이, edge case), 축 그리드의 셀당 한 예제 생성, 20% 샘플 손으로 리뷰.

Code

Grid-based synthetic generator that produces real diversity·python

import json, itertools
from openai import OpenAI

teacher = OpenAI()

TOPICS = ["refund request", "shipping delay", "product defect",
          "account locked", "billing question"]
DIFFICULTIES = ["clear request", "ambiguous", "angry", "polite-but-vague"]
LENGTHS = ["one sentence", "a short paragraph", "multiple paragraphs"]

SYSTEM = """You generate realistic customer-support training pairs.
Output JSON: {"user": "...", "assistant": "..."}.
The assistant should be empathetic, concise, and follow the company's
formal-but-warm voice."""

def generate_pair(topic: str, difficulty: str, length: str) -> dict:
    prompt = (f"Topic: {topic}. Customer mood: {difficulty}. "
              f"Customer message length: {length}. Generate one realistic pair.")
    r = teacher.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": prompt},
        ],
        response_format={"type": "json_object"},
        temperature=0.8,
    )
    pair = json.loads(r.choices[0].message.content)
    return {"messages": [
        {"role": "system", "content": "You are a customer support agent."},
        {"role": "user", "content": pair["user"]},
        {"role": "assistant", "content": pair["assistant"]},
    ]}

with open("synthetic.jsonl", "w") as f:
    for topic, diff, length in itertools.product(TOPICS, DIFFICULTIES, LENGTHS):
        ex = generate_pair(topic, diff, length)
        f.write(json.dumps(ex) + "\n")
print("Generated", len(TOPICS) * len(DIFFICULTIES) * len(LENGTHS), "examples")

데이터 수집 & 합성 생성

실무 소스 네 가지

드리프트 없는 합성 생성 파이프라인

Code

External links

Exercise

Progress

댓글 0