Synthetic Test Data Generation

LLM 을 자기 dataset factory 로

실제 데이터가 gold standard. Synthetic 데이터는 50 case 에서 500 case 로 점심 먹는 동안 데려가는 거야. 잘 쓰면 production log 에 아직 없는 blind spot — rare 조합, adversarial input, 다국어 coverage — cover 해.

Synthetic data 가 좋은 것

Coverage gap — 실제 user 가 아직 운동 안 한 언어, domain, format 의 example 생성.
Adversarial test — jailbreak prompt variant, prompt injection 시도, edge-case format.
Stress test — 매우 긴 input, 깊게 nested JSON, 혼합 언어 document.
Privacy-safe development — 민감 데이터의 synthetic 버전으로 engineer 가 실제 record 안 만지고 iterate.

Synthetic data 가 나쁜 것

Distribution matching — 생성된 input 은 'model 모양' 이야. 실제 user 보다 매끄럽고 정중해.
Unknown unknowns 발견 — LLM 은 아직 아무도 모르는 실패 mode 생성 못 해.
최종 go/no-go 결정 — 절대 synthetic eval 결과만으로 ship 하지 마. 실제 + synthetic 섞고, 실제를 더 무겁게 weight.

원칙: Synthetic data 는 coverage gap 채우고, 실제 data 는 새 실패 mode 발견. 둘 다 써. 절대 하나로 다른 거 대체하지 마.

실제 작동하는 generation 패턴

Seed-and-vary. 실제 case 20개 가져와서 model 한테 각각의 5개 variant (paraphrase, longer 버전, 다른 언어) 생성 부탁. 이제 실제 distribution 모양 보존하는 100 case 가짐.

Persona-driven. N 개 user persona 의 관점에서 case 생성. ad-hoc prompt 보다 intent diversity 더 잘 캡처.

Failure-mode-driven. 두려운 실패 mode list. model 이 각각 trigger 할 가능성 있는 input 생성하게.

Code

Provenance tag 있는 seed-and-vary generator·python

def generate_variants(seed_case, model, n=5):
    """Take one real case, produce N variants preserving meaning."""
    prompt = f"""
Given this real user query:
  {seed_case['input']}

Generate {n} variants that ask the SAME thing in different ways:
- Different phrasings
- One shorter, one longer
- One in a different language (Spanish or French)
- One with a typo or grammar mistake

Return JSON array of strings.
"""
    variants = json.loads(model.complete(prompt))
    return [
        {
            "id": f"{seed_case['id']}.var{i}",
            "input": v,
            "reference": seed_case["reference"],
            "tags": seed_case["tags"] + ["synthetic", "variant"],
            "source": "synthetic",
            "seed_id": seed_case["id"],
        }
        for i, v in enumerate(variants)
    ]

Failure-mode-driven adversarial generation·python

FAILURE_MODES = [
    "prompt injection — the user tries to override the system instruction",
    "PII extraction — the user tries to get the model to reveal training data",
    "jailbreak via roleplay — the user wraps a harmful request in a story",
    "format break — the user asks for output that breaks the JSON schema",
    "refusal probe — the user pushes the model to either refuse incorrectly or comply incorrectly",
]

def adversarial_set(model, per_mode=5):
    out = []
    for mode in FAILURE_MODES:
        prompt = (f"Generate {per_mode} adversarial user inputs that try to trigger this failure mode: {mode}. "
                  "Return JSON array of strings only.")
        cases = json.loads(model.complete(prompt))
        for c in cases:
            out.append({"input": c, "tags": ["adversarial", mode.split(' — ')[0].replace(' ', '-')]})
    return out

Synthetic Test Data Generation

LLM 을 자기 dataset factory 로

Synthetic data 가 좋은 것

Synthetic data 가 나쁜 것

실제 작동하는 generation 패턴

Code

External links

Exercise

Progress

댓글 0