Output filtering — 모델 말 trust 마

~14 min · security, output-filtering

Level 0수련생

0 XP0/100 lessons0/14 achievements

0/120 XP to next level120 XP to go0% complete

프롬프트 right여도 output 틀릴 수 있어

모델이 successfully attacked, rare failure 나타냄, 또는 그냥 너의 정책 위반 콘텐츠 produce. Output filtering이 프롬프트가 안 잡은 걸 catch하는 layer.

scan할 패턴

Sensitive data — email, phone number, SSN, credit card, internal API endpoint.
exfiltrate 가능한 embedded resource URL (image src, link href).
UI가 render하고 user가 trust할 markdown / HTML 구조.
Profanity, hate speech, 너의 context에 policy-restricted topic.
Prompt-injection marker ("ignore previous instructions," 예상 못 한 곳의 tool-call syntax).

어떻게 enforce

user 도달 전 raw output에 regex / classifier.
HTML/markdown sanitizer가 dangerous 구조 strip.
high-risk content에 second-pass LLM judge.
severity에 따라 block, redact, alert.

Code

Output sanitizer chain·python

def sanitize(text: str) -> tuple[str, list[str]]:
    flags = []
    text, flag = strip_credentials(text)
    flags += flag
    text, flag = strip_exfil_urls(text)
    flags += flag
    text, flag = sanitize_markdown(text)
    flags += flag
    return text, flags

out, flags = sanitize(model_text)
if "credential_leak" in flags:
    alert_security(out)
return out

External links

Exercise

한 specific exfiltration vector (예: external host로 image src) scan하는 output filter 추가. known-good과 synthetic-bad output에 테스트.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousJailbreak 패턴 — 어떻게 보이나 Next →Privilege boundary — 모델이 할 수 있고 없는 것

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.