AI system 용 incident response

~14 min · production, incidents

Level 0수련생

0 XP0/100 lessons0/14 achievements

0/120 XP to next level120 XP to go0% complete

새벽 2시 page

뭔가 wrong: output 더 나빠짐, cost spike, injection succeed, tool이 wrong action fire. AI system용 incident response가 classical SRE 빌리고 3개 추가: prompt provenance, model attribution, trace forensic.

처음 10분

Affected prompt version과 model을 metric tag에서 식별.
Version flag rollback (flag 존재하지, 맞지?).
Metric tag count로 rollback이 traffic serve하는지 confirm.
Forensic용 log와 trace snapshot.
Communicate: status page나 internal channel.

다음 1시간

Bad trace pull; failure mode 식별.
Provider issue면 status page 체크하고 fallback chain activation 고려.
Prompt regression이면 v_new vs v_old 변화 식별.
Injection / abuse면 entry point 식별, defense layer 추가.

Post-mortem

써. Finding을 eval case로 convert. 이걸 일찍 catch했을 새 dashboard나 alert 추가. Model 비난 X; guardrail / eval / canary 부재 비난.

Code

Pre-built rollback runbook (sketch)·markdown

# AI prompt rollback runbook

1. Open feature flag dashboard.
2. Set `prompt_v9` flag to 0%.
3. Verify in metrics: `prompt:v8` count > `prompt:v9` count within 5 minutes.
4. Snapshot traces for `prompt:v9` from last hour.
5. Create incident channel; invite owner + responder.
6. Within 2h: post initial findings.
7. Within 24h: post-mortem doc with eval-set additions.

External links

Google SRE book — Postmortems

Exercise

Non-critical prompt에 rollback drill 돌려. Step time. 가장 느린 step 식별, improve.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousCost engineering — 돈이 어디 가나 Next →Production maturity ladder

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.