LLM 워크로드 observability

뭘 측정

매 Claude 호출마다 캡처 — 지연(스트림이면 TTFT·TTLT), input/output/cache 토큰 카운트, 모델 id, feature 라벨, stop_reason, 에러 클래스(있으면), tool_use name(applicable이면). 비용 spike나 quality 떨어질 때 slice할 dimension.

호출 X, 루프 trace

Tool 루프가 5-10 round-trip일 수 있어. 라운드당 단일 trace는 노이즈; 한 parent trace 안 라운드당 span이 시그널. OpenTelemetry parent/child span 모델 자연스럽게 fit — 루프를 span으로 wrap, 각 라운드가 child.

프롬프트와 응답 sample

비용 spike와 quality regression이 보통 specific 프롬프트나 입력 패턴에 의해 야기. 주기적으로 full prompt+response 쌍을 (PII redaction과 함께) cold storage로 sample, 나중에 replay. cwkPippa의 JSONL이 정확히 이거 — 매 이벤트 durable, 매 regression replayable.

원칙: '어제 어느 기능이 가장 많이 썼고 왜' 답 못하면 observability 아직 빌드 안 한 거.

Code

Per-call structured 로그·python

import time, json

def call_with_telemetry(messages, *, feature: str, log_path: str):
    t0 = time.perf_counter()
    try:
        resp = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=messages,
        )
        latency_ms = int((time.perf_counter() - t0) * 1000)
        with open(log_path, "a") as f:
            f.write(json.dumps({
                "feature": feature,
                "model": "claude-sonnet-4-6",
                "latency_ms": latency_ms,
                "input": resp.usage.input_tokens,
                "output": resp.usage.output_tokens,
                "cache_read": resp.usage.cache_read_input_tokens,
                "stop_reason": resp.stop_reason,
            }) + "\n")
        return resp
    except Exception as e:
        with open(log_path, "a") as f:
            f.write(json.dumps({"feature": feature, "error": type(e).__name__, "msg": str(e)}) + "\n")
        raise

Tool 루프 주변 OpenTelemetry trace·python

from opentelemetry import trace
tracer = trace.get_tracer("claude")

async def traced_loop(prompt: str):
    with tracer.start_as_current_span("claude.tool_loop") as parent:
        parent.set_attribute("prompt.length", len(prompt))
        for round_i in range(MAX_ITERS):
            with tracer.start_as_current_span(f"claude.round.{round_i}") as span:
                resp = await client.messages.create(...)
                span.set_attribute("stop_reason", resp.stop_reason)
                span.set_attribute("output_tokens", resp.usage.output_tokens)
                if resp.stop_reason != "tool_use":
                    return resp

Exercise

Critical Claude 코드 경로 하나에 telemetry 추가. 지연, 토큰, stop_reason, feature 라벨 캡처. '어제 어느 기능이 가장 많이 썼는지' 답하는 dashboard 쿼리 하나 빌드.

Hint

로깅 라이브러리가 structured(key/value) 아니면 지금이 switch 시점 — 텍스트 로그가 slice-and-dice 질문 싸게 답 못 함.

LLM 워크로드 observability

뭘 측정

호출 X, 루프 trace

프롬프트와 응답 sample

Code

External links

Exercise

Progress

댓글 0