Cost engineering — 돈이 어디 가나

청구서가 predictable shape

2026 대부분 prompt-using system에 cost가 input token (context 큼) dominate, 그 다음 reasoning token (extended thinking on일 때), 그 다음 output token, 그 다음 다 나머지. Caching, model tier choice, reasoning routing이 가장 큰 3 lever.

Impact 순서로 lever

Prompt caching — stable prefix에 5–10× reduction.
Model-tier routing — 일상 request를 mid-tier로; hard에 top-tier 예약.
Reasoning budget routing — 필요할 때만 reasoning enable.
Context trimming — RAG top-K 더 tight, document compression, history에서 turn drop.
Output discipline — schema-bound output이 free-form보다 작아.
Provider competition — 주기적 비교; appropriate한 곳에 switch.

Optimize 안 할 것

Quality drop 측정 안 하고 quality를 cost로 trade X.
Tail cost가 issue일 때 average cost optimize X (한 slow long-context call이 1,000 cheap drown).
Operational complexity 안 optimize하고 cost optimize X.

Code

Request당 cost attribution·python

@dataclass
class CostBreakdown:
    input_tokens: int
    cache_read_tokens: int
    cache_write_tokens: int
    output_tokens: int
    thinking_tokens: int

    def total_usd(self, rates) -> float:
        return (
            self.input_tokens     * rates.input_per_tok
          + self.cache_read_tokens * rates.cache_read_per_tok
          + self.cache_write_tokens * rates.cache_write_per_tok
          + self.output_tokens    * rates.output_per_tok
          + self.thinking_tokens  * rates.thinking_per_tok
        )

Cost engineering — 돈이 어디 가나

청구서가 predictable shape

Impact 순서로 lever

Optimize 안 할 것

Code

External links

Exercise

Progress

댓글 0