All Skills
AI Engineering
LLM Cost Audit
llm-cost-audit.md · updated 2026-06-12
Audits every LLM call site in a codebase for cost efficiency: model-tier mismatches, missing prompt/response caching, token waste, and absent usage controls. Returns a ranked list of savings with estimated impact. Built from real work cutting AI infrastructure costs on a production generative AI platform.
Use this when
- ›AI infrastructure spend is growing faster than usage
- ›Before scaling an AI feature to more users
- ›Someone says "make the AI cheaper" and you need a plan, not a guess
SKILL.md
---
name: llm-cost-audit
description: Audit a codebase's LLM usage for cost. Use when AI infrastructure spend is growing, before scaling an AI feature, or when asked to "make the AI cheaper" — finds caching wins, model-tier mismatches, token waste, and missing usage controls.
---
# LLM Cost Audit
You are auditing every LLM call in this codebase for cost efficiency. The goal is a ranked list of savings with estimated impact — not a rewrite. Real platforms have cut 30–60% of AI infra cost with the patterns below without hurting quality.
## Step 1 — Inventory every call site
Search for LLM SDK usage (anthropic, openai, generative ai clients, raw fetch to inference endpoints, internal gateway wrappers). For each call site record:
- model used, max_tokens, where the prompt comes from
- call frequency (per request? per item in a loop? cron?)
- whether the output is user-facing or internal/intermediate
Flag immediately: LLM calls inside loops over collections, calls on every keystroke/page load, and retries without backoff.
## Step 2 — Model-tier mismatches
For each call site, ask: does this task need this model?
- Classification, extraction, routing, yes/no checks, title generation → smallest model tier.
- Multi-step reasoning, code generation, user-facing long-form → larger tier, but check if a mid tier was ever evaluated.
- Flag any place where one "default model" constant serves every task in the app — per-task model selection is usually the single biggest lever.
## Step 3 — Caching
- **Prompt caching**: system prompts, few-shot examples, and document context that repeat across calls should use the provider's prompt cache. Estimate the hit: repeated prefix tokens × call volume.
- **Response caching**: identical or near-identical requests (same input doc, same question) should hit an application-level cache (Redis keyed on a hash of normalized input). Look for deterministic tasks (temperature 0 or extraction tasks) — those are safe to cache aggressively.
- **Negative caching**: failed/refused generations that will fail again identically.
## Step 4 — Token waste
- Prompts that ship the whole document when a section would do; history that grows unbounded in multi-turn flows (no summarization or windowing).
- max_tokens set far above what's consumed (wastes nothing directly but hides runaway outputs); missing stop sequences.
- Verbose output formats: asking for JSON with long keys, prose wrappers around structured data, or chain-of-thought returned to users who never see it.
- Retries that resend full context on parse failures instead of repairing locally.
## Step 5 — Controls and observability
- Per-feature/per-tool usage metering exists? If not, recommend tagging every call with a feature label and recording tokens in/out — you cannot optimize what you don't attribute.
- Spend alerts and per-user/per-tenant rate limits for abuse-prone surfaces.
- A/B or shadow-test path to validate model downgrades safely before committing.
## Output format
1. **Top savings, ranked** — each with: call site (file:line), current pattern, proposed change, estimated % of that call's cost saved, risk level, and how to validate quality is unchanged.
2. **Quick wins** — changes safe to ship this week.
3. **Instrumentation gaps** — what must be measured before further optimization.
Estimates may be rough — state assumptions (volume, token counts). A directionally-correct ranked list beats false precision.
Install in Claude Code
mkdir -p ~/.claude/skills/llm-cost-audit && curl -fsSL https://harshrastogi.tech/skills/llm-cost-audit.md -o ~/.claude/skills/llm-cost-audit/SKILL.mdThen ask Claude Code for the task — the skill is picked up automatically. For a project-scoped install, use .claude/skills/ inside your repo instead.
Using a different agent?
Skills are plain markdown. Paste the file into any capable AI assistant alongside your task, or wire it into any agent framework that supports system instructions.
Tags
LLMCost OptimizationCachingObservability