How much can prompt caching actually save on Claude Code?

On reusable system prompts and large file contexts, cache hits cost 10% of the standard input rate, a 90% discount on cached tokens. Real-world teams report 60–90% total input reduction once caching, context compaction, and Sonnet-default routing are combined. The savings are largest on long sessions where the same project context is read dozens of times.

Should I use the 5-minute or 1-hour cache TTL?

Default to 5-minute. The 1-hour TTL writes cost 2x the standard input rate versus 1.25x for 5-minute, so you only break even after two cache reads inside the hour. Switch to 1-hour only when telemetry shows your re-read pattern justifies it. Long async refactors and overnight agent runs are the obvious candidates.

When should I use Opus 4.7 versus Sonnet 4.6 inside Claude Code?

Default to Sonnet 4.6 for almost everything: edits, refactors, test writing, code review. Escalate to Opus 4.7 for hard architectural decisions, multi-file debugging where Sonnet has already failed once, and schema or API design where errors are expensive to unwind. On ofox.ai pricing, Sonnet input is $3/M versus Opus $5/M, and the output gap is wider at $15/M versus $25/M.

Do subagents actually save tokens?

Sometimes. A subagent isolates verbose work (codebase greps, long file reads) in its own context window and returns only a summary to the parent. That helps when the verbose output is genuinely throwaway. But a subagent-heavy workflow can consume around 7x the tokens of a single-thread session because of spawn overhead, so reserve them for tasks where the parent context would otherwise balloon.

How do I tell which MCP server is eating my context?

Run `/context` in Claude Code to see the token breakdown. In the traditional preload model, heavy offenders like Jira can consume around 17K tokens alone and a five-server, 58-tool setup often sits at ~55K tokens before you type a single character. Current Claude Code defers MCP tool definitions by default, so the overhead is mostly tool names until Claude actually invokes a tool — but each connected server still adds surface area, so disable any you don't need this session.

May 13, 2026

claude-codeclaudetutorialcost-optimization

Claude Code Token Optimization 2026: 5 Strategies That Cut Your API Bill by 60-90%

TL;DR: Claude Code can burn through a four-figure API bill over a weekend if you let it stream raw file contents and reload the same system prompt every turn. Five techniques actually shift the cost curve in production: aggressive prompt caching with the right TTL, disciplined use of /compact and /clear, Sonnet-default routing with selective Opus escalation, isolated subagents for high-context work, and trimming MCP tool bloat. Teams running all five report 60–90% reductions on real workloads.

The bill isn’t an Anthropic problem. It’s a workflow problem. Claude Code happily re-reads your entire repo on every turn unless you tell it not to.

Why your Claude Code bill is bigger than your Cursor bill

Cursor charges a flat $20. Claude Code charges per token, and a hot session with three MCP servers loaded, no prompt caching, and Opus on default routing will incinerate $100 of credits before lunch. The fix isn’t a different agent. The fix is understanding what Claude Code actually sends on the wire.

Every turn ships:

The system prompt (Claude Code’s internal instructions, large and stable, perfect cache candidate)
The full conversation history so far
Every MCP tool’s schema (whether or not you’ll use it this turn)
File contents you read with the Read tool (re-sent on every subsequent turn)
Your actual message

Points 1, 3, and 4 are the ones bleeding you. The five strategies below kill each leak in turn.

Strategy 1: Prompt caching, pay 10% for everything you re-read

Cache hits cost 0.1× standard input on Anthropic’s published pricing, a flat 90% discount on cached tokens. The catch is that cache writes cost more: 1.25× standard input for the 5-minute TTL, or 2× for the 1-hour TTL. That math has a hard implication: 5-minute caching pays off after a single hit; 1-hour caching needs at least two hits to break even.

For Claude Sonnet 4.6 (input $3/M), a cache read is $0.30/M instead of $3/M. For Claude Opus 4.7 (input $5/M on ofox.ai), a cache read is $0.50/M instead of $5/M. On a 50,000-token system prompt that you hit 40 times in a session, you’ve saved roughly $5.94 on Sonnet, and that’s just one cached block.

What to cache:

System prompts and CLAUDE.md content: large, stable, hit every turn
Tool definitions: same payload turn after turn
Reference documents you’ll re-query: schemas, design docs, the codebase tree
Few-shot examples: perfect static payload

How to enable it in raw SDK calls:

client.messages.create(
    model="anthropic/claude-sonnet-4.6",
    system=[
        {
            "type": "text",
            "text": large_system_prompt,
            "cache_control": {"type": "ephemeral"}  # default 5m TTL
        }
    ],
    messages=conversation
)

For the 1-hour variant, only switch when your telemetry justifies it:

"cache_control": {"type": "ephemeral", "ttl": "1h"}

The March 2026 cache regression: check your bills

If your Claude Code bill jumped 20–30% in early March without anything in your workflow changing, you weren’t imagining it. Anthropic silently rolled back the default 1-hour TTL that Claude Code had been using since February, reverting to 5-minute. Long async sessions that previously reused a single cache write started writing the cache every five minutes instead, inflating cache creation costs by a measured 20–32%.

The fix on the user side: explicitly request 1h TTL in your config if your workload pattern justifies it, and monitor cache hit ratio through the API response’s cache_creation_input_tokens and cache_read_input_tokens fields. If reads outnumber writes by more than 2:1, you’re winning. If not, your cache strategy is upside down.

Strategy 2: `/compact` early, `/clear` often

Claude Sonnet 4.6 and Opus 4.7 now ship with a 1M token context window at standard pricing (generally available since March 13, 2026 — no beta header, no premium tier required). That sounds like room to breathe, but the naive failure mode is the same as before: letting a session drift toward the auto-compact trigger and getting a bad summary because the context is already noisy.

/compact summarizes the conversation and replaces the raw history with that summary. The earlier you trigger it, the better the summary, because Claude has more headroom to retain what matters and drop what doesn’t. A reasonable trigger threshold is ~60% context utilization, not 90%. Many engineers also pass custom hints: /compact Focus on code samples and API usage.

/clear wipes everything. No summary, no carryover. Use it when you switch to genuinely unrelated work. Finishing a frontend bug and starting on a migration script is a /clear moment, not a /compact moment. Carrying stale context into unrelated work means you’re paying input tokens on irrelevant history forever.

A working discipline:

Same feature, ongoing thread → keep going
New ticket, same area of the codebase → /compact first, narrate a one-line preface
New domain entirely → /clear

KDnuggets’ 7 practical ways to reduce Claude Code token usage places proactive compaction in the top three levers — the qualitative point is that compacting early, while the conversation is still clean, produces a usable summary; compacting at 90%+ produces noise. Anthropic’s own cost guide recommends the same pattern.

Strategy 3: Default Sonnet 4.6, escalate to Opus 4.7 only when stuck

The single highest-leverage configuration change in Claude Code is /model claude-sonnet-4.6. Sonnet 4.6 is genuinely strong on coding tasks; Anthropic’s own benchmarks have it close to Opus on SWE-bench while running at 60% of the input price and 60% of the output price.

On ofox.ai pricing:

Model	Input ($/M)	Output ($/M)
Claude Sonnet 4.6	$3	$15
Claude Opus 4.6	$5	$25
Claude Opus 4.7	$5	$25

When to actually reach for Opus 4.7:

Cross-file architectural decisions where you need consistency across many edits
Hard debugging where Sonnet has already failed once on the same problem
Schema or API contract design where mistakes propagate to downstream consumers
Migration logic where edge cases matter and Sonnet has been hallucinating column names

When Sonnet is the correct default:

Edits to a known file
Test writing
Refactors with clear scope
Code review and explanation
Most agentic loops (file reads, greps, simple edits)

A practical pattern is the “Sonnet does the legwork, Opus does the thinking” split: Sonnet handles the actual file operations, and you only /model opus for design discussions and stuck moments. See our Claude Code hybrid routing pattern for the configuration details. Our Claude Opus 4.7 API review covers when 4.7’s improvements over 4.6 are worth paying for and when they aren’t.

Strategy 4: Isolate verbose work in subagents

A subagent is a Claude instance that runs in its own context window. The parent invokes it with a task, the subagent burns through its own context doing the work, and only a summary returns to the parent. Net result: 50K tokens of grep output never poisons your main thread.

Where subagents pay off:

Codebase exploration: “find all callers of processPayment and summarize their patterns”
Large file reads: reading a 10K-line vendored library to extract one API surface
Research tasks: web search results, doc dumps, long log analysis
Test failure investigation: parsing 50KB of pytest output

Where subagents waste tokens:

Trivial shell commands (the spawn overhead is more than the task)
Single-file edits the parent could do directly
Anything where the parent will need the full output anyway

The cost trap: a subagent-heavy workflow can hit roughly 7× the token cost of a single-thread session because each subagent reloads system prompts and tool definitions. Reserve subagents for tasks where the verbose intermediate output is genuinely throwaway. If the parent is going to need the raw output, doing the work inline is cheaper.

Our Claude Code hooks, subagents, and skills guide covers the configuration in depth, including custom subagent definitions for repeated tasks.

Strategy 5: Audit your MCP servers, most aren’t worth their token tax

This is the most under-discussed lever. Modern Claude Code defers MCP tool definitions by default — only tool names enter context until Claude actually calls a specific tool — but the savings only materialise if you actually keep your server list lean. Real-world numbers from Anthropic’s own advanced tool use writeup:

A five-server setup with 58 tools: ~55K tokens of overhead in the traditional preload model
Jira MCP alone: ~17K tokens
Anthropic’s own internal testing pre-optimization: 134K tokens of tool definitions
With Tool Search enabled: 58 tools collapse to ~8.7K total context consumption, an ~84% reduction

That’s a third of your context window paid as tax before you’ve done anything if you’re on an older client without lazy loading. Even with deferred definitions, every connected MCP server still adds tool names to the system prompt and increases the surface area Claude has to reason about.

What to do, ranked by impact:

Run /context to see your actual tool-definition spend. If you’re surprised, you’re losing money.
Disable MCP servers you don’t need this session. Most workflows touch 1–2 servers at a time. Run /mcp to see configured servers and disable the rest.
Confirm Tool Search is active. In current Claude Code, MCP tool definitions are deferred by default; on older clients or direct SDK calls, enable Anthropic’s Tool Search explicitly.
Prefer CLI tools over MCP servers where they exist. gh, aws, gcloud, and sentry-cli add zero per-tool listing overhead because Claude invokes them through Bash instead of via tool schemas.
Cache the remaining tool definitions. They’re stable across turns, so they belong in the cached system prefix. This compounds with Strategy 1.
Don’t connect MCP servers globally if you only need them in one repo. Use project-scoped settings.

A reasonable rule of thumb: if a tool’s definition is bigger than the tool actually does for you, it shouldn’t be loaded. The convenience of having every MCP server available “just in case” is paid for in every input token on every turn for the rest of the session.

Putting it together: a 60–90% target is realistic

The savings stack multiplicatively. A team that does all five (caching, compaction discipline, Sonnet-default routing, careful subagent use, and a slimmed MCP loadout) is comparing a $1,600 weekend bill to roughly $200–400 for the same work. The largest individual lever is usually #1 (caching) on long sessions or #3 (Sonnet-default) on short ones. The biggest hidden lever is #5 (MCP audit), because the tokens involved never appear in your conversation.

If you’re routing through ofox.ai instead of direct Anthropic, you get the cache-aware pricing automatically plus Opus and Sonnet at a discount to direct list price. Our complete guide to reducing AI API costs covers cross-provider patterns, and the Claude Code ofox configuration guide walks through the actual setup. For a cost-aware end-to-end stack, the $30/month AI coding stack guide shows the full pattern.

A subagent doesn’t make Claude Code cheaper. A subagent makes one specific kind of expensive work cheaper. An MCP audit makes everything cheaper. Start there if you want the biggest win in the shortest time.

Sources

Anthropic Prompt Caching Documentation: cache pricing, TTL options, cache_control syntax
Anthropic Claude API Pricing: Sonnet, Opus, Haiku base rates
Claude Code GitHub Issue #46829: March 2026 TTL regression
Anthropic Engineering: Advanced Tool Use: Tool Search and on-demand tool loading
ofox.ai Models Catalog: current Claude pricing on the ofox gateway
Claude Code Cost Management Docs: official optimization guidance

Claude Code Token Optimization 2026: 5 Strategies That Cut Your API Bill by 60-90%

Why your Claude Code bill is bigger than your Cursor bill

Strategy 1: Prompt caching, pay 10% for everything you re-read

The March 2026 cache regression: check your bills

Strategy 2: /compact early, /clear often

Strategy 3: Default Sonnet 4.6, escalate to Opus 4.7 only when stuck

Strategy 4: Isolate verbose work in subagents

Strategy 5: Audit your MCP servers, most aren’t worth their token tax

Putting it together: a 60–90% target is realistic

Sources

Related Articles

Claude Haiku 4 API: The Budget Developer's Guide to Production-Grade AI

Claude Code Safety Guide: Prevent Accidental File Deletion with Hooks, Permissions & Git Worktrees

The $30/Month AI Coding Stack That Replaces $200 Subscriptions: A 2026 Setup Guide

Strategy 2: `/compact` early, `/clear` often