Anthropic vs OpenAI Prompt Caching 2026: Cost Math + 3 Cache-Miss Fixes
The cached-input discount race is over. Both Anthropic and OpenAI’s newest flagships now bill cached reads at 10% of the standard input rate — a 90% discount that started as Anthropic’s signature and got matched, quietly, when GPT-5.4 launched. The interesting question is no longer “who has the bigger discount.” It is which provider’s cache mechanics survive your actual prompt patterns, and which three mistakes are silently keeping your hit rate near zero.
This article walks the math, ranks the three cache-miss patterns we see most often in production, and ends with an A/B harness you can drop into a Python or Node service routed through ofox to compare both providers without rewriting cache logic.
30-Second Answer: Which Cache Wins for Your Workload?
| If your workload is… | Pick | Why |
|---|---|---|
| Long system prompt (≥4k tok) + RAG documents, ≥10 hits per write | Anthropic | Explicit cache_control on tools + system + retrieved docs separately; 4 cache breakpoints let you extend caching past the system prompt |
| Stable suffix-style prompt, automatic caching is fine | OpenAI | No cache_control to manage; first call free at standard input rate, no 1.25× write fee |
| Mixed multi-provider routing with fallback | Both, A/B | Same prefix economics via ofox; switch on cost-per-1k-cached-tokens by hour |
| Cache hit rate currently <40% | Fix the prompt first | Provider choice does not matter until you stop the three cache-miss patterns below |
If your hit rate is below 40%, the provider choice will not save you. Fix the patterns in §3 first, then run the math.
TL;DR: Real Monthly Bill on 10M Cached Tokens/Day
The headline question is “Anthropic 90% vs OpenAI 50%” — but in 2026 it has split by model generation:
| Workload | Provider | Model | Standard input bill | With cache (80% hit) | Savings |
|---|---|---|---|---|---|
| 10M tok/day cached prefix | Anthropic | anthropic/claude-sonnet-4.6 | $90/mo | $25.50/mo | 72% |
| 10M tok/day cached prefix | OpenAI | openai/gpt-5.5 | $1,500/mo | $390/mo | 74% |
| 10M tok/day cached prefix | OpenAI | openai/gpt-4.1 (legacy) | $600/mo | $360/mo | 40% |
Numbers assume 80% cache hit rate, 5-minute TTL, 1 cache write per hit cycle. The Sonnet 4.6 base price is illustrative — confirm current rates on the ofox models page.
The OpenAI GPT-4.1 line is the row that surprises teams running 2024-era code in 2026. The shape of your bill changed when you stayed on the old model.
The Headline That Aged: 90% Was Anthropic-Only Until GPT-5.4
Anthropic introduced prompt caching as a beta feature in August 2024 with a clean economic structure: cache reads at 0.1× the standard input rate (a 90% discount), cache writes at 1.25× the standard rate for a 5-minute TTL, 2× for a 1-hour TTL. The 90% read discount became the marketing line.
OpenAI shipped automatic caching later that year. The discount started at 50% across the gpt-4o family — half off cached input, no developer action required. That gap is what generated the “Anthropic wins on cache pricing” consensus, repeated in dozens of blog posts and still being cited in 2026.
Two things changed between then and now:
- GPT-5.4 and GPT-5.5 dropped cached input to $0.50/M against $5/M standard. That is 90%, identical to Anthropic’s read multiplier. The discount tier moved with the model generation, not the platform.
- Anthropic kept the 1.25× write fee. OpenAI’s automatic cache has no separate write line item — the first call is billed at the standard input rate, and subsequent matched prefixes are billed cheaper. Anthropic’s write fee shows up as a one-time tax per 5-minute TTL refresh.
The implication: for high cache-hit workloads (≥5 hits per write cycle), Anthropic’s 1.25× write fee is amortized away and the providers tie on read economics. For low cache-hit workloads (<3 hits per write cycle), OpenAI’s no-write-fee structure wins by a few percent. For dynamic prompts that never sustain a prefix at all, neither saves you anything.
Quick Specs: Anthropic vs OpenAI Prompt Caching
| Spec | Anthropic Claude | OpenAI GPT |
|---|---|---|
| Discount on cache read | 90% (0.1× input) | 50% (gpt-4o, gpt-4.1) / 90% (gpt-5.4, gpt-5.5) |
| Cache write fee | 1.25× input (5min), 2× (1hr) | None — first call at standard rate |
| Activation | Explicit cache_control marker | Automatic, prefix detection |
| Minimum tokens | 1,024 (Sonnet 4.6, Opus 4.8); 4,096 on older Claude models (Opus 4.7/4.6/4.5, Haiku 4.5) | 1,024 |
| Default TTL | 5 minutes ephemeral | 5–10 minutes of inactivity |
| Extended TTL | 1 hour at 2× write | Up to 24 hours (GPT-5.4/5.5 family, extended retention) |
| Cache break points | Up to 4 explicit markers per request | None — automatic prefix only |
| Cacheable content | Tools, system, messages, images, tool_use/tool_result | Stable text prefix only |
| Optional routing key | N/A (cache scoped to org + model) | prompt_cache_key for routing optimization |
| Cache hit signal in response | cache_creation_input_tokens, cache_read_input_tokens | usage.prompt_tokens_details.cached_tokens |
Sources: Anthropic prompt caching docs, OpenAI prompt caching guide. Verified 2026-06-10.
The row that matters most for cost modeling is Cache break points. OpenAI’s automatic cache locks onto the longest stable prefix it can find from the start of the request. Anthropic lets you place up to four cache_control markers anywhere in the request — meaning a retrieved document inserted mid-conversation can still cache, and the system prompt + tools can cache as a separate block from a retrieved document below it. For RAG workloads, this is a real structural advantage. For straight chat with no document retrieval, the OpenAI automatic model is simpler and equivalently priced on GPT-5.5.
Real Cost Math: Walking Through a 10M-Cached-Token Day
Assume the following workload: an agent service handling 10,000 requests per day, each with a 5,000-token system prompt + tools + retrieved context, plus a 500-token user message, plus a 1,000-token response. The first 5,000 tokens are the cache target.
Daily cacheable input: 10,000 requests × 5,000 tokens = 50M tokens/day. Of those, assume 80% hit a warm cache and 20% miss (write). That gives 40M read tokens/day and 10M write tokens/day.
Anthropic Claude Sonnet 4.6 (illustrative base $3/M input, $15/M output)
| Line item | Rate | Tokens/day | Daily cost |
|---|---|---|---|
| Cache reads | $0.30/M (0.1× input) | 40M | $12.00 |
| Cache writes (5min) | $3.75/M (1.25× input) | 10M | $37.50 |
| Standard input (user msgs) | $3/M | 5M | $15.00 |
| Output | $15/M | 10M | $150.00 |
| Total | $214.50/day |
Without caching: 50M × $3/M = $150 + $15 user + $150 output = $315/day. Savings: $100.50/day, or 32% off the daily bill.
OpenAI GPT-5.5 (verified $5/M input, $30/M output, $0.50/M cached input)
| Line item | Rate | Tokens/day | Daily cost |
|---|---|---|---|
| Cache reads | $0.50/M | 40M | $20.00 |
| Cache writes (no fee) | $5/M | 10M | $50.00 |
| Standard input (user msgs) | $5/M | 5M | $25.00 |
| Output | $30/M | 10M | $300.00 |
| Total | $395.00/day |
Without caching: 50M × $5/M = $250 + $25 user + $300 output = $575/day. Savings: $180/day, 31% off the daily bill.
Bill comparison
Both providers save roughly the same proportional amount (≈31–32%) at this hit rate. The absolute dollar gap between Sonnet 4.6 and GPT-5.5 reflects the underlying input/output rates, not the caching mechanism. At 80% hit rate, the “90% cached read” headline drops to a “~32% total bill reduction” reality because output tokens dominate the bill and writes still cost money.
The leverage point is hit rate, not provider. Move from 80% to 95% on the same Sonnet workload and the daily cost drops another $7 — equivalent to ~3% off the bill. Move from 80% to 40% and you lose $20/day to extra writes. The patterns in the next section are the difference.
3 Cache-Miss Patterns to Fix First
These are ranked by frequency in production code review, not by severity. Each one looks correct on the first read and breaks silently — the only signal is your cache_read_input_tokens field staying near zero while your spend climbs.
| Pattern | What it does | Typical hit-rate impact | Fix difficulty |
|---|---|---|---|
| 1. Mutable system prompt | Inserts a timestamp/UUID/username into the cached prefix | Drops to 0% | 5 minutes |
| 2. Non-deterministic tool serialization | Tool definitions render in different byte order across calls | 0–40% depending on Python/Node version | 15 minutes |
| 3. Sliding-window history | Drops oldest message each turn, invalidating the suffix | Drops to 0% after window fills | 30 minutes (requires window strategy change) |
Pattern 1: The Mutable System Prompt
The bug looks like this:
system_prompt = f"""You are a helpful assistant.
Current time: {datetime.utcnow().isoformat()}
Available tools: web_search, code_interpreter
User ID: {user.id}
..."""
Every call has a different Current time value, so every call has a different byte prefix. Cache hit rate: 0%. The fix is to move the dynamic values to the end of the user message or to a separate non-cached block:
system_prompt = """You are a helpful assistant.
Available tools: web_search, code_interpreter
...""" # static, cacheable
user_message = f"Current time: {datetime.utcnow().isoformat()}\nUser ID: {user.id}\n\n{actual_question}"
For Anthropic, this lets you place cache_control on the system block. For OpenAI, it lets the automatic prefix detector latch onto the static system content. Both providers benefit identically.
The variant of this bug that hides longer: a “static” template that gets re-rendered from a dictionary whose iteration order changes across Python versions or after a config refresh. The bytes look the same when you print them; they are not the same when the cache hashes them.
Pattern 2: Non-Deterministic Tool Serialization
Tool definitions are typically JSON-serialized before sending. If you build them like this:
tools = []
for tool_module in plugin_manager.discover_tools():
tools.append({
"name": tool_module.name,
"description": tool_module.description,
"input_schema": tool_module.schema,
})
The order of plugin_manager.discover_tools() is whatever the filesystem returns. On a different worker, after a deploy, or on a different OS, the order is different. The same tools, different bytes, no cache hit.
The fix is to sort deterministically before sending:
tools = sorted(
[tool_definition(t) for t in plugin_manager.discover_tools()],
key=lambda t: t["name"],
)
And in serialization, force key sort:
import json
serialized = json.dumps(payload, sort_keys=True, separators=(",", ":"))
This matters more for Anthropic (where tools are part of the explicit cache block) than for OpenAI (where automatic detection is more forgiving of low-level reordering), but the fix is free and helps both. We have seen production systems gain 30 percentage points of cache hit rate from this one change.
Pattern 3: Sliding-Window Conversation History
The natural way to manage a long conversation is to cap context: keep the last N messages, drop the oldest as new ones arrive. This is correct for token budget. It is fatal for cache hit rate.
When you drop message 1 to add message 12, the prefix changes. Every cached byte after message 1 is now in the wrong position. Anthropic’s cache_control markers on system + tools may still hit; the conversation portion will not.
The fix has three options, none free:
- Cache only the system + tools prefix, accept that conversation tail is uncached. Easiest. Costs you the savings on whatever conversation history you have.
- Use a fixed prefix + sliding tail: keep the first K messages permanently, slide only after message K+N. Requires picking K (typically 6–10) where the older messages are still relevant or summarized.
- Summarize older messages into the system prompt when you drop them, refresh the cache on summary update. Hit rate stays high; system prompt grows slowly; cache writes happen on summary boundary.
For Anthropic, option 2 maps cleanly onto cache_control markers placed at the boundary of the fixed prefix. For OpenAI, the automatic cache will detect the stable prefix on its own as long as you do not mutate the early messages.
When to Pick Anthropic Caching
- Long retrieved context (RAG with 8k–50k-token documents): explicit
cache_controllets you cache the document block separately from the system prompt. OpenAI’s automatic cache cannot do this if the document position varies. - Tool-heavy agents with 20+ tools: Anthropic caches tool definitions as part of the cached block. Cache hit on a 10k-token tool catalog is worth real money.
- Workloads with predictable cache lifetime ≥30 minutes: Anthropic’s 1-hour TTL option at 2× write cost is cheaper than re-writing every 5 minutes.
- Cost-sensitive batch processing: 90% off reads + amortized writes is the textbook Anthropic case.
Linking to a model page: anthropic/claude-sonnet-4.6 is the default starting point for cost-sensitive workloads; Opus 4.8 is the budget-no-object option.
When to Pick OpenAI Caching
- Chat-style workloads with stable system prompt, no retrieval, no tool catalog churn: automatic caching just works, no code changes.
- High cache-write churn (cache hit rate <50%): no write fee means you do not pay for misses. Anthropic’s 1.25× write penalty hurts here.
- GPT-5.4 or GPT-5.5 already in your stack: the read discount matches Anthropic. There is no economic reason to switch providers just for caching.
- Routing across many short prompts that almost-but-not-quite share prefixes: the
prompt_cache_keyparameter helps the router land cache hits where automatic detection might miss.
openai/gpt-5.5 is the model where the read discount caught up with Anthropic.
When Neither Helps (and What to Use Instead)
There are workloads where prompt caching is the wrong tool:
- Truly stateless single-shot calls under 1,024 tokens: nothing to cache, both providers’ minimum thresholds bite. Use a smaller, cheaper model and skip caching entirely.
- Hot-path personalized prompts where every byte of the prompt depends on the user (CRM lookups, real-time dashboards): the prefix is genuinely dynamic. Caching cannot help. Restructure the prompt so personalization comes after a long shared frame, or use embedding-based retrieval to avoid putting personalization in-prompt.
- Cross-model A/B experiments where the cache would expire between runs: the comparison is unfair. Use the ofox unified billing to compare apples-to-apples on equal hit rates, not on warm-vs-cold runs.
Alternative providers worth knowing about: Google Gemini offers ~50–75% cached-input discounts on Gemini 2.5 Pro (via ofox or direct); DeepSeek and Qwen series cache automatically with lower base rates that may beat both Anthropic and OpenAI on output cost. Same model-switching pattern via ofox applies.
Try Both via ofox: A/B in 10 Lines of Code
Both Anthropic’s cache_control and OpenAI’s automatic caching are accessible through the ofox unified OpenAI-compatible endpoint. The model ID string is the only difference. Drop this harness into a service to measure cache hit rate side by side.
Python — A/B both models in one loop
from openai import OpenAI
import os, json
client = OpenAI(base_url="https://api.ofox.ai/v1", api_key=os.environ["OFOX_API_KEY"])
SYSTEM_PROMPT = open("system.md").read() # 5k+ tokens, static
def measure(model: str, query: str):
resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": query},
],
extra_body={"cache_control": {"type": "ephemeral"}} if "claude" in model else {},
)
usage = resp.usage.model_dump()
return usage.get("prompt_tokens_details", {}).get("cached_tokens") or \
usage.get("cache_read_input_tokens", 0)
for model in ["anthropic/claude-sonnet-4.6", "openai/gpt-5.5"]:
cached = measure(model, "Refactor this FastAPI handler to use async DB calls.")
print(f"{model}: {cached} cached input tokens")
Node — same shape
import OpenAI from "openai";
import { readFileSync } from "fs";
const client = new OpenAI({ baseURL: "https://api.ofox.ai/v1", apiKey: process.env.OFOX_API_KEY });
const SYSTEM_PROMPT = readFileSync("system.md", "utf8");
async function measure(model, query) {
const resp = await client.chat.completions.create({
model,
messages: [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: query },
],
...(model.includes("claude") ? { cache_control: { type: "ephemeral" } } : {}),
});
return resp.usage.prompt_tokens_details?.cached_tokens ??
resp.usage.cache_read_input_tokens ?? 0;
}
for (const model of ["anthropic/claude-sonnet-4.6", "openai/gpt-5.5"]) {
const cached = await measure(model, "Refactor this FastAPI handler to use async DB calls.");
console.log(`${model}: ${cached} cached input tokens`);
}
Run twice in quick succession. The first call writes the cache; the second should report a high cached value if your prefix is genuinely stable. If the second call still reports zero, one of the three patterns above is in play.
FAQ
(See frontmatter faq block above — rendered into the page schema for AI search and PAA extraction.)
Sources Checked for This Refresh
- Anthropic Prompt Caching API Reference — verified 2026-06-10 for read/write multipliers, minimum tokens by model, TTL options
- OpenAI Prompt Caching Guide — verified 2026-06-10 for automatic activation, 1024-token minimum, extended retention on GPT-5.4/5.5
- ofox model marketplace — verified live model IDs
anthropic/claude-sonnet-4.6andopenai/gpt-5.5, cache read/write line items present - ofox prompt caching docs — confirmed
cache_controlpassthrough for Anthropic, automatic activation for OpenAI - DigitalOcean: Prompt Caching for Anthropic and OpenAI Models — cross-referenced for production patterns
- Internal production review of three customer agent services (anonymized) — cache hit rate degradation traced to the three patterns in §3, dates Q1–Q2 2026
Prices and discount rates verified on the providers’ official documentation at publication. Confirm current rates on the ofox models page before scaling any cost projection — model pricing changes by quarter, and the read multiplier on GPT-class models has trended downward through 2025–2026. The math in §4 assumes 80% hit rate; rerun for your actual workload before sizing.


