Does OpenAI prompt caching cost 50% or 90% less in 2026?

Both, depending on the model. Legacy gpt-4o and gpt-4.1 get a 50% cached-input discount. GPT-5.4 and GPT-5.5 dropped cached input to $0.50/M against $5/M standard — that is 90%, matching Anthropic's read rate. The 'OpenAI is 50%' shorthand is now wrong for new models.

Does Anthropic charge a cache write fee?

Yes. 5-minute writes are 1.25× the standard input rate; 1-hour writes are 2×. OpenAI does not charge a separate write fee — its cache is automatic and the first call is billed at the standard input rate. This matters when your cache hit rate is below ~50%.

What is the minimum prompt size to enable caching?

1,024 tokens on both providers for current flagship models (Claude Opus 4.8, Sonnet 4.6; OpenAI gpt-4o and newer). Smaller prefixes are never cached, regardless of cache_control markers.

How long does a cached prefix stay alive?

5 minutes default on both providers. Anthropic offers an explicit 1-hour TTL at 2× write cost via ttl: '1h'. OpenAI's standard TTL is 5–10 minutes of inactivity, with extended retention up to 24 hours on GPT-5.4/5.5 family.

Why is my cache hit rate stuck near zero even with cache_control set?

Almost always one of three patterns: a dynamic value (timestamp, request ID, user name) sitting inside what you thought was the static prefix, non-deterministic JSON serialization of tool definitions, or a sliding conversation window that rotates the oldest message every turn. All three look right in the code review and silently break prefix matching.

Does cache_control work when calling Claude through ofox?

Yes. ofox passes cache_control through transparently to Anthropic, and OpenAI's automatic caching kicks in for OpenAI models routed via ofox. You can A/B both providers without rewriting the cache logic — only the model ID string changes.

Should I cache the system prompt or the conversation history?

Cache the longest stable prefix you can name: usually system prompt + tool definitions + retrieved documents, in that order. Conversation history is a poor cache target unless you guarantee a stable window — otherwise every turn invalidates the suffix.

Is Anthropic's 90% cache discount still meaningful now that OpenAI matches it on GPT-5.5?

Yes, for two reasons. Anthropic gives you explicit control over what gets cached (4 break points), so you can extend caching past the system prompt to long retrieved documents. OpenAI's automatic caching is take-it-or-leave-it — if your dynamic content sits early in the prompt, you cache nothing.

Anthropic vs OpenAI Prompt Caching 2026: Cost Math + 3 Cache-Miss Fixes

The cached-input discount race is over. Both Anthropic and OpenAI’s newest flagships now bill cached reads at 10% of the standard input rate — a 90% discount that started as Anthropic’s signature and got matched, quietly, when GPT-5.4 launched. The interesting question is no longer “who has the bigger discount.” It is which provider’s cache mechanics survive your actual prompt patterns, and which three mistakes are silently keeping your hit rate near zero.

This article walks the math, ranks the three cache-miss patterns we see most often in production, and ends with an A/B harness you can drop into a Python or Node service routed through ofox to compare both providers without rewriting cache logic.

30-Second Answer: Which Cache Wins for Your Workload?

If your workload is…	Pick	Why
Long system prompt (≥4k tok) + RAG documents, ≥10 hits per write	Anthropic	Explicit `cache_control` on tools + system + retrieved docs separately; 4 cache breakpoints let you extend caching past the system prompt
Stable suffix-style prompt, automatic caching is fine	OpenAI	No `cache_control` to manage; first call free at standard input rate, no 1.25× write fee
Mixed multi-provider routing with fallback	Both, A/B	Same prefix economics via ofox; switch on cost-per-1k-cached-tokens by hour
Cache hit rate currently <40%	Fix the prompt first	Provider choice does not matter until you stop the three cache-miss patterns below

If your hit rate is below 40%, the provider choice will not save you. Fix the patterns in §3 first, then run the math.

TL;DR: Real Monthly Bill on 10M Cached Tokens/Day

The headline question is “Anthropic 90% vs OpenAI 50%” — but in 2026 it has split by model generation:

Workload	Provider	Model	Standard input bill	With cache (80% hit)	Savings
10M tok/day cached prefix	Anthropic	`anthropic/claude-sonnet-4.6`	$90/mo	$25.50/mo	72%
10M tok/day cached prefix	OpenAI	`openai/gpt-5.5`	$1,500/mo	$390/mo	74%
10M tok/day cached prefix	OpenAI	`openai/gpt-4.1` (legacy)	$600/mo	$360/mo	40%

Numbers assume 80% cache hit rate, 5-minute TTL, 1 cache write per hit cycle. The Sonnet 4.6 base price is illustrative — confirm current rates on the ofox models page.

The OpenAI GPT-4.1 line is the row that surprises teams running 2024-era code in 2026. The shape of your bill changed when you stayed on the old model.

The Headline That Aged: 90% Was Anthropic-Only Until GPT-5.4

Anthropic introduced prompt caching as a beta feature in August 2024 with a clean economic structure: cache reads at 0.1× the standard input rate (a 90% discount), cache writes at 1.25× the standard rate for a 5-minute TTL, 2× for a 1-hour TTL. The 90% read discount became the marketing line.

OpenAI shipped automatic caching later that year. The discount started at 50% across the gpt-4o family — half off cached input, no developer action required. That gap is what generated the “Anthropic wins on cache pricing” consensus, repeated in dozens of blog posts and still being cited in 2026.

Two things changed between then and now:

GPT-5.4 and GPT-5.5 dropped cached input to $0.50/M against $5/M standard. That is 90%, identical to Anthropic’s read multiplier. The discount tier moved with the model generation, not the platform.
Anthropic kept the 1.25× write fee. OpenAI’s automatic cache has no separate write line item — the first call is billed at the standard input rate, and subsequent matched prefixes are billed cheaper. Anthropic’s write fee shows up as a one-time tax per 5-minute TTL refresh.

The implication: for high cache-hit workloads (≥5 hits per write cycle), Anthropic’s 1.25× write fee is amortized away and the providers tie on read economics. For low cache-hit workloads (<3 hits per write cycle), OpenAI’s no-write-fee structure wins by a few percent. For dynamic prompts that never sustain a prefix at all, neither saves you anything.

Quick Specs: Anthropic vs OpenAI Prompt Caching

Spec	Anthropic Claude	OpenAI GPT
Discount on cache read	90% (0.1× input)	50% (gpt-4o, gpt-4.1) / 90% (gpt-5.4, gpt-5.5)
Cache write fee	1.25× input (5min), 2× (1hr)	None — first call at standard rate
Activation	Explicit `cache_control` marker	Automatic, prefix detection
Minimum tokens	1,024 (Sonnet 4.6, Opus 4.8); 4,096 on older Claude models (Opus 4.7/4.6/4.5, Haiku 4.5)	1,024
Default TTL	5 minutes ephemeral	5–10 minutes of inactivity
Extended TTL	1 hour at 2× write	Up to 24 hours (GPT-5.4/5.5 family, extended retention)
Cache break points	Up to 4 explicit markers per request	None — automatic prefix only
Cacheable content	Tools, system, messages, images, tool_use/tool_result	Stable text prefix only
Optional routing key	N/A (cache scoped to org + model)	`prompt_cache_key` for routing optimization
Cache hit signal in response	`cache_creation_input_tokens`, `cache_read_input_tokens`	`usage.prompt_tokens_details.cached_tokens`

Sources: Anthropic prompt caching docs, OpenAI prompt caching guide.

The row that matters most for cost modeling is Cache break points. OpenAI’s automatic cache locks onto the longest stable prefix it can find from the start of the request. Anthropic lets you place up to four cache_control markers anywhere in the request — meaning a retrieved document inserted mid-conversation can still cache, and the system prompt + tools can cache as a separate block from a retrieved document below it. For RAG workloads, this is a real structural advantage. For straight chat with no document retrieval, the OpenAI automatic model is simpler and equivalently priced on GPT-5.5.

Real Cost Math: Walking Through a 10M-Cached-Token Day

Assume the following workload: an agent service handling 10,000 requests per day, each with a 5,000-token system prompt + tools + retrieved context, plus a 500-token user message, plus a 1,000-token response. The first 5,000 tokens are the cache target.

Daily cacheable input: 10,000 requests × 5,000 tokens = 50M tokens/day. Of those, assume 80% hit a warm cache and 20% miss (write). That gives 40M read tokens/day and 10M write tokens/day.

Anthropic Claude Sonnet 4.6 (illustrative base $3/M input, $15/M output)

Line item	Rate	Tokens/day	Daily cost
Cache reads	$0.30/M (0.1× input)	40M	$12.00
Cache writes (5min)	$3.75/M (1.25× input)	10M	$37.50
Standard input (user msgs)	$3/M	5M	$15.00
Output	$15/M	10M	$150.00
Total			$214.50/day

Without caching: 50M × $3/M = $150 + $15 user + $150 output = $315/day. Savings: $100.50/day, or 32% off the daily bill.

OpenAI GPT-5.5 (verified $5/M input, $30/M output, $0.50/M cached input)

Line item	Rate	Tokens/day	Daily cost
Cache reads	$0.50/M	40M	$20.00
Cache writes (no fee)	$5/M	10M	$50.00
Standard input (user msgs)	$5/M	5M	$25.00
Output	$30/M	10M	$300.00
Total			$395.00/day

Without caching: 50M × $5/M = $250 + $25 user + $300 output = $575/day. Savings: $180/day, 31% off the daily bill.

Bill comparison

Both providers save roughly the same proportional amount (≈31–32%) at this hit rate. The absolute dollar gap between Sonnet 4.6 and GPT-5.5 reflects the underlying input/output rates, not the caching mechanism. At 80% hit rate, the “90% cached read” headline drops to a “~32% total bill reduction” reality because output tokens dominate the bill and writes still cost money.

The leverage point is hit rate, not provider. Move from 80% to 95% on the same Sonnet workload and the daily cost drops another $7 — equivalent to ~3% off the bill. Move from 80% to 40% and you lose $20/day to extra writes. The patterns in the next section are the difference.

3 Cache-Miss Patterns to Fix First

These are ranked by frequency in production code review, not by severity. Each one looks correct on the first read and breaks silently — the only signal is your cache_read_input_tokens field staying near zero while your spend climbs.

Pattern	What it does	Typical hit-rate impact	Fix difficulty
1. Mutable system prompt	Inserts a timestamp/UUID/username into the cached prefix	Drops to 0%	5 minutes
2. Non-deterministic tool serialization	Tool definitions render in different byte order across calls	0–40% depending on Python/Node version	15 minutes
3. Sliding-window history	Drops oldest message each turn, invalidating the suffix	Drops to 0% after window fills	30 minutes (requires window strategy change)

Pattern 1: The Mutable System Prompt

The bug looks like this:

system_prompt = f"""You are a helpful assistant.
Current time: {datetime.utcnow().isoformat()}
Available tools: web_search, code_interpreter
User ID: {user.id}
..."""

Every call has a different Current time value, so every call has a different byte prefix. Cache hit rate: 0%. The fix is to move the dynamic values to the end of the user message or to a separate non-cached block:

system_prompt = """You are a helpful assistant.
Available tools: web_search, code_interpreter
..."""  # static, cacheable

user_message = f"Current time: {datetime.utcnow().isoformat()}\nUser ID: {user.id}\n\n{actual_question}"

For Anthropic, this lets you place cache_control on the system block. For OpenAI, it lets the automatic prefix detector latch onto the static system content. Both providers benefit identically.

The variant of this bug that hides longer: a “static” template that gets re-rendered from a dictionary whose iteration order changes across Python versions or after a config refresh. The bytes look the same when you print them; they are not the same when the cache hashes them.

Pattern 2: Non-Deterministic Tool Serialization

Tool definitions are typically JSON-serialized before sending. If you build them like this:

tools = []
for tool_module in plugin_manager.discover_tools():
    tools.append({
        "name": tool_module.name,
        "description": tool_module.description,
        "input_schema": tool_module.schema,
    })

The order of plugin_manager.discover_tools() is whatever the filesystem returns. On a different worker, after a deploy, or on a different OS, the order is different. The same tools, different bytes, no cache hit.

The fix is to sort deterministically before sending:

tools = sorted(
    [tool_definition(t) for t in plugin_manager.discover_tools()],
    key=lambda t: t["name"],
)

And in serialization, force key sort:

import json
serialized = json.dumps(payload, sort_keys=True, separators=(",", ":"))

This matters more for Anthropic (where tools are part of the explicit cache block) than for OpenAI (where automatic detection is more forgiving of low-level reordering), but the fix is free and helps both. We have seen production systems gain 30 percentage points of cache hit rate from this one change.

Pattern 3: Sliding-Window Conversation History

The natural way to manage a long conversation is to cap context: keep the last N messages, drop the oldest as new ones arrive. This is correct for token budget. It is fatal for cache hit rate.

When you drop message 1 to add message 12, the prefix changes. Every cached byte after message 1 is now in the wrong position. Anthropic’s cache_control markers on system + tools may still hit; the conversation portion will not.

The fix has three options, none free:

Cache only the system + tools prefix, accept that conversation tail is uncached. Easiest. Costs you the savings on whatever conversation history you have.
Use a fixed prefix + sliding tail: keep the first K messages permanently, slide only after message K+N. Requires picking K (typically 6–10) where the older messages are still relevant or summarized.
Summarize older messages into the system prompt when you drop them, refresh the cache on summary update. Hit rate stays high; system prompt grows slowly; cache writes happen on summary boundary.

For Anthropic, option 2 maps cleanly onto cache_control markers placed at the boundary of the fixed prefix. For OpenAI, the automatic cache will detect the stable prefix on its own as long as you do not mutate the early messages.

When to Pick Anthropic Caching

Long retrieved context (RAG with 8k–50k-token documents): explicit cache_control lets you cache the document block separately from the system prompt. OpenAI’s automatic cache cannot do this if the document position varies.
Tool-heavy agents with 20+ tools: Anthropic caches tool definitions as part of the cached block. Cache hit on a 10k-token tool catalog is worth real money.
Workloads with predictable cache lifetime ≥30 minutes: Anthropic’s 1-hour TTL option at 2× write cost is cheaper than re-writing every 5 minutes.
Cost-sensitive batch processing: 90% off reads + amortized writes is the textbook Anthropic case.

Linking to a model page: anthropic/claude-sonnet-4.6 is the default starting point for cost-sensitive workloads; Opus 4.8 is the budget-no-object option.

When to Pick OpenAI Caching

Chat-style workloads with stable system prompt, no retrieval, no tool catalog churn: automatic caching just works, no code changes.
High cache-write churn (cache hit rate <50%): no write fee means you do not pay for misses. Anthropic’s 1.25× write penalty hurts here.
GPT-5.4 or GPT-5.5 already in your stack: the read discount matches Anthropic. There is no economic reason to switch providers just for caching.
Routing across many short prompts that almost-but-not-quite share prefixes: the prompt_cache_key parameter helps the router land cache hits where automatic detection might miss.

openai/gpt-5.5 is the model where the read discount caught up with Anthropic.

When Neither Helps (and What to Use Instead)

There are workloads where prompt caching is the wrong tool:

Truly stateless single-shot calls under 1,024 tokens: nothing to cache, both providers’ minimum thresholds bite. Use a smaller, cheaper model and skip caching entirely.
Hot-path personalized prompts where every byte of the prompt depends on the user (CRM lookups, real-time dashboards): the prefix is genuinely dynamic. Caching cannot help. Restructure the prompt so personalization comes after a long shared frame, or use embedding-based retrieval to avoid putting personalization in-prompt.
Cross-model A/B experiments where the cache would expire between runs: the comparison is unfair. Use the ofox unified billing to compare apples-to-apples on equal hit rates, not on warm-vs-cold runs.

Alternative providers worth knowing about: Google Gemini offers ~50–75% cached-input discounts on Gemini 2.5 Pro (via ofox or direct); DeepSeek and Qwen series cache automatically with lower base rates that may beat both Anthropic and OpenAI on output cost. Same model-switching pattern via ofox applies.

Try Both via ofox: A/B in 10 Lines of Code

Both Anthropic’s cache_control and OpenAI’s automatic caching are accessible through the ofox unified OpenAI-compatible endpoint. The model ID string is the only difference. Drop this harness into a service to measure cache hit rate side by side.

Python — A/B both models in one loop

from openai import OpenAI
import os, json

client = OpenAI(base_url="https://api.ofox.ai/v1", api_key=os.environ["OFOX_API_KEY"])

SYSTEM_PROMPT = open("system.md").read()  # 5k+ tokens, static

def measure(model: str, query: str):
    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": query},
        ],
        extra_body={"cache_control": {"type": "ephemeral"}} if "claude" in model else {},
    )
    usage = resp.usage.model_dump()
    return usage.get("prompt_tokens_details", {}).get("cached_tokens") or \
           usage.get("cache_read_input_tokens", 0)

for model in ["anthropic/claude-sonnet-4.6", "openai/gpt-5.5"]:
    cached = measure(model, "Refactor this FastAPI handler to use async DB calls.")
    print(f"{model}: {cached} cached input tokens")

Node — same shape

import OpenAI from "openai";
import { readFileSync } from "fs";

const client = new OpenAI({ baseURL: "https://api.ofox.ai/v1", apiKey: process.env.OFOX_API_KEY });
const SYSTEM_PROMPT = readFileSync("system.md", "utf8");

async function measure(model, query) {
  const resp = await client.chat.completions.create({
    model,
    messages: [
      { role: "system", content: SYSTEM_PROMPT },
      { role: "user", content: query },
    ],
    ...(model.includes("claude") ? { cache_control: { type: "ephemeral" } } : {}),
  });
  return resp.usage.prompt_tokens_details?.cached_tokens ??
         resp.usage.cache_read_input_tokens ?? 0;
}

for (const model of ["anthropic/claude-sonnet-4.6", "openai/gpt-5.5"]) {
  const cached = await measure(model, "Refactor this FastAPI handler to use async DB calls.");
  console.log(`${model}: ${cached} cached input tokens`);
}

Run twice in quick succession. The first call writes the cache; the second should report a high cached value if your prefix is genuinely stable. If the second call still reports zero, one of the three patterns above is in play.

FAQ

(See frontmatter faq block above — rendered into the page schema for AI search and PAA extraction.)

References

Anthropic Prompt Caching API Reference
OpenAI Prompt Caching Guide
ofox model marketplace
ofox prompt caching docs
DigitalOcean: Prompt Caching for Anthropic and OpenAI Models
Internal production review of three customer agent services (anonymized) — cache hit rate degradation traced to the three patterns in §3, dates Q1–Q2 2026

Prices and discount rates verified on the providers’ official documentation at publication. Confirm current rates on the ofox models page before scaling any cost projection — model pricing changes by quarter, and the read multiplier on GPT-class models has trended downward through 2025–2026. The math in §4 assumes 80% hit rate; rerun for your actual workload before sizing.