Skip to Content
DocsAdvancedGemini Explicit Caching

Gemini Explicit Caching (cachedContents)

Gemini offers two kinds of caching:

  • Implicit caching (automatic) — the model automatically detects and reuses repeated prompt prefixes, with no configuration needed. This is the default behavior covered in Prompt Caching; hits are best-effort, and any small change to the prefix may cause a miss.
  • Explicit caching (this page) — you actively create a large piece of context as a cache object, get back a cachedContents/{id} handle, and reference it in subsequent generateContent requests. Hits are deterministic: as long as the cache hasn’t expired, a reference always hits.

Use explicit caching when you have a large, stable context that is reused repeatedly (long documents, codebases, knowledge bases, fixed system instructions) and want controllable hit rates and predictable billing.

OfoxAI supports Gemini’s native cachedContents create / reference / get / delete, compatible with the Google GenAI SDK. For the endpoint-level reference, see the cachedContents API.

When to Use Explicit Caching

ScenarioRecommendation
Repeated Q&A over a large fixed context (doc QA, code assistant)✅ Explicit caching
You need guaranteed hits and cannot tolerate occasional misses✅ Explicit caching
Short, infrequent requests with varying prefixes❌ Implicit caching is enough; no cache object to manage
The context changes every time❌ Caching has no value

Explicit caching requires the cached content to meet a minimum token threshold (roughly 2,048–4,096 tokens, model-dependent). Content that is too small will fail to create or yield no benefit.

Endpoints

OfoxAI provides the full cachedContents endpoints under the Gemini native protocol:

POST https://api.ofox.ai/gemini/v1beta/cachedContents # Create GET https://api.ofox.ai/gemini/v1beta/cachedContents/{id} # Get DELETE https://api.ofox.ai/gemini/v1beta/cachedContents/{id} # Delete

To reference a cache for content generation, use the standard generateContent endpoint and add cachedContent to the request body:

POST https://api.ofox.ai/gemini/v1beta/models/{model}:generateContent

Authentication

Like other Gemini endpoints, use the x-goog-api-key header:

x-goog-api-key: <YOUR_OFOXAI_API_KEY>

Full Workflow

The following demonstrates a complete flow: create cache → reference cache to generate → delete cache.

1. Create a Cache

Put the large context you want to reuse into contents, and specify model and ttl (time to live):

create_cache.py
from google import genai from google.genai import types client = genai.Client( api_key="<YOUR_OFOXAI_API_KEY>", http_options={"api_version": "v1beta", "base_url": "https://api.ofox.ai/gemini"}, ) # A large document reused repeatedly (must meet the minimum token threshold) LONG_DOCUMENT = open("knowledge_base.txt").read() cache = client.caches.create( model="google/gemini-3.1-pro-preview", config=types.CreateCachedContentConfig( contents=[LONG_DOCUMENT], ttl="600s", # cache lives for 10 minutes ), ) print(cache.name) # cachedContents/xxxxxxxx — the handle for later references

The returned name (e.g. cachedContents/xxxxxxxx) is the cache handle. Save it for later references and management.

2. Reference the Cache to Generate

Reference the cache handle via cachedContent in a generateContent request; contents only needs the new question for this turn:

use_cache.py
response = client.models.generate_content( model="google/gemini-3.1-pro-preview", contents="Based on the document above, summarize three key points", config=types.GenerateContentConfig( cached_content=cache.name, # reference the cache ), ) print(response.text) # Cache hit: the cached portion is billed at the lower "read" rate print(response.usage_metadata.cached_content_token_count)

On a hit, the response usageMetadata.cachedContentTokenCount shows how many tokens came from the cache. The same cache can be referenced any number of times, each a deterministic hit, until it expires.

3. Get and Delete

Query cache metadata, or delete it proactively when no longer needed (no need to wait for the TTL to expire):

manage_cache.py
# Get info = client.caches.get(name=cache.name) print(info.expire_time) # Delete client.caches.delete(name=cache.name)

Get / delete do not require passing model again. OfoxAI locates the cache from the handle alone.

Lifecycle and TTL

Specify a time to live via ttl at creation, as a string of seconds ending in s (e.g. "600s").

ParameterValue
Minimum / default TTL600s (10 minutes)
Maximum TTL3600s (1 hour)
  • Within the TTL the cache can be referenced repeatedly with deterministic hits.
  • After the TTL expires the cache is invalidated automatically, and referencing it will error (the cache no longer exists).
  • You can delete it at any time to release it early.

Referencing an expired or deleted cache fails. Handle the “cache invalidated” case on the client side (catch the error and recreate the cache, or fall back to a normal request).

Billing

Explicit caching is billed transparently per token, in three parts:

StageFormulaNotes
Create cachetotalTokenCount × cache_write rateBilled at the “cache write” rate; totalTokenCount comes from the usageMetadata of the create response — i.e. the number of cached tokens
Reference hitcachedContentTokenCount × cache_read rateThe cached portion is billed at the “cache read” rate, roughly 0.10x of the standard input price, far cheaper than resending everything
New content per referenceThe new prompt / output for each turn is billed at standard ratesI.e. each new question you ask and the model’s new answer

The economics: pay one “write” fee on creation, then every reference bills the cached content at the lower “read” rate, saving the cost of “resending the whole context at full price.” The more references, the more you save. Each model’s cache_write / cache_read unit prices are authoritative in the model catalog  and your console usage stats.

OfoxAI Enhancement: Deterministic Routing

OfoxAI load-balances across multiple GCP regions / Vertex projects. Explicit caches are region-scoped: a cache created in one GCP project must be referenced in that same project, otherwise it 404s.

OfoxAI handles this automatically: at creation time it records the upstream instance the cache is bound to, then looks up that instance by the cache handle and hard-locks references back to the same upstream. Regardless of whether your reference request carries a user identifier or how load is scheduled, references are routed precisely to the project that created the cache, with zero drift. You don’t need to think about the underlying region distribution.

Cache ownership is protected: a cache handle can only be referenced / queried / deleted by the API Key (and same owner) that created it; cross-account access is rejected (403).

Explicit vs Implicit Caching

DimensionImplicit (automatic)Explicit (cachedContents)
TriggerAuto-detects repeated prefixesActively create a cache object and reference it
Hit determinismBest-effort; a small prefix change may missDeterministic; always hits while not expired
ManagementNoneCreate / reference / delete a handle
Best forShort content with occasionally repeated prefixesLarge, stable context reused repeatedly
BillingRead-rate discount on hitWrite fee on create + read-rate discount on reference

The two can be combined. For everyday requests, let implicit caching kick in automatically; for context you definitely reuse repeatedly, use explicit caching to lock in hits. See Prompt Caching for implicit-caching details.

Last updated on