Gemini Explicit Caching (cachedContents)
Gemini offers two kinds of caching:
- Implicit caching (automatic) — the model automatically detects and reuses repeated prompt prefixes, with no configuration needed. This is the default behavior covered in Prompt Caching; hits are best-effort, and any small change to the prefix may cause a miss.
- Explicit caching (this page) — you actively create a large piece of context as a cache object, get back a
cachedContents/{id}handle, and reference it in subsequentgenerateContentrequests. Hits are deterministic: as long as the cache hasn’t expired, a reference always hits.
Use explicit caching when you have a large, stable context that is reused repeatedly (long documents, codebases, knowledge bases, fixed system instructions) and want controllable hit rates and predictable billing.
OfoxAI supports Gemini’s native cachedContents create / reference / get / delete, compatible with the Google GenAI SDK. For the endpoint-level reference, see the cachedContents API.
When to Use Explicit Caching
| Scenario | Recommendation |
|---|---|
| Repeated Q&A over a large fixed context (doc QA, code assistant) | ✅ Explicit caching |
| You need guaranteed hits and cannot tolerate occasional misses | ✅ Explicit caching |
| Short, infrequent requests with varying prefixes | ❌ Implicit caching is enough; no cache object to manage |
| The context changes every time | ❌ Caching has no value |
Explicit caching requires the cached content to meet a minimum token threshold (roughly 2,048–4,096 tokens, model-dependent). Content that is too small will fail to create or yield no benefit.
Endpoints
OfoxAI provides the full cachedContents endpoints under the Gemini native protocol:
POST https://api.ofox.ai/gemini/v1beta/cachedContents # Create
GET https://api.ofox.ai/gemini/v1beta/cachedContents/{id} # Get
DELETE https://api.ofox.ai/gemini/v1beta/cachedContents/{id} # DeleteTo reference a cache for content generation, use the standard generateContent endpoint and add cachedContent to the request body:
POST https://api.ofox.ai/gemini/v1beta/models/{model}:generateContentAuthentication
Like other Gemini endpoints, use the x-goog-api-key header:
x-goog-api-key: <YOUR_OFOXAI_API_KEY>Full Workflow
The following demonstrates a complete flow: create cache → reference cache to generate → delete cache.
1. Create a Cache
Put the large context you want to reuse into contents, and specify model and ttl (time to live):
Python
from google import genai
from google.genai import types
client = genai.Client(
api_key="<YOUR_OFOXAI_API_KEY>",
http_options={"api_version": "v1beta", "base_url": "https://api.ofox.ai/gemini"},
)
# A large document reused repeatedly (must meet the minimum token threshold)
LONG_DOCUMENT = open("knowledge_base.txt").read()
cache = client.caches.create(
model="google/gemini-3.1-pro-preview",
config=types.CreateCachedContentConfig(
contents=[LONG_DOCUMENT],
ttl="600s", # cache lives for 10 minutes
),
)
print(cache.name) # cachedContents/xxxxxxxx — the handle for later referencesThe returned name (e.g. cachedContents/xxxxxxxx) is the cache handle. Save it for later references and management.
2. Reference the Cache to Generate
Reference the cache handle via cachedContent in a generateContent request; contents only needs the new question for this turn:
Python
response = client.models.generate_content(
model="google/gemini-3.1-pro-preview",
contents="Based on the document above, summarize three key points",
config=types.GenerateContentConfig(
cached_content=cache.name, # reference the cache
),
)
print(response.text)
# Cache hit: the cached portion is billed at the lower "read" rate
print(response.usage_metadata.cached_content_token_count)On a hit, the response usageMetadata.cachedContentTokenCount shows how many tokens came from the cache. The same cache can be referenced any number of times, each a deterministic hit, until it expires.
3. Get and Delete
Query cache metadata, or delete it proactively when no longer needed (no need to wait for the TTL to expire):
Python
# Get
info = client.caches.get(name=cache.name)
print(info.expire_time)
# Delete
client.caches.delete(name=cache.name)Get / delete do not require passing model again. OfoxAI locates the cache from the handle alone.
Lifecycle and TTL
Specify a time to live via ttl at creation, as a string of seconds ending in s (e.g. "600s").
| Parameter | Value |
|---|---|
| Minimum / default TTL | 600s (10 minutes) |
| Maximum TTL | 3600s (1 hour) |
- Within the TTL the cache can be referenced repeatedly with deterministic hits.
- After the TTL expires the cache is invalidated automatically, and referencing it will error (the cache no longer exists).
- You can
deleteit at any time to release it early.
Referencing an expired or deleted cache fails. Handle the “cache invalidated” case on the client side (catch the error and recreate the cache, or fall back to a normal request).
Billing
Explicit caching is billed transparently per token, in three parts:
| Stage | Formula | Notes |
|---|---|---|
| Create cache | totalTokenCount × cache_write rate | Billed at the “cache write” rate; totalTokenCount comes from the usageMetadata of the create response — i.e. the number of cached tokens |
| Reference hit | cachedContentTokenCount × cache_read rate | The cached portion is billed at the “cache read” rate, roughly 0.10x of the standard input price, far cheaper than resending everything |
| New content per reference | The new prompt / output for each turn is billed at standard rates | I.e. each new question you ask and the model’s new answer |
The economics: pay one “write” fee on creation, then every reference bills the cached content at the lower “read” rate, saving the cost of “resending the whole context at full price.” The more references, the more you save. Each model’s cache_write / cache_read unit prices are authoritative in the model catalog and your console usage stats.
OfoxAI Enhancement: Deterministic Routing
OfoxAI load-balances across multiple GCP regions / Vertex projects. Explicit caches are region-scoped: a cache created in one GCP project must be referenced in that same project, otherwise it 404s.
OfoxAI handles this automatically: at creation time it records the upstream instance the cache is bound to, then looks up that instance by the cache handle and hard-locks references back to the same upstream. Regardless of whether your reference request carries a user identifier or how load is scheduled, references are routed precisely to the project that created the cache, with zero drift. You don’t need to think about the underlying region distribution.
Cache ownership is protected: a cache handle can only be referenced / queried / deleted by the API Key (and same owner) that created it; cross-account access is rejected (403).
Explicit vs Implicit Caching
| Dimension | Implicit (automatic) | Explicit (cachedContents) |
|---|---|---|
| Trigger | Auto-detects repeated prefixes | Actively create a cache object and reference it |
| Hit determinism | Best-effort; a small prefix change may miss | Deterministic; always hits while not expired |
| Management | None | Create / reference / delete a handle |
| Best for | Short content with occasionally repeated prefixes | Large, stable context reused repeatedly |
| Billing | Read-rate discount on hit | Write fee on create + read-rate discount on reference |
The two can be combined. For everyday requests, let implicit caching kick in automatically; for context you definitely reuse repeatedly, use explicit caching to lock in hits. See Prompt Caching for implicit-caching details.