Gemini Explicit Caching (cachedContents)

Gemini offers two kinds of caching:

Implicit caching (automatic) — the model automatically detects and reuses repeated prompt prefixes, with no configuration needed. This is the default behavior covered in Prompt Caching; hits are best-effort, and any small change to the prefix may cause a miss.
Explicit caching (this page) — you actively create a large piece of context as a cache object, get back a cachedContents/{id} handle, and reference it in subsequent generateContent requests. Hits are deterministic: as long as the cache hasn’t expired, a reference always hits.

Use explicit caching when you have a large, stable context that is reused repeatedly (long documents, codebases, knowledge bases, fixed system instructions) and want controllable hit rates and predictable billing.

OfoxAI supports Gemini’s native cachedContents create / reference / get / delete, compatible with the Google GenAI SDK. For the endpoint-level reference, see the cachedContents API.

When to Use Explicit Caching

Scenario	Recommendation
Repeated Q&A over a large fixed context (doc QA, code assistant)	✅ Explicit caching
You need guaranteed hits and cannot tolerate occasional misses	✅ Explicit caching
Short, infrequent requests with varying prefixes	❌ Implicit caching is enough; no cache object to manage
The context changes every time	❌ Caching has no value

Explicit caching requires the cached content to meet a minimum token threshold (roughly 2,048–4,096 tokens, model-dependent). Content that is too small will fail to create or yield no benefit.

Endpoints

OfoxAI provides the full cachedContents endpoints under the Gemini native protocol:


POST   https://api.ofox.ai/gemini/v1beta/cachedContents              # Create
GET    https://api.ofox.ai/gemini/v1beta/cachedContents/{id}         # Get
DELETE https://api.ofox.ai/gemini/v1beta/cachedContents/{id}         # Delete

To reference a cache for content generation, use the standard generateContent endpoint and add cachedContent to the request body:


POST   https://api.ofox.ai/gemini/v1beta/models/{model}:generateContent

Authentication

Like other Gemini endpoints, use the x-goog-api-key header:


x-goog-api-key: <YOUR_OFOXAI_API_KEY>

Full Workflow

The following demonstrates a complete flow: create cache → reference cache to generate → delete cache.

1. Create a Cache

Put the large context you want to reuse into contents, and specify model and ttl (time to live):

Python

create_cache.py


from google import genai
from google.genai import types
 
client = genai.Client(
    api_key="<YOUR_OFOXAI_API_KEY>",
    http_options={"api_version": "v1beta", "base_url": "https://api.ofox.ai/gemini"},
)
 
# A large document reused repeatedly (must meet the minimum token threshold)
LONG_DOCUMENT = open("knowledge_base.txt").read()
 
cache = client.caches.create(
    model="google/gemini-3.1-pro-preview",
    config=types.CreateCachedContentConfig(
        contents=[LONG_DOCUMENT],
        ttl="600s",  # cache lives for 10 minutes
    ),
)
 
print(cache.name)  # cachedContents/xxxxxxxx — the handle for later references

TypeScript

create_cache.ts


import { GoogleGenAI } from '@google/genai'
import fs from 'node:fs'
 
const ai = new GoogleGenAI({
  apiKey: '<YOUR_OFOXAI_API_KEY>',
  httpOptions: { apiVersion: 'v1beta', baseUrl: 'https://api.ofox.ai/gemini' },
})
 
const longDocument = fs.readFileSync('knowledge_base.txt', 'utf-8')
 
const cache = await ai.caches.create({
  model: 'google/gemini-3.1-pro-preview',
  config: {
    contents: [longDocument],
    ttl: '600s', // cache lives for 10 minutes
  },
})
 
console.log(cache.name) // cachedContents/xxxxxxxx — the handle for later references

cURL

Terminal


curl "https://api.ofox.ai/gemini/v1beta/cachedContents" \
  -H "x-goog-api-key: $OFOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-3.1-pro-preview",
    "contents": [
      { "role": "user", "parts": [{ "text": "<a large block of context to reuse>" }] }
    ],
    "ttl": "600s"
  }'

The returned name (e.g. cachedContents/xxxxxxxx) is the cache handle. Save it for later references and management.

2. Reference the Cache to Generate

Reference the cache handle via cachedContent in a generateContent request; contents only needs the new question for this turn:

Python

use_cache.py


response = client.models.generate_content(
    model="google/gemini-3.1-pro-preview",
    contents="Based on the document above, summarize three key points",
    config=types.GenerateContentConfig(
        cached_content=cache.name,  # reference the cache
    ),
)
 
print(response.text)
# Cache hit: the cached portion is billed at the lower "read" rate
print(response.usage_metadata.cached_content_token_count)

TypeScript

use_cache.ts


const response = await ai.models.generateContent({
  model: 'google/gemini-3.1-pro-preview',
  contents: 'Based on the document above, summarize three key points',
  config: {
    cachedContent: cache.name, // reference the cache
  },
})
 
console.log(response.text)
// Cache hit: the cached portion is billed at the lower "read" rate
console.log(response.usageMetadata?.cachedContentTokenCount)

cURL

Terminal


curl "https://api.ofox.ai/gemini/v1beta/models/google/gemini-3.1-pro-preview:generateContent" \
  -H "x-goog-api-key: $OFOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "cachedContent": "cachedContents/xxxxxxxx",
    "contents": [
      { "role": "user", "parts": [{ "text": "Based on the document above, summarize three key points" }] }
    ]
  }'

On a hit, the response usageMetadata.cachedContentTokenCount shows how many tokens came from the cache. The same cache can be referenced any number of times, each a deterministic hit, until it expires.

3. Get and Delete

Query cache metadata, or delete it proactively when no longer needed (no need to wait for the TTL to expire):

Python

manage_cache.py


# Get
info = client.caches.get(name=cache.name)
print(info.expire_time)
 
# Delete
client.caches.delete(name=cache.name)

TypeScript

manage_cache.ts


// Get
const info = await ai.caches.get({ name: cache.name })
console.log(info.expireTime)
 
// Delete
await ai.caches.delete({ name: cache.name })

cURL

Terminal


# Get
curl "https://api.ofox.ai/gemini/v1beta/cachedContents/xxxxxxxx" \
  -H "x-goog-api-key: $OFOX_API_KEY"
 
# Delete
curl -X DELETE "https://api.ofox.ai/gemini/v1beta/cachedContents/xxxxxxxx" \
  -H "x-goog-api-key: $OFOX_API_KEY"

Get / delete do not require passing model again. OfoxAI locates the cache from the handle alone.

Lifecycle and TTL

Specify a time to live via ttl at creation, as a string of seconds ending in s (e.g. "600s").

Parameter	Value
Minimum / default TTL	`600s` (10 minutes)
Maximum TTL	`3600s` (1 hour)

Within the TTL the cache can be referenced repeatedly with deterministic hits.
After the TTL expires the cache is invalidated automatically, and referencing it will error (the cache no longer exists).
You can delete it at any time to release it early.

Referencing an expired or deleted cache fails. Handle the “cache invalidated” case on the client side (catch the error and recreate the cache, or fall back to a normal request).

Billing

Explicit caching is billed transparently per token, in three parts:

Stage	Formula	Notes
Create cache	`totalTokenCount × cache_write rate`	Billed at the “cache write” rate; `totalTokenCount` comes from the `usageMetadata` of the create response — i.e. the number of cached tokens
Reference hit	`cachedContentTokenCount × cache_read rate`	The cached portion is billed at the “cache read” rate, roughly 0.10x of the standard input price, far cheaper than resending everything
New content per reference	The new prompt / output for each turn is billed at standard rates	I.e. each new question you ask and the model’s new answer

The economics: pay one “write” fee on creation, then every reference bills the cached content at the lower “read” rate, saving the cost of “resending the whole context at full price.” The more references, the more you save. Each model’s cache_write / cache_read unit prices are authoritative in the model catalog and your console usage stats.

OfoxAI Enhancement: Deterministic Routing

OfoxAI load-balances across multiple GCP regions / Vertex projects. Explicit caches are region-scoped: a cache created in one GCP project must be referenced in that same project, otherwise it 404s.

OfoxAI handles this automatically: at creation time it records the upstream instance the cache is bound to, then looks up that instance by the cache handle and hard-locks references back to the same upstream. Regardless of whether your reference request carries a user identifier or how load is scheduled, references are routed precisely to the project that created the cache, with zero drift. You don’t need to think about the underlying region distribution.

Cache ownership is protected: a cache handle can only be referenced / queried / deleted by the API Key (and same owner) that created it; cross-account access is rejected (403).

Explicit vs Implicit Caching

Dimension	Implicit (automatic)	Explicit (cachedContents)
Trigger	Auto-detects repeated prefixes	Actively create a cache object and reference it
Hit determinism	Best-effort; a small prefix change may miss	Deterministic; always hits while not expired
Management	None	Create / reference / delete a handle
Best for	Short content with occasionally repeated prefixes	Large, stable context reused repeatedly
Billing	Read-rate discount on hit	Write fee on create + read-rate discount on reference

The two can be combined. For everyday requests, let implicit caching kick in automatically; for context you definitely reuse repeatedly, use explicit caching to lock in hits. See Prompt Caching for implicit-caching details.

Gemini Explicit Caching (cachedContents)

When to Use Explicit Caching

Endpoints

Authentication

Full Workflow

1. Create a Cache

Python

TypeScript

cURL

2. Reference the Cache to Generate

Python

TypeScript

cURL

3. Get and Delete

Python

TypeScript

cURL

Lifecycle and TTL

Billing

OfoxAI Enhancement: Deterministic Routing

Explicit vs Implicit Caching

Related