Skip to Content
APIGemini NativeCached Contents

Cached Contents

Manage explicit context caches (cachedContents) via the Gemini native protocol: actively cache a large context as an object, reference it across requests for deterministic hits and lower cost. OfoxAI is compatible with the Google GenAI SDK.

For use cases, the difference from implicit caching, and best practices, see the Gemini Explicit Caching guide. This page is the endpoint-level API reference.

Endpoints

POST https://api.ofox.ai/gemini/v1beta/cachedContents # Create GET https://api.ofox.ai/gemini/v1beta/cachedContents/{id} # Get DELETE https://api.ofox.ai/gemini/v1beta/cachedContents/{id} # Delete

To reference a cache for content generation, use the standard generateContent endpoint with a cachedContent field in the body:

POST https://api.ofox.ai/gemini/v1beta/models/{model}:generateContent

Authentication

Use the x-goog-api-key header:

x-goog-api-key: <YOUR_OFOXAI_API_KEY>

Resource Fields

Key fields of the CachedContent resource:

FieldTypeDescription
namestring (output only)Cache handle, e.g. cachedContents/{id}, returned on create
modelstring (required, immutable)Model the cache is bound to, e.g. models/gemini-3.1-pro-preview
contentsarrayContent to cache (same structure as generateContent’s contents)
systemInstructionobjectSystem instruction to cache (optional)
toolsarrayTool definitions to cache (optional)
ttlstringTime to live, a seconds string (e.g. "600s"); mutually exclusive with expireTime
expireTimestringExpiration timestamp (RFC 3339); mutually exclusive with ttl
displayNamestring (immutable)Custom name (optional)
usageMetadata.totalTokenCountintegerNumber of cached tokens (used for billing)

Supported TTL range: minimum / default 600s (10 minutes), maximum 3600s (1 hour).

Create a Cache

create.py
from google import genai from google.genai import types client = genai.Client( api_key="<YOUR_OFOXAI_API_KEY>", http_options={"api_version": "v1beta", "base_url": "https://api.ofox.ai/gemini"}, ) cache = client.caches.create( model="google/gemini-3.1-pro-preview", config=types.CreateCachedContentConfig( contents=[open("knowledge_base.txt").read()], system_instruction="You answer strictly based on the provided document.", ttl="600s", display_name="kb-v1", ), ) print(cache.name) # cachedContents/xxxxxxxx print(cache.usage_metadata.total_token_count)

Response

{ "name": "cachedContents/xxxxxxxx", "model": "google/gemini-3.1-pro-preview", "createTime": "2026-06-26T08:00:00Z", "updateTime": "2026-06-26T08:00:00Z", "expireTime": "2026-06-26T08:10:00Z", "displayName": "kb-v1", "usageMetadata": { "totalTokenCount": 14407 } }

Get / Delete

Get and delete do not require model; OfoxAI locates the upstream from the cache handle.

manage.py
# Get one info = client.caches.get(name=cache.name) print(info.expire_time) # Delete client.caches.delete(name=cache.name)

Reference a Cache to Generate

Add a cachedContent field to the generateContent body to reference the cache; contents only carries the new question for this turn:

use.py
response = client.models.generate_content( model="google/gemini-3.1-pro-preview", contents="Based on the document above, summarize three key points", config=types.GenerateContentConfig(cached_content=cache.name), ) print(response.text) print(response.usage_metadata.cached_content_token_count) # cached tokens hit

On a hit, the response usageMetadata.cachedContentTokenCount shows how many tokens came from the cache.

Billing

StageFormula
Create cachetotalTokenCount × cache_write rate
Reference hitcachedContentTokenCount × cache_read rate (~0.10x of standard input)
New content per referenceNew prompt / output for the turn billed at standard rates

Each model’s cache_write / cache_read unit prices are in the model catalog .

OfoxAI load-balances across multiple GCP projects, and explicit caches are region-scoped. OfoxAI automatically hard-locks references back to the upstream that created the cache, with zero drift; a cache handle can only be referenced / queried / deleted by the API Key that created it (cross-account access returns 403). See Explicit Caching guide · Deterministic Routing.

Last updated on