Rate Limits
OfoxAI’s rate limits ensure platform stability. Understanding the limits helps you optimize your API usage.
Default Limits
OfoxAI uses pay-as-you-go pricing with a unified rate policy for all users:
| Limit | Quota |
|---|---|
| RPM (requests/minute) | 100 (team-aggregated) |
| TPM (tokens/minute) | Unlimited |
Team-level aggregation: RPM is calculated cumulatively across the entire organization (team). Multiple API Keys under the same team share a single quota, preventing multi-key stacking from breaking through upstream provider rate limits at the source. If you need a higher RPM quota, contact [email protected] to request an adjustment.
Rate Limit Headers
Every API response includes rate limit information:
x-ratelimit-limit-requests: 100
x-ratelimit-remaining-requests: 95
x-ratelimit-reset-requests: 12s| Header | Description |
|---|---|
x-ratelimit-limit-requests | RPM limit value |
x-ratelimit-remaining-requests | Remaining request count |
x-ratelimit-reset-requests | Time until request limit resets |
Handling 429 Errors
When rate-limited, the API returns 429 Too Many Requests:
from openai import RateLimitError
import time
try:
response = client.chat.completions.create(...)
except RateLimitError as e:
retry_after = float(e.response.headers.get("retry-after", 1))
print(f"Rate limited, waiting {retry_after}s...")
time.sleep(retry_after)Optimization Strategies
1. Use Prompt Caching
For repeated system prompts, enabling caching reduces token consumption:
response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[
# Long system prompts are automatically cached
{"role": "system", "content": "You are a professional... (long text omitted)"},
{"role": "user", "content": "User question"}
]
)See Prompt Caching for details.
2. Batch Processing
Consolidate multiple short requests into a single request:
# ❌ Not recommended: separate request for each question
for question in questions:
client.chat.completions.create(messages=[{"role": "user", "content": question}])
# ✅ Recommended: combine into one request
combined = "\n".join(f"{i+1}. {q}" for i, q in enumerate(questions))
client.chat.completions.create(
messages=[{"role": "user", "content": f"Please answer the following questions:\n{combined}"}]
)3. Choose the Right Model
For recommended models, see the Model Marketplace .
4. Control max_tokens
Set a reasonable max_tokens limit to avoid unnecessary token consumption:
response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Summarize in one sentence"}],
max_tokens=100 # Limit output length
)5. Use Model Fallback
Automatically switch to alternative models when the primary model hits its limit:
response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[...],
extra_body={
"provider": {
"fallback": ["anthropic/claude-sonnet-4.6", "google/gemini-3.1-flash-lite-preview"]
}
}
)See Fallback for details.