Rate Limit Reached in Claude Code: 429 Causes and 6 Fixes (2026)

Q: Is a 429 the same as a 529 in Claude Code?

No. A 429 (rate_limit_error) means your account exceeded its own per-minute limits. A 529 (overloaded_error) means Anthropic's servers are temporarily at capacity for everyone, independent of your usage. Claude Code shows them as different messages: Request rejected (429) versus Repeated 529 Overloaded errors.

Q: How long should I wait after a 429 rate limit error?

Read the retry-after response header. It gives the exact number of seconds until you can retry; retrying earlier will fail again. If the header is missing, fall back to exponential backoff starting around 1-2 seconds and doubling, with random jitter so parallel workers do not all retry at the same instant.

Q: Does prompt caching help with 429 rate limit errors?

Yes. On most current Claude models, cache_read_input_tokens do not count toward your ITPM limit. With an 80% cache hit rate against a 2,000,000 ITPM cap you can effectively process about 10,000,000 total input tokens per minute, because the cached portion is free against the rate limit.

If Claude Code says “Rate limit reached,” the API is telling you that you spent your per-minute token or request budget, not that Anthropic is down. The fix is reading one header and slowing down the right dimension.

You ran a few parallel subagents, or a long agent loop, and Claude Code stops with a red line about a rate limit or “Request rejected (429).” Nothing is broken. You hit a per-minute ceiling that your account tier sets, and the API is asking you to wait a specific number of seconds before trying again.

This article is about the API-side 429 you get as a Console / API-key user billed per token. If you are on a Pro or Max subscription and you hit a weekly or session cap instead of a 429, that is a different mechanism with a different fix, covered in Claude Code usage limit hit too fast.

The 30-Second Diagnosis

First thing to settle: which error are you actually looking at, because the fixes diverge hard.

What you see in the terminal	HTTP status	What it means	First move
`API Error: Request rejected (429)`	429	Your account exceeded its own per-minute limit (RPM/ITPM/OTPM)	Read `retry-after`, then cut concurrency or cache
`API Error: Server is temporarily limiting requests (not your usage limit)`	429 (acceleration limit)	Short-lived throttle after a sharp traffic spike	Wait briefly, ramp traffic gradually
`API Error: Repeated 529 Overloaded errors`	529	Anthropic’s servers are at capacity for everyone	Wait or `/model` to a less-loaded model. See the 529 guide
`You've hit your weekly limit · resets Mon`	n/a (subscription)	Pro/Max plan allowance, not an API 429	See the usage-limit guide

Run this checklist in order:

Confirm it is a 429, not a 529. Claude Code uses different wording for each. “Request rejected (429)” or “Rate limited” is your account. “529 Overloaded” is the server. Treat them differently.
Check which credential is active. Run /status. A stray ANTHROPIC_API_KEY in your shell can route you through a low-tier key instead of the account you think you are on.
Look at your tier. Open the Limits page in the Console. Tier 1 is tight: 50 RPM and 30,000 ITPM for Sonnet 4.x. A single large-context turn plus one subagent can exceed that.

When to Fix the 429 (and When to Just Wait)

Decide before you start refactoring, because some 429s clear themselves and some need real changes.

When to just retry and wait. If you see an occasional 429 during a burst and retry-after is small (a few seconds), do nothing structural. Claude Code already retries transient 429 and 529 capacity errors up to 10 times with exponential backoff before it surfaces the error, showing a Retrying in Ns · attempt x/y countdown. If the error rate is under roughly 5% of requests, retry is the whole fix.

When to actually change something. If 429s are constant, if they kill subagents mid-task, or if retry-after keeps climbing, you are structurally over your tier’s per-minute budget. Lower concurrency, add caching, or raise your tier. Throwing more retries at a saturated limit just queues more failures.

Stop rule. If you are on Tier 1 running parallel subagents over a large repo, no amount of backoff will hold. Either move to a higher tier or pool capacity across a gateway. Backoff smooths spikes; it does not raise a ceiling.

What “Rate Limit Reached” and HTTP 429 Actually Mean

A 429 is the HTTP status the Claude API returns when your organization exceeds one of its per-minute rate limits. It is an account-level signal, scoped to your organization, and it is enforced separately for each model. The error body is a rate_limit_error and the message names which limit you hit, for example “This request would exceed your organization’s rate limit of x0,000 input tokens per minute.” The API also attaches a retry-after header.

There are three per-minute dimensions the Messages API enforces, per Anthropic’s rate-limits docs:

RPM (requests per minute): how many API calls you can make.
ITPM (input tokens per minute): how many input tokens you can send. This is the one that bites agent workloads, because every turn ships the full conversation context as input.
OTPM (output tokens per minute): how many tokens the models can generate for you per minute. Note: max_tokens does not factor into OTPM; only the tokens actually produced count, so setting a high max_tokens has no rate-limit downside.

You get a 429 the instant you cross any one of the three. The limit uses a token-bucket algorithm, so your capacity refills continuously rather than resetting at the top of each minute. That also means short bursts can trip the limit even when your minute-averaged usage looks fine: 60 RPM is enforced closer to 1 request per second, not 60 requests fired at once.

Three numbers govern every 429: requests, input tokens, and output tokens per minute. Find which one you are blowing, and the fix picks itself.

The 429 response and the headers that tell you everything

When you hit the limit, the response carries headers that tell you the limit, your remaining budget, and when it refills. These are worth surfacing in any agent loop you build:

HTTP/1.1 429 Too Many Requests
retry-after: 12
anthropic-ratelimit-input-tokens-limit: 30000
anthropic-ratelimit-input-tokens-remaining: 0
anthropic-ratelimit-input-tokens-reset: 2026-06-24T18:42:30Z

retry-after is the number of seconds to wait before retrying; earlier retries will fail. The anthropic-ratelimit-*-limit, -remaining, and -reset headers exist for requests, input tokens, and output tokens separately, so you can watch headroom on each dimension and back off before you trip rather than after.

One caveat the field has learned the hard way: retry-after is a server hint, and on the rolling-window boundary it can read a touch short. Honor it as a floor, then add positive jitter on top so a fleet of workers does not all wake up and retry in the same millisecond.

Error and Symptom to Cause to Fix

Symptom	Likely cause	Fix
429 within seconds of spawning subagents	Concurrency fans out RPM and ITPM at once	Lower `CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY`; spawn fewer subagents
429 on a single large turn, no parallelism	One turn’s context exceeds ITPM (Tier 1 Sonnet = 30k)	Trim context with `/compact` or `/clear`; enable prompt caching
429 right after a quiet period, then a spike	Acceleration limit on sharp usage increase	Ramp traffic gradually; keep a steady request rate
429s appear, dashboard shows quota left	Looking at spend limit, not the per-minute rate limit	Per-minute caps and monthly spend caps are different; check the rate-limit chart
`retry-after` keeps growing across retries	You are structurally over the limit, retries pile up	Stop retrying harder; raise tier or pool capacity
429 in CI that should be on a high tier	Stray `ANTHROPIC_API_KEY` routing to a low-tier key	Run `/status`; unset the wrong key

Why It Trips: Concurrency, Token Bursts, and No Backoff

Three patterns cause almost every Claude Code 429.

Parallel subagents and parallel sessions. This is the big one. When Claude Code fans out subagents through the Task tool, each subagent is a separate stream of requests, and each one carries its own copy of the context as input tokens. Run six subagents over a large repository and your ITPM and RPM both spike at once. A known Claude Code issue documents 5-10 parallel sessions launching subagent fan-outs and hard-failing under capacity errors. Reduce the fan-out and the 429s stop.

Context bursts on a tight tier. On Tier 1, Sonnet 4.x gets 30,000 ITPM. If your conversation context has grown to 25,000 tokens, one new message consumes most of your per-minute input budget, and a second message in the same minute trips the 429. This is “prompt bloat”: the session got larger than the task needs. Run /context to see what is filling the window, then /compact or /clear.

No backoff in your own scripts. Claude Code itself retries with exponential backoff. But if you are hitting the API directly around Claude Code (a CI wrapper, a custom agent loop), and you retry on a fixed interval with no jitter, every worker retries in lockstep and re-trips the limit together. Synchronized retries are how a brief throttle becomes a sustained outage.

How to Fix It: Solutions for Every Tier

Free / Tier 1 solution: read the header, cut the burst

If you are on the lowest tier, you have the least headroom, so behave accordingly.

Honor retry-after. Wait the number of seconds the header says, then add jitter. Do not retry early.
Shrink context. Run /compact at natural breakpoints or /clear to start fresh. Less context per turn means less ITPM per request.
Drop concurrency to near-serial. Set the concurrency cap low so you are not fanning out requests you cannot afford.

# Lower tool-use concurrency before a heavy session
export CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY=2
# Surface failures faster in scripts instead of retrying 10x
export CLAUDE_CODE_MAX_RETRIES=3
claude

Paid / Tier 2-3 solution: caching and exponential backoff

Once you have some headroom, the leverage is caching plus disciplined retries.

Prompt caching cuts ITPM, not just cost. On most current Claude models, cache_read_input_tokens do not count toward your ITPM limit. Anthropic’s own example: with a 2,000,000 ITPM limit and an 80% cache hit rate, you can effectively process about 10,000,000 total input tokens per minute, because the cached 8M is free against the rate limit. Cache your system prompt, tool definitions, and large context documents. For the mechanics of trimming token usage end to end, see Claude Code token optimization.

Exponential backoff with jitter is the correct retry shape when you control the loop:

import random, time

def backoff_sleep(attempt, retry_after=None):
    if retry_after is not None:        # honor the server hint as a floor
        base = float(retry_after)
    else:
        base = min(2 ** attempt, 60)   # 1, 2, 4, 8 ... capped at 60s
    time.sleep(base + random.uniform(0, base * 0.25))  # add jitter

For unattended runs like CI, you can also set CLAUDE_CODE_RETRY_WATCHDOG=1 so Claude Code retries 429 and 529 capacity errors indefinitely instead of giving up after CLAUDE_CODE_MAX_RETRIES attempts.

Enterprise / Tier 4 solution: raise the ceiling or pool it

When backoff and caching are not enough, you need more limit.

Advance your tier. Tiers raise automatically as cumulative credit purchases hit each threshold. Tier 4 gives Sonnet 4.x 4,000 RPM and 2,000,000 ITPM, versus Tier 1’s 50 RPM / 30,000 ITPM. For limits above Tier 4, contact Anthropic sales for custom or Priority Tier limits.
Set per-workspace caps so one project cannot starve the rest of the organization. Workspace limits carve up the org’s per-minute budget per limiter type.
Spread load across providers or pool capacity. If your workload is bursty and a single organization’s per-minute caps are the bottleneck, a routing layer in front of multiple providers widens the effective ceiling. More on the pattern in the Claude Code hybrid routing guide.

Anthropic rate limits by tier (Sonnet 4.x and Opus 4.x)

The per-minute caps that decide how fast you trip a 429, from the official rate-limits docs:

Tier	Credit to reach	Sonnet 4.x RPM / ITPM / OTPM	Opus 4.x RPM / ITPM / OTPM
Tier 1	$5	50 / 30,000 / 8,000	50 / 500,000 / 80,000
Tier 2	$40	1,000 / 450,000 / 90,000	1,000 / 2,000,000 / 200,000
Tier 3	$200	2,000 / 800,000 / 160,000	2,000 / 5,000,000 / 400,000
Tier 4	$400	4,000 / 2,000,000 / 400,000	4,000 / 10,000,000 / 800,000

The Opus 4.x limit is a single total shared across Opus 4.8, 4.7, 4.6, 4.5, and 4.1. The Sonnet 4.x limit is shared across Sonnet 4.6 and 4.5. Switching between versions in the same family does not give you a fresh bucket. One detail that helps: limits are applied separately per model, so an Opus job and a Sonnet job each draw from their own pool and run up to their own caps at the same time. Routing your cheap, high-volume work to Haiku 4.5 keeps it off the Opus and Sonnet buckets entirely.

The acceleration limit: the 429 that has nothing to do with your tier

There is a second flavor of 429 worth knowing about, because it surprises people who are nowhere near their tier ceiling. If your organization’s usage jumps sharply, the API can apply an acceleration limit and return a 429 even though your per-minute caps have headroom. The intent is to stop runaway spikes, not to punish steady use. The fix is not “raise my tier,” it is “ramp up gradually.” Warm a new workload in over a minute or two instead of firing peak traffic from a cold start, and keep your request rate roughly consistent. In Claude Code this is the throttle behind the Server is temporarily limiting requests (not your usage limit) message, which the CLI already retries for you before surfacing it.

Spread the load instead of stacking it

Most agent 429s come from doing everything in one tight window. A few cheap scheduling habits make the same total work fit under the same caps:

Serialize the expensive turns. A single large-context turn can be most of your ITPM minute on Tier 1. Run those one at a time rather than letting subagents fire them in parallel.
Batch the non-urgent work. If a job does not need a live answer, the Message Batches API has its own separate limits, so moving bulk work there frees your interactive RPM and token budgets.
Stagger CI jobs. Do not schedule ten pipelines to start on the same cron minute. Offset their start times so their first bursts do not collide on the same bucket.

Common Failure Patterns We’ve Observed in 2026

Recurring shapes the field keeps reporting, drawn from the Claude Code issue tracker and support threads:

Pattern	Trigger	What fixes it
Subagent fan-out hard-fail	6+ parallel subagents on large context	Lower concurrency cap; reduce fan-out
529 mislabeled as “Rate limited”	Server overload during parallel sessions	Recognize it is 529, not your quota; wait or `/model` switch
429 despite “available quota” on dashboard	Confusing monthly spend limit with per-minute rate limit	Read the rate-limit chart, not the spend chart
`/batch`-style burst rejected (429)	Many requests fired in one tight window	Stagger requests; respect the token bucket
CI loops re-tripping the same limit	Fixed-interval retries with no jitter	Exponential backoff plus random jitter

These are mechanism patterns, not a published incident log. The throughline is the same: find the dimension you are saturating, then either slow it down or widen it.

When You Need More Headroom: Best Alternatives That Work Now

If your bottleneck is one organization’s per-minute caps, here are the realistic options, ofox first.

ofox pooled capacity (one key, pay-as-you-go). ofox exposes an OpenAI-compatible endpoint at https://api.ofox.ai/v1 that routes one API key across many models on pooled, pay-as-you-go capacity. Because requests draw from shared capacity rather than your single low-tier organization bucket, a bursty agent workload is less likely to saturate one provider’s per-minute ceiling. You keep the same OpenAI SDK shape and swap a model string to change models. It does not repeal token-per-minute physics, but it gives a bursty workload more room before it trips.

from openai import OpenAI
client = OpenAI(base_url="https://api.ofox.ai/v1", api_key="YOUR_OFOX_KEY")
resp = client.chat.completions.create(
    model="anthropic/claude-opus-4.8",
    messages=[{"role": "user", "content": "Refactor this function for clarity."}],
)

Raise your Anthropic tier. The cleanest fix if you want to stay single-provider: deposit to the next tier and your RPM/ITPM/OTPM jump immediately. Best when your traffic is predictable and you can commit spend.

Multi-provider routing on your side. Run your own router that spreads requests across providers and respects each one’s retry-after. Most control, most code to maintain. The hybrid routing pattern covers the design.

How to Monitor Rate Limits and Get Ahead of 429s

You can see a 429 coming before it lands. Three places to watch:

The Console Usage page has two rate-limit charts: hourly maximum uncached input tokens per minute against your current ITPM cap, and the same for output tokens. It also shows your cache hit rate, which tells you how much ITPM headroom caching is buying you.
The response headers (anthropic-ratelimit-*-remaining and -reset) let your own code throttle proactively. When remaining input tokens approach zero, slow down before the bucket empties.
Anthropic status at status.claude.com tells you whether a wave of “rate limited” messages is actually a 529 capacity event affecting everyone, in which case the fix is patience, not config.

FAQ

What does API Error: Rate Limit Reached mean in Claude Code? It means the Claude API returned HTTP 429: your organization sent more requests or tokens per minute than your usage tier allows. It is an account-side throttle on RPM, ITPM, or OTPM, not a server outage. The response carries a retry-after header telling you how many seconds to wait.

Is a 429 the same as a 529 in Claude Code? No. A 429 (rate_limit_error) means your account exceeded its own per-minute limits. A 529 (overloaded_error) means Anthropic’s servers are temporarily at capacity for everyone, independent of your usage. Claude Code shows them as different messages: “Request rejected (429)” versus “Repeated 529 Overloaded errors.”

How long should I wait after a 429 rate limit error? Read the retry-after header. It gives the exact number of seconds until you can retry; retrying earlier will fail again. If the header is missing, fall back to exponential backoff starting around 1-2 seconds and doubling, with random jitter so parallel workers do not all retry at the same instant.

Why does Claude Code hit the rate limit so fast with subagents? Parallel subagents fan out many concurrent requests, and each one carries the full conversation context as input tokens. A handful of subagents on a large context can blow past your ITPM cap in seconds. Lower CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY or reduce how many subagents you spawn at once.

What are RPM, ITPM, and OTPM in the Anthropic rate limits? RPM is requests per minute, ITPM is input tokens per minute, and OTPM is output tokens per minute. The Messages API enforces all three per model class. You get a 429 the moment you exceed any one of them, and the error names which limit you hit.

How do I raise my Claude API rate limit? Usage tiers advance automatically as your cumulative credit purchases reach each threshold ($5 for Tier 1, $40 for Tier 2, $200 for Tier 3, $400 for Tier 4). Higher tiers raise RPM, ITPM, and OTPM. For limits above Tier 4 you contact Anthropic sales for custom or Priority Tier limits.

Does prompt caching help with 429 rate limit errors? Yes. On most current Claude models, cache_read_input_tokens do not count toward your ITPM limit. With an 80% cache hit rate against a 2,000,000 ITPM cap you can effectively process about 10,000,000 total input tokens per minute, because the cached portion is free against the rate limit.

Can I avoid Claude Code rate limits by switching providers? You can spread load across providers or use a pooled gateway. An OpenAI-compatible gateway like ofox at https://api.ofox.ai/v1 routes one key across many models on pay-as-you-go pooled capacity, so a single project is less likely to saturate one organization’s per-minute caps. It does not remove physics, but it widens the headroom.

A 429 is the API setting a pace, not slamming a door. Read retry-after, slow the dimension you are saturating, and the work keeps moving.

Sources Checked for This Refresh

Anthropic API Rate Limits reference for the tier table, RPM/ITPM/OTPM, retry-after and anthropic-ratelimit-* headers, and cache-aware ITPM (verified 2026-06-24)
Claude Code Error reference for the exact “Request rejected (429)” and “Server is temporarily limiting requests” strings, plus CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY, CLAUDE_CODE_MAX_RETRIES, CLAUDE_CODE_RETRY_WATCHDOG, and automatic-retry behavior (verified 2026-06-24)
Claude Code issue #68502 on parallel subagents, 429 vs 529 mislabeling, and hard-fail without backoff (verified 2026-06-24)
ofox models page for the current Claude flagship (Opus 4.8) and the OpenAI-compatible endpoint (verified 2026-06-24)