GPT-5.5 Instant Lands: New ChatGPT Default Model, Hallucinations Down 52.5% in High-Stakes Domains

GPT-5.5 Instant Lands: New ChatGPT Default Model, Hallucinations Down 52.5% in High-Stakes Domains

TL;DR — OpenAI swapped ChatGPT’s default model from GPT-5.3 Instant to GPT-5.5 Instant on May 5. API alias is chat-latest. The headline upgrade is not new capabilities — it is “fewer hallucinations, fewer words, better personalization.” Hallucinated claims dropped 52.5% in medicine/law/finance prompts, average answer length dropped 30.2%. Don’t confuse this with April 23’s GPT-5.5 flagship — they are different product lines. ofox support is rolling out.

What OpenAI shipped is a new default, not a new flagship

The GPT-5.5 family now has two distinct lines:

ProductReleasedPositioningAPI modelofox status
GPT-5.5 / GPT-5.5 ProApr 23Thinking/agent flagship, 1M contextopenai/gpt-5.5, openai/gpt-5.5-proLive
GPT-5.5 InstantMay 5ChatGPT default conversational modelchat-latest (alias)Coming soon

The flagship line we covered last month — fully retrained base, 82.7% on Terminal-Bench 2.0, doubled to $5/$30. Today’s release is the other line: Instant. OpenAI’s own framing — “the daily driver for hundreds of millions of people” — tells you the audience. This generation isn’t built to run agents; it’s built to handle billions of casual chats.

The product logic is clean: flagship competes on benchmarks, Instant competes on retention. The marginal returns on Instant are enormous — when hundreds of millions of users see 30% shorter answers, that compounds to staggering aggregate attention savings. Halving hallucinations and trimming filler matters more than another two points on SWE-Bench for the median user.

The numbers, all in one place

MetricImprovementBaseline
High-stakes hallucinated claims (medicine/law/finance)52.5%GPT-5.3 Instant
Inaccurate claims on user-flagged conversations↓ 37.3%GPT-5.3 Instant
Words per response↓ 30.2%GPT-5.3 Instant
Lines per response↓ 29.2%GPT-5.3 Instant

Source: OpenAI internal evaluation (official announcement).

A few things to note about how to read these numbers:

  1. The baseline is GPT-5.3 Instant — not other vendors. This is intra-product-line generational improvement. It does not say anything about Claude Haiku 4 or Gemini 3.1 Flash.
  2. 52.5% is specifically on medicine, law, and finance. Those three domains share a key property: questions tend to have ground-truth answers that users can’t immediately fact-check. Halving hallucination rate in that regime is a real win for anyone shipping chat into customer support, telehealth triage, or compliance Q&A surfaces.
  3. 37.3% is on “user-flagged” conversations. OpenAI built an eval set from chats users actually complained about as factually wrong, then re-ran them through 5.5 Instant. Cutting another third off that distribution is closer to real-world relevant than synthetic benchmark deltas.

The shorter-answer numbers deserve their own beat. OpenAI’s own example — “how do I tell my coworker to stop yapping” — has 5.3 Instant returning five tactics with a “what not to do” appendix, and 5.5 Instant returning five tighter scripts you can paste into a Slack DM and ship. Same problem, 30.2% fewer words, no “what not to do” filler. The shorter answer was the better answer.

What 5.5 Instant actually cares about

OpenAI’s pitch is built around three concrete capabilities, not “smarter”:

1. Factuality with self-correction

The math example is the most interesting one. The user’s prompt is asking the model to check their work on √(x+7) = x-1. GPT-5.3 Instant initially endorses the (wrong) answer, then notices on substitution that x=3 fails, and concludes “no real solution.”

GPT-5.5 Instant also catches that x=3 fails — but instead of stopping there, it traces back to the original algebra and finds the user’s earlier arithmetic mistake: they had x²-x+1 instead of x²-2x+1 after squaring. It re-derives the corrected quadratic and lands on x = (3+√33)/2.

The capability isn’t “got the right answer.” It’s the model’s willingness to revisit its own prior endorsement when downstream evidence contradicts it. For workflows like “have AI double-check my reasoning” or “have AI review this PR,” that recovery loop is qualitatively different from “model that gives a confident wrong answer faster.”

2. Concision without losing substance

OpenAI describes 5.5 Instant as “removing redundancy, asking fewer follow-up questions, avoiding gratuitous emoji and overformatting.” The 30.2% / 29.2% drops in words and lines play out as fewer bullet points, fewer “what not to do” appendices, less recapping of what the user just said.

Short ≠ shallow. In the coworker example, what 5.5 Instant cut was the “what not to do” section — and that section in 5.3 didn’t add new information; it just inverted the same advice. Removing it made the answer better.

For developers: the same prompt now likely consumes fewer output tokens. If you’re paying per-token for a chat assistant or support bot, this is one of those rare upgrades where capability goes up and unit cost goes down in the same release.

3. Personalization with memory sources

5.5 Instant is more aggressive about using past chats, files, and (if connected) Gmail as conversational context. OpenAI is shipping it alongside memory sources — a new UI affordance that surfaces what memories or past chats were used for a given response, with one-click deletion.

It reads like privacy plumbing but it’s really a UX trade: to justify pulling more context into more responses, OpenAI needs the user to be able to see and unwind it. Plus and Pro web users get it first, mobile follows, then the Free / Go / Business / Enterprise rollout.

API surface: what is chat-latest?

From the announcement:

rolling out… in the API as chat-latest

chat-latest is OpenAI’s alias for “whatever model currently powers ChatGPT’s default response.” It is a moving pointer, not a fixed ID:

  • A week ago: → gpt-5.3-instant
  • Today: → gpt-5.5-instant
  • In six months: → whatever ships next

The engineering trade-off is straightforward:

Use caseRecommendation
Want to “track ChatGPT default” without manual upgradesUse chat-latest
Need reproducibility, regression tests, behavioral pinningUse explicit gpt-5.5-instant
Have SLA, compliance, or auditability requirementsPin the version, log the model ID per request

One real risk before you wire chat-latest into production: model behavior will change with no deprecation window. If your prompt engineering, parsing logic, or downstream UI assumes a specific output shape, version-pin and upgrade deliberately.

Where ofox stands today

The current ofox catalog (per ofox.ai/models) for the OpenAI lineup:

  • GPT-5.5, GPT-5.5 Pro (April 23 flagship — live)
  • GPT-5.4, GPT-5.4 Pro / Mini / Nano
  • GPT-5.3 Chat, GPT-5.3 Codex

GPT-5.5 Instant (chat-latest) is rolling out soon. While you wait, the closest stand-ins:

  • GPT-5.4 Mini — low latency, cheap, closest profile to “casual conversational” workloads
  • GPT-5.5 Thinking — much stronger but ~6× the cost and slower; not a drop-in for the Instant slot
  • GPT-5.3 Chat — same generation as 5.3 Instant; useful baseline if you want to A/B against 5.5 once it lands

We’ll update this post when Instant goes live on ofox — at that point you swap the model ID, the base URL stays the same.

Should you migrate now?

Unlike flagship upgrades, the calculus on Instant is mostly one-directional. The decision usually fits one of three buckets:

Migrate now

  • Customer support, FAQ, assistant products with mainstream user-facing chat — shorter answers + lower hallucinations is a strict improvement
  • Education / tutoring apps where the self-correction behavior maps directly to student experience
  • Anything currently on 5.3 Instant where you’ve heard “it’s too verbose”

A/B first, then migrate

  • Production chat in medicine / law / finance — the 52.5% number is OpenAI’s, no third-party replication exists yet, run your own 100-prompt A/B
  • Workflows where downstream parsers depend on output shape (strict JSON, fixed bullet counts) — short-by-default may break assumptions
  • Compliance-controlled deployments where model upgrades require change management — three months of dual availability buys you the runway

Hold for now

  • Production regression suites pinned to 5.3 Instant with no time to re-run — you have three months
  • Long-context heavy workloads — Instant has never been the long-context line; if you need 1M context, you want flagship

An underrated detail: smarter web search routing

Buried in the announcement is “better at deciding when to use web search.” That matters more than it looks for RAG-style applications.

Earlier ChatGPT generations leaned on “search-everything” defaults — which retrieved results of mixed quality and often lowered answer accuracy by anchoring on noisy sources. 5.5 Instant is choosier about when to fire a search vs answer from internal knowledge. If you’ve wired ChatGPT into your own retrieval pipeline, expect fewer redundant searches, lower latency, and lower retrieval cost on the same workload.

If your product is “AI + search,” this upgrade may quietly be worth more than the headline numbers.

Quick clarifications worth filing away

  • Is the hallucination evaluation English-only or multilingual? OpenAI didn’t say. Internal evals are typically English-leaning; expect smaller gains in non-English locales.
  • Instant or Thinking? Default to Instant. Switch to Thinking (gpt-5.5) for agent workflows, long reasoning chains, and high-stakes math. They are not substitutes — they’re a division of labor.
  • What’s the price of chat-latest? OpenAI didn’t publish it in the announcement. Instant has historically been priced well below Thinking; ofox will sync the number on launch.
  • Are memory sources available via the API? No. Memory sources is a ChatGPT product feature; raw API calls don’t see ChatGPT’s memory layer. If you need “remembers user preferences,” store the context yourself.

OpenAI just took the model that hundreds of millions of people interact with daily and cut its hallucination rate in half (in the domains where it matters most) while also making responses 30% shorter. That’s a lot more user-impact than another three points at the top of a leaderboard. GPT-5.5 Instant isn’t a new flagship — but it’s the first 5.x release where the optimization target is explicitly the user, not the benchmark. Worth half a day to wire in and test.