MiniMax M3 vs Claude Opus 4.8: 59% vs 69% SWE-Bench, 10× Pricing, Pick (2026)
MiniMax M3 just landed at 59% on SWE-Bench Pro for one-tenth the price of Claude Opus 4.8 — but the headline that says “M3 beats GPT-5.5” quietly compares it to Anthropic’s old flagship.
30-Second Verdict
| Question | Answer |
|---|---|
| Higher SWE-Bench Pro score? | Claude Opus 4.8 (69.2% vs M3’s 59.0%) |
| Cheaper per token? | MiniMax M3 (~10× less on both input and output) |
| Bigger context window? | Tied — both 1M tokens |
| Open-weight available today? | Neither, in practice (M3 weights delayed past the promised 10-day window) |
| Best for routine coding agents? | M3 — quality cost gap closes once you account for $/task |
| Best for hard multi-file diffs and audit work? | Opus 4.8 — the ~10-point benchmark gap is real |
Verdict: if your workload is price-sensitive agent runs, pick MiniMax M3 via minimax/minimax-m3 on ofox. If your workload is hard reasoning over multi-file PRs, pick anthropic/claude-opus-4.8. The clean way to find out is to swap a string and run both on the same prompt — code at the end of this post.
TL;DR: Which One Should You Pick?
A one-line decision table for the four scenarios that cover ~90% of real coding work:
| Scenario | Pick | Why |
|---|---|---|
| Lint-fix loops, formatter agents, low-stakes refactors | MiniMax M3 | 10× cheaper per run; quality difference invisible on simple diffs |
| Agentic IDE plugins (Cursor, Windsurf, Cline) | MiniMax M3 by default, Opus 4.8 for “explain this bug” | M3 handles tool-loop volume; Opus handles the few prompts that need reasoning |
| Multi-file refactor where a wrong patch costs a debugging hour | Claude Opus 4.8 | 10-point SWE-Bench gap = noticeably fewer broken diffs on hard repos |
| 1M-context whole-repo grep+patch | Test both | MSA is faster at long ctx; Opus is more accurate. A/B on your actual repo |
The trap is treating this as one decision. Most teams want both models available, routed by task — and that’s exactly what ofox’s same-base_url swap is built for. We’ll show the routing pattern in the Try Both via ofox section.
Quick Specs Comparison
All prices verified from the ofox catalog on 2026-06-13. Context and output limits from vendor docs.
| Spec | MiniMax M3 | Claude Opus 4.8 |
|---|---|---|
| Model ID on ofox | minimax/minimax-m3 | anthropic/claude-opus-4.8 |
| Input price | $0.60/M tokens | $5.00/M tokens |
| Output price | $2.40/M tokens | $25.00/M tokens |
| Cached input price | $0.12/M tokens | $0.50/M tokens |
| Context window | 1M tokens | 1M tokens |
| Max output | 131K tokens | 128K tokens (Simon Willison’s review, 2026-05-28) |
| Modalities (input) | Text + image + video | Text + image |
| Vendor SWE-Bench Pro | 59.0% | 69.2% |
| Released | 2026-06-01 | 2026-05-28 |
| Open weight? | Promised, weights delayed | No (closed) |
| Architecture | MiniMax Sparse Attention (MSA) | Dense transformer (Anthropic) |
Two specs worth pausing on:
Input price ratio is 8.3×; output ratio is 10.4×. A typical coding agent emits 0.2–0.5 output tokens per input token, so the effective ratio sits between 9× and 10× depending on workload. Round to 10× for back-of-envelope.
Max output is effectively a tie. M3 ships 131K, Opus 4.8 ships 128K — the 3K gap doesn’t change the operational shape. Both can emit a small file or a dozen unit tests in one call, and both will need chained calls past roughly 130K. If you’re picking on output-cap headroom, this is a wash; pick on price or quality instead.
SWE-Bench Pro: The Number That Started the Story
SWE-Bench Pro is the hardest variant of the SWE-bench family — problems from actively-maintained repositories, multi-file diffs, no public ground-truth leakage. It’s the closest thing the field has to a coding benchmark that resists memorization.
Here’s where the three frontier models sat in early June 2026:
| Model | SWE-Bench Pro | Released | Notes |
|---|---|---|---|
| Claude Opus 4.8 | 69.2% | 2026-05-28 | Anthropic-run, official |
| Claude Opus 4.7 | 64.3% | 2026-04 | What MiniMax compared M3 against |
| MiniMax M3 | 59.0% | 2026-06-01 | Vendor-run on own infra, Claude Code scaffolding |
| GPT-5.5 | 58.6% | 2026-04-23 | OpenAI-run |
| Gemini 3.1 Pro | < 58.6% | 2026 | Below GPT-5.5 per public leaderboards |
The first sentence of MiniMax’s June 1 launch announcement reads, essentially: “M3 beats GPT-5.5 and Gemini 3.1 Pro on SWE-Bench Pro at one-tenth the cost.” That’s correct as printed. What’s left out: Anthropic had shipped Opus 4.8 four days earlier with a 69.2% score, and the MiniMax deck compared M3 against the older Opus 4.7 at 64.3%.
Independent verification status is the other footnote. MiniMax ran the eval on its own infrastructure, using Claude Code as the agentic scaffolding, with evaluation logic aligned to the official methodology. The official SWE-Bench Pro leaderboard had not added M3 as of this writing. Treat the 59.0% as a directional signal — it might land at 56% or 61% on a clean third-party run, and either still leaves the same shape: M3 is in the same league as GPT-5.5, one tier below Opus 4.8.
The honest one-line read: the M3 number is real, the marketing framing is selective.
Terminal-Bench 2.1 and Multimodal: Where M3 Closes the Gap
SWE-Bench Pro is one signal. On Terminal-Bench 2.1 — long-horizon terminal execution, the kind of thing a coding agent does when you ask it to “set up the dev environment and run the failing test” — MiniMax reports M3 at 66.0%. That’s competitive with Opus 4.8 at similar ranges per Anthropic’s release notes, and notably ahead of GPT-5.5. The reason: MSA’s decoding speed at long context makes long tool-use loops cheaper to retry, so the agent can recover from more failures within a budget.
Native multimodality is the other pitch. M3 accepts image and video input. Opus 4.8 accepts image input but not video. In practical coding terms this matters for two things: pasting a screenshot of a stack trace, and feeding a short screencast of a UI bug. Both models handle the screenshot case; only M3 handles the screencast.
For 95% of coding work neither of these tips the decision — you’re staring at text. They become decisive once you start building agents that actually look at the browser.
Pricing Math: What 1M Tokens Actually Cost
Vendor benchmarks are run on perfect infrastructure. Your bill is run on production traffic. Here are three realistic shapes:
| Workload shape | Tokens | MiniMax M3 cost | Claude Opus 4.8 cost | Multiplier |
|---|---|---|---|---|
| Routine refactor agent (1M in + 200K out) | 1.2M total | $1.08 | $10.00 | 9.3× |
| Heavy code generation (500K in + 500K out) | 1M total | $1.50 | $15.00 | 10.0× |
| Whole-repo grep + patch (1M in + 50K out) | 1.05M total | $0.72 | $6.25 | 8.7× |
| Long-context audit with cache hit (1M cached + 50K out) | 1.05M total | $0.24 | $1.75 | 7.3× |
Numbers use ofox’s published rates verified on 2026-06-13: M3 $0.60/M input / $2.40/M output / $0.12/M cached; Opus 4.8 $5/M input / $25/M output / $0.50/M cached. Math is unit price × token count, no rounding.
The picture changes when you scale to a team. Pick a representative profile — five developers, 100 coding-agent runs per day each, 500K input and 100K output per run, 22 working days per month:
- M3 per run: $0.30 + $0.24 = $0.54. Monthly: 5 × 100 × 22 × $0.54 = $5,940.
- Opus 4.8 per run: $2.50 + $2.50 = $5.00. Monthly: 5 × 100 × 22 × $5.00 = $55,000.
A five-person engineering org running Opus on default routing burns through a small mortgage every month. The same team on M3-default routing with Opus called only for hard problems (say 10% of runs) pays roughly $11K instead. The price-performance argument for M3 isn’t “cheap is fine”; it’s that you can spend the saved $44K on running Opus more on the prompts that actually need it.
The “Open-Weight” Caveat: Where Are the Weights?
MiniMax’s June 1 announcement positioned M3 as “the first and only open-weight model” combining frontier coding, 1M context, and native multimodality. The weights and technical report were scheduled for Hugging Face and GitHub “within roughly 10 days” of launch — call it the June 10–11 window.
As of June 13, 2026, the MiniMax-M3 GitHub repo still notes: “this model is not yet released — this repository exists so the community can share what they need next.” The API is live and you can call M3 via providers including ofox, but you cannot self-host it today. The repo has been frozen on a placeholder for almost two weeks.
This is not a fatal point — vendors slip weight releases all the time, and “10 days” was a soft window, not a contract. But it changes the practical comparison. If you picked M3 specifically because the weights would land in your private cluster within two weeks, that bet has not paid off yet. For now, both MiniMax M3 and Claude Opus 4.8 are API-only from a deployment perspective; the open-weight axis isn’t decisive in June 2026.
When the weights do ship, the math changes again. A self-hosted M3 cluster amortizes against your GPU lease, not per-token pricing — for sustained 24/7 workloads that’s a fundamentally different cost curve from per-token Opus. But that’s a story for the article we’ll write the day the weights actually appear on Hugging Face.
When to Pick MiniMax M3
Pick minimax/minimax-m3 if any of the following is true:
-
You’re running coding agents at volume. Lint-fixer bots, formatter loops, codemod agents, “write the docstring” pipelines. These are dominated by token cost, not per-prompt quality, and M3’s 10× pricing edge dwarfs the ~10-point quality gap.
-
You’re paying for long-context input. Whole-repo prompts (1M tokens of code in, small diff out) are where MSA’s decoding speed and M3’s input pricing compound. A million cached tokens on M3 costs $0.12 versus $0.50 on Opus.
-
Video input is a hard requirement. Opus 4.8 accepts images but not video. If your agent needs to look at a 30-second screen recording of a UI bug, you have one option in this comparison.
-
You’re hedging against the Opus 4.8 price tier. Even teams that prefer Opus 4.8 for primary work route routine prompts to a cheaper model. M3 is currently the strongest sub-$1/M coding option that also tops 1M context.
-
You’ll switch if and when independent SWE-Bench Pro reruns come in lower. Treat the 59% as provisional. Build your stack so swapping
minimax/minimax-m3for the next cheap challenger is one config change away.
When to Pick Claude Opus 4.8
Pick anthropic/claude-opus-4.8 if any of the following is true:
-
A wrong patch costs more than a token bill. Production hotfixes, security-sensitive refactors, anything where you’d review the diff yourself before merging anyway. The ~10-point SWE-Bench Pro gap is concentrated on the hardest problems — not the median ones.
-
You’re building reasoning-heavy agents. “Read this incident postmortem and propose three fixes.” “Audit this OAuth flow and find the bug.” Opus 4.8’s reasoning gains over 4.7 are tangible per Anthropic’s release notes and per independent reviews like Simon Willison’s.
-
You’re already in the Anthropic ecosystem. Claude Code, Anthropic’s MCP tooling, dynamic workflows — all of these assume Anthropic-style tool semantics. M3 works with Claude Code (MiniMax themselves used it as scaffolding) but you’ll hit edge cases on tool format expectations.
-
The “Fast mode” cost tier suits your shape. Opus 4.8 introduced a $10/M input / $50/M output Fast mode tier for latency-sensitive use cases. It’s more expensive than the regular tier but less than calling Opus 4.7 and waiting longer. The relevant comparison there isn’t M3 — it’s Opus 4.8 standard vs Fast mode within Anthropic’s own lineup, covered in our Claude Opus 4.8 release review.
-
Your eval harness already calibrates against Opus. If your team has a “would the senior reviewer accept this PR” eval suite that’s been tuned against Opus outputs, switching models invalidates your eval until you re-baseline. That’s real engineering cost, not vibes.
When NOT to Pick Either (and What to Use Instead)
A few scenarios where this whole comparison is the wrong question:
-
Sub-$0.10/M-token budget, simple refactors. Look at smaller models like Claude Haiku 4 or GPT-5.4 Mini covered in our budget coding model comparison. Spending $0.60/M on M3 when GPT-5.4 Mini at $0.10/M would do the same lint-fix is theater.
-
You need on-prem deployment today. Both M3 (weights not shipped) and Opus 4.8 (closed) are API-only. Self-host options for frontier coding today are Qwen 3.7 Max and the open Chinese model lineup — see our Qwen vs DeepSeek coding comparison.
-
You’re optimizing for a strict latency SLA, not cost. Both M3 and Opus 4.8 are designed for quality, not p50 latency. Smaller faster models will beat both on TTFT.
-
You need to evaluate multiple frontier models at once. Build a comparison harness instead of picking one. The agentic coding model showdown walks through the harness pattern.
Try Both via ofox: A/B in 10 Lines of Code
The whole comparison reduces to a one-string change if you call both models through ofox’s OpenAI-compatible endpoint. Same base_url, same SDK, just swap the model argument.
Python — A/B both models in one loop
from openai import OpenAI
client = OpenAI(api_key=OFOX_API_KEY, base_url="https://api.ofox.ai/v1")
PROMPT = "Refactor this function to remove duplication: ..."
for model in ["minimax/minimax-m3", "anthropic/claude-opus-4.8"]:
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": PROMPT}],
)
print(model, resp.usage.total_tokens, resp.choices[0].message.content[:120])
Run this and you get per-model token usage and the first 120 characters of each output for eyeball comparison. Plug the total_tokens numbers into the pricing math table above and you have a per-run cost on a real prompt rather than a vendor benchmark.
Node — same shape
import OpenAI from "openai";
const client = new OpenAI({ apiKey: process.env.OFOX_API_KEY, baseURL: "https://api.ofox.ai/v1" });
const prompt = "Refactor this function to remove duplication: ...";
for (const model of ["minimax/minimax-m3", "anthropic/claude-opus-4.8"]) {
const r = await client.chat.completions.create({ model, messages: [{ role: "user", content: prompt }] });
console.log(model, r.usage.total_tokens, r.choices[0].message.content.slice(0, 120));
}
Identical shape, identical endpoint, identical SDK call. The migration cost between models is one string. That’s the only reason this comparison is answerable in 10 lines instead of a week of vendor onboarding.
For a multi-turn agent loop that includes tool calls, the same swap works — both models accept OpenAI-style tools arrays via ofox. You’ll want to test the tool-call format on your specific tools because each provider’s strict mode handling diverges at the edges, but the contract is the same.
Compatibility Quirks: What Differs Between the Two APIs
Same endpoint, same SDK call — but a few sharp edges worth knowing before you wire either model into production.
System prompt handling. Claude Opus 4.8 treats the system role as a strict system prompt with elevated trust. MiniMax M3 (via the OpenAI-compatible path) folds system into the conversation more loosely. If your agent depends on system-prompt-only constraints — “never call this tool unless asked,” “always respond in JSON” — M3 follows them most of the time but is statistically more likely to drift on long tool loops. Workaround: repeat critical constraints in the first user message.
Tool-call format strictness. Opus 4.8 enforces tool argument schemas hard — it will refuse to call a tool if your parameters JSON Schema marks a field required and the model can’t fill it. M3 is more lenient and will sometimes emit a tool call with a placeholder string. If your tool layer treats placeholders as valid, you’ll silently execute wrong actions; if it validates strictly, you’ll see more retry loops. The fix is the same either way: validate tool arguments on the server side, not just at the model.
Caching semantics. Both models offer cached input pricing, but Anthropic splits the bill into write and read. On Opus 4.8 you pay a one-time cache write at $6.25/M (5-minute TTL) or $10/M (1-hour TTL), then every subsequent cache read lands at $0.50/M — that’s the number in the spec table above. M3’s cache on ofox is a single $0.12/M read rate with implicit TTL and no separate write surcharge. For workloads that hit the same long-context prompt many times per minute (like a code review agent with a static repo prompt), M3 is dramatically cheaper at the cache read layer. For workloads where the cached portion stays warm for hours and write costs amortize across many reads, Opus 4.8’s 1-hour tier is competitive on a per-token basis even before quality.
Streaming chunk shape. Both models stream OpenAI-compatible chunks, but Opus 4.8 emits more granular delta.thinking events when extended thinking is enabled (covered in our Opus 4.8 release review). If your client parses thinking deltas separately from content deltas, that code works against Opus but no-ops against M3, which doesn’t currently expose thinking deltas through the OpenAI-compatible route. Not a bug — just an unused field.
Rate limits at the provider edge. When you call both models through ofox, you share one rate limit envelope keyed to your API key — not two separate per-vendor quotas. That’s the point of the gateway shape: M3 fallback when Opus is rate-limited, Opus fallback when M3 is, all without juggling two sets of credentials.
The whole MiniMax M3 vs Claude Opus 4.8 question collapses to one string swap on the same endpoint — which is the only sane way to pick a coding model in 2026.
Sources Checked for This Refresh
- Anthropic — Introducing Claude Opus 4.8 (verified 2026-06-13)
- MiniMax-M3 GitHub repository (weights status verified 2026-06-13)
- TestingCatalog on X — Opus 4.8 SWE-Bench Pro 69.2% vs 4.7 64.3%
- The Decoder — MiniMax M3: open-weight model with a million-token context challenges proprietary leaders
- Simon Willison — Claude Opus 4.8: a modest but tangible improvement
- ofox catalog snapshot for
minimax/minimax-m3andanthropic/claude-opus-4.8(pricing verified 2026-06-13)


