Is MiniMax M3 actually open-weight in June 2026?

Not yet. MiniMax said weights would land on Hugging Face within ~10 days of the June 1 launch, but as of mid-June the GitHub repo still says the model is 'not yet released.' The API is live via providers like ofox, but you can't self-host until weights ship.

Why does MiniMax M3 score lower than Claude Opus 4.8 on SWE-Bench Pro?

M3's 59.0% trails Opus 4.8's 69.2% by about 10 points. The gap is widest on multi-file diffs from active repos, where Opus 4.8's reasoning gains over 4.7 (64.3%) push it ahead. M3's win is price-performance: roughly 10× less per token while sitting on par with GPT-5.5 (58.6%).

Did MiniMax cherry-pick Opus 4.7 instead of 4.8?

Effectively yes. MiniMax's launch deck pinned M3 against Opus 4.7. Anthropic had shipped Opus 4.8 four days earlier on May 28. Benchmarks freeze weeks before launch, so the version offset isn't dishonest, but it does make M3 look closer to the frontier than the 4.8 numbers actually permit.

Can MiniMax M3 handle 1M tokens of code in one prompt?

Yes — 1M context via MiniMax Sparse Attention (MSA), with up to 131K max output tokens. MSA gives roughly 15.6× faster decoding and 9.7× faster prefill at 1M vs the older M2 line, but throughput at full context still varies by workload. A/B against Opus 4.8 (also 1M ctx) on your real prompts before committing.

Which model is cheaper for production coding agents?

MiniMax M3, by a wide margin. On ofox: M3 is $0.6/M input and $2.4/M output; Opus 4.8 is $5/M and $25/M. A 1M-input + 200K-output agent run costs about $1.08 on M3 versus $10 on Opus 4.8 — a 9× per-run gap that compounds fast across a team.

Should I use MiniMax M3 for agentic workflows with Claude Code?

You can. MiniMax themselves ran SWE-Bench Pro with Claude Code as the scaffolding. Point Claude Code at ofox with model ID `minimax/minimax-m3` and it runs. Watch for minor tool-format quirks on long agent loops — most Claude Code tooling assumes Anthropic-style semantics, and not every provider maps cleanly.

What benchmarks favor MiniMax M3 over Claude Opus 4.8?

Cost-adjusted benchmarks and long-horizon tasks. MSA is specifically tuned for 1M-context decoding speed, so 35+ hour autonomous runs and whole-repo refactors should favor M3 on $/task. Opus 4.8 still wins raw quality per call on hard reasoning.

Is the SWE-Bench Pro score independently verified?

Not yet. MiniMax published 59.0% on its own infrastructure with Claude Code as scaffolding, and the official SWE-Bench Pro leaderboard hadn't added M3 as of mid-June 2026. Treat the number as a directional signal until third-party reruns land.

MiniMax M3 vs Claude Opus 4.8: 59% vs 69% SWE-Bench, 10× Pricing, Pick (2026)

MiniMax M3 just landed at 59% on SWE-Bench Pro for one-tenth the price of Claude Opus 4.8 — but the headline that says “M3 beats GPT-5.5” quietly compares it to Anthropic’s old flagship.

30-Second Verdict

Question	Answer
Higher SWE-Bench Pro score?	Claude Opus 4.8 (69.2% vs M3’s 59.0%)
Cheaper per token?	MiniMax M3 (~10× less on both input and output)
Bigger context window?	Tied — both 1M tokens
Open-weight available today?	Neither, in practice (M3 weights delayed past the promised 10-day window)
Best for routine coding agents?	M3 — quality cost gap closes once you account for $/task
Best for hard multi-file diffs and audit work?	Opus 4.8 — the ~10-point benchmark gap is real

Verdict: if your workload is price-sensitive agent runs, pick MiniMax M3 via minimax/minimax-m3 on ofox. If your workload is hard reasoning over multi-file PRs, pick anthropic/claude-opus-4.8. The clean way to find out is to swap a string and run both on the same prompt — code at the end of this post.

TL;DR: Which One Should You Pick?

A one-line decision table for the four scenarios that cover ~90% of real coding work:

Scenario	Pick	Why
Lint-fix loops, formatter agents, low-stakes refactors	MiniMax M3	10× cheaper per run; quality difference invisible on simple diffs
Agentic IDE plugins (Cursor, Windsurf, Cline)	MiniMax M3 by default, Opus 4.8 for “explain this bug”	M3 handles tool-loop volume; Opus handles the few prompts that need reasoning
Multi-file refactor where a wrong patch costs a debugging hour	Claude Opus 4.8	10-point SWE-Bench gap = noticeably fewer broken diffs on hard repos
1M-context whole-repo grep+patch	Test both	MSA is faster at long ctx; Opus is more accurate. A/B on your actual repo

The trap is treating this as one decision. Most teams want both models available, routed by task — and that’s exactly what ofox’s same-base_url swap is built for. We’ll show the routing pattern in the Try Both via ofox section.

Quick Specs Comparison

All prices verified from the ofox catalog on 2026-06-13. Context and output limits from vendor docs.

Spec	MiniMax M3	Claude Opus 4.8
Model ID on ofox	`minimax/minimax-m3`	`anthropic/claude-opus-4.8`
Input price	$0.60/M tokens	$5.00/M tokens
Output price	$2.40/M tokens	$25.00/M tokens
Cached input price	$0.12/M tokens	$0.50/M tokens
Context window	1M tokens	1M tokens
Max output	131K tokens	128K tokens (Simon Willison’s review, 2026-05-28)
Modalities (input)	Text + image + video	Text + image
Vendor SWE-Bench Pro	59.0%	69.2%
Released	2026-06-01	2026-05-28
Open weight?	Promised, weights delayed	No (closed)
Architecture	MiniMax Sparse Attention (MSA)	Dense transformer (Anthropic)

Two specs worth pausing on:

Input price ratio is 8.3×; output ratio is 10.4×. A typical coding agent emits 0.2–0.5 output tokens per input token, so the effective ratio sits between 9× and 10× depending on workload. Round to 10× for back-of-envelope.

Max output is effectively a tie. M3 ships 131K, Opus 4.8 ships 128K — the 3K gap doesn’t change the operational shape. Both can emit a small file or a dozen unit tests in one call, and both will need chained calls past roughly 130K. If you’re picking on output-cap headroom, this is a wash; pick on price or quality instead.

SWE-Bench Pro: The Number That Started the Story

SWE-Bench Pro is the hardest variant of the SWE-bench family — problems from actively-maintained repositories, multi-file diffs, no public ground-truth leakage. It’s the closest thing the field has to a coding benchmark that resists memorization.

Here’s where the three frontier models sat in early June 2026:

Model	SWE-Bench Pro	Released	Notes
Claude Opus 4.8	69.2%	2026-05-28	Anthropic-run, official
Claude Opus 4.7	64.3%	2026-04	What MiniMax compared M3 against
MiniMax M3	59.0%	2026-06-01	Vendor-run on own infra, Claude Code scaffolding
GPT-5.5	58.6%	2026-04-23	OpenAI-run
Gemini 3.1 Pro	< 58.6%	2026	Below GPT-5.5 per public leaderboards

The first sentence of MiniMax’s June 1 launch announcement reads, essentially: “M3 beats GPT-5.5 and Gemini 3.1 Pro on SWE-Bench Pro at one-tenth the cost.” That’s correct as printed. What’s left out: Anthropic had shipped Opus 4.8 four days earlier with a 69.2% score, and the MiniMax deck compared M3 against the older Opus 4.7 at 64.3%.

Independent verification status is the other footnote. MiniMax ran the eval on its own infrastructure, using Claude Code as the agentic scaffolding, with evaluation logic aligned to the official methodology. The official SWE-Bench Pro leaderboard had not added M3 as of this writing. Treat the 59.0% as a directional signal — it might land at 56% or 61% on a clean third-party run, and either still leaves the same shape: M3 is in the same league as GPT-5.5, one tier below Opus 4.8.

The honest one-line read: the M3 number is real, the marketing framing is selective.

Terminal-Bench 2.1 and Multimodal: Where M3 Closes the Gap

SWE-Bench Pro is one signal. On Terminal-Bench 2.1 — long-horizon terminal execution, the kind of thing a coding agent does when you ask it to “set up the dev environment and run the failing test” — MiniMax reports M3 at 66.0%. That’s competitive with Opus 4.8 at similar ranges per Anthropic’s release notes, and notably ahead of GPT-5.5. The reason: MSA’s decoding speed at long context makes long tool-use loops cheaper to retry, so the agent can recover from more failures within a budget.

Native multimodality is the other pitch. M3 accepts image and video input. Opus 4.8 accepts image input but not video. In practical coding terms this matters for two things: pasting a screenshot of a stack trace, and feeding a short screencast of a UI bug. Both models handle the screenshot case; only M3 handles the screencast.

For 95% of coding work neither of these tips the decision — you’re staring at text. They become decisive once you start building agents that actually look at the browser.

Pricing Math: What 1M Tokens Actually Cost

Vendor benchmarks are run on perfect infrastructure. Your bill is run on production traffic. Here are three realistic shapes:

Workload shape	Tokens	MiniMax M3 cost	Claude Opus 4.8 cost	Multiplier
Routine refactor agent (1M in + 200K out)	1.2M total	$1.08	$10.00	9.3×
Heavy code generation (500K in + 500K out)	1M total	$1.50	$15.00	10.0×
Whole-repo grep + patch (1M in + 50K out)	1.05M total	$0.72	$6.25	8.7×
Long-context audit with cache hit (1M cached + 50K out)	1.05M total	$0.24	$1.75	7.3×

Numbers use ofox’s published rates verified on 2026-06-13: M3 $0.60/M input / $2.40/M output / $0.12/M cached; Opus 4.8 $5/M input / $25/M output / $0.50/M cached. Math is unit price × token count, no rounding.

The picture changes when you scale to a team. Pick a representative profile — five developers, 100 coding-agent runs per day each, 500K input and 100K output per run, 22 working days per month:

M3 per run: $0.30 + $0.24 = $0.54. Monthly: 5 × 100 × 22 × $0.54 = $5,940.
Opus 4.8 per run: $2.50 + $2.50 = $5.00. Monthly: 5 × 100 × 22 × $5.00 = $55,000.

A five-person engineering org running Opus on default routing burns through a small mortgage every month. The same team on M3-default routing with Opus called only for hard problems (say 10% of runs) pays roughly $11K instead. The price-performance argument for M3 isn’t “cheap is fine”; it’s that you can spend the saved $44K on running Opus more on the prompts that actually need it.

The “Open-Weight” Caveat: Where Are the Weights?

MiniMax’s June 1 announcement positioned M3 as “the first and only open-weight model” combining frontier coding, 1M context, and native multimodality. The weights and technical report were scheduled for Hugging Face and GitHub “within roughly 10 days” of launch — call it the June 10–11 window.

As of June 13, 2026, the MiniMax-M3 GitHub repo still notes: “this model is not yet released — this repository exists so the community can share what they need next.” The API is live and you can call M3 via providers including ofox, but you cannot self-host it today. The repo has been frozen on a placeholder for almost two weeks.

This is not a fatal point — vendors slip weight releases all the time, and “10 days” was a soft window, not a contract. But it changes the practical comparison. If you picked M3 specifically because the weights would land in your private cluster within two weeks, that bet has not paid off yet. For now, both MiniMax M3 and Claude Opus 4.8 are API-only from a deployment perspective; the open-weight axis isn’t decisive in June 2026.

When the weights do ship, the math changes again. A self-hosted M3 cluster amortizes against your GPU lease, not per-token pricing — for sustained 24/7 workloads that’s a fundamentally different cost curve from per-token Opus. But that’s a story for the article we’ll write the day the weights actually appear on Hugging Face.

When to Pick MiniMax M3

Pick minimax/minimax-m3 if any of the following is true:

You’re running coding agents at volume. Lint-fixer bots, formatter loops, codemod agents, “write the docstring” pipelines. These are dominated by token cost, not per-prompt quality, and M3’s 10× pricing edge dwarfs the ~10-point quality gap.
You’re paying for long-context input. Whole-repo prompts (1M tokens of code in, small diff out) are where MSA’s decoding speed and M3’s input pricing compound. A million cached tokens on M3 costs $0.12 versus $0.50 on Opus.
Video input is a hard requirement. Opus 4.8 accepts images but not video. If your agent needs to look at a 30-second screen recording of a UI bug, you have one option in this comparison.
You’re hedging against the Opus 4.8 price tier. Even teams that prefer Opus 4.8 for primary work route routine prompts to a cheaper model. M3 is currently the strongest sub-$1/M coding option that also tops 1M context.
You’ll switch if and when independent SWE-Bench Pro reruns come in lower. Treat the 59% as provisional. Build your stack so swapping minimax/minimax-m3 for the next cheap challenger is one config change away.

When to Pick Claude Opus 4.8

Pick anthropic/claude-opus-4.8 if any of the following is true:

A wrong patch costs more than a token bill. Production hotfixes, security-sensitive refactors, anything where you’d review the diff yourself before merging anyway. The ~10-point SWE-Bench Pro gap is concentrated on the hardest problems — not the median ones.
You’re building reasoning-heavy agents. “Read this incident postmortem and propose three fixes.” “Audit this OAuth flow and find the bug.” Opus 4.8’s reasoning gains over 4.7 are tangible per Anthropic’s release notes and per independent reviews like Simon Willison’s.
You’re already in the Anthropic ecosystem. Claude Code, Anthropic’s MCP tooling, dynamic workflows — all of these assume Anthropic-style tool semantics. M3 works with Claude Code (MiniMax themselves used it as scaffolding) but you’ll hit edge cases on tool format expectations.
The “Fast mode” cost tier suits your shape. Opus 4.8 introduced a $10/M input / $50/M output Fast mode tier for latency-sensitive use cases. It’s more expensive than the regular tier but less than calling Opus 4.7 and waiting longer. The relevant comparison there isn’t M3 — it’s Opus 4.8 standard vs Fast mode within Anthropic’s own lineup, covered in our Claude Opus 4.8 release review.
Your eval harness already calibrates against Opus. If your team has a “would the senior reviewer accept this PR” eval suite that’s been tuned against Opus outputs, switching models invalidates your eval until you re-baseline. That’s real engineering cost, not vibes.

When NOT to Pick Either (and What to Use Instead)

A few scenarios where this whole comparison is the wrong question:

Sub-$0.10/M-token budget, simple refactors. Look at smaller models like Claude Haiku 4 or GPT-5.4 Mini covered in our budget coding model comparison. Spending $0.60/M on M3 when GPT-5.4 Mini at $0.10/M would do the same lint-fix is theater.
You need on-prem deployment today. Both M3 (weights not shipped) and Opus 4.8 (closed) are API-only. Self-host options for frontier coding today are Qwen 3.7 Max and the open Chinese model lineup — see our Qwen vs DeepSeek coding comparison.
You’re optimizing for a strict latency SLA, not cost. Both M3 and Opus 4.8 are designed for quality, not p50 latency. Smaller faster models will beat both on TTFT.
You need to evaluate multiple frontier models at once. Build a comparison harness instead of picking one. The agentic coding model showdown walks through the harness pattern.

Try Both via ofox: A/B in 10 Lines of Code

The whole comparison reduces to a one-string change if you call both models through ofox’s OpenAI-compatible endpoint. Same base_url, same SDK, just swap the model argument.

Python — A/B both models in one loop

from openai import OpenAI

client = OpenAI(api_key=OFOX_API_KEY, base_url="https://api.ofox.ai/v1")
PROMPT = "Refactor this function to remove duplication: ..."

for model in ["minimax/minimax-m3", "anthropic/claude-opus-4.8"]:
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": PROMPT}],
    )
    print(model, resp.usage.total_tokens, resp.choices[0].message.content[:120])

Run this and you get per-model token usage and the first 120 characters of each output for eyeball comparison. Plug the total_tokens numbers into the pricing math table above and you have a per-run cost on a real prompt rather than a vendor benchmark.

Node — same shape

import OpenAI from "openai";

const client = new OpenAI({ apiKey: process.env.OFOX_API_KEY, baseURL: "https://api.ofox.ai/v1" });
const prompt = "Refactor this function to remove duplication: ...";

for (const model of ["minimax/minimax-m3", "anthropic/claude-opus-4.8"]) {
  const r = await client.chat.completions.create({ model, messages: [{ role: "user", content: prompt }] });
  console.log(model, r.usage.total_tokens, r.choices[0].message.content.slice(0, 120));
}

Identical shape, identical endpoint, identical SDK call. The migration cost between models is one string. That’s the only reason this comparison is answerable in 10 lines instead of a week of vendor onboarding.

For a multi-turn agent loop that includes tool calls, the same swap works — both models accept OpenAI-style tools arrays via ofox. You’ll want to test the tool-call format on your specific tools because each provider’s strict mode handling diverges at the edges, but the contract is the same.

Compatibility Quirks: What Differs Between the Two APIs

Same endpoint, same SDK call — but a few sharp edges worth knowing before you wire either model into production.

System prompt handling. Claude Opus 4.8 treats the system role as a strict system prompt with elevated trust. MiniMax M3 (via the OpenAI-compatible path) folds system into the conversation more loosely. If your agent depends on system-prompt-only constraints — “never call this tool unless asked,” “always respond in JSON” — M3 follows them most of the time but is statistically more likely to drift on long tool loops. Workaround: repeat critical constraints in the first user message.

Tool-call format strictness. Opus 4.8 enforces tool argument schemas hard — it will refuse to call a tool if your parameters JSON Schema marks a field required and the model can’t fill it. M3 is more lenient and will sometimes emit a tool call with a placeholder string. If your tool layer treats placeholders as valid, you’ll silently execute wrong actions; if it validates strictly, you’ll see more retry loops. The fix is the same either way: validate tool arguments on the server side, not just at the model.

Caching semantics. Both models offer cached input pricing, but Anthropic splits the bill into write and read. On Opus 4.8 you pay a one-time cache write at $6.25/M (5-minute TTL) or $10/M (1-hour TTL), then every subsequent cache read lands at $0.50/M — that’s the number in the spec table above. M3’s cache on ofox is a single $0.12/M read rate with implicit TTL and no separate write surcharge. For workloads that hit the same long-context prompt many times per minute (like a code review agent with a static repo prompt), M3 is dramatically cheaper at the cache read layer. For workloads where the cached portion stays warm for hours and write costs amortize across many reads, Opus 4.8’s 1-hour tier is competitive on a per-token basis even before quality.

Streaming chunk shape. Both models stream OpenAI-compatible chunks, but Opus 4.8 emits more granular delta.thinking events when extended thinking is enabled (covered in our Opus 4.8 release review). If your client parses thinking deltas separately from content deltas, that code works against Opus but no-ops against M3, which doesn’t currently expose thinking deltas through the OpenAI-compatible route. Not a bug — just an unused field.

Rate limits at the provider edge. When you call both models through ofox, you share one rate limit envelope keyed to your API key — not two separate per-vendor quotas. That’s the point of the gateway shape: M3 fallback when Opus is rate-limited, Opus fallback when M3 is, all without juggling two sets of credentials.

The whole MiniMax M3 vs Claude Opus 4.8 question collapses to one string swap on the same endpoint — which is the only sane way to pick a coding model in 2026.