Qwen 3.7 Max Developer Guide: 1M Context, $2.50/MTok, and the Anthropic-Protocol Drop-In (2026)

Qwen 3.7 Max Developer Guide: 1M Context, $2.50/MTok, and the Anthropic-Protocol Drop-In (2026)

TL;DR — Qwen 3.7 Max-Preview launched May 19, 2026 with a 1M-token context, native Anthropic Messages protocol, $2.50/$7.50 per million tokens, and a 90% cached-input discount ($0.25/M). It scores 56.6 on the Artificial Analysis Intelligence Index (top 10 globally and the highest-ranked Chinese model on the board) and 97.1 on HMMT 2026 February. Access it on ofox.ai as bailian/qwen3.7-max with the same key that gives you Claude, GPT, and Gemini. The catch: default extended thinking makes it verbose enough that effective costs run 3–4× the headline rate on long agent sessions unless you cap max_tokens.

A model that natively speaks the Anthropic Messages protocol at half the input cost of Claude Opus 4.7, with a real 1M-token window and a 90% cache discount, is the first credible plug-replacement for Opus in a Claude Code harness. The pricing war for serious coding models just got uncomfortable.

What is Qwen 3.7 Max?

Qwen 3.7 Max-Preview is Alibaba’s May 2026 flagship reasoning model, announced at the Alibaba Cloud Summit on May 20 and live on the API one day earlier. It replaces Qwen 3.6 Max Preview as the company’s most capable hosted model. A few properties make it more than a routine version bump:

  1. Native Anthropic Messages support. Most non-Anthropic models advertise “Anthropic compatibility” via a translation shim. Qwen 3.7 Max accepts the Anthropic Messages format directly at the endpoint level, so you can point Claude Code, OpenClaw, or any Anthropic SDK call at Qwen with only the base URL and model id changed.
  2. 1,000,000-token native context with 65,536-token max output. Not a sliding-window approximation. The model is trained for end-to-end long context and posts 90.4 on the MRCR-v2 128k retrieval benchmark, which is the score most “1M context” competitors quietly fail.
  3. Extended thinking on by default. Every response runs through deliberation before output. Quality goes up; verbosity goes up much further. Plan for it or watch your bill.

For background on why “1M context” claims are worth verifying, see Long-Context LLM Benchmarks. Most models marketed at 1M lose accuracy hard past 200K.

Qwen 3.7 Max pricing — what you actually pay

You pay $2.50 per million input tokens, $7.50 per million output tokens, and $0.25 per million cached input tokens on ofox.ai. The cached-input rate is the number that changes the planning, not the headline rate.

ModelInputOutputCached InContext
Qwen 3.7 Max-Preview$2.50$7.50$0.251M
Qwen 3.6 Plus$0.50$3.001M
Qwen 3.6 Max Preview$2.00$12.00256K
Claude Opus 4.7$5.00$25.00$0.50200K
GPT-5.5$3.00$12.00$0.30400K
Gemini 3.1 Pro$2.50$10.00$0.311M

Qwen 3.7 Max is the cheapest model in its tier on output by a meaningful margin: $7.50 versus $10–25 for the comparable flagships. Output tokens dominate agent-loop bills because thinking and tool calls all accumulate on the output side, so this is the line item that shapes the monthly bill.

The cached-input rate is the other half of the story. At $0.25 per million tokens, repeated reads of the same context cost the same as uncached input on Gemini 3.1 Flash. If you are doing RAG over a stable codebase, document QA over a fixed PDF set, or agent loops that carry a long system prompt across hundreds of calls, the cached rate is the one that decides the bill. The headline rate is misleading for those workloads.

The output rate carries an asterisk we will get to in the verbosity section.

Quickstart: OpenAI SDK route

This is the standard chat completions path. Works with any OpenAI SDK without modification — same call shape as GPT or DeepSeek.

from openai import OpenAI

client = OpenAI(
    api_key="sk-ofox-xxx",
    base_url="https://api.ofox.ai/v1",
)

response = client.chat.completions.create(
    model="bailian/qwen3.7-max",
    messages=[{"role": "user", "content": "Explain MoE routing in three sentences."}],
    max_tokens=1024,
)
print(response.choices[0].message.content)

If you are already on the OpenAI SDK and switching from another model, swapping the model string is the whole migration. See the OpenAI SDK Migration Guide for the rest of the catalog.

Quickstart: Anthropic Messages route (drop-in for Claude Code)

This is the unusual part. Qwen 3.7 Max accepts the Anthropic Messages format at the protocol level, so anything that already targets Claude works by switching the base URL and model id.

export ANTHROPIC_BASE_URL=https://api.ofox.ai/anthropic
export ANTHROPIC_API_KEY=sk-ofox-xxx
export ANTHROPIC_MODEL=bailian/qwen3.7-max

# Claude Code, OpenClaw, or any Anthropic-SDK harness now routes to Qwen
claude  # runs against Qwen 3.7 Max with the same Claude Code UI

If you have been running Claude Code on Opus 4.7 and want to test Qwen on a real task without rewriting your harness, that is the whole setup. The protocol-level support is what makes this different from every other “Anthropic-compatible” model on the market: those are shims, with the usual edge cases around tool-use schemas and streaming chunk boundaries.

For a comparison of the agent harnesses themselves (Claude Code vs Codex CLI vs Cursor vs DeepSeek TUI) on the same task, see AI Coding Agents Compared 2026.

Benchmarks — and which ones actually matter

Headline numbers from Artificial Analysis and the Qwen team’s own evaluation:

BenchmarkQwen 3.7 MaxWhat it measures
AA Intelligence Index56.6Composite across 10 evaluations (MMLU-Pro, GPQA, HumanEval+, SWE-bench Verified, etc.)
HMMT 2026 Feb97.1Competition mathematics — top result in the AA leaderboard at launch
GPQA Diamond92.4Graduate-level science questions
MRCR-v2 128k90.4Long-context multi-hop retrieval at 128K tokens
LM Arena (Elo)~1,475Crowd-sourced pairwise preference

The math score matters more than it looks. Nobody is shipping production code that solves HMMT, but the benchmark measures whether the model can hold a long multi-step chain without losing the thread. That correlates closely with whether the model can hold its own during a debugging session where you need it to track three files and four constraints at once.

The MRCR-v2 score is what makes the 1M context window believable. Most 1M-context models drop below 70% retrieval accuracy past 200K. Qwen 3.7 Max retains accuracy at the length it advertises, which is what the window is being sold for.

The composite Intelligence Index puts it at #1 among Chinese models and inside the global top 10, a tier above Qwen 3.6 Max Preview, comparable to the GPT-5.5 / Claude Opus 4.6 cluster, and just below the absolute frontier covered in GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro.

The verbosity tax most reviews skip

The launch coverage glosses over this part. Artificial Analysis’s published evaluation of Qwen 3.7 Max generated ~97 million output tokens, against a ~24 million median for the comparable models on the same task. Roughly 4× the verbosity. Multiply by the $7.50/M output rate and the “cheap” model lands in Claude Opus 4.7 cost territory on workloads where thinking compounds.

This is a structural property, not a benchmark artifact. Extended thinking runs by default and there is no first-class flag to switch it off. The practical mitigation set:

  • Cap max_tokens aggressively. Most agent loops do not need 65K of output per turn. max_tokens=2048 or 4096 per turn cuts the worst-case bill without hurting quality on typical work.
  • Use the cached-input rate religiously. Anything that carries a stable system prompt or repeated context across many turns should hit the $0.25/M cached rate. A 10× input saving often outweighs the verbosity tax on the output side.
  • Route by task length. Send the hard long-context turns to Qwen 3.7 Max where reasoning earns its keep; route short turns to a cheaper model. The Claude Code hybrid routing pattern generalizes to any agent harness.
  • Watch the reasoning_content field. It is billed as output even when your application discards it from the final message. If you toss it without trimming, you are still paying for it.

The economic shape of Qwen 3.7 Max is “very cheap headline, verbose middle, watch your output cap.” Different enough from how you plan a Claude or GPT budget that ignoring it makes the model look more expensive than it is.

Where Qwen 3.7 Max loses

Three places I would still reach for Claude Opus 4.7 or GPT-5.5:

  • Multi-step agent workflows with brittle tool schemas. Opus 4.7 is measurably more reliable at long-horizon tool use where a single malformed call breaks the loop. Qwen 3.7 Max is good here, better than 3.6 Max, but not best.
  • End-user-facing copy where short responses matter. If you need concise output for a chat UI and have no clean way to trim post-hoc, the cheaper-on-paper model becomes annoying to control.
  • Latency-sensitive interactive UIs. Extended thinking adds latency by design. At the same price point, a non-thinking model will feel faster to a user typing in a chat box.

None of these are showstoppers; they are tradeoffs to plan around. For the leaderboard view of when each model wins, see Best LLM for Coding 2026 and the LLM Leaderboard.

Practical recommendation

Use Qwen 3.7 Max as the default for two workload shapes: long-context document or codebase QA where cached input dominates the bill, and Claude Code / OpenClaw sessions where you want to cut Opus costs by roughly 3× without touching the harness. The Anthropic protocol support is the deciding factor. Every other cheap-Opus-alternative requires a shim that breaks tool-use schemas in subtle ways. Qwen is the first model where that is no longer the case.

Keep Opus 4.7 for agent loops where you cannot afford a single failed tool call, and keep a Flash-tier model for short turns where reasoning is overkill. The operational argument for running all three behind one key sits in Why an LLM API Gateway and the AI API Aggregation Guide. For the routing patterns that benefit most from Qwen’s pricing shape, How to Reduce AI API Costs covers the mechanics.

The question after the Qwen 3.7 Max launch is no longer whether a Chinese model can compete at the frontier. It is which workloads you stop sending to Opus this month.

Sources