Qwen3 API Guide: Access Qwen3-Max, Qwen3-Coder, and Qwen3.5 Models (2026)

Qwen3 API Guide: Access Qwen3-Max, Qwen3-Coder, and Qwen3.5 Models (2026)

TL;DR: Qwen3-Max costs $0.36/M input and $1.43/M output via ofox — roughly 4× cheaper than Claude Sonnet 4.6 for input tokens. The Coder variant targets agentic multi-step tasks and costs only $0.20/M input. The fastest way to try either without an Alibaba Cloud account is a single ofox API key.

What Is the Qwen3 Family?

Alibaba’s Qwen3 generation is the largest open-weight model family they’ve shipped. The flagship open-weight model — Qwen3-235B-A22B — is a Mixture-of-Experts architecture with 235B total parameters but only 22B activated per forward pass. That MoE design is what keeps inference costs low despite the model’s scale.

The family spans eight open-weight sizes (0.6B through 235B), two MoE variants, and a set of API-only commercial models. The ones relevant to developers calling an API are the commercial models available via cloud providers — specifically the ones ofox carries:

Modelofox Model IDContextInputOutput
Qwen3-Maxbailian/qwen3-max256K$0.36/M$1.43/M
Qwen3 Coder Nextbailian/qwen3-coder-next256K$0.20/M$1.50/M
Qwen3.5 Flashbailian/qwen3.5-flash1M$0.10/M$0.40/M
Qwen3.5 27Bbailian/qwen3.5-27b256K$0.29/M$2.05/M
Qwen3.5 35B A3Bbailian/qwen3.5-35b-a3b256K$0.29/M$1.83/M
Qwen3.5 122B A10Bbailian/qwen3.5-122b-a10b256K$0.29/M$2.29/M
Qwen3.5 397B A17Bbailian/qwen3.5-397b-a17b256K$0.55/M$3.50/M
Qwen3.5 Plusbailian/qwen3.5-plus1M$0.40/M$2.40/M
Qwen3.6 Plusbailian/qwen3.6-plus1M$0.50/M$3.00/M

Prices are per million tokens as listed on ofox.ai at the time of writing. All models are OpenAI-compatible — same endpoint, same request format.

Thinking Mode: What It Is and How to Use It

Every Qwen3 model ships with two operating modes in the same checkpoint:

  • Thinking mode — slow, explicit chain-of-thought. The model reasons through steps before answering. Thinking tokens are returned in a separate block before the final answer.
  • Non-thinking mode — fast response, no explicit reasoning trace.

You switch between them at the API level:

# Enable thinking mode
response = client.chat.completions.create(
    model="bailian/qwen3-max",
    messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational"}],
    extra_body={"enable_thinking": True}
)

Or use soft switches inside the system or user prompt: /think to activate reasoning, /no_think to suppress it. This is useful when you want different behavior on different turns in a multi-turn conversation without changing API parameters.

For most API use cases — especially coding tasks — non-thinking mode is faster and cheaper. Use thinking mode when the task genuinely benefits from extended reasoning: math proofs, complex planning, multi-step logic.

Qwen3-Max: General Purpose Flagship

bailian/qwen3-max is the commercial flagship for general tasks — reasoning, math, multilingual work, and instruction following. It was trained on roughly 36 trillion tokens (nearly double Qwen2.5’s training data), with a final stage extending context to 32K for the base models, scaled further for the commercial API variant.

The 256K context window is practical for most document processing and long-context retrieval tasks. At $0.36/M input and $1.43/M output, it’s meaningfully cheaper than Claude Sonnet 4.6 ($3/M input, $15/M output) for high-throughput pipelines where you’re sending large contexts repeatedly.

Where Qwen3-Max tends to do well: multilingual tasks across the 119 languages it was trained on, structured output generation, math reasoning, and agent-style reasoning when thinking mode is enabled.

Qwen3 Coder Next: Built for Agentic Coding

bailian/qwen3-coder-next is a dedicated coding model. The public Qwen3-Coder series was trained on 7.5 trillion tokens with roughly 70% code content, then fine-tuned specifically for execution-driven tasks — running code, using tools, managing multi-turn agent loops.

Performance claims from Alibaba put it near Claude Sonnet 4 on agentic coding benchmarks and at state-of-the-art for open models on SWE-Bench Verified without test-time scaling. The 256K native context (extendable to 1M via YaRN) means you can feed it large codebases without chunking.

At $0.20/M input and $1.50/M output, it’s the cheapest option in the Qwen3 lineup per token for input-heavy coding workloads. For a typical SWE-agent loop with 100K tokens of context per turn, that’s $0.02 per request — roughly 10× cheaper than Claude Sonnet 4.6 at the same context length.

The model integrates with standard coding tool ecosystems: Cursor, Cline, Claude Code-style CLIs, and any tool that accepts an OpenAI-compatible endpoint. See the AI tools configuration guide for setup details across different editors.

Qwen3.5 Flash: 1M Context at $0.10/M

bailian/qwen3.5-flash is the cheapest Qwen model on ofox and the one to reach for when context length and cost are the primary constraints. At $0.10/M input and $0.40/M output with a 1M token context window, it’s in the same tier as Gemini Flash variants for document-heavy pipelines.

Use it for: summarization of long documents, RAG reranking, classification tasks over large datasets, and any workload where you’re primarily paying for input tokens and need to process a lot of text cheaply.

Quick Start: Call Qwen3 via ofox

All Qwen3 models on ofox use the OpenAI-compatible endpoint. Drop this into any existing OpenAI SDK project:

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-ofox-key",
    base_url="https://api.ofox.ai/v1"
)

response = client.chat.completions.create(
    model="bailian/qwen3-max",
    messages=[{"role": "user", "content": "Your prompt here"}]
)
print(response.choices[0].message.content)

To switch to Qwen3-Coder for a coding task, change the model string to bailian/qwen3-coder-next. Everything else stays the same.

For thinking mode, add extra_body={"enable_thinking": True} to the request. The thinking content arrives in the response as a separate field before the final message.

Get your API key at ofox.ai. The same key works for Claude, GPT, Gemini, and all other models on the platform — no separate Alibaba Cloud or DashScope account needed.

Qwen3 vs Claude Sonnet 4.6: When to Pick Which

This isn’t a rigorous head-to-head benchmark — results vary by task. But here’s the practical cost comparison that matters for choosing:

Qwen3-MaxQwen3 Coder NextClaude Sonnet 4.6
Input ($/1M tokens)$0.36$0.20$3.00
Output ($/1M tokens)$1.43$1.50$15.00
Context256K256K200K
Thinking modeYesYesYes (extended thinking)
Coding focusGeneralSpecializedGeneral

For input-heavy workloads — long-context retrieval, document analysis, RAG pipelines — Qwen3-Max is roughly 8× cheaper on input tokens than Claude Sonnet 4.6. The output cost gap is even wider.

For complex agentic coding tasks where you want the most reliable multi-step execution, Claude Sonnet 4.6 still has an edge in real-world integrations and broader tool ecosystem support. Qwen3 Coder Next is competitive on benchmarks, but “benchmark-competitive” and “production-reliable” aren’t always the same thing — worth testing your specific use case.

If your task is well-defined — structured output, classification, multilingual NLP, code generation with a clear spec — Qwen3-Max at $0.36/M input gets you very far for very little.

Which Model Should You Start With?

For most developers: start with bailian/qwen3.5-flash ($0.10/M) to prototype. Once you’ve validated the task, move to bailian/qwen3-max for quality or bailian/qwen3-coder-next for coding agents. The difference in cost is real enough that it’s worth the quick A/B test.

For large-scale or high-quality workloads that need 1M context, bailian/qwen3.6-plus or bailian/qwen3.5-plus are the options. For raw throughput across the full Qwen3.5 MoE lineup, bailian/qwen3.5-397b-a17b at $0.55/M input is the heaviest hitter.

If you’re building a coding agent and want to cut costs by 10× compared to Claude Sonnet 4.6, bailian/qwen3-coder-next at $0.20/M is worth a proper eval before you commit to anything else.

For context on how ofox compares to routing through other API gateways, see the AI API aggregation guide and the OpenRouter alternatives comparison.