Qwen3 API Guide: Access Qwen3-Max, Qwen3-Coder, and Qwen3.5 Models (2026)
TL;DR: Qwen3-Max costs $0.36/M input and $1.43/M output via ofox — roughly 4× cheaper than Claude Sonnet 4.6 for input tokens. The Coder variant targets agentic multi-step tasks and costs only $0.20/M input. The fastest way to try either without an Alibaba Cloud account is a single ofox API key.
What Is the Qwen3 Family?
Alibaba’s Qwen3 generation is the largest open-weight model family they’ve shipped. The flagship open-weight model — Qwen3-235B-A22B — is a Mixture-of-Experts architecture with 235B total parameters but only 22B activated per forward pass. That MoE design is what keeps inference costs low despite the model’s scale.
The family spans eight open-weight sizes (0.6B through 235B), two MoE variants, and a set of API-only commercial models. The ones relevant to developers calling an API are the commercial models available via cloud providers — specifically the ones ofox carries:
| Model | ofox Model ID | Context | Input | Output |
|---|---|---|---|---|
| Qwen3-Max | bailian/qwen3-max | 256K | $0.36/M | $1.43/M |
| Qwen3 Coder Next | bailian/qwen3-coder-next | 256K | $0.20/M | $1.50/M |
| Qwen3.5 Flash | bailian/qwen3.5-flash | 1M | $0.10/M | $0.40/M |
| Qwen3.5 27B | bailian/qwen3.5-27b | 256K | $0.29/M | $2.05/M |
| Qwen3.5 35B A3B | bailian/qwen3.5-35b-a3b | 256K | $0.29/M | $1.83/M |
| Qwen3.5 122B A10B | bailian/qwen3.5-122b-a10b | 256K | $0.29/M | $2.29/M |
| Qwen3.5 397B A17B | bailian/qwen3.5-397b-a17b | 256K | $0.55/M | $3.50/M |
| Qwen3.5 Plus | bailian/qwen3.5-plus | 1M | $0.40/M | $2.40/M |
| Qwen3.6 Plus | bailian/qwen3.6-plus | 1M | $0.50/M | $3.00/M |
Prices are per million tokens as listed on ofox.ai at the time of writing. All models are OpenAI-compatible — same endpoint, same request format.
Thinking Mode: What It Is and How to Use It
Every Qwen3 model ships with two operating modes in the same checkpoint:
- Thinking mode — slow, explicit chain-of-thought. The model reasons through steps before answering. Thinking tokens are returned in a separate block before the final answer.
- Non-thinking mode — fast response, no explicit reasoning trace.
You switch between them at the API level:
# Enable thinking mode
response = client.chat.completions.create(
model="bailian/qwen3-max",
messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational"}],
extra_body={"enable_thinking": True}
)
Or use soft switches inside the system or user prompt: /think to activate reasoning, /no_think to suppress it. This is useful when you want different behavior on different turns in a multi-turn conversation without changing API parameters.
For most API use cases — especially coding tasks — non-thinking mode is faster and cheaper. Use thinking mode when the task genuinely benefits from extended reasoning: math proofs, complex planning, multi-step logic.
Qwen3-Max: General Purpose Flagship
bailian/qwen3-max is the commercial flagship for general tasks — reasoning, math, multilingual work, and instruction following. It was trained on roughly 36 trillion tokens (nearly double Qwen2.5’s training data), with a final stage extending context to 32K for the base models, scaled further for the commercial API variant.
The 256K context window is practical for most document processing and long-context retrieval tasks. At $0.36/M input and $1.43/M output, it’s meaningfully cheaper than Claude Sonnet 4.6 ($3/M input, $15/M output) for high-throughput pipelines where you’re sending large contexts repeatedly.
Where Qwen3-Max tends to do well: multilingual tasks across the 119 languages it was trained on, structured output generation, math reasoning, and agent-style reasoning when thinking mode is enabled.
Qwen3 Coder Next: Built for Agentic Coding
bailian/qwen3-coder-next is a dedicated coding model. The public Qwen3-Coder series was trained on 7.5 trillion tokens with roughly 70% code content, then fine-tuned specifically for execution-driven tasks — running code, using tools, managing multi-turn agent loops.
Performance claims from Alibaba put it near Claude Sonnet 4 on agentic coding benchmarks and at state-of-the-art for open models on SWE-Bench Verified without test-time scaling. The 256K native context (extendable to 1M via YaRN) means you can feed it large codebases without chunking.
At $0.20/M input and $1.50/M output, it’s the cheapest option in the Qwen3 lineup per token for input-heavy coding workloads. For a typical SWE-agent loop with 100K tokens of context per turn, that’s $0.02 per request — roughly 10× cheaper than Claude Sonnet 4.6 at the same context length.
The model integrates with standard coding tool ecosystems: Cursor, Cline, Claude Code-style CLIs, and any tool that accepts an OpenAI-compatible endpoint. See the AI tools configuration guide for setup details across different editors.
Qwen3.5 Flash: 1M Context at $0.10/M
bailian/qwen3.5-flash is the cheapest Qwen model on ofox and the one to reach for when context length and cost are the primary constraints. At $0.10/M input and $0.40/M output with a 1M token context window, it’s in the same tier as Gemini Flash variants for document-heavy pipelines.
Use it for: summarization of long documents, RAG reranking, classification tasks over large datasets, and any workload where you’re primarily paying for input tokens and need to process a lot of text cheaply.
Quick Start: Call Qwen3 via ofox
All Qwen3 models on ofox use the OpenAI-compatible endpoint. Drop this into any existing OpenAI SDK project:
from openai import OpenAI
client = OpenAI(
api_key="sk-your-ofox-key",
base_url="https://api.ofox.ai/v1"
)
response = client.chat.completions.create(
model="bailian/qwen3-max",
messages=[{"role": "user", "content": "Your prompt here"}]
)
print(response.choices[0].message.content)
To switch to Qwen3-Coder for a coding task, change the model string to bailian/qwen3-coder-next. Everything else stays the same.
For thinking mode, add extra_body={"enable_thinking": True} to the request. The thinking content arrives in the response as a separate field before the final message.
Get your API key at ofox.ai. The same key works for Claude, GPT, Gemini, and all other models on the platform — no separate Alibaba Cloud or DashScope account needed.
Qwen3 vs Claude Sonnet 4.6: When to Pick Which
This isn’t a rigorous head-to-head benchmark — results vary by task. But here’s the practical cost comparison that matters for choosing:
| Qwen3-Max | Qwen3 Coder Next | Claude Sonnet 4.6 | |
|---|---|---|---|
| Input ($/1M tokens) | $0.36 | $0.20 | $3.00 |
| Output ($/1M tokens) | $1.43 | $1.50 | $15.00 |
| Context | 256K | 256K | 200K |
| Thinking mode | Yes | Yes | Yes (extended thinking) |
| Coding focus | General | Specialized | General |
For input-heavy workloads — long-context retrieval, document analysis, RAG pipelines — Qwen3-Max is roughly 8× cheaper on input tokens than Claude Sonnet 4.6. The output cost gap is even wider.
For complex agentic coding tasks where you want the most reliable multi-step execution, Claude Sonnet 4.6 still has an edge in real-world integrations and broader tool ecosystem support. Qwen3 Coder Next is competitive on benchmarks, but “benchmark-competitive” and “production-reliable” aren’t always the same thing — worth testing your specific use case.
If your task is well-defined — structured output, classification, multilingual NLP, code generation with a clear spec — Qwen3-Max at $0.36/M input gets you very far for very little.
Which Model Should You Start With?
For most developers: start with bailian/qwen3.5-flash ($0.10/M) to prototype. Once you’ve validated the task, move to bailian/qwen3-max for quality or bailian/qwen3-coder-next for coding agents. The difference in cost is real enough that it’s worth the quick A/B test.
For large-scale or high-quality workloads that need 1M context, bailian/qwen3.6-plus or bailian/qwen3.5-plus are the options. For raw throughput across the full Qwen3.5 MoE lineup, bailian/qwen3.5-397b-a17b at $0.55/M input is the heaviest hitter.
If you’re building a coding agent and want to cut costs by 10× compared to Claude Sonnet 4.6, bailian/qwen3-coder-next at $0.20/M is worth a proper eval before you commit to anything else.
For context on how ofox compares to routing through other API gateways, see the AI API aggregation guide and the OpenRouter alternatives comparison.


