DeepSeek V4 Released: Open-Source 1.6T MoE, 1M Context, Apache 2.0 — and It's Already on the API

DeepSeek V4 Released: Open-Source 1.6T MoE, 1M Context, Apache 2.0 — and It's Already on the API

TL;DR — DeepSeek picked the same day as OpenAI’s GPT-5.5 to ship V4 preview. 1.6T-parameter Pro, 284B Flash, 1M context on both, Apache 2.0 weights on Hugging Face, and API pricing of $1.74 / $3.48 per million tokens for Pro — less than Opus 4.7, less than GPT-5.5, less than Kimi K2.6. ofox will support it at first opportunity.

What DeepSeek shipped

From the official announcement on April 24 2026:

  • Two variants: deepseek-v4-pro (1.6T total parameters, 49B activated) and deepseek-v4-flash (284B total, 13B activated). Both are MoE.
  • 1M-token context on both, max output 384K.
  • Dual modes: Thinking / Non-Thinking, with three effort levels (high, max, plus non-think). See thinking mode docs.
  • Open source, Apache 2.0 — weights on Hugging Face.
  • API live today. Same base_url, change model ID. Both OpenAI ChatCompletions and Anthropic protocols supported.
  • Deprecation: deepseek-chat and deepseek-reasoner retire July 24 2026. They currently route to deepseek-v4-flash.

The timing is not accidental. OpenAI shipped GPT-5.5 the same day. DeepSeek needed a launch window where “open-source 1M-context MoE at a fraction of the cost” would not be buried under a closed-source price hike. Ship on the same day and you split the news cycle.

Architecture — the part that actually matters

V4 introduces a hybrid attention mechanism: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA). Combined with Manifold-Constrained Hyper-Connections (mHC) for residual signal propagation and the Muon optimizer for training stability, the net effect at 1M context is:

  • 27% of V3.2’s single-token inference FLOPs
  • 10% of V3.2’s KV cache

That is the headline efficiency story. Long-context inference was the main cost barrier for open models serving 1M windows; V4 cuts it by an order of magnitude on KV cache. The model was pre-trained on 32T+ tokens using FP4 + FP8 mixed precision — MoE experts at FP4, most other parameters at FP8.

The Flash variant is not a trimmed Pro — it is a separately trained MoE at 284B / 13B activated. Flash-Max (max thinking effort) approaches Pro-level reasoning on most benchmarks with a much lower serving cost.

The Arena Code numbers

Arena AI’s live code leaderboard put V4-Pro Thinking straight at #3 among open models, ahead of the rest of DeepSeek’s prior releases by a large margin:

Code Arena leaderboard with DeepSeek V4 Pro Thinking at #3 among open models

Source: Arena AI Code Leaderboard, April 24 2026

RankModelElo
1GLM-5.11,534
2Kimi-K2.61,529
3DeepSeek-V4 Pro (Thinking)1,456
4GLM-4.71,440
12DeepSeek-V3.2 (Thinking)1,368

The V3.2 → V4-Pro jump is 88 Elo — roughly the same delta between #3 and #13 on the current board. It is a genuine generational step, not a refresh.

Full benchmark grid — vs K2.6, GLM-5.1, Opus 4.6, GPT-5.4, Gemini 3.1 Pro

DeepSeek published the complete head-to-head against the top open and closed models:

DeepSeek V4-Pro Max and V4-Flash Max benchmark comparison vs K2.6 Thinking, GLM-5.1 Thinking, Opus 4.6 Max, GPT-5.4 xHigh, Gemini 3.1 Pro High

Source: DeepSeek V4 technical report, April 24 2026

The honest reading, benchmark by benchmark:

Where V4-Pro wins outright:

BenchmarkV4-Pro MaxK2.6 ThinkingOpus 4.6GPT-5.4Gemini 3.1 Pro
Chinese-SimpleQA84.475.976.276.885.9
LiveCodeBench93.589.688.891.7
Codeforces (rating)320631683052
HMMT 2026 Feb95.292.796.297.794.7
IMOAnswerBench89.886.075.391.481.0
MCPAtlas Public73.666.673.867.269.2

Codeforces 3206 is the line that matters. That is better than GPT-5.4 (xHigh) at 3168 — competitive-programming territory where closed frontier models have historically held the lead.

Where V4-Pro loses to K2.6:

BenchmarkV4-ProK2.6 Thinking
SWE Pro (resolved)55.458.6
SWE Multilingual76.276.7
HLE w/tools48.254.0
GPQA Diamond90.190.5

SWE-Bench Pro is the benchmark that matters most for “fix a real GitHub issue.” K2.6’s 58.6 vs V4-Pro’s 55.4 is a 3-point gap — small, but consistent with the Arena Code leaderboard where K2.6 sits 73 Elo ahead.

Where V4-Pro trails the closed frontier:

  • MRCR 1M (long-context retrieval): 83.5 vs Opus 4.6’s 92.9. Opus is still the long-context leader.
  • CorpusQA 1M: 62.0 vs Opus 71.7. Same story.
  • GDPval-AA (Elo): 1554 vs GPT-5.4’s 1674 and Opus 4.6’s 1619. Knowledge-work economic value still favors the closed models.
  • HLE (no tools): 37.7 vs Gemini 3.1 Pro’s 44.4.

Flash-Max holds up:

V4-Flash-Max hits 86.2 on MMLU-Pro (Pro at 87.5), 91.6 on LiveCodeBench (Pro at 93.5), and 52.6 on SWE-Pro (Pro at 55.4). For most tasks the quality gap between Flash and Pro is narrow — and Flash is dramatically cheaper.

Pricing — where V4 really changes the calculus

From the DeepSeek pricing docs:

ModelInput (miss)Input (hit)Output
deepseek-v4-flash$0.14 / M$0.028 / M$0.28 / M
deepseek-v4-pro$1.74 / M$0.145 / M$3.48 / M

Compare to the frontier:

ModelInputOutput
DeepSeek V4-Pro$1.74$3.48
Kimi K2.6 (non-think)$1.40$5.60
GPT-5.5$5.00$30.00
Claude Opus 4.7$15.00$75.00

V4-Pro output is $3.48 vs GPT-5.5’s $30. That is 8.6× cheaper. Against Opus 4.7 it is 21× cheaper. Flash at $0.28 output is essentially free.

This is the single biggest story of the release. You can now run a 1M-context, Codeforces-3200-tier reasoning model in production for the same budget that used to cover a mid-tier chat endpoint.

Community takes

First-day reactions from the open-source and research community:

  • “Apache 2.0 matters.” V3 was MIT; V4 moves to Apache 2.0, giving enterprises clearer patent protection. For commercial deployments this is the material change.
  • “Chinese SimpleQA is a wake-up call.” 84.4 on Chinese-SimpleQA beats every closed model except Gemini 3.1 Pro. For Chinese-first products this is the first open-weight option that is genuinely at parity with the best.
  • “SWE-Pro is closer than the Arena board suggests.” K2.6 leads by 3 points on SWE-Pro, but V4-Pro leads on LiveCodeBench and Codeforces. Short-form code generation vs long-horizon codebase resolution — they are different skills, and the picks split cleanly.
  • “The 1M context is real, but not Opus-level.” MRCR and CorpusQA show Opus 4.6 still holds the long-context crown. V4’s win is efficiency (10% KV cache), not absolute retrieval quality.

Access via ofox (coming soon)

ofox serves deepseek/deepseek-v3.2 today. V4-Pro and V4-Flash are being added at first opportunity — expect them on the model list shortly.

For now, if you want V4 immediately, you can hit DeepSeek’s own API directly:

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "Port this Rust service to Go, preserving concurrency semantics"}],
    extra_body={"thinking": {"type": "enabled"}}
)
print(response.choices[0].message.content)

Once ofox rolls V4 into the aggregator, the switch is one line — same ofox key, same https://api.ofox.ai/v1 base URL, just deepseek/deepseek-v4-pro or deepseek/deepseek-v4-flash. Sign up at ofox.ai and one key will cover V4 the moment it lands alongside GPT-5.5, Claude, Gemini, Kimi K2.6, and the rest.

Should you switch?

Switch to V4-Pro if you are running Kimi K2.6 for Chinese-heavy workloads, competitive-programming-style code generation, or Codeforces-grade reasoning. The Chinese SimpleQA and Codeforces numbers are the reason.

Switch to V4-Flash if you are running anything in the $1-2 per million output token range. Flash-Max’s reasoning is within 1-3 points of Pro on most knowledge benchmarks, and 12× cheaper than V4-Pro on output.

Stay on K2.6 if your workload is SWE-Bench-style codebase resolution, agent tool calls under high concurrency, or anything where the Arena Code delta (K2.6 +73 Elo) maps to your task.

Stay on closed frontier (GPT-5.5 / Opus 4.7) if your tasks are long-context retrieval over millions of tokens (Opus MRCR still wins), GDPval-grade knowledge work (GPT-5.4 still wins), or agentic terminal workflows (GPT-5.5 Terminal-Bench 82.7% is in its own tier).