April 23, 2026
Million tokens, million takes
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
Million-token DeepSeek V4 drops; fans split: 'Bargain!' vs 'Late!'
TLDR: DeepSeek V4 launches with a “million-token” memory and two models, pitching open-source power at striking prices. Commenters duke it out over value versus timing: some hail Flash at $0.28 per million tokens as a steal, others say it’s still late, with debates on local use and product lineup shifts.
DeepSeek just dropped a million-token memory monster, and the comments instantly split into camps: the “value hunters” yelling deal of the year and the skeptics sighing it’s “two months behind the leaders.” The headline features: two models—Pro (huge, but only part wakes per request) and Flash (lighter)—that swallow massive inputs like whole codebases. It’s open under MIT, offers three “thinking” modes, and swaps the usual chat template for a quirky encoding script.
Prices poured gasoline on the thread: one tally puts Pro at $3.48 per million tokens and Flash at $0.28 (pricing). That had folks crowning Flash “quite competent” and calling the whole thing an “insane value deal.” The pushback: a cool take that DeepSeek is still “just about 2 months behind,” triggering a “better value” vs “bleeding edge” squabble, complete with scoreboard comparisons.
Side dramas: Is the old “R” line quietly folded into V4? And can the 1.6‑trillion Pro run at home because only 49B parameters are active at once? Hopeful tinkerers float “maybe (very slowly),” while pragmatists clutch their GPUs. Meme of the day: “It remembers your diary and your Slack—and still asks for dessert.”
Key Points
- •DeepSeek-AI released a preview of DeepSeek-V4-Pro (1.6T parameters, 49B activated) and DeepSeek-V4-Flash (284B parameters, 13B activated), both supporting a one million-token context.
- •A hybrid attention architecture combining CSA and HCA reduces V4-Pro’s 1M-context single-token inference FLOPs to 27% and KV cache to 10% versus DeepSeek-V3.2.
- •Training includes pre-training on 32T+ tokens and a two-stage post-training pipeline: SFT and RL with GRPO for experts, followed by on-policy distillation for consolidation.
- •Precision uses FP4 for MoE expert parameters and FP8 for most others; three reasoning effort modes are supported, including Pro-Max and Flash-Max configurations.
- •The release provides encoding/inference tooling (OpenAI-compatible Python scripts), local deployment guidance (temperature=1.0, top_p=1.0; Think Max ≥384K context), and an MIT License.