xiji2646-netizen

xiji2646-netizen

DeepSeek V4 dropped today — $0.28/M output on 1M context, running on Huawei Ascend. Are you routing workloads to it?

DeepSeek just released V4 and the pricing is hard to ignore.

V4-Flash: $0.28/M output tokens. V4-Pro: $2.19/M. Both with 1M token context as default.

For reference: GPT-4 Turbo is $30/M output. Claude Opus 4.6 is $75/M. That’s not a marginal difference — it’s a structural one. I’ve been digging into the technical report and wanted to share what I found, because I think this release has implications beyond “another cheap Chinese model.”


The infrastructure story is the real headline

V4 is the first Tier-1 LLM to run on **Huawei Ascend chips at 85%+ utilization**. DeepSeek co-optimized inference kernels directly with Huawei’s teams for Ascend 910B/950. They report inference quality matching Nvidia A100 deployments at roughly 40% lower hardware cost.

This matters because the GPU export ban was supposed to slow Chinese AI development. DeepSeek V4 running on Huawei Ascend at 85% utilization while costing 100x less than Western alternatives is a pretty direct answer to how that played out.


Three architecture innovations that make the pricing possible

Engram Architecture — separates static knowledge (CPU RAM, hash-based lookup) from dynamic reasoning (GPU). CPU RAM is 10-20x cheaper per GB than GPU HBM. This is why 1M context doesn’t require proportional GPU memory growth, and why 1M context is the default even on the cheapest tier. The model offloads long-context storage to CPU memory rather than keeping everything in HBM.

mHC (Manifold-Constrained Hyper-Connections) — training stability mechanism for the 1.6T parameter MoE via bi-stochastic matrix projection (Sinkhorn-Knopp). Prevents gradient explosion, reduces failed training runs, lowers amortized training cost. This is part of why they can offer these prices — fewer wasted training runs means lower cost basis.

DSA (DeepSeek Sparse Attention) — token-dimension compression that takes attention from O(n²) to near-linear scaling, with 60-70% memory bandwidth reduction per attention layer. Combined with the MoE architecture (1.6T total parameters, ~37B active per forward pass), this is what makes the Flash tier viable at $0.28/M output.

You’re not getting a smaller model. You’re getting selective activation of a very large model with near-linear attention scaling and cheap long-context storage.


Pricing table

| Model | Input | Output | Context |

|—|—|—|—|

| V4-Pro | $0.55/M | $2.19/M | 1M tokens |

| V4-Flash | $0.014/M | $0.28/M | 1M tokens |

| GPT-4 Turbo | $10/M | $30/M | 128K tokens |

| Claude Opus 4.6 | $15/M | $75/M | 200K tokens |


What this looks like on real workloads

  • Production chatbot (1M queries/month): $25,000/month on GPT-4 Turbo → $154/month on V4-Flash*

  • Agent coding assistant (1.5M output tokens/month): $112.50 on Opus 4.6 → $3.29 on V4-Pro

  • Enterprise doc processing (200K input + 10K output per doc): $2.30 on GPT-4 Turbo → $0.13 on V4-Pro

Even if V4 is meaningfully worse on some tasks, the cost gap is large enough that you can run multiple passes, add verification steps, or accept some quality tradeoff and still come out ahead economically.


The two-stack future

One day before release, Reuters reported DeepSeek refused early API access to U.S. chip manufacturers including Nvidia — a deliberate mirror of the U.S. GPU export ban. The AI supply chain is splitting:

  • Western stack: Nvidia GPUs → CUDA → AWS/Azure/GCP → OpenAI/Anthropic/Google

  • Chinese stack: Huawei Ascend → CANN → Huawei Cloud/Alibaba Cloud → DeepSeek/Alibaba/Baidu

For developers and enterprises, this creates a strategic dimension that goes beyond benchmark comparisons. If you build on V4 and the geopolitical situation escalates, what’s your fallback? If you stay on Western APIs and the 100x cost gap persists, what’s the competitive pressure from teams that don’t?


API compatibility and quick start

The endpoint is OpenAI-compatible, so migration from existing OpenAI SDK integrations is minimal:


from openai import OpenAI

client = OpenAI(

    api_key="your-deepseek-api-key",

    base_url="https://api.deepseek.com/v1"

)

response = client.chat.completions.create(

    model="deepseek-chat",   # V4-Pro

    messages=\[{"role": "user", "content": "Your prompt here"}\],

    max_tokens=1024

)

print(response.choices\[0\].message.content)

For V4-Flash, use `model=“deepseek-chat-flash”`. Both are live now.


The post-scaling paradigm shift

What’s interesting about V4 architecturally is that it represents a different thesis than “train bigger on more compute.” The Engram + DSA + mHC combination is about extracting more capability per dollar of inference cost, not just per dollar of training compute. If this approach generalizes, it suggests the next few years of model competition will be as much about inference efficiency as raw benchmark scores.

The open-source weights are on Hugging Face. The API is live at api.deepseek.com. Both tiers available now.

DeepSeek V4


Questions

1. For teams currently spending significant budget on GPT-4 Turbo or Claude Opus for non-reasoning workloads — are you evaluating V4, or do compliance/data residency concerns make it a non-starter regardless of price? Curious what the actual blockers look like in practice.

2. Has anyone tested V4-Pro on agent/coding tasks specifically? The claim that it benchmarks near Opus 4.6 on non-reasoning tasks is interesting if it holds up in practice. Would love to hear real results rather than benchmark numbers.

3. For those thinking about the two-stack future — are you building with explicit fallback strategies in mind, or treating this as a “wait and see” situation? At what point does the cost gap become large enough that you’d accept the strategic dependency?

Also curious: for anyone self-hosting, what hardware are you running it on and what utilization are you seeing? And has anyone tested the full 1M context window in production — curious about latency at that scale.

First Post!

mercyf

mercyf

Nice, rock-bottom prices :slight_smile:
Is there a catch?

Where Next?

Popular Ai topics Top

AstonJ
This video about multi-agent AI is a really nice watch - it only took them a few million tries to master certain strategies - doing much ...
#ai
New
AstonJ
Loads of news stories about DeepSeek here in the last few days, no surprise as it’s been making headlines across the world! Currently a h...
New
AstonJ
This is a very quick guide, you just need to: Download LM Studio: https://lmstudio.ai/ Click on search Type DeepSeek, then select the o...
New
apoorv-2204
General thoughts on google gemini ? IMHO , when compared chatgpt and claude sonnnet its pretty shit, and its feels broken,
#ai
New
xiji2646-netizen
I’ve been following Seedance 2.0 since ByteDance dropped it in February, and after a few weeks of testing through third-party APIs, I wan...
New
xiji2646-netizen
Google just dropped a significant Deep Research upgrade: collaborative planning, multi-tool orchestration (MCP servers, Code Execution, F...
New
xiji2646-netizen
DeepSeek just released V4 and the pricing is hard to ignore. V4-Flash: $0.28/M output tokens. V4-Pro: $2.19/M. Both with 1M token contex...
New
xiji2646-netizen
DeepSeek officially launched deepseek-v4-flash and deepseek-v4-pro in preview on April 24, 2026. The legacy routes (deepseek-chat, deepse...
New
xiji2646-netizen
Been using the skills repo (77K stars, #1 on GitHub Trending recently) with Claude Code. Sharing what worked and what did not. What work...
New
xiji2646-netizen
Cursor cloud agent development This month’s updates: Codex got real Windows sandboxing (May 13) ...
New

Other popular topics Top

PragmaticBookshelf
Stop developing web apps with yesterday’s tools. Today, developers are increasingly adopting Clojure as a web-development platform. See f...
New
Exadra37
I am thinking in building or buy a desktop computer for programing, both professionally and on my free time, and my choice of OS is Linux...
New
siddhant3030
I’m thinking of buying a monitor that I can rotate to use as a vertical monitor? Also, I want to know if someone is using it for program...
New
brentjanderson
Bought the Moonlander mechanical keyboard. Cherry Brown MX switches. Arms and wrists have been hurting enough that it’s time I did someth...
New
AstonJ
You might be thinking we should just ask who’s not using VSCode :joy: however there are some new additions in the space that might give V...
New
Rainer
My first contact with Erlang was about 2 years ago when I used RabbitMQ, which is written in Erlang, for my job. This made me curious and...
New
AstonJ
Thanks to @foxtrottwist’s and @Tomas’s posts in this thread: Poll: Which code editor do you use? I bought Onivim! :nerd_face: https://on...
New
PragmaticBookshelf
Author Spotlight Rebecca Skinner @RebeccaSkinner Welcome to our latest author spotlight, where we sit down with Rebecca Skinner, auth...
New
New
PragmaticBookshelf
Use advanced functional programming principles, practical Domain-Driven Design techniques, and production-ready Elixir code to build scal...
New