xiji2646-netizen

xiji2646-netizen

DeepSeek V4 dropped today — $0.28/M output on 1M context, running on Huawei Ascend. Are you routing workloads to it?

DeepSeek just released V4 and the pricing is hard to ignore.

V4-Flash: $0.28/M output tokens. V4-Pro: $2.19/M. Both with 1M token context as default.

For reference: GPT-4 Turbo is $30/M output. Claude Opus 4.6 is $75/M. That’s not a marginal difference — it’s a structural one. I’ve been digging into the technical report and wanted to share what I found, because I think this release has implications beyond “another cheap Chinese model.”


The infrastructure story is the real headline

V4 is the first Tier-1 LLM to run on **Huawei Ascend chips at 85%+ utilization**. DeepSeek co-optimized inference kernels directly with Huawei’s teams for Ascend 910B/950. They report inference quality matching Nvidia A100 deployments at roughly 40% lower hardware cost.

This matters because the GPU export ban was supposed to slow Chinese AI development. DeepSeek V4 running on Huawei Ascend at 85% utilization while costing 100x less than Western alternatives is a pretty direct answer to how that played out.


Three architecture innovations that make the pricing possible

Engram Architecture — separates static knowledge (CPU RAM, hash-based lookup) from dynamic reasoning (GPU). CPU RAM is 10-20x cheaper per GB than GPU HBM. This is why 1M context doesn’t require proportional GPU memory growth, and why 1M context is the default even on the cheapest tier. The model offloads long-context storage to CPU memory rather than keeping everything in HBM.

mHC (Manifold-Constrained Hyper-Connections) — training stability mechanism for the 1.6T parameter MoE via bi-stochastic matrix projection (Sinkhorn-Knopp). Prevents gradient explosion, reduces failed training runs, lowers amortized training cost. This is part of why they can offer these prices — fewer wasted training runs means lower cost basis.

DSA (DeepSeek Sparse Attention) — token-dimension compression that takes attention from O(n²) to near-linear scaling, with 60-70% memory bandwidth reduction per attention layer. Combined with the MoE architecture (1.6T total parameters, ~37B active per forward pass), this is what makes the Flash tier viable at $0.28/M output.

You’re not getting a smaller model. You’re getting selective activation of a very large model with near-linear attention scaling and cheap long-context storage.


Pricing table

| Model | Input | Output | Context |

|—|—|—|—|

| V4-Pro | $0.55/M | $2.19/M | 1M tokens |

| V4-Flash | $0.014/M | $0.28/M | 1M tokens |

| GPT-4 Turbo | $10/M | $30/M | 128K tokens |

| Claude Opus 4.6 | $15/M | $75/M | 200K tokens |


What this looks like on real workloads

  • Production chatbot (1M queries/month): $25,000/month on GPT-4 Turbo → $154/month on V4-Flash*

  • Agent coding assistant (1.5M output tokens/month): $112.50 on Opus 4.6 → $3.29 on V4-Pro

  • Enterprise doc processing (200K input + 10K output per doc): $2.30 on GPT-4 Turbo → $0.13 on V4-Pro

Even if V4 is meaningfully worse on some tasks, the cost gap is large enough that you can run multiple passes, add verification steps, or accept some quality tradeoff and still come out ahead economically.


The two-stack future

One day before release, Reuters reported DeepSeek refused early API access to U.S. chip manufacturers including Nvidia — a deliberate mirror of the U.S. GPU export ban. The AI supply chain is splitting:

  • Western stack: Nvidia GPUs → CUDA → AWS/Azure/GCP → OpenAI/Anthropic/Google

  • Chinese stack: Huawei Ascend → CANN → Huawei Cloud/Alibaba Cloud → DeepSeek/Alibaba/Baidu

For developers and enterprises, this creates a strategic dimension that goes beyond benchmark comparisons. If you build on V4 and the geopolitical situation escalates, what’s your fallback? If you stay on Western APIs and the 100x cost gap persists, what’s the competitive pressure from teams that don’t?


API compatibility and quick start

The endpoint is OpenAI-compatible, so migration from existing OpenAI SDK integrations is minimal:


from openai import OpenAI

client = OpenAI(

    api_key="your-deepseek-api-key",

    base_url="https://api.deepseek.com/v1"

)

response = client.chat.completions.create(

    model="deepseek-chat",   # V4-Pro

    messages=\[{"role": "user", "content": "Your prompt here"}\],

    max_tokens=1024

)

print(response.choices\[0\].message.content)

For V4-Flash, use `model=“deepseek-chat-flash”`. Both are live now.


The post-scaling paradigm shift

What’s interesting about V4 architecturally is that it represents a different thesis than “train bigger on more compute.” The Engram + DSA + mHC combination is about extracting more capability per dollar of inference cost, not just per dollar of training compute. If this approach generalizes, it suggests the next few years of model competition will be as much about inference efficiency as raw benchmark scores.

The open-source weights are on Hugging Face. The API is live at api.deepseek.com. Both tiers available now.

DeepSeek V4


Questions

1. For teams currently spending significant budget on GPT-4 Turbo or Claude Opus for non-reasoning workloads — are you evaluating V4, or do compliance/data residency concerns make it a non-starter regardless of price? Curious what the actual blockers look like in practice.

2. Has anyone tested V4-Pro on agent/coding tasks specifically? The claim that it benchmarks near Opus 4.6 on non-reasoning tasks is interesting if it holds up in practice. Would love to hear real results rather than benchmark numbers.

3. For those thinking about the two-stack future — are you building with explicit fallback strategies in mind, or treating this as a “wait and see” situation? At what point does the cost gap become large enough that you’d accept the strategic dependency?

Also curious: for anyone self-hosting, what hardware are you running it on and what utilization are you seeing? And has anyone tested the full 1M context window in production — curious about latency at that scale.

Where Next?

Popular Ai topics Top

AstonJ
I saw this clip of Elon Musk talking about AI and wondered what others think - are you looking forward to AI? Or do you find it concerning?
New
AstonJ
Watching any? Any favourites? :upside_down_face:
New
AstonJ
Loads of news stories about DeepSeek here in the last few days, no surprise as it’s been making headlines across the world! Currently a h...
New
AstonJ
This is a very quick guide, you just need to: Download LM Studio: https://lmstudio.ai/ Click on search Type DeepSeek, then select the o...
New
apoorv-2204
General thoughts on google gemini ? IMHO , when compared chatgpt and claude sonnnet its pretty shit, and its feels broken,
#ai
New
kammy
Hi everyone! The other day I was having a debate with my friends about whether or not the top LLM models are “good at design.” I’d love ...
New
Eiji
Yesterday a very interesting to discuss situation have happen. While StackOverflow still suffer a lot, because of chat bots, but yesterda...
New
xiji2646-netizen
Woke up to this today: Claude Code’s complete source code exposed via npm source map. Not a snippet. All 512,000 lines. 1,900 TypeScript ...
New
xiji2646-netizen
I’ve been tracking this for the past two weeks and wanted to see if others are experiencing the same thing. BridgeBench (independent hal...
New
xiji2646-netizen
Anthropic launched Claude Design this week and there’s a lot of noise about the generation demos and the stock reaction. But the feature ...
New

Other popular topics Top

PragmaticBookshelf
Take your Go skills to the next level by learning how to design, develop, and deploy a distributed service. Start from the bare essential...
New
dasdom
No chair. I have a standing desk. This post was split into a dedicated thread from our thread about chairs :slight_smile:
New
New
dimitarvp
Small essay with thoughts on macOS vs. Linux: I know @Exadra37 is just waiting around the corner to scream at me “I TOLD YOU SO!!!” but I...
New
PragmaticBookshelf
Learn different ways of writing concurrent code in Elixir and increase your application's performance, without sacrificing scalability or...
New
PragmaticBookshelf
Build highly interactive applications without ever leaving Elixir, the way the experts do. Let LiveView take care of performance, scalabi...
New
PragmaticBookshelf
Author Spotlight Jamis Buck @jamis This month, we have the pleasure of spotlighting author Jamis Buck, who has written Mazes for Prog...
New
New
New
PragmaticBookshelf
Get the comprehensive, insider information you need for Rails 8 with the new edition of this award-winning classic. Sam Ruby @rubys ...
New