xiji2646-netizen

xiji2646-netizen

DeepSeek V4 dropped today — $0.28/M output on 1M context, running on Huawei Ascend. Are you routing workloads to it?

DeepSeek just released V4 and the pricing is hard to ignore.

V4-Flash: $0.28/M output tokens. V4-Pro: $2.19/M. Both with 1M token context as default.

For reference: GPT-4 Turbo is $30/M output. Claude Opus 4.6 is $75/M. That’s not a marginal difference — it’s a structural one. I’ve been digging into the technical report and wanted to share what I found, because I think this release has implications beyond “another cheap Chinese model.”


The infrastructure story is the real headline

V4 is the first Tier-1 LLM to run on **Huawei Ascend chips at 85%+ utilization**. DeepSeek co-optimized inference kernels directly with Huawei’s teams for Ascend 910B/950. They report inference quality matching Nvidia A100 deployments at roughly 40% lower hardware cost.

This matters because the GPU export ban was supposed to slow Chinese AI development. DeepSeek V4 running on Huawei Ascend at 85% utilization while costing 100x less than Western alternatives is a pretty direct answer to how that played out.


Three architecture innovations that make the pricing possible

Engram Architecture — separates static knowledge (CPU RAM, hash-based lookup) from dynamic reasoning (GPU). CPU RAM is 10-20x cheaper per GB than GPU HBM. This is why 1M context doesn’t require proportional GPU memory growth, and why 1M context is the default even on the cheapest tier. The model offloads long-context storage to CPU memory rather than keeping everything in HBM.

mHC (Manifold-Constrained Hyper-Connections) — training stability mechanism for the 1.6T parameter MoE via bi-stochastic matrix projection (Sinkhorn-Knopp). Prevents gradient explosion, reduces failed training runs, lowers amortized training cost. This is part of why they can offer these prices — fewer wasted training runs means lower cost basis.

DSA (DeepSeek Sparse Attention) — token-dimension compression that takes attention from O(n²) to near-linear scaling, with 60-70% memory bandwidth reduction per attention layer. Combined with the MoE architecture (1.6T total parameters, ~37B active per forward pass), this is what makes the Flash tier viable at $0.28/M output.

You’re not getting a smaller model. You’re getting selective activation of a very large model with near-linear attention scaling and cheap long-context storage.


Pricing table

| Model | Input | Output | Context |

|—|—|—|—|

| V4-Pro | $0.55/M | $2.19/M | 1M tokens |

| V4-Flash | $0.014/M | $0.28/M | 1M tokens |

| GPT-4 Turbo | $10/M | $30/M | 128K tokens |

| Claude Opus 4.6 | $15/M | $75/M | 200K tokens |


What this looks like on real workloads

  • Production chatbot (1M queries/month): $25,000/month on GPT-4 Turbo → $154/month on V4-Flash*

  • Agent coding assistant (1.5M output tokens/month): $112.50 on Opus 4.6 → $3.29 on V4-Pro

  • Enterprise doc processing (200K input + 10K output per doc): $2.30 on GPT-4 Turbo → $0.13 on V4-Pro

Even if V4 is meaningfully worse on some tasks, the cost gap is large enough that you can run multiple passes, add verification steps, or accept some quality tradeoff and still come out ahead economically.


The two-stack future

One day before release, Reuters reported DeepSeek refused early API access to U.S. chip manufacturers including Nvidia — a deliberate mirror of the U.S. GPU export ban. The AI supply chain is splitting:

  • Western stack: Nvidia GPUs → CUDA → AWS/Azure/GCP → OpenAI/Anthropic/Google

  • Chinese stack: Huawei Ascend → CANN → Huawei Cloud/Alibaba Cloud → DeepSeek/Alibaba/Baidu

For developers and enterprises, this creates a strategic dimension that goes beyond benchmark comparisons. If you build on V4 and the geopolitical situation escalates, what’s your fallback? If you stay on Western APIs and the 100x cost gap persists, what’s the competitive pressure from teams that don’t?


API compatibility and quick start

The endpoint is OpenAI-compatible, so migration from existing OpenAI SDK integrations is minimal:


from openai import OpenAI

client = OpenAI(

    api_key="your-deepseek-api-key",

    base_url="https://api.deepseek.com/v1"

)

response = client.chat.completions.create(

    model="deepseek-chat",   # V4-Pro

    messages=\[{"role": "user", "content": "Your prompt here"}\],

    max_tokens=1024

)

print(response.choices\[0\].message.content)

For V4-Flash, use `model=“deepseek-chat-flash”`. Both are live now.


The post-scaling paradigm shift

What’s interesting about V4 architecturally is that it represents a different thesis than “train bigger on more compute.” The Engram + DSA + mHC combination is about extracting more capability per dollar of inference cost, not just per dollar of training compute. If this approach generalizes, it suggests the next few years of model competition will be as much about inference efficiency as raw benchmark scores.

The open-source weights are on Hugging Face. The API is live at api.deepseek.com. Both tiers available now.

DeepSeek V4


Questions

1. For teams currently spending significant budget on GPT-4 Turbo or Claude Opus for non-reasoning workloads — are you evaluating V4, or do compliance/data residency concerns make it a non-starter regardless of price? Curious what the actual blockers look like in practice.

2. Has anyone tested V4-Pro on agent/coding tasks specifically? The claim that it benchmarks near Opus 4.6 on non-reasoning tasks is interesting if it holds up in practice. Would love to hear real results rather than benchmark numbers.

3. For those thinking about the two-stack future — are you building with explicit fallback strategies in mind, or treating this as a “wait and see” situation? At what point does the cost gap become large enough that you’d accept the strategic dependency?

Also curious: for anyone self-hosting, what hardware are you running it on and what utilization are you seeing? And has anyone tested the full 1M context window in production — curious about latency at that scale.

First Post!

mercyf

mercyf

Nice, rock-bottom prices :slight_smile:
Is there a catch?

Where Next?

Popular Ai topics Top

AstonJ
I have a feeling we’re going to see a lot of threads about DeepSeek, so have put up a portal for it :003:
New
apoorv-2204
I’m reaching out to all software engineers, especially senior developers — I really want to hear your thoughts. I’ve always loved buildi...
New
apoorv-2204
General thoughts on google gemini ? IMHO , when compared chatgpt and claude sonnnet its pretty shit, and its feels broken,
#ai
New
Eiji
Yesterday a very interesting to discuss situation have happen. While StackOverflow still suffer a lot, because of chat bots, but yesterda...
New
nix0097
Hello I hope you’re doing well. I’m looking to develop a custom chatbot and would love to collaborate with you on this project. The chat...
New
xiji2646-netizen
DeepSeek officially launched deepseek-v4-flash and deepseek-v4-pro in preview on April 24, 2026. The legacy routes (deepseek-chat, deepse...
New
xiji2646-netizen
Alibaba just opened public API access for HappyHorse 1.0, the model currently ranked #1 on Video Arena’s blind tests. What caught my att...
New
xiji2646-netizen
Anthropic shipped Opus 4.7 last week and the agentic coding improvements look real. But the breaking changes are giving me pause. Specif...
New
xiji2646-netizen
Curious how others deal with this: you start a refactoring task with your AI coding assistant, close the terminal, come back – and it has...
New
xiji2646-netizen
Codex mobile in the ChatGPT app https://techcrunch.com/wp-content/uploads/2026/05/App-view.png?resize=1200,675) Codex shipped a batch o...
New

Other popular topics Top

PragmaticBookshelf
Learn from the award-winning programming series that inspired the Elixir language, and go on a step-by-step journey through the most impo...
New
ohm
Which, if any, games do you play? On what platform? I just bought (and completed) Minecraft Dungeons for my Nintendo Switch. Other than ...
New
AstonJ
Or looking forward to? :nerd_face:
503 14742 279
New
AstonJ
I’ve been hearing quite a lot of comments relating to the sound of a keyboard, with one of the most desirable of these called ‘thock’, he...
New
PragmaticBookshelf
Tailwind CSS is an exciting new CSS framework that allows you to design your site by composing simple utility classes to create complex e...
New
New
First poster: bot
zig/http.zig at 7cf2cbb33ef34c1d211135f56d30fe23b6cacd42 · ziglang/zig. General-purpose programming language and toolchain for maintaini...
New
PragmaticBookshelf
Develop, deploy, and debug BEAM applications using BEAMOps: a new paradigm that focuses on scalability, fault tolerance, and owning each ...
New
AstonJ
This is a very quick guide, you just need to: Download LM Studio: https://lmstudio.ai/ Click on search Type DeepSeek, then select the o...
New
PragmaticBookshelf
Use advanced functional programming principles, practical Domain-Driven Design techniques, and production-ready Elixir code to build scal...
New