xiji2646-netizen

xiji2646-netizen

DeepSeek V4 dropped today — $0.28/M output on 1M context, running on Huawei Ascend. Are you routing workloads to it?

DeepSeek just released V4 and the pricing is hard to ignore.

V4-Flash: $0.28/M output tokens. V4-Pro: $2.19/M. Both with 1M token context as default.

For reference: GPT-4 Turbo is $30/M output. Claude Opus 4.6 is $75/M. That’s not a marginal difference — it’s a structural one. I’ve been digging into the technical report and wanted to share what I found, because I think this release has implications beyond “another cheap Chinese model.”


The infrastructure story is the real headline

V4 is the first Tier-1 LLM to run on **Huawei Ascend chips at 85%+ utilization**. DeepSeek co-optimized inference kernels directly with Huawei’s teams for Ascend 910B/950. They report inference quality matching Nvidia A100 deployments at roughly 40% lower hardware cost.

This matters because the GPU export ban was supposed to slow Chinese AI development. DeepSeek V4 running on Huawei Ascend at 85% utilization while costing 100x less than Western alternatives is a pretty direct answer to how that played out.


Three architecture innovations that make the pricing possible

Engram Architecture — separates static knowledge (CPU RAM, hash-based lookup) from dynamic reasoning (GPU). CPU RAM is 10-20x cheaper per GB than GPU HBM. This is why 1M context doesn’t require proportional GPU memory growth, and why 1M context is the default even on the cheapest tier. The model offloads long-context storage to CPU memory rather than keeping everything in HBM.

mHC (Manifold-Constrained Hyper-Connections) — training stability mechanism for the 1.6T parameter MoE via bi-stochastic matrix projection (Sinkhorn-Knopp). Prevents gradient explosion, reduces failed training runs, lowers amortized training cost. This is part of why they can offer these prices — fewer wasted training runs means lower cost basis.

DSA (DeepSeek Sparse Attention) — token-dimension compression that takes attention from O(n²) to near-linear scaling, with 60-70% memory bandwidth reduction per attention layer. Combined with the MoE architecture (1.6T total parameters, ~37B active per forward pass), this is what makes the Flash tier viable at $0.28/M output.

You’re not getting a smaller model. You’re getting selective activation of a very large model with near-linear attention scaling and cheap long-context storage.


Pricing table

| Model | Input | Output | Context |

|—|—|—|—|

| V4-Pro | $0.55/M | $2.19/M | 1M tokens |

| V4-Flash | $0.014/M | $0.28/M | 1M tokens |

| GPT-4 Turbo | $10/M | $30/M | 128K tokens |

| Claude Opus 4.6 | $15/M | $75/M | 200K tokens |


What this looks like on real workloads

  • Production chatbot (1M queries/month): $25,000/month on GPT-4 Turbo → $154/month on V4-Flash*

  • Agent coding assistant (1.5M output tokens/month): $112.50 on Opus 4.6 → $3.29 on V4-Pro

  • Enterprise doc processing (200K input + 10K output per doc): $2.30 on GPT-4 Turbo → $0.13 on V4-Pro

Even if V4 is meaningfully worse on some tasks, the cost gap is large enough that you can run multiple passes, add verification steps, or accept some quality tradeoff and still come out ahead economically.


The two-stack future

One day before release, Reuters reported DeepSeek refused early API access to U.S. chip manufacturers including Nvidia — a deliberate mirror of the U.S. GPU export ban. The AI supply chain is splitting:

  • Western stack: Nvidia GPUs → CUDA → AWS/Azure/GCP → OpenAI/Anthropic/Google

  • Chinese stack: Huawei Ascend → CANN → Huawei Cloud/Alibaba Cloud → DeepSeek/Alibaba/Baidu

For developers and enterprises, this creates a strategic dimension that goes beyond benchmark comparisons. If you build on V4 and the geopolitical situation escalates, what’s your fallback? If you stay on Western APIs and the 100x cost gap persists, what’s the competitive pressure from teams that don’t?


API compatibility and quick start

The endpoint is OpenAI-compatible, so migration from existing OpenAI SDK integrations is minimal:


from openai import OpenAI

client = OpenAI(

    api_key="your-deepseek-api-key",

    base_url="https://api.deepseek.com/v1"

)

response = client.chat.completions.create(

    model="deepseek-chat",   # V4-Pro

    messages=\[{"role": "user", "content": "Your prompt here"}\],

    max_tokens=1024

)

print(response.choices\[0\].message.content)

For V4-Flash, use `model=“deepseek-chat-flash”`. Both are live now.


The post-scaling paradigm shift

What’s interesting about V4 architecturally is that it represents a different thesis than “train bigger on more compute.” The Engram + DSA + mHC combination is about extracting more capability per dollar of inference cost, not just per dollar of training compute. If this approach generalizes, it suggests the next few years of model competition will be as much about inference efficiency as raw benchmark scores.

The open-source weights are on Hugging Face. The API is live at api.deepseek.com. Both tiers available now.

DeepSeek V4


Questions

1. For teams currently spending significant budget on GPT-4 Turbo or Claude Opus for non-reasoning workloads — are you evaluating V4, or do compliance/data residency concerns make it a non-starter regardless of price? Curious what the actual blockers look like in practice.

2. Has anyone tested V4-Pro on agent/coding tasks specifically? The claim that it benchmarks near Opus 4.6 on non-reasoning tasks is interesting if it holds up in practice. Would love to hear real results rather than benchmark numbers.

3. For those thinking about the two-stack future — are you building with explicit fallback strategies in mind, or treating this as a “wait and see” situation? At what point does the cost gap become large enough that you’d accept the strategic dependency?

Also curious: for anyone self-hosting, what hardware are you running it on and what utilization are you seeing? And has anyone tested the full 1M context window in production — curious about latency at that scale.

First Post!

mercyf

mercyf

Nice, rock-bottom prices :slight_smile:
Is there a catch?

Where Next?

Popular Ai topics Top

AstonJ
Watching any? Any favourites? :upside_down_face:
New
AstonJ
This video about multi-agent AI is a really nice watch - it only took them a few million tries to master certain strategies - doing much ...
#ai
New
AstonJ
I have a feeling we’re going to see a lot of threads about DeepSeek, so have put up a portal for it :003:
New
AstonJ
Curious what kind of results others are getting, I think actually prefer the 7B model to the 32B model, not only is it faster but the qua...
New
apoorv-2204
I’m reaching out to all software engineers, especially senior developers — I really want to hear your thoughts. I’ve always loved buildi...
New
kammy
Hi everyone! The other day I was having a debate with my friends about whether or not the top LLM models are “good at design.” I’d love ...
New
nix0097
Hello I hope you’re doing well. I’m looking to develop a custom chatbot and would love to collaborate with you on this project. The chat...
New
xiji2646-netizen
Google just dropped a significant Deep Research upgrade: collaborative planning, multi-tool orchestration (MCP servers, Code Execution, F...
New
xiji2646-netizen
There’s a GitHub repo at forrestchang/andrej-karpathy-skills that’s sitting at 97.8k stars. It’s a single CLAUDE.md file with four behavi...
New
xiji2646-netizen
Cursor cloud agent development This month’s updates: Codex got real Windows sandboxing (May 13) ...
New

Other popular topics Top

AstonJ
Or looking forward to? :nerd_face:
503 14742 279
New
PragmaticBookshelf
From finance to artificial intelligence, genetic algorithms are a powerful tool with a wide array of applications. But you don't need an ...
New
rustkas
Intensively researching Erlang books and additional resources on it, I have found that the topic of using Regular Expressions is either c...
New
AstonJ
We’ve talked about his book briefly here but it is quickly becoming obsolete - so he’s decided to create a series of 7 podcasts, the firs...
New
PragmaticBookshelf
Author Spotlight Jamis Buck @jamis This month, we have the pleasure of spotlighting author Jamis Buck, who has written Mazes for Prog...
New
Help
I am trying to crate a game for the Nintendo switch, I wanted to use Java as I am comfortable with that programming language. Can you use...
New
PragmaticBookshelf
Programming Ruby is the most complete book on Ruby, covering both the language itself and the standard library as well as commonly used t...
New
CommunityNews
A Brief Review of the Minisforum V3 AMD Tablet. Update: I have created an awesome-minisforum-v3 GitHub repository to list information fo...
New
AnfaengerAlex
Hello, I’m a beginner in Android development and I’m facing an issue with my project setup. In my build.gradle.kts file, I have the foll...
New
RobertRichards
Hair Salon Games for Girls Fun Girls Hair Saloon game is mainly developed for kids. This game allows users to select virtual avatars to ...
New