ManningBooks

ManningBooks

Devtalk Sponsor

Quantization and Fast Inference (Manning)

Today’s AI models demand a lot of memory, compute, and server horsepower–which quickly translates into cost. Quantization and Fast Inference show you how you can optimize AI models without architectural redesigns or task-specific compression.

Vivek Kalyanarangan

If you’ve worked with modern AI models in production, you’ve probably run into the same wall: great results in development, followed by uncomfortable conversations about memory, latency, and cost. Quantization and Fast Inference is built for that exact point in the workflow.

This book walks through how to shrink and speed up models without redesigning them from scratch. It starts with the fundamentals—what quantization actually does to numbers and why it works—then moves into techniques you can apply right away. You’ll get hands-on with post-training quantization (PTQ), quantization-aware training (QAT), and the details that tend to cause trouble in practice, like activation outliers in LLMs or pressure on the KV cache.

What stands out is the full pipeline view. It doesn’t stop at “here’s how to quantize a model.” It covers how those choices affect deployment, runtime behavior, and tradeoffs you have to make along the way. There’s also coverage of newer low-precision formats like NF4 and FP4, which are increasingly common in real systems.

If you’re trying to run larger models on tighter budgets—or want to understand what’s happening under the hood when you compress them—this is a solid place to dig in while the book is still in Early Access.


Don’t forget you can get 45% off with your Devtalk discount! Just use the coupon code “devtalk.com” at checkout :+1:

Where Next?

Popular Ai topics Top

ManningBooks
Based on Ilya Sutskever’s famous “must-read” list of ~30 AI papers, this book walks you through the research that shaped today’s deep lea...
New
New
ManningBooks
Build an AI Agent (From Scratch) is a step-by-step guide to creating a working AI agent, starting with the bare essentials and growing yo...
New
ManningBooks
The bestselling book on Python deep learning, now covering generative AI, Keras 3, PyTorch, and JAX! François Chollet and Matthew ...
New
pragdave
Build robust LLM-powered apps, chatbots, and agents while mastering AI engineering principles that will help you outlast the tools and th...
New
ManningBooks
Erlang and OTP in Action teaches you the concepts of concurrent programming and the use of Erlang’s message-passing model. It walks you t...
New
pragdave
Build a prototype in a weekend or a full product in a month or two. Untangle legacy systems, improve tests and documentation, and tackle ...
New
ManningBooks
Rearchitecting LLMs: Structural techniques for efficient models turns research from the latest AI papers into production-ready practices ...
New
ManningBooks
CUDA for Deep Learning shows you how to work within the CUDA ecosystem, from your first kernel to implementing advanced LLM features like...
New
New

Other popular topics Top

ohm
Which, if any, games do you play? On what platform? I just bought (and completed) Minecraft Dungeons for my Nintendo Switch. Other than ...
New
DevotionGeo
I know that -t flag is used along with -i flag for getting an interactive shell. But I cannot digest what the man page for docker run com...
New
AstonJ
You might be thinking we should just ask who’s not using VSCode :joy: however there are some new additions in the space that might give V...
New
AstonJ
This looks like a stunning keycap set :orange_heart: A LEGENDARY KEYBOARD LIVES ON When you bought an Apple Macintosh computer in the e...
New
AstonJ
Continuing the discussion from Thinking about learning Crystal, let’s discuss - I was wondering which languages don’t GC - maybe we can c...
New
husaindevelop
Inside our android webview app, we are trying to paste the copied content from another app eg (notes) using navigator.clipboard.readtext ...
New
PragmaticBookshelf
Author Spotlight: Peter Ullrich @PJUllrich Data is at the core of every business, but it is useless if nobody can access and analyze ...
New
PragmaticBookshelf
Leverage Elixir and the Nx ecosystem to build intelligent applications that solve real-world problems in computer vision, natural languag...
New
AstonJ
Curious what kind of results others are getting, I think actually prefer the 7B model to the 32B model, not only is it faster but the qua...
New
xiji2646-netizen
Woke up to this today: Claude Code’s complete source code exposed via npm source map. Not a snippet. All 512,000 lines. 1,900 TypeScript ...
New