CommunityNews

CommunityNews

We Bought the Whole GPU, So We're Damn Well Going to Use the Whole GPU

TLDR: We’re releasing a throughput-optimized megakernel for tensor-parallel inference with Llama-70B on H100s. Our kernel can aggressively overlap compute, memory, and communication ops in order to simultaneously use the different hardware resources available on a GPU. When integrated into the Tokasaurus inference engine, our megakernel can outperform SGLang by >22% on end-to-end throughput (measured as time to finish 65,536 prompts from the ShareGPT benchmark). We’re releasing the code here; please be warned that this really is research code; it is sensitive to compiler versions, GPU setup, and sometimes even being looked at the wrong way, and we have no intention whatsoever of supporting it. We hope you’ll find the ideas and results interesting nonetheless!

Read in full here:

Where Next?

Popular Ai topics Top

First poster: CommunityNews
Now that DeepMind has taught AI to master the game of Go—and furthered its advantage in chess—they’ve turned their attention to another b...
New
First poster: bot
Kicking off another busy Spring GPU Technology Conference for NVIDIA, this morning the graphics and accelerator designer is announcing th...
New
CommunityNews
Artificial intelligence is now smart enough to write tracks that earn streaming service royalties.
New
First poster: bot
Language technology powered by AI can perpetuate bias if we are not careful. We need to be sure that language AI is trained to be ethical...
New
First poster: CommunityNews
Getting a glimpse into Nvidia’s R&D has become a regular feature of the spring GTC conference with Bill Dally, chief scientist and se...
New
First poster: Eiji
AI web crawling bots are the cockroaches of the internet, many developers believe. FOSS devs are fighting back in ingenuous, humorous wa...
New
First poster: brennan
It’s Not a Hypothetical, I’ve Already Lost My Job to AI For The Last Year
New
New
First poster: TimButterfield
A new agentic IDE that works alongside you from prototype to production
New
CommunityNews
Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language...
New

Other popular topics Top

AstonJ
A thread that every forum needs! Simply post a link to a track on YouTube (or SoundCloud or Vimeo amongst others!) on a separate line an...
New
PragmaticBookshelf
Ruby, Io, Prolog, Scala, Erlang, Clojure, Haskell. With Seven Languages in Seven Weeks, by Bruce A. Tate, you’ll go beyond the syntax—and...
New
New
PragmaticBookshelf
Learn different ways of writing concurrent code in Elixir and increase your application's performance, without sacrificing scalability or...
New
New
rustkas
Intensively researching Erlang books and additional resources on it, I have found that the topic of using Regular Expressions is either c...
New
PragmaticBookshelf
Programming Ruby is the most complete book on Ruby, covering both the language itself and the standard library as well as commonly used t...
New
First poster: AstonJ
Jan | Rethink the Computer. Jan turns your computer into an AI machine by running LLMs locally on your computer. It’s a privacy-focus, l...
New
PragmaticBookshelf
Explore the power of Ash Framework by modeling and building the domain for a real-world web application. Rebecca Le @sevenseacat and ...
New
AstonJ
Curious what kind of results others are getting, I think actually prefer the 7B model to the 32B model, not only is it faster but the qua...
New