
CommunityNews
We Bought the Whole GPU, So We're Damn Well Going to Use the Whole GPU
TLDR: We’re releasing a throughput-optimized megakernel for tensor-parallel inference with Llama-70B on H100s. Our kernel can aggressively overlap compute, memory, and communication ops in order to simultaneously use the different hardware resources available on a GPU. When integrated into the Tokasaurus inference engine, our megakernel can outperform SGLang by >22% on end-to-end throughput (measured as time to finish 65,536 prompts from the ShareGPT benchmark). We’re releasing the code here; please be warned that this really is research code; it is sensitive to compiler versions, GPU setup, and sometimes even being looked at the wrong way, and we have no intention whatsoever of supporting it. We hope you’ll find the ideas and results interesting nonetheless!
Read in full here:
Popular Ai topics









Other popular topics










Categories:
Sub Categories:
Popular Portals
- /elixir
- /rust
- /ruby
- /wasm
- /erlang
- /phoenix
- /keyboards
- /rails
- /python
- /js
- /security
- /go
- /swift
- /vim
- /clojure
- /emacs
- /haskell
- /java
- /onivim
- /typescript
- /svelte
- /kotlin
- /crystal
- /c-plus-plus
- /tailwind
- /react
- /gleam
- /ocaml
- /elm
- /flutter
- /vscode
- /ash
- /html
- /opensuse
- /centos
- /php
- /deepseek
- /zig
- /scala
- /textmate
- /sublime-text
- /lisp
- /nixos
- /debian
- /react-native
- /agda
- /kubuntu
- /arch-linux
- /django
- /revery
- /ubuntu
- /spring
- /manjaro
- /nodejs
- /diversity
- /lua
- /deno
- /julia
- /slackware
- /c