CommunityNews

CommunityNews

ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention’s quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

Read in full here:

Most Liked

chris.johan

chris.johan

Any examples on how to use this?

Where Next?

Popular Ai topics Top

First poster: bot
NVIDIA Uses AI to Slash Bandwidth on Video Calls. NVIDIA Research has invented a way to use AI to dramatically reduce video call bandwid...
New
First poster: bot
Within the decade, Google aims to build a useful, error-corrected quantum computer. This will accelerate solutions for some of the world’...
New
First poster: bot
Use AI to turn simple brushstrokes into realistic landscape images. Create backgrounds quickly, or speed up your concept exploration so y...
New
First poster: CommunityNews
Imagine you’re sitting at a casino’s poker table. Someone has explained the basic rules to you, but you’ve never played before and don’t ...
New
First poster: CommunityNews
A new computer program fashioned after artificial intelligence systems like AlphaGo has solved several open problems in combinatorics and...
New
First poster: CommunityNews
A simple algorithm that revolutionizes how neural networks approach language is now taking on image classification as well. It may not st...
New
First poster: bot
DeepMind AI learns simple physics like a baby. Neural network could be a step towards programs for studying how human infants learn.
New
First poster: bot
You can’t solve AI security problems with more AI. One of the most common proposed solutions to prompt injection attacks (where an AI la...
New
CommunityNews
AI supercomputer will use “tens of thousands” of Nvidia A100 and H100 GPUs.
New
First poster: Eiji
AI web crawling bots are the cockroaches of the internet, many developers believe. FOSS devs are fighting back in ingenuous, humorous wa...
New

Other popular topics Top

Devtalk
Hello Devtalk World! Please let us know a little about who you are and where you’re from :nerd_face:
New
AstonJ
What chair do you have while working… and why? Is there a ‘best’ type of chair or working position for developers?
New
Exadra37
I am asking for any distro that only has the bare-bones to be able to get a shell in the server and then just install the packages as we ...
New
PragmaticBookshelf
Build highly interactive applications without ever leaving Elixir, the way the experts do. Let LiveView take care of performance, scalabi...
New
PragmaticBookshelf
Create efficient, elegant software tests in pytest, Python's most powerful testing framework. Brian Okken @brianokken Edited by Kat...
New
PragmaticBookshelf
Use WebRTC to build web applications that stream media and data in real time directly from one user to another, all in the browser. ...
New
PragmaticBookshelf
Rails 7 completely redefines what it means to produce fantastic user experiences and provides a way to achieve all the benefits of single...
New
New
husaindevelop
Inside our android webview app, we are trying to paste the copied content from another app eg (notes) using navigator.clipboard.readtext ...
New
PragmaticBookshelf
Explore the power of Ash Framework by modeling and building the domain for a real-world web application. Rebecca Le @sevenseacat and ...
New