5x Faster AI? “DMS” Breakthrough Crushes Memory Bottlenecks

Published by

2 months ago

Scaling Artificial Intelligence usually demands a heavy price… More time and more memory. However, researchers from the University of Warsaw, NVIDIA, and the University of Edinburgh have flipped the script. They introduced a breakthrough concept called “Inference-Time Hyper-Scaling“, which uses Dynamic Memory Sparsification (DMS). This new approach allows Large Language Models (LLMs) to reason better and faster without exhausting hardware memory.

The Memory Bottleneck

Modern LLMs, like OpenAI’s o1 or DeepSeek’s R1, improve their reasoning by generating longer chains of thought. However, this comes with a massive cost. As a model generates more text, its Key-Value (KV) cache grows linearly.

Consequently, the generation process hits a bottleneck. It isn’t just about the number of tokens… It is about the accelerator’s memory. Retrieving this massive cache from memory dominates the cost and slows down generation. Essentially, the smarter the model tries to be, the slower and more memory-hungry it becomes.

Enter Dynamic Memory Sparsification (DMS)

To solve this, the researchers proposed Dynamic Memory Sparsification (DMS). Unlike previous methods that blindly evict tokens or require complex merging, DMS teaches the model a smart eviction policy.

Here is the game-changer: Delayed Eviction. Instead of immediately deleting a token, DMS waits. It keeps the token in a “sliding window” for a short period, allowing the model to extract critical information before removal.

Remarkably, this method is highly efficient. While other compression methods require expensive training, DMS needs only 1,000 training steps to achieve an impressive 8x compression ratio. It retrofits existing pre-trained models quickly using logit distillation.

Hyper-Scaling: Speed Meets Accuracy

The results are stunning. By compressing the KV cache, DMS allows models to explore more reasoning paths within the same compute budget. This is “Inference-Time Hyper-Scaling”.

On the AIME 24 benchmark, a DMS-equipped Qwen-R1 32B model improved its score by 12.0 points. Furthermore, it saw gains of 8.6 points on GPQA and 9.7 points on LiveCodeBench.

Crucially, DMS consistently dominates other efficiency baselines like Quest and TOVA. It delivers better accuracy for the same memory reads and peak memory usage. For the Qwen3-8B model, DMS matched the accuracy of the vanilla model while enabling up to 5x higher throughput.

This research proves that we don’t always need bigger GPUs to get smarter AI. Sometimes, we just need to manage our memory better.

Muhammad Haaris

Bioscientist x Tech Analyst. Dissecting the intersection of technology, science, gaming, and startups with professional rigor and a Gen-Z lens. Powered by chai, deep-tech obsessions, and high-functioning anxiety. Android > iOS (don't @ me).