Google's TurboQuant Algorithm Cuts AI Memory Needs by 8x

A brilliant mathematical trick by Google researchers is causing a revolution right now — wiping out billions in market value from some of the world's largest hardware manufacturers along the way.

-

At the end of March, Google researchers published a new compression algorithm they developed, called TurboQuant.
The algorithm can compress up to 32 bits of information down to just 3 or 4 bits, which in practice means the memory required for systems to function normally is cut by a ratio of roughly 1 to 8.

When you consider the record-breaking hardware costs of the AI industry, such a substantial saving on memory costs is a genuine game-changer — and shares in major memory manufacturers responded with sharp declines.

Google went even further and released the algorithm as free, open-source code available to everyone, meaning that even its biggest competitors will soon become dramatically more efficient. Implementing the algorithm requires neither retraining models nor any hardware changes, and tests have shown that the compression loses absolutely no information.

The algorithm is designed primarily for the inference processes of AI models.
With older compression methods, the model was forced to reserve an ever-growing storage space in the GPU's RAM as a conversation with the user continued. The new algorithm makes it possible to compress the conversation's memory while fully preserving the information.

-

How does it work?
The answer involves a lot of mathematics, but the core idea is a change in the way numbers are represented.
In the conventional method, data is stored in GPUs as floating-point variables. These variables can represent very large numbers — such as distances between galaxies — as well as very small ones, such as subatomic scales, and they require between 16 and 32 bits depending on the level of precision needed.
These numbers are used to represent the position of a point in a matrix by expressing its distance from the x-axis and the y-axis — for example, to calculate the color of a pixel on screen during a video game.

In the new method, a position is represented using just two values. The first is the distance of the point from the center of the matrix, and the second is the angle of the point. Together, these two values represent the exact position of the point without needing the lengthy numbers that express distance from the x- and y-axes.
An additional bit is added to the compressed data to assist with error correction through mathematical calculations — a capability essential to any compression algorithm that demands precision.

-

Releasing such valuable information for free is far from obvious for a commercial company of Google's size, and its revenues from the AI industry make such an advantage even more significant.
Even so, once the existence of the algorithm became public, other researchers would likely have discovered it independently — and Google prefers to earn, in the process, the prestige and reputation of a talented market leader.
Google took a similar step in the past when it released the Transformer architecture, which is the foundation of the entire AI revolution today — a revolution from which Google is a primary beneficiary.

--
👋 Hi, I'm Shlomo Strauss — follow me for more interesting content on science and technology.

Google's TurboQuant Algorithm Cuts AI Memory Needs by 8x