Ars Technica AI · 25 Mar

Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x

modelinfrastructureresearch

Google Research unveiled TurboQuant, a compression algorithm that reduces large language model (LLM) memory usage by 6x without sacrificing output quality. The technology addresses a critical bottleneck in AI model efficiency where memory demands have made components like RAM expensive.

TurboQuant specifically targets the key-value cache, which Google describes as a 'digital cheat sheet' storing important information to avoid recomputation. This cache occupies substantial memory and inflates the size of high-dimensional vectors, bottlenecking overall model performance.

The compression algorithm applies a two-step process beginning with PolarQuant. This technique converts vectors from standard XYZ coordinates into polar coordinates on a circular grid, representing data as a radius and direction.

A second step called Quantized Johnson-Lindenstrauss (QJL) addresses residual errors from PolarQuant. QJL applies a 1-bit error-correction layer that reduces each vector to a single bit while preserving essential vector relationships that determine attention scores.

Google tested TurboQuant across a suite of long-context benchmarks using both Gemma and Mistral open models. The algorithm delivered what Google describes as 'perfect downstream results' while achieving an 8x performance increase measured against a highly optimized JAX baseline.

On Nvidia H100 accelerators, computing attention scores with 4-bit TurboQuant proved 8x faster than 32-bit unquantized keys. The algorithm can quantize the key-value cache to just 3 bits with no additional training, making it immediately applicable to existing models.

TurboQuant could reduce AI operational costs or enable more complex models within the same memory constraints. Mobile AI applications may see particular benefit, as compression techniques could improve output quality on smartphones without requiring cloud processing.

Read original → arstechnica.com