Top AI Product

Every day, hundreds of new AI tools launch across Product Hunt, Hacker News, and GitHub. We dig through the noise so you don't have to — surfacing only the ones worth your attention with honest, no-fluff reviews. Explore our latest picks, deep dives, and curated collections to find your next favorite AI tool.


Google TurboQuant Squeezes LLM Cache to 3 Bits — 6x Less Memory, 8x Faster, Zero Accuracy Loss

Every large language model running today has the same dirty secret: the longer the conversation goes, the more memory the Key-Value cache eats. For models like Gemini handling 100k+ token contexts, the KV cache can balloon to consume more memory than the model weights themselves. Google Research just published a direct answer to this problem. TurboQuant, a new compression algorithm presented at ICLR 2026, compresses KV caches down to 3 bits per value — no retraining, no fine-tuning, and according to Google’s benchmarks, no accuracy loss whatsoever.

The paper hit Hacker News on March 25, 2026, racking up 155 points. Google Research’s official tweet drew immediate attention from the AI inference community, and within hours, someone had already opened a discussion thread on the llama.cpp repository asking about integration. The interest is real, but so are the questions.

The KV Cache Problem Nobody Talks About Enough

When an LLM generates text, it stores intermediate computations — keys and values from every attention layer — so it doesn’t have to recompute them for each new token. This is the KV cache. It’s essential for fast inference, but it scales linearly with sequence length. Run a model with a 128k context window and you’re looking at gigabytes of KV cache sitting in GPU memory.

This creates a cascade of practical problems. Fewer concurrent users per GPU. Slower batch processing. And for anyone trying to run models locally on consumer hardware or edge devices, it’s often the KV cache — not the model weights — that makes longer contexts impossible.

Existing solutions have chipped away at this. KIVI, presented at ICML 2024, demonstrated tuning-free 2-bit quantization of the KV cache, achieving 2.6x memory reduction with Llama, Falcon, and Mistral models. KVQuant, from NeurIPS 2024, pushed toward supporting 10-million-token context lengths. But most of these methods either require careful calibration, sacrifice some accuracy at extreme compression ratios, or need model-specific tuning.

TurboQuant takes a fundamentally different approach.

How TurboQuant Actually Works

TurboQuant is built on two complementary sub-algorithms: PolarQuant and QJL (Quantized Johnson-Lindenstrauss). Together they form a two-stage compression pipeline that is — and this is the critical detail — completely data-oblivious.

PolarQuant handles the heavy lifting. Instead of quantizing vectors in their original Cartesian coordinate space, it converts pairs of coordinates into polar form — a radius and an angle. This transformation has a specific mathematical advantage: it eliminates the normalization step that traditional quantization methods require, removing both the computational overhead and the error it introduces. PolarQuant then recursively applies this polar transformation, pairing radii together until the entire vector is distilled into a single final radius and a set of compact angles.

QJL cleans up what PolarQuant leaves behind. Using the Johnson-Lindenstrauss Transform — a well-known dimensionality reduction technique — QJL compresses the residual error down to a single sign bit per value. That’s it. One bit to capture whatever PolarQuant didn’t.

The combined result: KV cache compressed to approximately 3 bits per value.

What makes this different from traditional Product Quantization (PQ) is that TurboQuant needs zero training. PQ requires running k-means clustering on your dataset to build codebooks, a process that can take hundreds of seconds for large datasets. TurboQuant skips all of that. You can apply it as a post-training compression layer on top of any existing deployed model, instantly.

The Numbers: Benchmarks Across Five Tasks

Google tested TurboQuant on Gemma and Mistral models across five long-context benchmarks. The headline results:

  • Memory reduction: At least 6x compared to uncompressed FP32 KV storage
  • Speed: Up to 8x faster attention logit computation on NVIDIA H100 GPUs in 4-bit mode vs. 32-bit baseline
  • Needle-In-A-Haystack: 100% retrieval accuracy maintained up to 104k tokens under 4x compression — matching full-precision performance exactly
  • Task coverage: No measurable accuracy loss across question answering, code generation, and summarization tasks

The Needle-In-A-Haystack result is particularly notable. This benchmark tests whether a model can find a specific piece of information buried deep within a long context. Compression methods that introduce even small errors tend to degrade rapidly on this test as context length increases. TurboQuant held at 100% all the way to 104k tokens.

Beyond LLM inference, Google also tested TurboQuant on nearest-neighbor vector search. It outperformed both standard Product Quantization and RabitQ in recall accuracy on the GloVe dataset (d=200), while reducing indexing time to effectively zero — since there’s no codebook to train.

TurboQuant vs. the Competition

The KV cache quantization space has gotten crowded over the past two years. Here’s how TurboQuant stacks up against the key alternatives:

KIVI (ICML 2024): Uses asymmetric 2-bit quantization — keys quantized per-channel, values per-token. Achieves 2.6x peak memory reduction and 2.35-3.47x throughput improvements. Tuning-free like TurboQuant, but at a lower compression ratio with less dramatic speedups.

KVQuant (NeurIPS 2024): Targets extreme context lengths (up to 10 million tokens). Uses per-channel quantization with outlier handling. Strong results but more complex calibration requirements.

SqueezeLLM: Takes a different approach entirely with non-uniform quantization and dense-and-sparse decomposition. Orthogonal to per-token methods, meaning it could theoretically be combined with approaches like KIVI.

Product Quantization (traditional): The baseline that TurboQuant explicitly aims to replace. Requires expensive k-means training, dataset-specific codebooks, and doesn’t work in online/streaming settings.

TurboQuant’s key differentiator is the combination of aggressive compression (3-bit), zero preprocessing time, and data-oblivious operation. Most competing methods require you to choose two of those three — TurboQuant delivers all three simultaneously.

What This Means for Local and Edge Deployment

The implications for the local LLM community are significant. If you’re running models through llama.cpp on a MacBook or a single consumer GPU, memory is your tightest constraint. The KV cache is often what prevents you from using longer context windows. A 6x reduction in KV cache memory could mean the difference between a 16k and a 96k context window on the same hardware.

The llama.cpp community noticed immediately. A discussion thread appeared on the repository the same day as the announcement, exploring how TurboQuant could be integrated into the framework’s existing quantization pipeline. The interest makes sense — llama.cpp already supports various quantization formats (Q4_0, Q5_K_M, etc.) for model weights, but KV cache compression has been less explored in the open-source ecosystem.

There’s a significant catch, though. As multiple community members pointed out, Google has not released a downloadable PyTorch or CUDA implementation. The paper is public and will be formally presented at ICLR 2026, but the actual code is not yet available on GitHub. This means that while the research results are impressive, practical adoption depends on either Google releasing reference code or the open-source community implementing it from the paper — which, given the mathematical complexity of PolarQuant’s recursive polar transformations, is not trivial.

The research was led by Amir Zandieh and Vahab Mirrokni at Google Research, in collaboration with researchers from KAIST and NYU.

Beyond KV Cache: Vector Search Applications

One aspect that hasn’t gotten enough attention in the initial coverage is TurboQuant’s applicability to vector search. The same properties that make it effective for KV cache compression — data-oblivious operation, near-zero preprocessing, high recall at low bit-widths — translate directly to approximate nearest-neighbor search.

In vector databases, Product Quantization is the standard compression technique for making large indices fit in memory. But PQ’s reliance on k-means training means it’s slow to index and needs to be retrained when the data distribution shifts. TurboQuant eliminates both problems. Google’s results show it achieving better recall than PQ on standard benchmarks while indexing in effectively zero time.

For companies running large-scale retrieval systems — RAG pipelines, recommendation engines, semantic search — this could be as significant as the KV cache application. A drop-in replacement for PQ that’s both faster to build and more accurate is a hard combination to ignore.

FAQ

Is TurboQuant open source?

Not yet. The research paper is publicly available and will be presented at ICLR 2026, but Google has not released an official implementation on GitHub. The open-source community is actively discussing how to implement it from the paper, particularly for integration with llama.cpp and vLLM, but a production-ready implementation does not exist outside of Google at this time.

Does TurboQuant require retraining or fine-tuning the model?

No. TurboQuant is a post-training quantization (PTQ) method that is completely data-oblivious. It can be applied as a compression layer on top of any existing model without modifying the model weights or running any calibration steps. This is one of its primary advantages over methods that require dataset-specific training.

How does TurboQuant compare to KIVI?

KIVI achieves 2-bit KV cache quantization with 2.6x memory reduction. TurboQuant compresses to 3 bits but achieves 6x memory reduction (compared to FP32 baseline) and up to 8x speedup. The two methods target different points on the compression-accuracy tradeoff, but TurboQuant shows stronger overall results on long-context benchmarks while also being applicable to vector search workloads.

What models has TurboQuant been tested on?

Google evaluated TurboQuant on Gemma and Mistral model families across five long-context benchmarks, including the Needle-In-A-Haystack test. The algorithm maintained accuracy parity with full-precision inference across all tested tasks and models.

Can TurboQuant be used for purposes beyond LLM inference?

Yes. TurboQuant is also applicable to vector search and approximate nearest-neighbor retrieval. Google demonstrated that it outperforms traditional Product Quantization and RabitQ in recall accuracy while reducing indexing time to near zero, making it relevant for RAG pipelines, recommendation systems, and any application that relies on compressed vector indices.


You Might Also Like


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment