Research: TriAttention KV cache compression — viability assessment for NVIDIA + Gemma 4¶
Date: 2026-04-19 Spike Ticket: gpumod-lff Status: Final Author: researcher (gpumod triattention-spike team) Reviewed by: tech-lead (approved 2026-04-19)
Summary¶
TriAttention is a newly published (arXiv 2604.04921, 2026-04-07) KV cache
compression method that prunes keys via pre-RoPE Q/K frequency analysis. A
working CUDA implementation exists in an external llama.cpp fork
(atomicmilkshake/llama-cpp-turboquant, branch feature/triattention), so
NVIDIA viability (Q1) is not the blocker.
The blocker is Gemma 4 architectural compatibility: Gemma 4 (released
2026-04-02) uses a dual-RoPE configuration where global-attention layers apply
rope_type="proportional" with partial_rotary_factor=0.25. TriAttention's
trigonometric series (paper eq. 2) sums over all d/2 RoPE frequency bands; with
only 25% of dimensions rotated on the global layers, the series must be
re-derived. Sliding layers inherently bound their KV by sliding_window (512 or
1024 tokens), so pruning them below that is redundant — meaningful compression
would come from the 1/6 global layers, which are precisely the algorithmically
incompatible ones. No Gemma-family calibration stats or benchmarks exist in the
paper, the reference implementation, or the forks.
Gate decision: DEFER. Re-evaluate on or after 2026-08-01 (≈3 months), once the community publishes p-RoPE compatibility analyses or Gemma calibration stats emerge. In the interim, pilot TriAttention on Qwen3-32B (where it is validated and pre-calibrated) to de-risk gpumod's KV-compression plumbing in isolation from the Gemma 4 questions.
Research Questions¶
From tech-lead's task #1 briefing, 13 sub-questions plus 5 cross-cutting integration questions:
- Q1 — Is a CUDA TriAttention impl viable for llama.cpp + NVIDIA? (sub-questions 1.1–1.6)
- Q2a — Is Gemma 4's attention architecture compatible with TriAttention? (sub-questions 2a.1–2a.3)
- Q2b — Is calibration for Gemma 4 feasible on RTX 4090? (sub-questions 2b.1–2b.4)
- Cross-cutting A–E — estimation formula shape, driver scope, preset surface, preflight implications, failure modes.
Q1 — CUDA viability¶
| # | Answer | Evidence | Confidence | Gaps |
|---|---|---|---|---|
| 1.1 | CUDA impl exists — atomicmilkshake/llama-cpp-turboquant branch feature/triattention. HIP/ROCm sibling fork at domvox/llama-cpp-turboquant-hip branch feature/triattention-scoring. Upstream ggml-org/llama.cpp: 0 PRs matching "TriAttention" (fork-only). |
github.com/atomicmilkshake/llama-cpp-turboquant ; github.com/ggml-org/llama.cpp/pulls?q=TriAttention (0 results) | High | — |
| 1.2 | Active fork, MIT license, not upstreamed. Build flags: -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="75;80;86;89;120;121". Requires CUDA 13.x runtime. Turing+ (sm_75) minimum; RTX 4090 = sm_89 is supported. |
fork README | High | No tagged release; rolling branch — version pin risk |
| 1.3a | Paper claim: 10.7× KV reduction at matched accuracy on AIME25 (Qwen3-8B); 2.5× throughput at fixed accuracy. | arXiv 2604.04921 §1, Figure 1 | High (as claim) | Not independently reproduced |
| 1.3b | Fork README reports throughput only (75 tok/s vs 17.5 tok/s baseline); no independent KV-ratio measurement. Combined with TurboQuant turbo3 yields ~6.8× effective on Qwen3.5-27B per community reports. | fork README ; github.com/ggml-org/llama.cpp discussion #20969 | Med | — |
| 1.4 | Quality cost is benchmark-dependent. MATH 500: 68.4% vs 69.6% (–1.2pp at 1024-token budget). AIME24: 42.1% vs 57.1% (–15pp at 2048-token budget — non-trivial). AIME25 long-reasoning: parity at 10.7×. No PPL/KLD reported in paper. | arXiv 2604.04921 Tables 1–2 | High | No PPL/KLD, no non-reasoning benchmarks (MMLU etc.) |
| 1.5 | Compression is budget-bounded, NOT linear. Method retains top-B keys (fixed integer budget). Pruning triggered every β=128 tokens. Once ctx > B, KV cache size is constant. This breaks gpumod's linear kv_per_1k assumption. |
arXiv 2604.04921 §4.3 | High | — |
| 1.6 | Per-head, not uniform. Configurable per_head or per_layer_per_head modes via TRIATTN_RUNTIME_PRUNING_MODE. GQA handling via z-score normalize-then-max across query heads sharing a KV head. |
arXiv 2604.04921 §4.3 ; WeianMao/triattention README | High | No per-layer compression variance data |
Port effort: ~0 hours — CUDA fork is already production-ready. Kill condition (>40 person-hours port) NOT triggered on Q1.
Q2a — Gemma 4 architecture compatibility¶
Configs fetched directly from Hugging Face (2026-04-19).
Gemma 4 E4B (dense): 42 layers (35 sliding_attention + 7
full_attention), num_attention_heads=8, num_key_value_heads=2 (GQA 4:1),
head_dim=256, sliding_window=512, max_position_embeddings=131072,
num_kv_shared_layers=18, attention_k_eq_v=false.
Gemma 4 26B A4B (MoE): 30 layers (25 sliding + 5 full, repeating 5:1
pattern), num_attention_heads=16, num_key_value_heads=8 (local) /
num_global_key_value_heads=2 (global), head_dim=256, global_head_dim=512,
sliding_window=1024, max_position_embeddings=262144, num_experts=128,
num_kv_shared_layers=0, attention_k_eq_v=true.
Gemma 4 31B (dense): 60 layers, attention_k_eq_v=true, otherwise mirrors
the 26B architecture.
Dual RoPE configuration (from 26B config.json#text_config.rope_parameters):
{
"full_attention": { "rope_type": "proportional", "partial_rotary_factor": 0.25, "rope_theta": 1000000.0 },
"sliding_attention": { "rope_type": "default", "rope_theta": 10000.0 }
}
| # | Answer | Evidence | Confidence | Gaps |
|---|---|---|---|---|
| 2a.1 | Architecture verified via direct config.json fetch (above). Layer pattern is explicit in the layer_types array. p-RoPE applies only on global layers with 25% partial rotation. Last layer is always global. |
HF google/gemma-4-E4B / -26B-A4B / -31B-it config.json ; HF blog gemma4 |
High | — |
| 2a.2 | TriAttention assumes standard full-dim RoPE. Logit formula (paper eq. 2): logit(q,k) = Σ_f ‖q_f‖‖k_f‖·cos(ω_f·Δ+φ_f) sums over all d/2 frequency bands. With partial_rotary_factor=0.25, only 25% of bands rotate; the remaining 75% are pass-through — the trigonometric series must be re-derived for partial RoPE. Sliding-window attention is not discussed anywhere in the paper or either fork README. |
arXiv 2604.04921 §2.1 & §4.1 ; absence in WeianMao & atomicmilkshake READMEs | High | Unknown whether Q/K concentration (R) still holds on Gemma p-RoPE |
| 2a.3 | No Gemma benchmarks exist. Paper validates Qwen3-8B, Qwen3-32B, Qwen2.5, Llama3 (via DeepSeek-R1-Distill), and GLM-4.7-Flash (MLA) only. Pre-calibrated stats shipped in triattention/vllm/stats/: Qwen3-8B, Qwen3-32B-int4, DeepSeek-R1-Distill-{Llama-8B, Qwen-7B}. [unverified] for Gemma-family. |
arXiv 2604.04921 §3.3 & §5 ; WeianMao/triattention README stats directory | High | — |
Secondary compatibility issues:
attention_k_eq_v=trueon 26B/31B already halves KV vs a standard GQA model, diminishing TriAttention's marginal gain (compression on top of an already-compressed K=V representation).num_kv_shared_layers=18on E4B: 18 of 42 layers reuse K/V from an upstream layer of the same attention type. Pruning must be coherent across the sharing chain — not addressed by the paper.- Sliding layers are inherently bounded by their window (512 / 1024). Pruning below window size is redundant; meaningful compression would come only from the 1/6 global layers — which are the algorithmically incompatible p-RoPE layers. This is the core reason DEFER is the right call for Gemma 4.
Q2b — Calibration feasibility¶
| # | Answer | Evidence | Confidence | Gaps |
|---|---|---|---|---|
| 2b.1 | Calibration required, data-dependent. Not data-free. Computes per-band complex-valued Q center E[q_f] and expected Q norm E[‖q_f‖] from a representative corpus. Produces .pt (vLLM) or .triattention binary (llama.cpp fork) sidecar. |
arXiv 2604.04921 §3.3 & eq. 4 ; WeianMao README | High | — |
| 2b.2 | Dataset/sample count not published. Paper states "calibration data" without specifying corpus size or sampling protocol. For Gemma 4 E4B BF16 (~15 GB), VRAM for calibration = weights + activations ≈ 15 + 3 GB ≈ 18 GB → fits RTX 4090 24 GB with ~6 GB headroom. Runtime [unverified]; analogue to llama.cpp imatrix calibration gives ~1–4 h wall for ≲1000 samples. |
arXiv 2604.04921 §3.3 ; WeianMao README ; imatrix analogue | Low | Actual runtime, recommended corpus size and sampling |
| 2b.3 | One-time offline step. Command: llama-cli -m model.gguf -ngl 99 --triattention-calibrate corpus.txt --triattention-calibrate-out model.triattention. Referenced at inference via --triattention-stats model.triattention. |
atomicmilkshake/llama-cpp-turboquant README | High | — |
| 2b.4 | Sidecar binary file, stored separately from the GGUF. No GGUF-metadata integration path. Artifact size undocumented; by analogy to imatrix output, likely tens of MB. | atomicmilkshake/llama-cpp-turboquant README | Med | Exact artifact size per model |
Risk on Gemma 4 specifically: no one has calibrated Gemma 4 for
TriAttention. A first run would require empirical verification that the Q/K
concentration phenomenon holds on Gemma's dual-RoPE — if the Mean Resultant
Length R drops on global layers because of the 25% partial rotation, the
S_trig/S_norm balance (paper §4.2) would need retuning.
Cross-cutting integration¶
A — Estimation formula shape¶
Must become non-linear and layer-type-aware. Current formula in
src/gpumod/fetchers/huggingface.py:263-299:
kv_per_1k = 2 * num_layers * num_kv_heads * head_dim * 2 * 1000 / 1024**2
total_vram = base_vram + (ctx / 1000) * kv_per_1k # registry.py:159-163
is strictly linear in ctx and treats all layers identically. Under TriAttention
on a hybrid-attention model like Gemma 4, KV size becomes piecewise with two
distinct per-token costs (local head_dim=256 vs global global_head_dim=512
on 26B/31B) and two distinct per-layer-type bounds (window for sliding,
triattn_budget for global). Layers that reuse K/V via num_kv_shared_layers
should not be counted in storage at all.
unique_sliding_layers = sliding_layers - shared_sliding_layers
unique_global_layers = global_layers - shared_global_layers
kv_mb(ctx) ≈ unique_sliding_layers * min(ctx, sliding_window) * per_tok_local
+ unique_global_layers * min(ctx, triattn_budget) * per_tok_global
+ base_overhead
Reduces to the current linear formula when triattn_budget=∞,
sliding_window=∞, and unique_layers = num_layers. So the new formula can be
adopted as a generalisation, not a replacement.
B — Driver scope¶
llama.cpp only (via fork). vLLM has a separate implementation in
WeianMao/triattention with the same calibration artifact requirement and
comparable compression ratios, but it is out of scope for this spike (gpumod's
vLLM driver targets different workloads). Scope: drivers/llamacpp.py only.
C — Preset surface¶
Add an optional preset block. Absence = off, preserving backward compatibility. Implicit per-model lookup would hide too much complexity behind the driver.
kv_compression:
method: triattention # future: also turboquant, triattention+turboquant
budget: 4096 # top-B keys retained on global layers
stats_path: /path/to/model.triattention
window: 256 # recent tokens protected from eviction (β parameter)
D — Preflight implications¶
src/gpumod/preflight/vram_check.py currently suggests "halving ctx roughly
halves KV cache" when VRAM is insufficient. Under TriAttention this heuristic
is wrong once ctx > budget — halving ctx would not halve KV. VRAMCheck must
read the new kv_compression field and use the compound formula in §A for
both the estimate and the suggestion. A new preflight check
(TriAttentionStatsCheck) should verify the sidecar .triattention file
exists, matches the model (hash or model-id sidecar metadata), and the driver
binary is a TriAttention-capable build.
E — Failure modes¶
The fork's README does not document fallback behaviour for unsupported architectures or stale stats. Most likely: runtime CUDA crash mid-decode. gpumod preflight MUST enforce: (a) stats file exists, (b) stats file matches model, (c) driver binary is TriAttention-capable (grep for flag support / version sentinel), (d) model architecture is on an explicit allowlist. Without preflight, users hit cryptic CUDA errors with no actionable signal.
gpumod-specific implementation implications¶
-
src/gpumod/models.py:158—ModelInfoneeds optionalkv_compression_budget: int | None, and (for Gemma 4-style hybrids) exposure of thelayer_typesratio and bothhead_dimvalues. -
src/gpumod/fetchers/huggingface.py:88-93, 263-299— non-trivial fetcher refactor required.HuggingFaceFetchercurrently consumeshuggingface_hub.model_info()(repo metadata), not a parsedconfig.json.info.configexposes a partial view of the config and does not reliably surface Gemma 4'srope_parameters,layer_types,num_kv_shared_layers,global_head_dim, orattention_k_eq_v. Implementing the layer-type-aware formula in §A requires either: - (a) an additional HTTP fetch of
raw/main/config.json(new dependency + URL-handling + 401/redirect semantics), or -
(b) extending the parser to read the extra keys from
info.configwhen available and fall back to the current formula when not. This is part of the "redesign gpumod KV estimation" follow-up (#3 below) and should be scoped as a fetcher refactor, not a one-line formula swap. -
src/gpumod/registry.py:159-163—total_vram_for_contextbecomes piecewise (min of ctx and per-layer-type budget). -
src/gpumod/preflight/vram_check.py— ctx-reduction suggestion must be compression-aware; addTriAttentionStatsCheck. -
src/gpumod/services/drivers/llamacpp.py— translate thekv_compressionpreset field into--triattention-budget,--triattention-window,--triattention-statsCLI flags on service start. -
presets/llm/gemma4-*.yaml— if TriAttention ever ships for Gemma 4, presets need akv_compressionblock + mechanism to fetch/ship the.triattentionsidecar. -
No changes needed to vllm driver (out of scope).
Options Considered¶
| Option | Pros | Cons |
|---|---|---|
| A. GO — integrate TriAttention for Gemma 4 now | - Largest VRAM headroom gain on long-context Gemma 4 workloads - Research is freshly published; potential differentiation | - Zero Gemma-family evidence; paper's trig series provably broken on partial_rotary_factor=0.25 - No calibration stats exist; first-ever attempt on Gemma - Requires gpumod estimation + preflight redesign - High risk |
| B. KILL — abandon TriAttention entirely | - No engineering cost - Avoids fork maintenance | - Discards a working, CUDA-validated compression method that is viable for Qwen/Llama users - Gives up 10.7× reduction wins on reasoning workloads where it's proven |
| C. DEFER — pilot on Qwen3-32B first, Gemma 4 later (chosen) | - CUDA path validated on a model where TriAttention is proven (pre-calibrated stats ship in repo) - Exercises gpumod driver/preflight plumbing with controlled risk - Unblocks estimation-formula redesign independently | - No immediate Gemma 4 benefit - Requires follow-up research checkpoint |
Recommendation¶
DEFER. Pursue option C. Proceed with Qwen3-32B pilot to de-risk the integration surface; revisit Gemma 4 on or after 2026-08-01 once community data on p-RoPE compatibility or Gemma-family calibration stats emerges.
Do not commit gpumod to Gemma 4 TriAttention integration in the current quarter. Gate decision drivers:
- TriAttention paper published 2026-04-07, Gemma 4 released 2026-04-02 — both under 3 weeks old as of report date.
- Zero public evidence of TriAttention on Gemma-family.
- Gemma 4's
partial_rotary_factor=0.25p-RoPE breaks the paper's trigonometric-series-over-all-d/2-bands assumption — requires novel derivation before integration. - Interleaved sliding+global attention is not addressed by the paper. Sliding layers are already window-bounded; TriAttention would mostly help the 1/6 global layers — which are the incompatible p-RoPE layers.
- AIME24 showed –15pp accuracy drop at aggressive budgets — non-reasoning quality cost on Gemma 4 is unknown.
- CUDA implementation for Qwen/Llama is production-ready (sm_75+, covers RTX 4090 sm_89), so we can capture the easy wins immediately via the Qwen3-32B pilot without committing to Gemma 4.
References¶
- TriAttention paper — arXiv 2604.04921 — primary source; algorithm, experiments, formulae
- WeianMao/triattention (reference impl) — calibration pipeline, pre-rolled stats
- atomicmilkshake/llama-cpp-turboquant — CUDA fork, branch
feature/triattention - domvox/triattention-ggml (HIP sibling) — ROCm/HIP fork, branch
feature/triattention-scoring - ggml-org/llama.cpp discussion #20969 — TurboQuant — upstream integration status; 0 TriAttention PRs
- Hugging Face blog: Welcome Gemma 4 — dual-RoPE (p-RoPE on global layers), shared KV, sliding-window details
- Gemma 4 model card — Google AI — model sizes, layers, release date (2026-04-02)
- MarkTechPost — TriAttention overview — compression/accuracy tables
Gemma 4 configs directly fetched (raw/main/config.json):
google/gemma-4-E4Bgoogle/gemma-4-26B-A4Bgoogle/gemma-4-31B-it
Follow-up Tasks¶
- Track upstream llama.cpp merge of TurboQuant + TriAttention (watch discussions #20969, #21526)
- Pilot TriAttention on Qwen3-32B (pre-calibrated stats available) to validate gpumod driver-flag plumbing and calibration-artifact fetching, in isolation from Gemma 4 questions
- Spike: redesign gpumod KV estimation to non-linear, layer-type-aware model (prerequisite for any KV-compression driver work; see §A and
fetchers/huggingface.pyimplementation note) - Research check-back on 2026-08-01: re-evaluate Gemma 4 + TriAttention once p-RoPE compatibility analyses or Gemma calibration stats emerge
- (Deferred) Gemma 4 TriAttention integration — unblock only after the KV-estimation redesign lands and the 2026-08-01 check-back shows progress
Acceptance Criteria¶
- All research questions answered (Q1.1–1.6, Q2a.1–2a.3, Q2b.1–2b.4, A–E)
- Every claim has a cited source
- Resource usage evaluated (VRAM, calibration cost)
- License verified as permissive (MIT for fork; Apache-2.0 for reference impl)
- Python/ecosystem compatibility checked (CUDA 13.x, sm_75+ covers RTX 4090 sm_89)
- Recommendation stated with rationale (DEFER; 6 drivers enumerated)
- Follow-up tickets to be created (see "Follow-up Tasks")