Research: KV Cache Estimation Redesign -- Linear to Compound Formula¶
Date: 2026-05-24
Spike Ticket: gpumod-cf8
Status: Final
Author: researcher (gpumod-cf8)
Prior art: docs/research/20260419_triattention_viability.md section A
Summary¶
gpumod's current KV cache estimation is strictly linear:
total_vram = base_vram + (context_size / 1000) * kv_cache_per_1k_tokens_mb. This
formula treats all transformer layers identically and assumes KV cache scales
linearly with context. This is wrong for hybrid-attention models (Gemma 3,
Gemma 3n, Gemma 4) where sliding-window layers cap their KV cache at the window
size, and for models with KV sharing where multiple layers reuse the same K/V
tensors.
The compound formula proposed here overestimates KV cache by 73-88% for Gemma models while producing identical results for dense models (Qwen3, Llama 3.3), making it a strict generalization that can replace the current formula without breaking backward compatibility.
Recommendation: Adopt option (a) -- keep the existing scalar
kv_cache_per_1k_tokens_mb for backward compatibility and add a new structured
kv_cache_profile field to ModelInfo. Implement a two-phase rollout: first
add the profile-aware estimation path, then backfill existing models.
Table of Contents¶
- Step 1: Call Site Audit
- Step 2: Compound Formula Derivation
- Step 3: Persistence Shape
- Step 4: Fetcher Refactor Prototype
- Step 5: Reference Values
- Step 6: Preflight Update Sketch
- Step 7: Deliverables
Step 1: Call Site Audit¶
All consumers of kv_cache_per_1k_tokens_mb identified by grepping src/
(verified 2026-05-24):
Producer (1 site)¶
| File | Line | Role | What it needs |
|---|---|---|---|
fetchers/huggingface.py |
263-299 | Compute | _estimate_kv_cache_per_1k() -- the linear formula. Inputs: num_layers, hidden_size, num_kv_heads, num_attention_heads. Returns int (MB). |
Schema / definition (2 sites)¶
| File | Line | Role |
|---|---|---|
models.py |
179 | ModelInfo.kv_cache_per_1k_tokens_mb: int \| None -- field definition |
models.py |
207 | PresetConfig.kv_cache_per_1k: int \| None -- preset config field |
Storage (4 sites)¶
| File | Line | Role |
|---|---|---|
db.py |
91 | SQL schema: kv_cache_per_1k_tokens_mb INTEGER column in models table |
db.py |
222 | _row_to_model_info() -- reads column into ModelInfo |
db.py |
599 | insert_model() -- INSERT statement column list |
db.py |
609 | insert_model() -- parameter binding |
Consumer: VRAM estimation (3 sites)¶
| File | Line | Role | What it needs |
|---|---|---|---|
registry.py |
129 | Docstring documents the formula | N/A |
registry.py |
161 | Guard: if model.kv_cache_per_1k_tokens_mb is not None |
Boolean check |
registry.py |
162 | Core consumer: kv_addition = int((context_size / 1000) * model.kv_cache_per_1k_tokens_mb) |
Needs either scalar or compound function |
Consumer: Display (3 sites)¶
| File | Line | Role | What it needs |
|---|---|---|---|
cli_model.py |
34 | JSON serialization: "kv_cache_per_1k_tokens_mb": model.kv_cache_per_1k_tokens_mb |
Any serializable value |
cli_model.py |
133-134 | Table column: str(row["kv_cache_per_1k_tokens_mb"]) |
Displayable string |
cli_model.py |
184 | Panel output: f"[bold]KV/1K:[/bold] {model.kv_cache_per_1k_tokens_mb or '-'} MB" |
Displayable value |
mcp_resources.py |
175 | MCP resource: f"- **KV Cache per 1k tokens:** {model.kv_cache_per_1k_tokens_mb or '-'} MB" |
Displayable value |
Consumer: Preset loading (from presets/*.yaml)¶
21 preset files contain kv_cache_per_1k: <int> values. These are loaded via
PresetConfig.kv_cache_per_1k and used to populate Service / ModelInfo
during preset registration. Preset values are static overrides, not computed.
Consumer: Test files (15+ sites)¶
Test files in tests/unit/ and tests/integration/ reference the field in
fixtures, assertions, and mock data. These will need updating if the field
type changes, but NOT if we add a new field alongside.
Consumer summary¶
The critical consumer is registry.py:162 -- the VRAM estimation formula.
All other consumers are either storage (passthrough), display (format for
humans), or tests. This means the migration can be done with minimal blast
radius: change the estimation logic in registry.py, extend the fetcher,
and add the new field. Display consumers can optionally show the profile
alongside the legacy scalar.
Step 2: Compound Formula Derivation¶
Current formula (linear, layer-homogeneous)¶
kv_per_1k = 2 * num_layers * num_kv_heads * head_dim * 2 * 1000 / 1024^2
total_kv = (ctx / 1000) * kv_per_1k
2: K + V tensorsnum_layers: all layers counted equallynum_kv_heads: same for all layershead_dim: same for all layers2: bytes per fp16 element1000: tokens per 1K batch1024^2: bytes to MB
Compound formula (layer-type-aware, non-linear)¶
per_tok_sliding = kv_factor * num_kv_heads * head_dim * bytes_per_elem
per_tok_global = kv_factor * global_kv_heads * global_head_dim * bytes_per_elem
unique_sliding = n_sliding_layers - shared_sliding_layers
unique_global = n_global_layers - shared_global_layers
KV_total(ctx) = unique_sliding * min(ctx, sliding_window) * per_tok_sliding
+ unique_global * min(ctx, triattn_budget) * per_tok_global
Where:
- kv_factor = 1 if attention_k_eq_v else 2 (K=V halves storage)
- global_kv_heads = num_global_key_value_heads if present, else num_kv_heads
- global_head_dim = global_head_dim if present, else head_dim
- shared_*_layers = layers that reuse KV from an upstream layer of the same type
(last num_kv_shared_layers layers, distributed proportionally across types)
- triattn_budget = TriAttention fixed key budget (future, currently infinity)
Per-architecture reduction proofs¶
Case 1: Dense (current behavior)
- layer_types = all full_attention
- sliding_window = None (treated as infinity)
- num_kv_shared_layers = 0
- triattn_budget = None (infinity)
unique_global = num_layers, unique_sliding = 0
KV(ctx) = num_layers * min(ctx, inf) * 2 * num_kv_heads * head_dim * 2
= num_layers * ctx * 2 * num_kv_heads * head_dim * 2
= current formula * ctx / 1000 (when expressed in MB/1K)
Case 2: Sliding-window only (all layers use sliding attention)
- layer_types = all sliding_attention
- sliding_window = W tokens
ctx <= W, this equals the linear formula. At ctx > W, KV is capped at
num_layers * W * per_tok -- constant regardless of context growth.
Case 3: Hybrid (sliding + global)
- layer_types = mix of sliding_attention and full_attention
- sliding_window = W
ctx > W. When W = inf, all layers are
effectively global and the formula reduces to Case 1.
Case 4: Hybrid + KV sharing (Gemma 3n style)
- num_kv_shared_layers = S > 0
Same as Case 3 but unique_sliding and unique_global are reduced by
the shared layer counts. When S = 0, reduces to Case 3.
Case 5: Hybrid + asymmetric head_dim (Gemma 4 26B/31B style)
- global_head_dim != head_dim
- num_global_key_value_heads != num_kv_heads
Global layers use per_tok_global with different dimensions. Sliding layers
use per_tok_sliding with base dimensions. When global_head_dim = head_dim
and global_kv_heads = kv_heads, reduces to Case 3/4.
Case 6: Hybrid + TriAttention budget (future)
- triattn_budget = B
KV(ctx) = unique_sliding * min(ctx, W) * per_tok_sliding
+ unique_global * min(ctx, B) * per_tok_global
ctx > B, global-layer KV is also capped. When B = inf, reduces to
Case 3/4/5.
Backward-compatibility verification¶
Numerically verified for all 4 reference models: the compound formula with
sliding_window=None, triattn_budget=None, num_kv_shared_layers=0 produces
identical results to the current _estimate_kv_cache_per_1k() at ctx = 1K, 8K,
32K, 128K. See PoC output in Step 5.
Step 3: Persistence Shape¶
Option (a): Additive -- keep scalar, add structured profile¶
class KVCacheProfile(BaseModel):
"""Structured KV cache estimation data for hybrid-attention models."""
model_config = ConfigDict(extra="forbid")
num_sliding_layers: int = 0
num_global_layers: int = 0
num_kv_shared_layers: int = 0
sliding_window: int | None = None # None = no sliding window
head_dim: int = 128
global_head_dim: int | None = None # None = same as head_dim
num_kv_heads: int = 1
num_global_kv_heads: int | None = None # None = same as num_kv_heads
attention_k_eq_v: bool = False
triattn_budget: int | None = None # None = no TriAttention
# Redundant but useful for display and backward compat
kv_per_1k_at_inf: int | None = None # linear-equivalent rate
class ModelInfo(BaseModel):
# ... existing fields unchanged ...
kv_cache_per_1k_tokens_mb: int | None = None # KEPT for backward compat
kv_cache_profile: KVCacheProfile | None = None # NEW, optional
DB migration:
- Add kv_cache_profile TEXT column to models table (JSON-serialized)
- One ALTER TABLE ... ADD COLUMN migration (same pattern as v2->v3)
- Existing rows have kv_cache_profile = NULL -- estimation falls back to
kv_cache_per_1k_tokens_mb scalar
Estimation logic change in registry.py:
# Pseudocode for new estimate_vram()
if model.kv_cache_profile is not None:
kv_addition = compute_compound_kv(model.kv_cache_profile, context_size)
elif model.kv_cache_per_1k_tokens_mb is not None:
kv_addition = int((context_size / 1000) * model.kv_cache_per_1k_tokens_mb)
else:
kv_addition = 0
total = base_vram + kv_addition
Option (b): Replace scalar with struct¶
Replace kv_cache_per_1k_tokens_mb entirely with kv_cache_profile.
Pros: - Cleaner schema -- one source of truth - No ambiguity about which field to use
Cons:
- Breaking change to ModelInfo -- all 15+ test sites need updating
- Breaking change to PresetConfig -- all 21 preset YAML files need updating
- CLI and MCP display code needs rewriting
- DB column rename or migration that drops and re-creates data
- External tools that parse ModelInfo JSON break
Recommendation: Option (a)¶
Rationale:
1. Zero migration cost for existing data. The 21 preset files and all test
fixtures continue to work unchanged. New models populated via the fetcher
get both fields; old models use the scalar fallback.
2. Gradual rollout. The fetcher can start populating kv_cache_profile
immediately; registry.py can prefer it when available. No flag day.
3. Display backward compat. CLI and MCP already show kv_cache_per_1k_tokens_mb.
The profile is additional detail, not a replacement for the summary metric.
4. DB migration is trivial. One ALTER TABLE models ADD COLUMN kv_cache_profile TEXT
with NULL default. Same pattern used for preflight_required and compat
columns (see db.py:137-152).
Estimated migration cost:
- models.py: +15 lines (new Pydantic model + field)
- db.py: +10 lines (column, read, write)
- registry.py: +20 lines (compound estimation logic)
- fetchers/huggingface.py: +40 lines (parse config.json, build profile)
- Tests: +30 lines (new test cases for compound formula)
- Presets: 0 changes
- CLI/MCP: 0 required changes (optional: show profile detail)
Total: ~115 lines added, 0 lines changed in existing code.
Step 4: Fetcher Refactor Prototype¶
The current HuggingFaceFetcher.fetch() uses huggingface_hub.model_info() which
returns a partial config view via info.config. This dict contains the top-level
config fields (num_hidden_layers, hidden_size, num_attention_heads,
num_key_value_heads) but does NOT reliably surface:
layer_types(only present in Gemma 3n, not Gemma 3)sliding_window_pattern(not in config.json; derived by transformers)sliding_window(present but not ininfo.configfor all models)num_kv_shared_layers(Gemma 3n/4 specific)global_head_dim(Gemma 4 specific)num_global_key_value_heads(Gemma 4 specific)attention_k_eq_v(Gemma 4 specific)text_confignesting (multimodal models like Gemma 3/3n/4 nest text config)
Solution: additional hf_hub_download call for config.json¶
hf_hub_download is already a dependency (from huggingface_hub). It downloads
and caches config.json locally. This is a single HTTP GET with local caching,
adding ~50ms latency on cache miss and 0ms on cache hit.
Prototype code¶
# In fetchers/huggingface.py -- proposed addition
import json
from huggingface_hub import hf_hub_download
def _fetch_raw_config(self, model_id: str) -> dict | None:
"""Fetch and parse raw config.json from HuggingFace Hub.
Uses hf_hub_download which caches locally. Falls back to None
on any error (gated repos, network issues, missing file).
"""
try:
path = hf_hub_download(repo_id=model_id, filename="config.json")
with open(path) as f:
return json.load(f)
except Exception:
return None
def _build_kv_cache_profile(self, config: dict) -> KVCacheProfile | None:
"""Build KVCacheProfile from raw config.json.
Handles both top-level configs (Qwen, Llama) and nested
text_config (Gemma 3, 3n, 4).
"""
# Resolve text_config nesting for multimodal models
tc = config.get("text_config", config)
num_layers = tc.get("num_hidden_layers")
num_heads = tc.get("num_attention_heads")
num_kv_heads = tc.get("num_key_value_heads", num_heads)
hidden_size = tc.get("hidden_size")
head_dim = tc.get("head_dim")
if num_layers is None or num_heads is None or hidden_size is None:
return None
if head_dim is None and num_heads > 0:
head_dim = hidden_size // num_heads
# Determine layer types
layer_types = tc.get("layer_types")
sliding_window = tc.get("sliding_window")
sliding_window_pattern = tc.get("sliding_window_pattern")
# If no explicit layer_types but has sliding_window_pattern, derive
if layer_types is None and sliding_window_pattern is not None:
layer_types = [
"sliding_attention" if bool((i + 1) % sliding_window_pattern)
else "full_attention"
for i in range(num_layers)
]
# If no layer_types and no pattern but has sliding_window,
# use the transformers default pattern of 6 for Gemma 3
model_type = tc.get("model_type", "")
if layer_types is None and sliding_window is not None and sliding_window_pattern is None:
if "gemma3" in model_type:
# Gemma 3 default: sliding_window_pattern=6
# Source: transformers v5.1.0 configuration_gemma3.py:286
layer_types = [
"sliding_attention" if bool((i + 1) % 6)
else "full_attention"
for i in range(num_layers)
]
# Count layer types
if layer_types is not None:
from collections import Counter
counts = Counter(layer_types)
n_sliding = counts.get("sliding_attention", 0)
n_global = counts.get("full_attention", 0)
else:
n_sliding = 0
n_global = num_layers
return KVCacheProfile(
num_sliding_layers=n_sliding,
num_global_layers=n_global,
num_kv_shared_layers=tc.get("num_kv_shared_layers", 0),
sliding_window=sliding_window if isinstance(sliding_window, int) else None,
head_dim=head_dim,
global_head_dim=tc.get("global_head_dim"),
num_kv_heads=num_kv_heads,
num_global_kv_heads=tc.get("num_global_key_value_heads"),
attention_k_eq_v=tc.get("attention_k_eq_v", False),
triattn_budget=None, # future: from preset kv_compression block
)
Key design decisions in the prototype¶
-
text_configresolution: Multimodal models (Gemma 3/3n/4) nest text model config undertext_config. The prototype falls through to top-level config for single-modality models (Qwen, Llama). -
Gemma 3 pattern inference: Gemma 3's config.json does NOT contain
layer_typesorsliding_window_pattern. The 5:1 sliding/global pattern is derived at model load time by theGemma3TextConfigclass usingsliding_window_pattern=6(default). The fetcher must hardcode this knowledge formodel_type=gemma3_text. Source: transformers v5.1.0 configuration_gemma3.py:286-292. -
Graceful fallback: If
_fetch_raw_config()fails (gated repo, network error), the fetcher still produces a validModelInfowith the existing scalarkv_cache_per_1k_tokens_mbandkv_cache_profile = None.
Step 5: Reference Values¶
Model configurations (from config.json, fetched 2026-05-24)¶
| Field | Gemma 3n E4B | Gemma 3 27B | Qwen3 32B | Llama 3.3 70B |
|---|---|---|---|---|
| Source | google/gemma-3n-E4B-it |
google/gemma-3-27b-it |
Qwen/Qwen3-32B |
unsloth/Llama-3.3-70B-Instruct |
num_hidden_layers |
35 | 62 | 64 | 80 |
num_attention_heads |
8 | 32 | 64 | 64 |
num_key_value_heads |
2 | 16 | 8 | 8 |
head_dim |
256 | 128 | 128 | 128 |
hidden_size |
2048 | 5376 | 5120 | 8192 |
sliding_window |
512 | 1024 | None (null) | None |
sliding_window_pattern |
5 (explicit layer_types) |
6 (inferred from transformers default) | N/A | N/A |
| Sliding layers | 28 | 52 | 0 | 0 |
| Global layers | 7 | 10 | 64 | 80 |
num_kv_shared_layers |
15 | 0 | 0 | 0 |
| Unique sliding | 16 | 52 | 0 | 0 |
| Unique global | 4 | 10 | 64 | 80 |
attention_k_eq_v |
false | false | false | false |
model_type |
gemma3n_text | gemma3_text | qwen3 | llama |
| Config nesting | text_config |
text_config |
top-level | top-level |
Notes:
- Gemma 3n E4B: layer_types is explicit in config.json with 28 sliding + 7 full.
num_kv_shared_layers=15 means the last 15 layers reuse KV from an upstream layer.
Actual distribution: 12 shared sliding + 3 shared global = 15 total.
Source: Botmonster Gemma 4 architecture.
- Gemma 3 27B: config.json has sliding_window=1024 but NO layer_types field.
The 5:1 pattern (52 sliding + 10 global) is derived from sliding_window_pattern=6
which is a transformers default, not a config.json field.
Source: transformers v5.1.0 Gemma3TextConfig.
- Llama 3.3 70B: meta-llama/Llama-3.3-70B-Instruct is gated. Config fetched from
unsloth/Llama-3.3-70B-Instruct mirror. Architecture fields are identical.
KV per 1K tokens (linear formula, current gpumod)¶
| Model | Formula | MB/1K |
|---|---|---|
| Gemma 3n E4B | 2 * 35 * 2 * 256 * 2 * 1000 / 1024^2 |
69 |
| Gemma 3 27B | 2 * 62 * 16 * 128 * 2 * 1000 / 1024^2 |
485 |
| Qwen3 32B | 2 * 64 * 8 * 128 * 2 * 1000 / 1024^2 |
250 |
| Llama 3.3 70B | 2 * 80 * 8 * 128 * 2 * 1000 / 1024^2 |
313 |
Reference value table: Linear vs Compound KV cache (MB)¶
| Model | Context | Linear MB | Compound MB | Delta MB | Overestimate % |
|---|---|---|---|---|---|
| Gemma 3n E4B | 8,000 | 546.9 | 78.5 | 468.4 | 85.6% |
| Gemma 3n E4B | 32,000 | 2,187.5 | 266.0 | 1,921.5 | 87.8% |
| Gemma 3n E4B | 128,000 | 8,750.0 | 1,016.0 | 7,734.0 | 88.4% |
| Gemma 3 27B | 8,000 | 3,875.0 | 1,041.0 | 2,834.0 | 73.1% |
| Gemma 3 27B | 32,000 | 15,500.0 | 2,916.0 | 12,584.0 | 81.2% |
| Gemma 3 27B | 128,000 | 62,000.0 | 10,416.0 | 51,584.0 | 83.2% |
| Qwen3 32B | 8,000 | 2,000.0 | 2,000.0 | 0.0 | 0.0% |
| Qwen3 32B | 32,000 | 8,000.0 | 8,000.0 | 0.0 | 0.0% |
| Qwen3 32B | 128,000 | 32,000.0 | 32,000.0 | 0.0 | 0.0% |
| Llama 3.3 70B | 8,000 | 2,500.0 | 2,500.0 | 0.0 | 0.0% |
| Llama 3.3 70B | 32,000 | 10,000.0 | 10,000.0 | 0.0 | 0.0% |
| Llama 3.3 70B | 128,000 | 40,000.0 | 40,000.0 | 0.0 | 0.0% |
Effective KV MB per 1K tokens (varies with context for hybrid models)¶
| Model | Context | Linear/1K | Compound/1K |
|---|---|---|---|
| Gemma 3n E4B | 8,000 | 68.4 | 9.8 |
| Gemma 3n E4B | 32,000 | 68.4 | 8.3 |
| Gemma 3n E4B | 128,000 | 68.4 | 7.9 |
| Gemma 3 27B | 8,000 | 484.4 | 130.1 |
| Gemma 3 27B | 32,000 | 484.4 | 91.1 |
| Gemma 3 27B | 128,000 | 484.4 | 81.4 |
| Qwen3 32B | 8K-128K | 250.0 | 250.0 |
| Llama 3.3 70B | 8K-128K | 312.5 | 312.5 |
Key insight: For Gemma 3n E4B, the effective KV/1K drops from 68.4 (linear) to 7.9-9.8 (compound) because most layers are sliding-window-bounded at 512 tokens, and 15 layers share KV entirely. The current formula reports this model as needing 8,750 MB of KV cache at 128K context when it actually needs only 1,016 MB -- a 7.7 GB overestimate that would prevent deploying it on a 24 GB GPU when it actually fits.
Backward compatibility verification¶
All 4 models tested: compound formula with sliding_window=None (infinity),
triattn_budget=None (infinity), num_kv_shared_layers=0 produces values
identical (within 0.01 MB) to the current linear formula at ctx = 1K, 8K, 32K,
128K. PASS.
Step 6: Preflight Update Sketch¶
Current code (vram_check.py:107-119)¶
# Strategy 2: Reduce context size
# KV cache roughly scales with context size
if ctx_size > 4096:
# Halving context roughly halves KV cache
suggested_ctx = ctx_size // 2
kv_savings = overage // 2 # Conservative estimate
This heuristic is wrong for hybrid-attention models. For Gemma 3 27B at ctx=128K, halving context to 64K does NOT halve KV cache: - Linear formula: 62,000 MB -> 31,000 MB (halved, as claimed) - Compound formula: 10,416 MB -> 5,416 MB (48% reduction, not 50%)
At ctx=8K, the difference is more dramatic: - Linear: 3,875 -> 1,937.5 (halved) - Compound: 1,041 -> 833 (20% reduction, NOT halved)
The sliding layers are already at their window cap (1024 tokens). Halving context from 8K to 4K only reduces the global layers' contribution, which is 10 out of 62 layers.
Proposed replacement (description, not code change)¶
The ctx-reduction suggestion in VRAMSuggestion.for_llamacpp() should be
replaced with a profile-aware calculation:
- If the model has a
kv_cache_profile: - Compute KV at current ctx and at proposed ctx using compound formula
- Report actual savings: "Reduce ctx from 32K to 16K to save ~X MB"
-
If sliding window caps mean ctx reduction yields <10% savings, skip this suggestion entirely and suggest quantization instead
-
If the model has only scalar
kv_cache_per_1k_tokens_mb: -
Fall back to current heuristic (halving ctx roughly halves KV)
-
Add a new suggestion type for hybrid models: "KV cache is dominated by sliding-window layers capped at {W} tokens. Reducing context below {W} will have minimal KV impact. Consider a smaller quantization instead."
Additional preflight check¶
For models with kv_cache_profile, the VRAMCheck should also:
- Warn if context_size < sliding_window (wasted attention capacity)
- Report the breakdown: "KV: {X} MB sliding + {Y} MB global = {Z} MB total"
Step 7: Deliverables¶
1. Research report¶
This document: docs/research/20260524_kv_estimation_redesign.md
2. Proposed ModelInfo schema (additive, backward-compatible)¶
class KVCacheProfile(BaseModel):
"""Structured KV cache data for layer-type-aware estimation."""
model_config = ConfigDict(extra="forbid")
num_sliding_layers: int = 0
num_global_layers: int = 0
num_kv_shared_layers: int = 0
sliding_window: int | None = None
head_dim: int = 128
global_head_dim: int | None = None
num_kv_heads: int = 1
num_global_kv_heads: int | None = None
attention_k_eq_v: bool = False
triattn_budget: int | None = None
kv_per_1k_at_inf: int | None = None # backward-compat: linear-equivalent rate
class ModelInfo(BaseModel):
# ... existing fields UNCHANGED ...
kv_cache_per_1k_tokens_mb: int | None = None # KEPT
kv_cache_profile: KVCacheProfile | None = None # NEW, optional
DB migration: ALTER TABLE models ADD COLUMN kv_cache_profile TEXT (JSON blob).
3. PoC demonstrating compound formula¶
File: docs/research/poc/kv_estimation_compound.py
Demonstrates: - Compound formula implementation with all 6 architecture cases - Backward-compatibility verification (compound with inf windows = linear) - Reference value computation for all 4 models - Detail breakdown showing per-layer-type contributions
Run: uv run python docs/research/poc/kv_estimation_compound.py
4. Reference-value table¶
See Step 5 above. Key finding: current formula overestimates KV cache by 73-88% for Gemma hybrid-attention models.
5. Persistence shape recommendation¶
Option (a): Additive. Keep kv_cache_per_1k_tokens_mb, add
kv_cache_profile. Migration cost: ~115 lines added, 0 existing lines changed,
0 preset files changed. See Step 3 for full rationale.
Risks and Open Questions¶
-
Gemma 3 pattern inference. The 5:1 sliding/global pattern for Gemma 3 is NOT in config.json -- it's a transformers default (
sliding_window_pattern=6). If Google changes this default in a future transformers release, the fetcher's hardcoded inference will produce wrong results. Mitigation: pin the pattern inference to known model_type values and log a warning for unknown types. -
num_kv_shared_layersdistribution. The proportional distribution (shared layers distributed across types based on layer ratio) was verified to match the actual Gemma 3n E4B distribution exactly. But this is one model -- future models might distribute shared layers differently. Mitigation: whenlayer_typesis available, count the actual shared layers by type using the "last N layers" rule. The proportional heuristic is the fallback. -
attention_k_eq_vimpact. None of the 4 reference models use K=V sharing. Gemma 4 models (26B-A4B, 31B) do. This halves KV storage but was not numerically verified in this PoC because the Gemma 4 models were not in scope. Follow-up ticket should verify. -
KV cache dtype. The formula assumes fp16 (2 bytes per element). Some backends (llama.cpp with
--cache-type-k q8_0) use quantized KV caches. The compound formula does not account for this. [unverified] whether this should be a profile field or a runtime parameter. -
TriAttention budget integration. The
triattn_budgetfield is included in the formula for forward compatibility but has no current producer. When TriAttention is integrated (seedocs/research/20260419_triattention_viability.md), the budget will come from the presetkv_compressionblock, not from config.json. The formula already handles this case correctly.
References¶
Primary sources (config.json, fetched 2026-05-24)¶
google/gemma-3n-E4B-itconfig.jsontext_config-- layer_types, sliding_window, num_kv_shared_layersgoogle/gemma-3-27b-itconfig.jsontext_config-- sliding_window, head_dim, num_kv_headsQwen/Qwen3-32Bconfig.json -- dense architecture, sliding_window=nullunsloth/Llama-3.3-70B-Instructconfig.json -- dense architecture (mirror of gated meta-llama repo)
Transformers source¶
- Gemma3TextConfig sliding_window_pattern default --
sliding_window_pattern=6at line 286 - Gemma3n HF docs -- KV cache sharing, architecture overview
- Gemma 3 HF docs -- sliding window pattern
Architecture documentation¶
- Botmonster: Gemma 4 Architecture -- num_kv_shared_layers semantics
- HF Blog: Welcome Gemma 3 -- 5:1 sliding/global pattern
- HF Blog: Welcome Gemma 4 -- dual RoPE, KV sharing, asymmetric head_dim
Prior gpumod research¶
docs/research/20260419_triattention_viability.mdsection A -- compound formula first proposalsrc/gpumod/fetchers/huggingface.py:263-299-- current linear formulasrc/gpumod/registry.py:159-163-- current VRAM estimation consumer
huggingface_hub API¶
- hf_hub_download documentation -- file download with caching
Follow-up Tasks¶
- Implement KVCacheProfile model and DB migration -- add the new Pydantic
model and
ALTER TABLEmigration. ~25 lines. - Extend HuggingFaceFetcher -- add
_fetch_raw_config()and_build_kv_cache_profile()per Step 4 prototype. ~50 lines. - Update registry.py estimation -- prefer
kv_cache_profilewhen available, fall back to scalar. ~20 lines. - Update preflight ctx-reduction suggestion -- profile-aware savings calculation per Step 6. ~20 lines.
- Verify Gemma 4 models -- the
attention_k_eq_v,global_head_dim, andnum_global_key_value_headsfields need numerical verification on Gemma 4 26B-A4B and 31B configs. - KV cache dtype support -- investigate whether
--cache-type-k q8_0(llama.cpp) and FP8 KV cache (vLLM) should be represented in the profile.