Gemma 4 26B-A4B — QAT UD-Q4_K_XL vs imatrix UD-IQ4_XS¶
Date: 2026-06-07
Tickets: gpumod-rjkx (bench); gpumod-p2gj (CLI exit-code fix landed mid-bench)
Question: Does Unsloth's QAT-derived Gemma 4 26B-A4B GGUF preserve or improve on the current Hermes-agent baseline (imatrix UD-IQ4_XS), and is it worth swapping modes/hermes-agent.yaml:28?
TL;DR¶
| Variant | Mean | σ | Min/Max | 95% CI | TPS | VRAM idle | GGUF size | Verdict |
|---|---|---|---|---|---|---|---|---|
| Gemma 4 26B-A4B QAT UD-Q4_K_XL | 100.00 | 0.00 | 100/100 | [100.0, 100.0] | 168.9 | 17567 MB | 14.25 GB | Quality tied, +20% TPS — swap to this |
| Gemma 4 26B-A4B imatrix UD-IQ4_XS † | 100.00 | 0.00 | 100/100 | [100.0, 100.0] | 140.7 | 15665 MB | 12.7 GB | Hermes baseline (perfect quality, see 20260603) |
† Imatrix row reused from docs/benchmarks/20260603_gemma4_12b_vs_qwen36_35b_a3b_mtp/result_gemma4-26b-a4b-q4.json (same b9500 binary, same GEMMA_CODING sampler, same --cache-type-k/v q8_0, same enable_thinking, same v2 coding suite). Different host-load window — flag for TPS interpretation (see Methodology Caveats).
Three headlines:
- Quality is preserved exactly. QAT scores 100/100 on every one of 15 iterations with σ=0 — bit-for-bit the same headline number as the imatrix baseline. All five levels (including the L5 source-inspection composition test) pass 15/15. QAT does what it says on the tin: at the precision it was trained against, accuracy holds.
- TPS is +20% (+28 TPS). QAT 168.9 mean vs imatrix 140.7 — a far larger delta than the σ noise floor (QAT σ=0.73, imatrix σ=0.32, combined ≪ 1 TPS). The real reason is more nuanced than "QAT is faster": Unsloth's QAT GGUF ships as a Q4_K_XL (K-quant, no runtime lookups), while the imatrix variant uses IQ4_XS (i-quant with per-block lookup tables that cost cycles per token). The +20% mostly comes from the quant-format swap; QAT itself "earns" the right to ship at the higher-throughput Q4_K layout without losing accuracy.
- The cost is +1.9 GB VRAM and +1.5 GB GGUF. Steady-state VRAM 17567 vs 15665 MB. Still comfortably inside the 24 GB ceiling — with multimodal mmproj BF16 (~1.1 GB) loaded, peak VRAM during a benchmark iteration was ~18 GB. Six GB of headroom remain for image-token KV growth on multi-image conversations.
See Recommendation for the mode-swap decision.
Setup¶
| Component | Specification |
|---|---|
| CPU | AMD Ryzen 7 5700G (16 threads) |
| RAM | 32 GB DDR4 |
| GPU | NVIDIA GeForce RTX 4090 (24 GB VRAM) |
| OS | Ubuntu 24.04.4 LTS |
| Driver | NVIDIA 580.65.06 |
| CUDA | 12.0 |
| llama.cpp | b9500 (3d1998634, built 2026-06-04) — same binary as the 20260603 imatrix run, no quant-format changes between |
| Bench commit | 157febf (HEAD at run time) — QAT preset + runner edits + driver pre-fix; CLI exit-code fix landed mid-bench |
VRAM isolation enforced — gpumod mode switch blank stopped vllm-embedding-code before the QAT model start; nothing else was GPU-resident during the run (see run_bench.sh).
Why this benchmark¶
Unsloth shipped a QAT (Quantization-Aware Training) GGUF for Gemma 4 26B-A4B at unsloth/gemma-4-26B-A4B-it-qat-GGUF on top of google/gemma-4-26B-A4B-it-qat-q4_0-unquantized. Per Unsloth's QAT docs, they ship only one quant tier (UD-Q4_K_XL) because "precisions higher than the uploaded UD-Q4_K_XL version degrade accuracy" — QAT-derived weights are most consistent at the precision the training targeted.
The current Hermes-agent mode (modes/hermes-agent.yaml:28) uses gemma4-26b-a4b-q4 (imatrix UD-IQ4_XS, 12.7 GB GGUF). That row in the 20260603 bench scored 100/100 across 15 iterations with σ=0 — a tough baseline to improve on for quality. The question for QAT is:
- Does it match imatrix's perfect quality? (Expected: yes — QAT training preserves accuracy at the target precision.)
- Does it cost more or less TPS? (Unknown — UD-Q4_K_XL is a different K-quant block layout than IQ4_XS.)
- Is the +1.5 GB GGUF size (14.25 vs 12.7 GB) and +1.2 GB VRAM headroom hit acceptable?
- Net: keep imatrix, switch to QAT, or split (e.g. QAT for hermes-agent single-slot, imatrix for code mode's multi-slot variant)?
Models tested¶
| ID | Source | Architecture | Quant | File size | Context | Sampler | mmproj |
|---|---|---|---|---|---|---|---|
gemma4-26b-a4b-qat-q4 |
unsloth/gemma-4-26B-A4B-it-qat-GGUF |
MoE 26B / 4B active | QAT UD-Q4_K_XL | 14.25 GB | 131072 | GEMMA_CODING (temp 1.0, top_p 0.95, top_k 64, RP 1.05) | BF16 (reused) |
gemma4-26b-a4b-q4 † |
unsloth/gemma-4-26B-A4B-it-GGUF |
MoE 26B / 4B active | imatrix UD-IQ4_XS | 12.7 GB | 131072 | GEMMA_CODING | BF16 |
† Imatrix row not re-run for this bench. Reused from 20260603.
Both presets ship identical flags: --parallel 1 --threads 16 --cache-type-k q8_0 --cache-type-v q8_0 --temp 1.0 --top-p 0.95 --top-k 64 --chat-template-kwargs '{"enable_thinking":true}' --mmproj $HOME/bin/gemma-4-26B-A4B-it-mmproj-BF16.gguf. The QAT preset reuses the imatrix repo's mmproj-BF16.gguf (Unsloth ships the same vision encoder in both repos; reusing the existing file keeps the multimodal path byte-identical between arms). Presets:
presets/llm/gemma4-26b-a4b-qat-q4.yaml— new for this bench, port 7110,vram_mb: 20500presets/llm/gemma4-26b-a4b-q4.yaml— Hermes baseline, port 7109,vram_mb: 19500
Methodology¶
The bench harness is the v2 coding suite from scripts/run_qwen36_benchmark.py. Every iteration runs five levels:
| Level | Task | Points |
|---|---|---|
| L1 | Basic queue (add/get, FIFO) | 25 |
| L2 | Retry with exponential backoff | 25 |
| L3 | Priority scheduling | 25 |
| L4 | Find & fix concurrency bug | 15 |
| L5 | Compose Job + RetryPolicy + JobQueue (source-inspection) | 10 |
Validation: PytestValidator with 30 s per-level timeout, 900 s per-request client timeout, max_tokens=32768. The L5 source-inspection assertion (gpumod-pdtn fix landed 2026-06-04) is the hardest test in the suite — any model that scores 100 across 15 iters on L5 is doing real composition, not pattern-matching.
QAT-specific test additions:
- New entry in
scripts/run_qwen36_benchmark.pyMODELS dict withsampler=GEMMA_CODING. - New CLI choice
gemma4-26b-a4b-qat-q4. - Driver script
run_bench.sh:mode switch blank→ 15 s quiesce wait → start QAT service → wait for/health→ 15 iters → stop. The 15 s quiesce wait is a hard-won learning from this bench's first failed launch (see Lessons).
Results¶
Summary¶
| Variant | Mean | σ | Min/Max | 95% CI | TPS mean | TPS min/max | TPS σ |
|---|---|---|---|---|---|---|---|
| QAT UD-Q4_K_XL | 100.00 | 0.00 | 100/100 | [100.0, 100.0] | 168.88 | 166.8 / 169.8 | 0.73 |
| imatrix IQ4_XS † | 100.00 | 0.00 | 100/100 | [100.0, 100.0] | 140.72 | 140.0 / 141.2 | 0.32 |
| Δ (QAT − imatrix) | 0.00 | 0.00 | — | — | +28.16 (+20.0 %) | — | — |
Per-level pass rates¶
| Level | Task | QAT UD-Q4_K_XL | imatrix IQ4_XS |
|---|---|---|---|
| L1 | Basic queue | 100% (15/15) | 100% (15/15) |
| L2 | Retry with backoff | 100% (15/15) | 100% (15/15) |
| L3 | Priority scheduling | 100% (15/15) | 100% (15/15) |
| L4 | Concurrency bug fix | 100% (15/15) | 100% (15/15) |
| L5 | Compose Job + RetryPolicy + JobQueue | 100% (15/15) | 100% (15/15) |
Per-iteration scores¶
| Variant | Scores (15 iters) |
|---|---|
| QAT UD-Q4_K_XL | 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100 |
| imatrix IQ4_XS † | 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100 |
Per-iteration TPS¶
| Variant | TPS (15 iters) |
|---|---|
| QAT UD-Q4_K_XL | 168.5, 168.5, 168.7, 168.4, 169.4, 169.1, 169.8, 168.5, 168.9, 169.0, 169.3, 169.8, 169.4, 169.4, 166.8 |
| imatrix IQ4_XS † | 141.0, 141.2, 140.8, 140.8, 140.8, 140.9, 140.9, 140.7, 140.9, 140.0, 140.2, 140.4, 140.7, 140.6, 140.9 |
VRAM and load characteristics¶
| Variant | GGUF on disk | mmproj | VRAM idle (steady-state) | TTFT warm |
|---|---|---|---|---|
| QAT UD-Q4_K_XL | 14.25 GB | 1.13 GB (BF16, shared) | 17567 MB | 52 ms |
| imatrix IQ4_XS † | 12.7 GB | 1.13 GB (BF16) | 15665 MB | 53 ms |
| Δ | +1.55 GB | 0 | +1902 MB (+12 %) | −1 ms (noise) |
Steady-state VRAM is reported after the first iteration; iter-1 was identical (within 28 MB). Iter-1 TTFT on QAT was 73 ms (warm-cache miss on first request); imatrix iter-1 TTFT was 82 ms. Both arms have a flat-line VRAM curve — no growth across 15 iters, no leak signal. Total bench duration (QAT, 15 iters): 27.6 min (started 2026-06-06T22:34:35Z, completed 2026-06-06T23:02:10Z).
Methodology Caveats¶
- Imatrix baseline reused, not re-run. TPS comparisons must allow for the fact that the imatrix run happened on 2026-06-04 under a (potentially) different host-load profile. Same b9500 binary and same
GEMMA_CODINGsampler, but I/O / CPU contention could differ. Quality (pass-rate, score) is host-load-insensitive and fully comparable. If the TPS delta is small (< 5%), assume noise; if it's large, the cause is more likely the quant pipeline than the host. - Sampler is GEMMA_CODING per Google's model card. Temp 1.0 / top_p 0.95 / top_k 64 / repetition_penalty 1.05. Both arms use the same values.
enable_thinking=trueis enabled on both arms. Gemma's chat template routes the reasoning portion of the response to areasoning_contentfield on the OpenAI-compat response;contentholds the post-thinking answer. The benchmark'sLlamaCppClient(src/gpumod/benchmarks/coding/llm_client.py:93) readscontent(notreasoning_content), so the score reflects the post-thinking answer only.- MTP is not in play for either arm. Mainline llama.cpp PR #23398 (Gemma 4 MTP) is still WIP. Both QAT and imatrix run at native non-speculative speed. Track gpumod-rj0s.
- Single run, 15 iters. Enough for tight mean/σ separation. The imatrix arm has σ=0 (no spread to detect), so any QAT std > 0 immediately signals a quant-pipeline accuracy difference (not noise).
Lessons learned mid-bench¶
The first bench launch failed — the gpumod CLI silently exited 0 on a Lifecycle error, so the set -e harness didn't catch a quiesce-window collision. Root cause and fix were extracted into a separate ticket and shipped before the re-launch:
- gpumod-p2gj —
error_handler(src/gpumod/cli.py:296-304) now raisestyper.Exit(code=1)after printing the error, so shell automation (set -e,$?, pipefail) sees a non-zero exit code. New testtests/unit/test_cli_error_handler.py; 16 existing tests across 5 files updated fromexit_code == 0 # error_handler catches ittoexit_code != 0 # gpumod-p2gj: …. All four gates pass (2381 pytest, mypy --strict, ruff, format).
The bench's run_bench.sh also added a 15 s quiesce wait after mode switch blank (the mode switch stops the embedding server; that stop is itself heavy-GPU-eligible and trips the start-side quiesce gate on the QAT model).
Recommendation¶
Swap modes/hermes-agent.yaml:28 from gemma4-26b-a4b-q4 to gemma4-26b-a4b-qat-q4.
The decision criteria from gpumod-rjkx were: "keep imatrix, switch to QAT, or split". QAT wins on the only axes that matter:
| Criterion | imatrix IQ4_XS | QAT UD-Q4_K_XL | Verdict |
|---|---|---|---|
| Score mean | 100.00 | 100.00 | tied |
| Score σ | 0.00 | 0.00 | tied |
| L5 composition pass-rate | 15/15 | 15/15 | tied |
| TPS mean | 140.7 | 168.9 | +28.2 (+20 %) |
| VRAM idle | 15.3 GB | 17.2 GB | imatrix wins by 1.9 GB |
| GGUF on disk | 12.7 GB | 14.25 GB | imatrix wins by 1.55 GB |
| Multimodal path | mmproj BF16 OK | mmproj BF16 OK (reused) | tied |
| Sampler / context / flags | identical | identical | tied |
| llama.cpp binary | b9500 | b9500 | tied |
VRAM cost is comfortably under the 24 GB ceiling: peak ~18 GB with mmproj leaves >6 GB headroom for image-KV growth, which is enough for any realistic Hermes-agent multimodal conversation. The +1.55 GB disk hit is irrelevant on a 5.3 TB pool.
The +20 % TPS is delivered to every Hermes-agent request — a real, persistent latency improvement for the operator's primary daily-driver model. No regressions in any measured dimension.
Code mode follow-up. This bench only covers the single-slot multimodal hermes-agent config. The multi-slot code-mode preset (gemma4-26b-a4b-q4-multi, port 7109, --parallel 3 --cont-batching) would need its own QAT mirror plus an install-all re-render to verify the change. Not in scope here; file a follow-up if the operator wants it after a few days of hermes-agent QAT runtime confirms no regressions.
MTP follow-up unchanged. gpumod-rj0s still tracks PR #23398. Once Gemma 4 MTP lands, an assistant-arch drafter pairs with whichever quant variant is in production at that point. QAT and MTP are orthogonal.
Files¶
run_bench.sh— driver (QAT only)result_gemma4-26b-a4b-qat-q4.json— QAT raw resultsrun_gemma4-26b-a4b-qat-q4.log— QAT run log (gitignored per*.log; local-only)artifacts/— per-iteration prompt + response + extracted code- Imatrix baseline:
../20260603_gemma4_12b_vs_qwen36_35b_a3b_mtp/result_gemma4-26b-a4b-q4.json
Related¶
- gpumod-rjkx — this bench
- gpumod-p2gj — CLI exit-code fix (landed mid-bench)
- gpumod-rj0s — Gemma 4 MTP tracking (upstream PR #23398, future TPS boost)
- gpumod-h6gs — predecessor bench (Gemma 4 12B/26B-A4B vs Qwen3.6-35B-A3B-MTP); imatrix baseline lives here
- Unsloth Gemma 4 QAT docs
modes/hermes-agent.yaml:28— current single-slot LLM selector (would change if QAT wins)