Gemma 4 (12B Q4/Q5/Q8 + 26B-A4B IQ4) vs Qwen3.6-35B-A3B-MTP-preserve¶
Date: 2026-06-04 Tickets: gpumod-h6gs (bench); gpumod-9ial / gpumod-eods / gpumod-7vy8 / gpumod-pdtn (methodology fixes); gpumod-t84m / gpumod-4omn (bench infra) Question: How does Gemma 4 at the 12B and 26B-A4B sizes compare on the v2 coding benchmark against the current Hermes-agent model (Qwen3.6-35B-A3B-MTP-IQ4_XS preserve_thinking)?
TL;DR¶
| Model | Mean | σ | Min/Max | 95% CI | TPS | VRAM | Verdict |
|---|---|---|---|---|---|---|---|
| Gemma 4 26B-A4B IT UD-IQ4_XS | 100.0 | 0.0 | 100/100 | [100.0, 100.0] | 140.7 | ~16 GB | Perfect across all 15 iters, all 5 levels. Strongest single result in this suite. |
| Gemma 4 12B IT UD-Q8_K_XL | 89.7 | 13.4 | 65/100 | [82.9, 96.5] | 52.3 | ~15 GB | L2 80% / L5 73% on the composition test. Decent quality, slowest TPS. |
| Gemma 4 12B IT Q5_K_M | 80.7 | 12.8 | 65/100 | [74.2, 87.2] | 73.7 | ~10 GB | Reasonable quality-VRAM-TPS knee. Drops on the harder L5. |
| Gemma 4 12B IT UD-Q4_K_XL | 62.3 | 18.9 | 40/100 | [52.8, 71.9] | 80.0 | ~9 GB | Not recommended for serious coding work on this benchmark. |
| Qwen3.6-35B-A3B-MTP-IQ4_XS preserve † | 88.3 | 6.5 | 65/100 | [84.8, 91.9] | 216.5 | ~22 GB | Hermes baseline. TPS lead is real; quality is now overtaken by Gemma 4 26B-A4B. |
† Qwen row reused from 20260524 benchmark; not re-run here. Its L5 had a 0% pass rate (different test format then) so the 88.3 mean reflects L1–L4 only; the comparison with Gemma 4 26B-A4B's 100/0 should be read as "Qwen falls 12 points short on the lower 4 levels, Gemma also clears the harder composition L5".
Three headlines:
- Gemma 4 26B-A4B is the unambiguous quality leader. Perfect 100 mean / σ=0 / 15 of 15 iters at the ceiling, including 100% on the L5 composition test that requires a Job dataclass + RetryPolicy + JobQueue with real composition between them.
- 12B-Q8 sits 10 points below 26B-A4B. 89.67 mean with σ=13.4 — usable but not in the same league. Its L2 80% is a 20pp improvement over what an earlier rev of the L2 prompt produced; the bench infrastructure changes that drove that are listed under Methodology.
- 12B-Q4 is too weak for the benchmark's harder levels. 62.33 mean / 33% L5 — the cheapest dense Gemma is below the bar for any production coding workload that exercises composition.
See Recommendation for mode-swap decisions.
Setup¶
| Component | Specification |
|---|---|
| CPU | AMD Ryzen 7 5700G (16 threads) |
| RAM | 32 GB DDR4 |
| GPU | NVIDIA GeForce RTX 4090 (24 GB VRAM) |
| OS | Ubuntu 24.04.4 LTS |
| Driver | NVIDIA 580.65.06 |
| CUDA | 12.0 |
| llama.cpp | b9500 (3d1998634, built 2026-06-04) |
VRAM isolation enforced for every model: only the model under test was GPU-resident; all other gpumod services were stopped via gpumod mode switch blank before each model start (the bench drivers call that themselves).
Models Tested¶
| ID | Source | Architecture | Quant | File size | Context | Sampler |
|---|---|---|---|---|---|---|
qwen36-35b-a3b-mtp-iq4xs-preserve † |
unsloth/Qwen3.6-35B-A3B-MTP-GGUF |
MoE 35B / 3B active + MTP | UD-IQ4_XS | 18.2 GB | 131072 | THINKING_CODING (temp 0.6) |
gemma4-12b-q4 |
unsloth/gemma-4-12b-it-GGUF |
Dense 12B | UD-Q4_K_XL | 7.4 GB | 131072 | GEMMA_CODING (temp 1.0, RP 1.05) |
gemma4-12b-q5 |
unsloth/gemma-4-12b-it-GGUF |
Dense 12B | Q5_K_M | 8.4 GB | 131072 | GEMMA_CODING |
gemma4-12b-q8 |
unsloth/gemma-4-12b-it-GGUF |
Dense 12B | UD-Q8_K_XL | 13.6 GB | 131072 | GEMMA_CODING |
gemma4-26b-a4b-q4 |
unsloth/gemma-4-26B-A4B-it-GGUF |
MoE 26B / 4B active | UD-IQ4_XS | 12.7 GB | 131072 | GEMMA_CODING |
† Reused from 20260524 — not re-run.
All Gemma presets ship --cache-type-k q8_0 --cache-type-v q8_0 (matches Qwen baseline's strategy at 131072 ctx), --parallel 1, --flash-attn on, and --chat-template-kwargs '{"enable_thinking":true}'. None ship MTP — see Methodology Caveats.
Methodology¶
This bench's results depend on four bench-infrastructure fixes landed in commit history before the published run:
| Ticket | Fix |
|---|---|
| gpumod-9ial | Code extractor unwraps the <reasoning_content>…<content> artifact wrapper before extraction (so artifact L*_code.py files match what was validated) and dedents indented fences (Gemma's chat template wraps code inside numbered list items at column 4+). Pure-correctness fix — previously the artifacts didn't reflect the scored code. |
| gpumod-eods | GEMMA_CODING.repetition_penalty 1.0 → 1.05 to break degeneration loops on the dense 12B at temp=1.0. |
| gpumod-7vy8 | L2 prompt rewritten. Removed a misleading requests.get example that derailed 12B dense, fixed a data-shape mismatch with the tests, disambiguated "retry up to 3 times" → "4 total attempts (initial + 3 retries)", explicit "do not import external packages". |
| gpumod-pdtn | L5 test rewritten. Previous L5 only asserted from solution import JobQueue/Job (trivially passable — any file with those two class names passed). New L5 requires Job dataclass + RetryPolicy class + JobQueue that composes RetryPolicy (source-inspection assertion). Level renamed "Multi-file Refactor" → "Compose Job + RetryPolicy + JobQueue" so the name matches what is measured. |
The L5 change is the most significant for cross-benchmark comparisons — any older benchmark report under the v2 methodology used the trivial test, so its L5 numbers are not comparable to this one's. The other three fixes don't change the test bar, they just make the runner correctly evaluate model output.
Results¶
Summary¶
| Model | Mean | σ | Min/Max | 95% CI | TPS |
|---|---|---|---|---|---|
| Gemma 4 26B-A4B | 100.00 | 0.00 | 100/100 | [100.0, 100.0] | 140.7 |
| Gemma 4 12B Q8 | 89.67 | 13.43 | 65/100 | [82.9, 96.5] | 52.3 |
| Gemma 4 12B Q5 | 80.67 | 12.80 | 65/100 | [74.2, 87.2] | 73.7 |
| Gemma 4 12B Q4 | 62.33 | 18.89 | 40/100 | [52.8, 71.9] | 80.0 |
Per-level pass rates¶
| Level | Task | Q4 | Q5 | Q8 | 26B-A4B |
|---|---|---|---|---|---|
| L1 | Basic queue (add/get, FIFO) | 46% | 100% | 100% | 100% |
| L2 | Retry with backoff | 40% | 46% | 80% | 100% |
| L3 | Priority scheduling | 93% | 100% | 93% | 100% |
| L4 | Find & fix concurrency bug | 93% | 100% | 93% | 100% |
| L5 | Compose Job + RetryPolicy + JobQueue | 33% | 40% | 73% | 100% |
Per-iteration scores¶
| Model | Scores (15 iters) |
|---|---|
| Gemma 4 26B-A4B | 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100 |
| Gemma 4 12B Q8 | 100, 75, 75, 100, 90, 100, 90, 90, 100, 100, 100, 90, 65, 100, 100 |
| Gemma 4 12B Q5 | 90, 75, 75, 75, 65, 90, 100, 75, 90, 100, 75, 90, 75, 90, 65 |
| Gemma 4 12B Q4 | 75, 40, 75, 75, 100, 50, 65, 50, 75, 50, 50, 50, 65, 100, 65 |
What separates 26B-A4B from 12B¶
The 26B-A4B clears every iteration on every level. The 12B variants cluster failures on L2 (retry-with-backoff: 40–80% pass) and L5 (composition: 33–73% pass). The L1/L3/L4 tasks are within reach of even Q5; the gap is on the two levels that ask for structural correctness (off-by-one-free retry loop, three classes that compose). That's the kind of failure mode where MoE 26B-A4B's 4B-active-params budget seems to give it a hard advantage over dense 12B regardless of quant tier.
Methodology Caveats¶
- MTP asymmetry, not by choice. Qwen baseline uses MTP speculative decoding (+24% TPS measured in the 20260524 benchmark). No Gemma 4 variant tested here has any MTP path available:
- 12B: Google released a 12B drafter (
google/gemma-4-12B-it-assistant) but itsGemma4UnifiedAssistantForCausalLMarch has zero upstream PRs inggml-org/llama.cpp. - 26B-A4B: a drafter exists (
google/gemma-4-26B-A4B-it-assistant) and AtomicChat ships GGUF conversions. Mainline llama.cpp PR #23398 is still OPEN (WIP) as of 2026-06-04. Tracking via gpumod-rj0s. - Interpretation: Gemma TPS columns reflect non-speculative speed. The Gemma 4 26B-A4B 140.7 TPS vs Qwen-MTP 216.5 TPS gap (-35%) is partly architecture, partly the missing speculative-decoding boost. Once PR #23398 merges, a re-bench should close most of that gap.
- Sampler asymmetry, by design. Each model uses its vendor-recommended sampler: Qwen runs THINKING_CODING (temp=0.6, top_p=0.95, top_k=20); Gemma runs GEMMA_CODING (temp=1.0, top_p=0.95, top_k=64, RP=1.05). Plumbed via the
samplerfield onModelConfiginscripts/run_qwen36_benchmark.py. - Iterations: 15 per model. Enough to call meaningful mean/σ differences but light for per-level pass-rate stability on rare-failure levels.
- Validation: PytestValidator with 30 s per-level timeout, 900 s per-request client timeout, max_tokens=32768. Identical to the 20260524 baseline.
Recommendation¶
| Use case | Recommended | Why |
|---|---|---|
| Highest-quality coding model | gemma4-26b-a4b-q4 |
100/100 perfect across 15 iters with σ=0. Clears the L5 composition test that 12B-Q8 only manages 73% on. |
| Hermes-agent slot | Swap to gemma4-26b-a4b-q4 (landed in commit 7523805, gpumod-yxr6 partial) |
+12 mean (88.3 → 100), σ collapses to 0, VRAM total drops 22.8 GB → 19 GB (+4 GB headroom). Trade-off: -35% TPS (216 → 140) and loss of preserve_thinking multi-turn kwarg (Gemma's chat template uses enable_thinking). Track gpumod-rj0s for the upstream PR #23398 merge that would let Gemma 4 26B-A4B run with its own MTP drafter and close the TPS gap. |
| Code mode slot | Pending follow-up (gpumod-yxzt) | code mode invariant requires --parallel 3 --cont-batching for concurrent coding tabs. Need gemma4-26b-a4b-q4-multi preset first; then swap. |
| Low-VRAM coding mode | gemma4-12b-q5 (best 12B knee that fits in ~10 GB) |
80.67 mean / 46% L2 / 40% L5. Q8 is clearly better at 89.67 mean if VRAM allows it. |
| Lowest-precision dense Gemma | gemma4-12b-q4 is not recommended for serious work |
62.33 mean / 33% L5; the cheaper variant fails the composition tests too often. |
Why the Hermes swap is defensible¶
The Qwen baseline used to be defended as "statistically equivalent on mean, ~2-3× faster, lower variance" against the prior Gemma 12B variants. That comparison is no longer the relevant one — gemma4-26b-a4b-q4 is the comparison point now, and its 100 mean vs Qwen's 88.3 is a +12 quality delta with no CI overlap and σ collapsing from 6.5 to 0. The TPS cost is unchanged (-35%) but the quality gain is meaningful enough that for an interactive agent where quality of single replies dominates, the swap is defensible. Track gpumod-rj0s — once Gemma 4 MTP lands, the swap also wins on TPS.
What we'd want before treating the swap as production-final¶
- Real chat / tool-calling session validation. v2 benchmark covers single-shot coding only — multi-turn agent behaviour under the new Gemma chat template is not measured.
- Re-bench 26B-A4B with MTP drafter once PR #23398 lands (gpumod-rj0s). If MTP adds ~50% TPS as it does on Qwen, 26B-A4B reaches ~210 TPS at 100/0 quality.
- n=30 confirmation on Q8's L2 80% to lock in that the L2 prompt-fix gain wasn't a draw of the dice.
Files¶
| File | Description |
|---|---|
result_gemma4-12b-q4.json |
15-iter result, Gemma 4 12B UD-Q4_K_XL |
result_gemma4-12b-q5.json |
15-iter result, Gemma 4 12B Q5_K_M |
result_gemma4-12b-q8.json |
15-iter result, Gemma 4 12B UD-Q8_K_XL |
result_gemma4-26b-a4b-q4.json |
15-iter result, Gemma 4 26B-A4B UD-IQ4_XS |
run_bench.sh |
Driver for Q4 + Q5 sequential run |
run_bench_extra.sh |
Driver for Q8 + 26B-A4B sequential run (includes 12 GiB size guard for the 26B GGUF) |
run_gemma4-*.log |
Per-model benchmark stdout (gitignored per *.log) |
artifacts/<model>/iter_NN/ |
Per-iteration, per-level generated code and validation output |
Per-bench drivers are superseded by the central scripts/run_coding_benchmark.sh (gpumod-4omn) — same patterns, one script, options instead of per-dir wrappers. The local run_bench{,_extra}.sh are kept because they document the exact model list and 26B size guard for this benchmark.
References¶
- 20260524 Qwen3.6 MTP vs non-MTP benchmark (Hermes-agent baseline)
- 20260423 Qwen3.6 vs Gemma4 E4B comparison (prior generation)
- Unsloth gemma-4-12b-it-GGUF model card
- Unsloth gemma-4-26B-A4B-it-GGUF model card
- google/gemma-4-12B-it (sampler recommendation source)
- ggml-org/llama.cpp PR #23398 — Gemma 4 MTP port to mainline (WIP)
- AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF — drafter pre-conversion for the future PR #23398 swap