Skip to content

Qwen3.6 MTP vs Non-MTP — Hermes-Agent Swap Evaluation

Date: 2026-05-24 (initial), 2026-05-25 (35B-A3B-MTP rerun at 128K + q8_0) Ticket: gpumod-76l.3 (epic gpumod-76l); benchmark refresh gpumod-ttk7 Question: Should modes/hermes-agent.yaml swap from qwen36-35b-a3b-iq4xs to a Multi-Token Prediction (MTP) variant?

Goal

Compare Qwen3.6 with and without Multi-Token Prediction (MTP) on the v2 coding benchmark (15 iterations × 5 levels of increasing difficulty, validated by real pytest tests). MTP is a speculative-decoding technique baked into the GGUF tensors that Unsloth/Qwen claim gives ~1.4–2.2× faster inference with no accuracy loss. The benchmark settles whether MTP delivers on that claim and whether it should replace the current Hermes-agent chat model.

Methodology Caveats

Direct comparison with the prior 20260423_qwen36_gemma4_comparison benchmark has two known confounds:

  • Binary version: prior runs used llama.cpp b8838 (23b8cc499); MTP runs need b9297 (b0df4c0cf). MTP support was merged on 2026-05-13. Between the two binaries are ~459 releases of kernel improvements that affect both quality (numerical determinism) and speed.
  • Quant variant for dense 27B: prior 27B used legacy Q4_K_M; MTP variant ships only as Unsloth's dynamic UD-Q4_K_XL. UD- quants typically add 1–3 score points vs the legacy quant. For the MoE 35B-A3B, both runs used the SAME UD-IQ4_XS quant, so that comparison is cleaner.

Caveats are flagged inline where they affect interpretation.

Setup

Component Specification
CPU AMD Ryzen 7 5700G (16 threads)
RAM 32 GB DDR4
GPU NVIDIA GeForce RTX 4090 (24 GB VRAM)
OS Ubuntu 24.04.4 LTS
Driver NVIDIA 580.65.06
CUDA 12.0 (avoid 13.2 — known gibberish bug)
llama.cpp b9297 (b0df4c0cf) for MTP; b8838 (23b8cc499) for prior non-MTP rows

VRAM isolation: vllm-embedding-code was stopped throughout to remove co-tenant contention. Only the model under test was GPU-resident at any time.

Models Tested

ID Source repo Architecture Quant File Bin Context Thinking flag
qwen36-27b unsloth/Qwen3.6-27B-GGUF Dense (27B all active) Q4_K_M 16.0 GB b8838 40960 (default)
qwen36-27b-mtp-q4 unsloth/Qwen3.6-27B-MTP-GGUF Dense (27B all active) + MTP UD-Q4_K_XL 18.0 GB b9297 40960 enable_thinking
qwen36-35b-a3b-iq4xs unsloth/Qwen3.6-35B-A3B-GGUF MoE (35B total, 3B active) UD-IQ4_XS 17.0 GB b8838 32768 (default)
qwen36-35b-a3b-mtp-iq4xs unsloth/Qwen3.6-35B-A3B-MTP-GGUF MoE (35B total, 3B active) + MTP UD-IQ4_XS 18.2 GB b9297 131072 enable_thinking
qwen36-35b-a3b-mtp-iq4xs-preserve unsloth/Qwen3.6-35B-A3B-MTP-GGUF MoE (35B total, 3B active) + MTP UD-IQ4_XS 18.2 GB b9297 131072 preserve_thinking

† Results reused from the 20260423 benchmark (same v2 methodology, different binary).

Results

Summary Table

All 35B-A3B-MTP rows below reflect the production-aligned 128K context + q8_0 KV cache configuration (gpumod-ttk7). See Configuration Update for the prior 32K-f16 numbers and why we re-ran.

Model Quant MTP Thinking flag Mean Score Std Dev 95% CI Mean TPS Speedup
Qwen3.6-35B-A3B-MTP UD-IQ4_XS enable_thinking 89.0 7.1 [85.1, 92.9] 216.8 1.24× vs non-MTP twin
Qwen3.6-35B-A3B-MTP (preserve) UD-IQ4_XS preserve_thinking 88.3 6.5 [84.8, 91.9] 216.5 1.24× vs non-MTP twin
Qwen3.6-35B-A3B UD-IQ4_XS (default) 87.3 10.3 [81.6, 93.0] 174.5 (baseline)
Qwen3.6-27B-MTP UD-Q4_K_XL enable_thinking 87.3 7.3 [83.3, 91.4] 85.4 1.82× vs non-MTP twin
Qwen3.6-27B Q4_K_M (default) 80.3 6.9 [76.5, 84.2] 46.9 (baseline)

95% CI overlap: the 35B-A3B-MTP enable CI [85.1, 92.9] and preserve CI [84.8, 91.9] are nearly identical, and both overlap the non-MTP CI [81.6, 93.0]. The +1.7/+1.0 mean deltas vs non-MTP are not statistically significant; treat as quality parity. The +24% TPS is the unambiguous win.

enable vs preserve_thinking on single-shot is empirically equivalent at 128K: mean 89.0 vs 88.3, σ 7.1 vs 6.5, TPS 216.8 vs 216.5 — three out of three metrics within run-to-run noise. This matches the chat-template proof (preserve_thinking only fires for prior assistant messages — see research note). The earlier 32K preserve run showed -5.0 mean and 4× the L1 failures, which we now attribute to insufficient thinking budget at 32K context, not to the flag itself.

Score Distribution (per iteration)

Model Scores (15 iters)
Qwen3.6-35B-A3B-MTP (enable) 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 100, 90, 90, 65, 90
Qwen3.6-35B-A3B-MTP (preserve) 90, 90, 90, 65, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90
Qwen3.6-35B-A3B (prior) 90, 90, 90, 90, 90, 90, 90, 90, 90, 50, 90, 90, 90, 90, 90
Qwen3.6-27B-MTP 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 75, 65, 90, 90
Qwen3.6-27B (prior) 90, 90, 85, 75, 75, 75, 75, 75, 90, 75, 85, 75, 75, 90, 75

Both MTP variants now sit firmly at the 90 ceiling. Enable has one 65 outlier and one 100 (first time any model in this suite hit L5 — a partial multi-file refactor pass); preserve has a single 65 outlier. The two distributions are statistically indistinguishable. The non-MTP 27B remains the most variable, frequently landing at 75.

Level Pass Rates (15 iterations × 5 levels)

Level Task Pts 27B (non-MTP) 27B-MTP 35B-A3B (non-MTP) 35B-A3B-MTP (enable) 35B-A3B-MTP (preserve)
L1 Basic queue (add/get, FIFO) 25 100% 93% 93% 93% 93%
L2 Retry with exponential backoff 25 100% 100% 100% 100% 100%
L3 Priority scheduling 25 100% 100% 100% 100% 100%
L4 Find & fix concurrency bug 15 27% 93% 93% 100% 100%
L5 Multi-file refactoring 10 13% 0% 0% 7% 0%

L4 jumped dramatically for 27B-MTP (27% → 93%) — this is partly the new binary (better thinking-mode reasoning), partly the quant upgrade (UD-Q4_K_XL ≥ Q4_K_M), and partly the runner fix that captures reasoning_content. The exact attribution is confounded, but the net result is that 27B-MTP is now nearly as good as 35B-A3B on the concurrency task.

L5 cracked once on the enable run — 1/15 iterations passed (a partial). This is the first non-zero L5 in this suite for the strong MoE models. With 128K context the model can finally hold the full multi-file refactor in its thinking budget. n=15 is too few to call this a stable improvement, but it suggests L5 is no longer an absolute ceiling — just an extremely tight one.

MTP-Specific Metrics

Model n_max Mean draft acceptance Total drafts Drafts accepted
Qwen3.6-27B-MTP 2 80.2% 475,884 364,804
Qwen3.6-35B-A3B-MTP (enable) 2 78.9% 349,154 267,423
Qwen3.6-35B-A3B-MTP (preserve) 2 78.7% 366,406 277,488

MTP draft acceptance is healthy across all three — well above the threshold where speculative decoding pays for itself. q8_0 KV cache compression cost ~1pp of acceptance vs the f16 32K runs (78.9% → 78.9% enable, 79.6% → 78.7% preserve — within noise). The MoE has slightly lower acceptance than the dense 27B, possibly because the 3B-active routing makes the draft head's predictions less aligned with the target's actual sampling path.

Wall-clock

Run Duration Per iteration
Qwen3.6-27B-MTP (15) 2h 03m ~8 min
Qwen3.6-35B-A3B-MTP enable (15, 128K+q8_0) ~0h 35m ~2.3 min
Qwen3.6-35B-A3B-MTP preserve (15, 128K+q8_0) ~0h 35m ~2.3 min

The MoE's 3B-active routing dominates wall-clock; MTP layered on top gives the +24% on raw TPS. q8_0 KV cache compression cost ~3% wall-clock vs the prior f16 32K runs.

Key Findings

  1. MoE MTP is the new top performer on every axis: highest mean (89.0 enable / 88.3 preserve), highest TPS (216.8 / 216.5), low variance (σ=7.1 / 6.5). It dethrones the prior champion (35B-A3B UD-Q4_K_S at 90.0/173.7).
  2. MTP's speed claim holds: 1.82× for dense 27B, 1.24× for MoE 35B-A3B. The MoE gains less because the 3B-active arch was already fast — the draft head still helps but has less wall-clock to save.
  3. MTP quality is at parity with the non-MTP twin within statistical noise (CIs overlap heavily across all three MTP variants). Vendor's "no accuracy loss" claim survives the test.
  4. Variance dropped with MTP: 35B-A3B σ went from 10.3 (non-MTP) → 7.1 (MTP enable) → 6.5 (MTP preserve). Lower variance = more predictable for production agents.
  5. enable vs preserve_thinking is empirically equivalent on single-shot at 128K context: means 89.0 vs 88.3, σ 7.1 vs 6.5, TPS 216.8 vs 216.5 — three out of three metrics within run-to-run noise. Matches the chat-template proof. The earlier 32K preserve regression was a thinking-budget artifact, not a flag difference.
  6. q8_0 KV cache is compatible with MTP draft head: smoke and 15-iter benchmark both pass. Cost: ~3% wall-clock and ~1pp draft acceptance. Benefit: 4× more context for thinking budget.
  7. L4 (concurrency bug fix) is now 100% on 35B-A3B-MTP: confounded by binary + quant + runner-fix history but real.
  8. L5 (multi-file refactoring) cracked once on the enable run (1/15): first non-zero L5 in this suite for strong MoE models. The 128K context finally lets thinking budget hold the full task. n=15 is too few to call this a stable improvement, but L5 is no longer the absolute ceiling it was at 32K.

Methodology

Aspect Configuration
Iterations 15 per model
Validation PytestValidator (real pytest tests, 30 s per-level timeout)
Sampler THINKING_CODING: temp=0.6, top_p=0.95, top_k=20, min_p=0, presence_penalty=0 (matches Unsloth's recommendation for thinking-mode coding)
Client timeout 900 s per request (bumped from prior 300 s — MTP thinking can be long-running)
max_tokens 32768 (Unsloth's recommendation for Qwen3.6 general queries)
Thinking mode Enabled (default for Qwen3.6). The enable variant uses --chat-template-kwargs '{"enable_thinking":true}'; the preserve variant uses '{"preserve_thinking":true}'. Both are functionally equivalent on first-turn inputs per the chat-template proof.
Context 131072 (Unsloth's "minimum 128K for thinking capabilities" recommendation). Earlier runs in this folder at 32768 were under-provisioned — see Configuration Update.
KV cache --cache-type-k q8_0 --cache-type-v q8_0. Halves KV memory so the 128K context fits alongside MTP on 24 GB VRAM. Verified compatible with the draft head (gpumod-ttk7).
MTP flags --spec-type draft-mtp --spec-draft-n-max 2 --parallel 1 (per Unsloth docs; -np>1 unsupported with MTP)
Code extraction Try message.content first (the model's polished final answer); fall back to reasoning_content only if content has no code fence. Earlier benchmarks treated reasoning-first and graded the model's draft code — the fix landed in commit 219d688.

Recommendation

Swap Hermes-agent's chat model to qwen36-35b-a3b-mtp-iq4xs-preserve (128K context + q8_0 KV cache).

Reason Detail
Quality 88.3 mean (preserve) and 89.0 mean (enable) both overlap the non-MTP CI [81.6, 93.0]. Single-shot equivalence between the two flags is now both theoretically proven (chat-template extraction) AND empirically confirmed (means/σ/TPS all within run-to-run noise at 128K).
Speed 216.5 TPS (preserve) / 216.8 TPS (enable) vs 174.5 TPS (non-MTP) — +24% real, sustained
Variance σ=6.5 (preserve) / σ=7.1 (enable) vs σ=10.3 (non-MTP) — -37% / -31%, more predictable for agents
Multi-turn behavior Hermes-agent is multi-turn (chat + tool calling). preserve_thinking:true keeps prior <think> blocks in the conversation history; enable_thinking:true drops them. Preserving improves reasoning consistency across tool-calling turns.
Context 131072 tokens — matches the non-MTP baseline and exceeds Unsloth's 128K minimum for thinking. Production multi-turn needs this; the prior 32K runs in this folder were under-provisioned.
VRAM ~20 GB load on 24 GB GPU (model + 128K KV cache with q8_0). Leaves ~4 GB headroom for co-tenant services like vllm-embedding-code.
Architecture parity Same MoE 35B-A3B base + MTP + 128K context + q8_0 cache. Only delta from the non-MTP baseline is --spec-type draft-mtp.

Why preserve over enable? At 128K the two are statistically indistinguishable on single-shot (mean 88.3 vs 89.0, σ 6.5 vs 7.1, TPS 216.5 vs 216.8). preserve is the correct flag for multi-turn agent workloads — it keeps prior <think> blocks across tool calls, which keeps the model's reasoning chain coherent. enable would drop them every turn.

Implementation — the actual edit to modes/hermes-agent.yaml is gpumod-aop. With this benchmark refresh, that ticket's recommended preset is now qwen36-35b-a3b-mtp-iq4xs-preserve (formerly qwen36-35b-a3b-mtp-iq4xs).

What we'd want before fully committing

  • One real chat/tool-calling session under hermes-agent's actual prompts (this benchmark only covers single-shot coding tasks)
  • VRAM budget verified when co-running vllm-embedding-code (currently isolated for measurement; production needs both)

These can be one-week monitoring after the swap, not gating the swap itself.

Why not 27B-MTP?

It scored 87.3 (≈ baseline 35B-A3B-IQ4XS) at less than half the TPS of 35B-A3B-MTP (85.4 vs 216.5). Smaller VRAM footprint (~21 GB vs ~22 GB) but the speed and variance penalties make it inferior to 35B-A3B-MTP for the Hermes-agent use case. Possible alternative if a multi-agent / batch workload appears, but -np>1 is unsupported with MTP, so 27B-MTP loses its main advantage there too.

Configuration Update

The MTP variants were originally benchmarked at context_size=32768 with f16 KV cache. Cross-checking against the Unsloth Qwen3.6-35B-A3B-MTP model card flagged that 32K is below the vendor's "minimum 128K for thinking capabilities" recommendation. The non-MTP twin already runs at 131072 thanks to --cache-type-k q8_0 --cache-type-v q8_0 halving the KV memory; gpumod-ttk7 tested whether q8_0 cache is compatible with the MTP draft head.

Result: compatible. Both 15-iter benchmarks above were re-run at 131072 context + q8_0 cache.

Metric enable @ 32K f16 enable @ 128K q8_0 preserve @ 32K f16 preserve @ 128K q8_0
Mean score 88.3 89.0 83.3 88.3
Std Dev 6.5 7.1 11.4 6.5
95% CI lower 84.8 85.1 77.0 84.8
95% CI upper 91.9 92.9 89.7 91.9
Mean TPS 222.3 216.8 230.8 216.5
Draft acceptance 78.9% 78.9% 79.6% 78.7%

The 32K→128K change cost ~3% TPS (q8_0 cache slows kernel slightly) and ~1pp draft acceptance, in exchange for: - 4× thinking budget (the smoke that triggered this investigation showed 14891 reasoning chars for a single coding prompt at finish_reason=length) - preserve_thinking quality recovers from 83.3→88.3 (the 32K result was a thinking-budget artifact, not a flag difference) - Multi-turn viability for Hermes-agent (32K saturates after 2-3 preserved-thinking turns)

The prior 32K result JSONs are preserved as result_*.32k-baseline.json for audit. The headline tables in this README use the 128K production-aligned numbers.

Files

File Description
result_qwen36-27b-mtp-q4.json 15-iteration results, 27B-MTP
result_qwen36-35b-a3b-mtp-iq4xs.json 15-iteration results, 35B-A3B-MTP enable_thinking @ 128K + q8_0
result_qwen36-35b-a3b-mtp-iq4xs.32k-baseline.json Earlier 32K f16 run (superseded — see Configuration Update)
result_qwen36-35b-a3b-mtp-iq4xs-preserve.json 15-iteration results, 35B-A3B-MTP preserve_thinking @ 128K + q8_0
result_qwen36-35b-a3b-mtp-iq4xs-preserve.32k-baseline.json Earlier 32K f16 run (superseded — see Configuration Update)
run.log Original 32K stdout from 27B + enable runs
run_preserve.log Original 32K stdout from preserve run
run_preserve_128k.log 128K + q8_0 stdout from preserve rerun (enable rerun log lost to a shell-quoting accident — result_*.json is authoritative)
artifacts/qwen36-27b-mtp-q4/ Per-iteration response artifacts (tagged <reasoning_content> + <content>)
artifacts/qwen36-35b-a3b-mtp-iq4xs/ Per-iteration response artifacts (enable_thinking @ 128K)
artifacts/qwen36-35b-a3b-mtp-iq4xs-preserve/ Per-iteration response artifacts (preserve_thinking @ 128K)

GGML_CUDA_NO_PINNED A/B (gpumod-56md)

Date: 2026-05-26 Question: Does disabling cudaMallocHost (via GGML_CUDA_NO_PINNED=1) measurably degrade TPS or quality?

Background: The cudaHostAlloc-based pinned memory path in llama.cpp requires contiguous high-order physical pages. When MemAvailable is low due to fragmentation, the NVIDIA driver hangs indefinitely waiting for pages — no OOM signal, no PSI spike, eventual hard reboot. Setting GGML_CUDA_NO_PINNED=1 makes ggml_cuda_host_malloc fall back to regular malloc, eliminating the failure class entirely. See docs/research/20260525_oom_protection_findings/FINDINGS.md for root-cause analysis.

Results

Metric Pinned (baseline) No-pinned (variant) Delta
Mean score 88.3 86.7 -1.6
Std Dev 6.5 8.8 +2.3
95% CI [84.8, 91.9] [81.8, 91.5] overlap
Mean TPS 216.5 215.9 -0.28%
Draft acceptance 78.7% 79.1% +0.4pp
L1 pass 93% 93%
L4 pass 100% 100%
L5 pass 0% 0%

Score Distribution

Variant Scores (15 iters)
Pinned (baseline) 90, 90, 90, 65, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90
No-pinned 90, 65, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 65, 90, 90

Both have exactly two 65-score outliers (L1 failures); the remainder hit the 90 ceiling. Distributions are statistically indistinguishable.

Decision

Enable. Both criteria pass:

  1. TPS regression = 0.28% — well within the 5% threshold (216.5 → 215.9 t/s)
  2. 95% CIs overlap — baseline [84.8, 91.9] ∩ no-pinned [81.8, 91.5] = [84.8, 91.5]

Draft acceptance actually improved slightly (78.7% → 79.1%), confirming the MTP speculative-decoding path is unaffected by the host buffer allocation strategy.

Recommendation

GGML_CUDA_NO_PINNED=1 is now the unconditional default for all llamacpp services via src/gpumod/templates/systemd/llamacpp.service.j2. This eliminates the entire cudaHostAlloc fragmentation-hang failure class at negligible cost. Operators who need the pinned path for benchmarking CPU↔GPU transfer bandwidth can override via systemctl --user edit <service> with Environment="GGML_CUDA_NO_PINNED=0".

Files

File Description
result_qwen36-35b-a3b-mtp-iq4xs-preserve.json Pinned baseline (15 iters, 128K + q8_0)
result_qwen36-35b-a3b-mtp-iq4xs-preserve.no-pinned.json No-pinned variant (15 iters, same config)
run_no_pinned.log Benchmark stdout for the no-pinned run

References