Qwen3.6-A3B baseline vs Qwen3.5-A3B heretic (uncensored, MTP-Preserved)¶
Date: 2026-05-26
Ticket: gpumod-ojey
Question: Is the llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF fine-tune a viable swap-in for qwen36-35b-a3b-mtp-iq4xs-preserve in modes/hermes-agent.yaml?
⚠ Cross-Version + Cross-Quant Caveat¶
This comparison conflates three variables; deltas cannot be attributed to any single one.
| Axis | Baseline | Heretic | Effect on comparison |
|---|---|---|---|
| Model version | Qwen 3.6 | Qwen 3.5 | One generation older base |
| Fine-tune | Stock Unsloth | heretic-v2-uncensored |
Different post-training data |
| Quant | UD-IQ4_XS (~4.25 bpw) | Q3_K_L (~3.4 bpw) | ~0.85 bpw less precision |
| Binary | llama.cpp b9297 | llama.cpp b9297 | (clean — no binary confound) |
Quant fallback: the ticket originally specified Q4_K_S (the closest bit-width match to UD-IQ4_XS available in the heretic repo). Q4_K_S would not fit alongside the MTP draft head + 128K q8_0 KV cache on 24 GB VRAM (preflight: 25,012 MB required vs 24,067 MB available, OOM by ~945 MB). We took the ticket's pre-authorized fallback to Q3_K_L. The bit-width gap widens vs the baseline as a result, and Q3_K_L is the dominant quality confound below.
Treat this benchmark as a coarse "is the heretic variant in the same ballpark end-to-end" check, not an ablation of any one factor.
Setup¶
| Component | Specification |
|---|---|
| CPU | AMD Ryzen 7 5700G (16 threads) |
| RAM | 32 GB DDR4 |
| GPU | NVIDIA GeForce RTX 4090 (24 GB VRAM) |
| OS | Ubuntu 24.04.4 LTS |
| Driver | NVIDIA 580.65.06 |
| CUDA | 12.0 |
| llama.cpp | b9297 (b0df4c0cf) — same binary for both rows; no binary confound |
| GGML_CUDA_NO_PINNED | 1 (default since gpumod-56md) |
VRAM isolation: all other GPU-resident services stopped throughout. Only the model under test was GPU-resident at any time.
Models Tested¶
| ID | Source repo | Quant | File size | Context | Thinking flag |
|---|---|---|---|---|---|
qwen36-35b-a3b-mtp-iq4xs-preserve † |
unsloth/Qwen3.6-35B-A3B-MTP-GGUF |
UD-IQ4_XS | 18.2 GB | 131072 | preserve_thinking |
qwen35-35b-a3b-heretic-mtp-q3kl-preserve |
llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF |
Q3_K_L | 17.4 GB | 131072 | preserve_thinking |
† Baseline row reused from 20260524_qwen36_mtp_vs_a3b (see result_qwen36-35b-a3b-mtp-iq4xs-preserve.json symlink). Not re-run; same 128K + q8_0 KV configuration, same b9297 binary, no fresh measurement noise.
Results¶
Summary Table¶
| Variant | Mean Score | Std Dev | 95% CI | Mean TPS | Draft accept | Wall (15 iters) |
|---|---|---|---|---|---|---|
| Baseline preserve (UD-IQ4_XS) | 88.3 | 6.5 | [84.8, 91.9] | 216.5 | 78.7% | ~35 min |
| Heretic preserve (Q3_K_L) | 83.3 | 11.4 | [77.0, 89.7] | 212.5 | 76.0% | ~32 min |
| Δ | -5.0 | +4.9 | overlap | -1.84% | -2.7pp | — |
95% CI overlap is heavy (baseline upper 91.9 vs heretic lower 77.0). The -5.0 score delta is real-looking but the heretic's variance is ~2× the baseline (σ 11.4 vs 6.5), so the distributions cross substantially. Heretic min/max = 65/90 vs baseline 65/90 — same support; the difference is in the mix of 65s vs 90s.
TPS is a wash at -1.8%. Both at ~213-217 t/s with MTP draft.
Score Distribution (per iteration)¶
| Variant | Scores (15 iters) |
|---|---|
| Baseline preserve | 90, 90, 90, 65, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90 (1× 65) |
| Heretic preserve | 65, 90, 65, 65, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 65 (4× 65) |
Both ceilings are 90. The full -5.0 mean delta is accounted for by 3 extra 65s in the heretic run — and all four heretic 65s come from L1 failures (same failure mode as the baseline's single 65).
Level Pass Rates (15 iterations × 5 levels)¶
| Level | Task | Pts | Baseline preserve | Heretic preserve |
|---|---|---|---|---|
| L1 | Basic queue (add/get, FIFO) | 25 | 14/15 (93%) | 11/15 (73%) |
| L2 | Retry with exponential backoff | 25 | 15/15 (100%) | 15/15 (100%) |
| L3 | Priority scheduling | 25 | 15/15 (100%) | 15/15 (100%) |
| L4 | Find & fix concurrency bug | 15 | 15/15 (100%) | 15/15 (100%) |
| L5 | Multi-file refactoring | 10 | 0/15 (0%) | 0/15 (0%) |
L2–L5 are tied to the iteration. The entire benchmark delta lives at L1. That's the surprising part: L1 is the easiest level (basic FIFO queue), and the heretic — same family, same MTP architecture — flakes on it 4× as often as the baseline.
L5 (multi-file refactor) is a 0/15 hard wall for both; this matches the baseline preserve's L5 history. Multi-file refactor remains an unsolved level at this model size + quant — including the heretic's single Q3_K_L sample size.
MTP-Specific Metrics¶
| Variant | n_max | Total drafts | Drafts accepted | Mean acceptance |
|---|---|---|---|---|
| Baseline preserve | 2 | 366,406 | 277,488 | 78.7% |
| Heretic preserve | 2 | 350,181 | 266,045 | 76.0% |
Heretic's MTP draft acceptance is -2.7pp vs baseline — small but real. Plausible cause: Q3_K_L target weights produce slightly different sampling paths than the draft head's UD-IQ4_XS-derived expectations. Acceptance is still healthy (>75%) — well above the threshold where speculative decoding pays for itself.
Wall-clock¶
| Variant | 15 iters | Per iter |
|---|---|---|
| Baseline preserve | ~35 min | ~2.3 min |
| Heretic preserve | ~32 min | ~2.1 min |
Heretic was ~9% faster wall-clock despite -1.8% TPS, because per-iteration token counts were lower on average (a few iterations hit the 32K thinking budget cap on the baseline that the heretic didn't reproduce).
L1 Standalone Smoke (gpumod-ojey follow-up)¶
The bench's 4-extra-L1-failures result raised the question: is this a model competence regression, or noise from the bench's multi-level conversation accumulation?
To answer it, we ran the same L1 prompt 20 times standalone against each service (fresh chat context per attempt — no prior turns, no preserve_thinking carry-over). Same sampler (THINKING_CODING), same code extractor + validator the main runner uses. See scripts/heretic_l1_smoke.py.
L1 Smoke Results¶
| Variant | Pass | Rate | Wilson 95% CI | Mean dur | Mean tokens (clean) | Failure modes |
|---|---|---|---|---|---|---|
| Baseline UD-IQ4_XS (enable) | 19/20 | 95.0% | [76.4%, 99.1%] | 30.8s | ~4,400 | #2 32K thinking-budget exhaustion |
| Heretic Q3_K_L (preserve) | 18/20 | 90.0% | [69.9%, 97.2%] | 36.0s | ~5,800 | #1 32K thinking-budget exhaustion, #3 SyntaxError in extracted code |
(Baseline smoke used qwen36-35b-a3b-mtp-iq4xs on port 7103 — the enable_thinking sibling — because the preserve service was down for VRAM isolation. enable vs preserve is empirically equivalent on single-shot per the prior bench's chat-template proof, so this is the right surrogate for a "single-shot L1 only" comparison.)
What the smoke tells us¶
- Statistically indistinguishable. Wilson CIs are [69.9%, 97.2%] vs [76.4%, 99.1%] — they overlap from 76% to 97%. With n=20 you cannot separate 18/20 from 19/20.
- Both share the same dominant failure mode: 32K thinking-budget exhaustion. The model goes deep into reasoning and never emits the final code fence within the cap. This is a benchmark-configuration artifact, not model competence.
- Heretic-only failure (#3 SyntaxError) is n=1. Single sample; could be the lossier Q3_K_L producing a slightly-malformed code block once. Not a signal.
- Heretic uses ~32% more reasoning tokens per clean answer (5,800 vs 4,400). Consistent with Q3_K_L needing more chain-of-thought to recover (lossier quant → longer reasoning paths).
- The bench's 11/15 L1 result was conversation-accumulation, not competence. preserve_thinking carries prior
<think>blocks into each subsequent level prompt. By the time L1 fires in iter 4, the model is reasoning against a context already pre-filled with L5 thinking from iter 3. Smaller thinking budget → more 32K caps → more L1 fails. Fresh-context smoke disproves the "Q3_K_L can't do L1" hypothesis.
Implication¶
The heretic-Q3_K_L variant has L1-equivalent competence to baseline-UD-IQ4_XS at this sample size. The benchmark-score gap is dominated by the quant-precision interaction with the preserve_thinking conversation buffer at 32K thinking budget, not by the fine-tune or version difference.
To statistically prove a real competence delta would require ~100+ smoke attempts per side; n=20 establishes only that the two are within noise.
Key Findings¶
- Mean score gap is real but small (-5.0 pts) and dominated by L1 conversation-accumulation under Q3_K_L, not by the fine-tune or version.
- TPS is a wash (-1.8%). Both at ~213-217 t/s with MTP.
- MTP draft acceptance dropped -2.7pp (78.7% → 76.0%) with Q3_K_L. Still healthy.
- L2–L5 are perfectly tied with the baseline. The entire bench-score delta lives at L1.
- Standalone L1 smoke nails 18/20 = 90% for the heretic vs 19/20 = 95% for the baseline. CIs overlap heavily. The bench's 11/15 L1 result was conversation accumulation, not competence.
- L5 multi-file refactor is 0/15 hard wall for both. Unchanged at this model class.
- The Q4_K_S → Q3_K_L fallback was forced by VRAM (24,012 MB Q4_K_S preflight requirement vs 24,067 MB available). Q4_K_S would have given a fairer bit-width comparison; we get the documented fallback's wider quant gap.
Methodology¶
| Aspect | Configuration |
|---|---|
| Iterations | 15 per model |
| Validation | PytestValidator (real pytest tests, 30s per-level timeout) |
| Sampler | THINKING_CODING: temp=0.6, top_p=0.95, top_k=20, min_p=0, presence_penalty=0 |
| Client timeout | 900s per request |
| max_tokens | 32,768 per level |
| Context | 131,072 |
| KV cache | --cache-type-k q8_0 --cache-type-v q8_0 |
| MTP flags | --spec-type draft-mtp --spec-draft-n-max 2 --parallel 1 |
| Code extraction | message.content preferred; falls back to reasoning_content only if content has no code fence |
| Runner | scripts/run_qwen36_benchmark.py --model qwen35-35b-a3b-heretic-mtp-q3kl-preserve |
| L1 smoke script | scripts/heretic_l1_smoke.py (standalone, 20 attempts, fresh context per attempt) |
Recommendation¶
Do not swap. The current qwen36-35b-a3b-mtp-iq4xs-preserve baseline is preferred for modes/hermes-agent.yaml because:
- Quant gap dominates. The heretic repo's lowest bit-width that fits is Q3_K_L. The baseline runs at UD-IQ4_XS (one rung higher). At 24 GB VRAM with MTP + 128K q8_0 KV, the heretic can't match the baseline's effective precision.
- The fine-tune is "uncensored" and unaudited for the agent use case. Hermes-agent is an in-loop tool-calling chat model. We have no calibration on whether the heretic's post-training data introduces failure modes specific to tool-call structure or agent reasoning. The 5-pt single-shot coding gap is the only data we have.
- MTP draft acceptance is lower (-2.7pp) on the heretic. Small but compounds across many turns.
If the use case strictly needs uncensored output, the heretic-Q3_K_L is competitive — same L1 competence in smoke, same L2–L4 in bench, same L5 hard wall. Variance is higher; expect more 65-score iterations under preserve_thinking + long conversations. Acceptable as a specialized swap, not as a default.
What we'd want before reconsidering: - Heretic at a higher bit-width (Q4_K_S or Q4_K_M) — would require 32 GB GPU or aggressive context reduction. - L1 smoke n≥100 on each side to actually separate the pass rates statistically. - One real Hermes-agent multi-turn session with heretic to check for tool-call regression.
Files¶
| File | Description |
|---|---|
result_qwen35-35b-a3b-heretic-mtp-q3kl-preserve.json |
15-iter bench result, heretic Q3_K_L preserve |
result_qwen36-35b-a3b-mtp-iq4xs-preserve.json |
Baseline 15-iter result (symlink to prior bench dir) |
run.log |
Heretic bench stdout |
l1_smoke.json, l1_smoke.log |
L1 standalone smoke, heretic Q3_K_L (port 7105, 20 attempts) |
l1_smoke_baseline_enable.json, l1_smoke_baseline_enable.log |
L1 standalone smoke, baseline UD-IQ4_XS enable (port 7103, 20 attempts) |
artifacts/qwen35-35b-a3b-heretic-mtp-q3kl-preserve/ |
Per-iteration response artifacts (tagged <reasoning_content> + <content>) |
References¶
- Prior bench: Qwen3.6 MTP vs non-MTP — baseline preserve row was first established here
- heretic_l1_smoke.py — standalone L1 flake-rate script
- llmfan46/Qwen3.5-…-MTP-Preserved-GGUF
- Unsloth Qwen3.5 docs