Qwen3.5-35B-A3B IQ4 Provider Comparison¶
Date: 2026-02-27 (v1), 2026-03-01 (v2)
Goal¶
Compare IQ4-class GGUF quantizations from different providers on speed, VRAM, and coding task performance.
Background¶
KLD/PPL rankings for Qwen3.5-35B-A3B quantizations already exist (Qwen3.5-35B-A3B Q4 Quantization Comparison). This benchmark adds:
- TPS (tokens per second) — Speed comparison across providers
- VRAM — Actual nvidia-smi measurements
- Coding tasks — Real-world task performance with pytest validation
Setup¶
| Component | Specification |
|---|---|
| CPU | AMD Ryzen 7 5700G (16 threads) |
| RAM | 32 GB DDR4 |
| GPU | NVIDIA GeForce RTX 4090 (24 GB VRAM) |
| OS | Ubuntu 24.04.4 LTS |
| Driver | NVIDIA 580.65.06 |
| llama.cpp | b8149-6-g832aa9476 |
Models Tested¶
All models are IQ4-class quantizations (~18-20GB) of the same base model (Qwen3.5-35B-A3B MoE).
| ID | Provider | Quant | Approach |
|---|---|---|---|
aessedai-iq4xs |
AesSedai | IQ4_XS | MoE-optimized: Q8_0 attention + IQ3_S FFN experts (source) |
bartowski-iq4xs |
bartowski | IQ4_XS | imatrix calibration (source) |
unsloth-mxfp4 |
unsloth | MXFP4_MOE | MXFP4 format (source) |
Results (v2 — 2026-03-01)¶
Summary Table¶
| Model | Mean Score | 95% CI | TPS | Perfect Runs |
|---|---|---|---|---|
| AesSedai IQ4_XS | 85.7 | [77.7, 93.7] | 27.3 | 1/15 |
| bartowski IQ4_XS | 84.7 | [78.4, 90.9] | 25.3 | 1/15 |
| unsloth MXFP4 | 83.7 | [75.8, 91.5] | 28.2 | 3/15 |
Finding: All three providers perform within margin of error on coding tasks. The confidence intervals overlap significantly, indicating no statistically significant difference.
Score Distribution¶
| Model | Min | Max | Std Dev | Scores |
|---|---|---|---|---|
| AesSedai | 40 | 100 | 14.5 | 90, 90, 40, 90, 90, 90, 90, 90, 100, 90, 90, 90, 90, 90, 65 |
| bartowski | 65 | 100 | 11.3 | 65, 90, 90, 65, 90, 90, 100, 90, 90, 90, 65, 90, 90, 75, 90 |
| unsloth | 65 | 100 | 14.2 | 65, 100, 65, 90, 90, 90, 100, 65, 100, 65, 90, 90, 90, 65, 90 |
Level Pass Rates (15 iterations)¶
| Level | Task | Points | AesSedai | bartowski | unsloth |
|---|---|---|---|---|---|
| L1 | Basic queue (add/get, FIFO) | 25 | 87% | 80% | 67% |
| L2 | Retry with exponential backoff | 25 | 87% | 87% | 100% |
| L3 | Priority scheduling | 25 | 93% | 100% | 100% |
| L4 | Find & fix concurrency bug | 15 | 100% | 100% | 100% |
| L5 | Multi-file refactoring | 10 | 7% | 20% | 27% |
Observations: - L4 (concurrency bug fix) has 100% pass rate across all providers - L5 (multi-file refactoring) is the hardest, with <30% pass rate - L1 surprisingly has the most variance — syntax errors in code extraction
Methodology Changes (v1 → v2)¶
| Aspect | v1 (2026-02-27) | v2 (2026-03-01) |
|---|---|---|
| Iterations | 5 | 15 (better statistical significance) |
| Context size | 40960 tokens | 40960 tokens (unchanged) |
| Validation | Lambda string matching | PytestValidator (real pytest tests) |
| Sampler | temp=0.1, /no_think prefix |
temp=0.6, top_p=0.95 (THINKING_CODING) |
| Prompts | ~50 chars each | 500+ chars with clear requirements |
| Code storage | Best iteration only | All iterations (full artifacts) |
v2 Test Details¶
L1: Basic Queue — Implement add_job(fn, *args) returning job_id, get_result(job_id) blocking until complete, FIFO ordering verified
L2: Retry with Backoff — Max 3 retries with exponential backoff (1s base), JobFailedError after exhaustion
L3: Priority Queue — add_job(fn, *args, priority=0), higher priority executes first, same priority uses FIFO
L4: Concurrency Bug Fix — Given broken code with race condition in self.results[job_id] = result, fix with proper locking
L5: Multi-file Refactor — Split monolithic queue.py into queue/{__init__,core,retry,priority}.py maintaining all functionality
Key Findings¶
-
No significant difference between providers — All three providers score within 2 points of each other (83.7-85.7), with overlapping confidence intervals.
-
v2 methodology shows higher scores — With proper prompts and thinking mode, all providers achieve 80%+ avg vs v1's erratic results.
-
L5 remains challenging — Multi-file refactoring has <30% success rate across all providers, suggesting this is a genuine capability limit.
-
TPS varies slightly — unsloth MXFP4 is marginally faster (28.2 TPS) but all are within 10% of each other.
Recommendations¶
| Use Case | Recommended |
|---|---|
| Tight VRAM budget | AesSedai IQ4_XS (18.1 GB, lowest VRAM) |
| Maximum speed | unsloth MXFP4 (28.2 TPS, slightly faster) |
| Coding tasks | Any — no significant difference |
| Lowest variance | bartowski IQ4_XS (std dev 11.3) |
Files¶
| File | Description |
|---|---|
result_aessedai.json |
Full benchmark results (15 iterations) |
result_bartowski.json |
Full benchmark results (15 iterations) |
result_unsloth.json |
Full benchmark results (15 iterations) |
artifacts/*/iter_*/ |
Generated code per iteration per level |
References¶
- Why Maybe We're Measuring LLM Compression Wrong — KLD methodology
- Qwen3.5-35B-A3B Q4 Quantization Comparison (Reddit) — Source of KLD/PPL data
- AesSedai MoE-Optimized Explanation — Explains Q8_0 attention + lower FFN approach