Qwen3.6 vs Gemma4 Architecture Comparison¶

Date: 2026-04-24 Ticket: gpumod-4la

Goal¶

Compare local LLMs with different architectures and quantizations on coding task performance, TPS, and VRAM usage on RTX 4090 (24GB VRAM).

Setup¶

Component	Specification
CPU	AMD Ryzen 7 5700G (16 threads)
RAM	32 GB DDR4
GPU	NVIDIA GeForce RTX 4090 (24 GB VRAM)
OS	Ubuntu 24.04.4 LTS
Driver	NVIDIA 580.65.06
llama.cpp	b8838 (23b8cc499)

Models Tested¶

ID	Model	Architecture	Quant	File Size	VRAM est.
`qwen36-27b`	Qwen3.6-27B	Dense (27B all active)	Q4_K_M	16.0 GB	~18 GB
`qwen36-35b-a3b`	Qwen3.6-35B-A3B	MoE (35B total, 3B active)	UD-Q4_K_S	19.9 GB	~22 GB
`qwen36-35b-a3b-iq4xs`	Qwen3.6-35B-A3B	MoE (35B total, 3B active)	UD-IQ4_XS	17.0 GB	~21 GB
`gemma4-e4b`	Gemma 4 E4B	Dense (full precision)	BF16	15.0 GB	~16 GB

Results¶

Summary Table¶

Model	Architecture	Quant	Mean Score	Std Dev	95% CI	TPS	Perfect Runs
Qwen3.6-35B-A3B	MoE (3B active)	UD-Q4_K_S	90.0	0.0	[90.0, 90.0]	173.7	0/15
Gemma 4 E4B	Dense	BF16	88.3	6.5	[84.8, 91.9]	82.9	0/15
Qwen3.6-35B-A3B	MoE (3B active)	UD-IQ4_XS	87.3	10.3	[81.6, 93.0]	174.5	0/15
Qwen3.5-35B-A3B (AesSedai)†	MoE (3B active)	IQ4_XS	85.7	14.5	[77.7, 93.7]	27.3†	1/15
Qwen3.5-35B-A3B (bartowski)†	MoE (3B active)	IQ4_XS	84.7	11.3	[78.4, 90.9]	25.3†	1/15
Qwen3.5-35B-A3B (unsloth)†	MoE (3B active)	MXFP4	83.7	14.2	[75.8, 91.5]	28.2†	3/15
Qwen3.6-27B	Dense (27B)	Q4_K_M	80.3	6.9	[76.5, 84.2]	46.9	0/15

95% CI (Confidence Interval): the range where the true mean score likely falls 95% of the time. A narrow CI like [90.0, 90.0] means highly consistent results; a wide CI like [75.8, 91.5] means high variance across runs. When CIs overlap between models, the difference is not statistically significant.

Perfect Runs: iterations that scored 100/100 (all 5 levels passed). No model in this benchmark achieved a perfect run because L5 (multi-file refactoring) was never solved. The Qwen3.5 models occasionally scored 100 in the prior benchmark due to different L5 behavior.

† Qwen3.5 results from prior benchmark (2026-02-27), same v2 methodology. TPS measured via X-Llama-Timings header (may undercount thinking tokens).

Score Distribution¶

Model	Quant	Min	Max	Scores
Qwen3.6-35B-A3B	UD-Q4_K_S	90	90	90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90
Qwen3.6-35B-A3B	UD-IQ4_XS	50	90	90, 90, 90, 90, 90, 90, 90, 90, 90, 50, 90, 90, 90, 90, 90
Gemma 4 E4B	BF16	65	90	90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 65, 90
Qwen3.6-27B	Q4_K_M	75	90	90, 90, 85, 75, 75, 75, 75, 75, 90, 75, 85, 75, 75, 90, 75

Level Pass Rates (15 iterations)¶

Each iteration runs the model through 5 coding tasks of increasing difficulty. The model receives a prompt, generates Python code, and the output is validated by running real pytest tests against it. A level passes only if all tests pass.

L1 (25 pts): Implement a basic job queue with add_job() and get_result(), verified by FIFO ordering tests.
L2 (25 pts): Add retry logic with exponential backoff (1s base, max 3 retries), raise JobFailedError after exhaustion.
L3 (25 pts): Implement priority scheduling — higher priority jobs execute first, same priority uses FIFO.
L4 (15 pts): Given broken code with a race condition in self.results[job_id] = result, diagnose and fix with proper locking.
L5 (10 pts): Split a monolithic queue.py into queue/{__init__,core,retry,priority}.py while maintaining all functionality.

Level	Task	Points	Qwen3.6-27B	35B-A3B Q4_K_S	35B-A3B IQ4_XS	Gemma 4 E4B
L1	Basic queue (add/get, FIFO)	25	100%	100%	93%	100%
L2	Retry with exponential backoff	25	100%	100%	100%	93%
L3	Priority scheduling	25	100%	100%	100%	100%
L4	Find & fix concurrency bug	15	27%	100%	93%	100%
L5	Multi-file refactoring	10	13%	0%	0%	0%

Key Findings¶

MoE dominates dense on every metric. Qwen3.6-35B-A3B (Q4_K_S) scored 90/100 on every single iteration — zero variance across 15 runs. It is also 3.7x faster than the 27B dense model (173.7 vs 46.9 TPS) because only 3B of 35B params are active per token.
IQ4_XS trades minor quality for 1 GB VRAM savings. The smaller quant (UD-IQ4_XS, ~21 GB) scores 87.3 mean vs 90.0 for Q4_K_S (~22 GB), with one outlier iteration (50/100 where L1 and L4 both failed). TPS is identical (174.5 vs 173.7). The VRAM savings matter for co-hosting with an embedding model on 24 GB GPUs.
Gemma 4 E4B punches above its weight. A smaller model at full BF16 precision scores 88.3 mean — within striking distance of the 35B-A3B MoE — at 82.9 TPS. One outlier iteration (65) dragged the mean down.
Dense 27B struggles with L4 (concurrency). Only 27% pass rate on the bug-fix task. With thinking mode enabled, the model generates extensive reasoning tokens, sometimes timing out or producing incomplete code. The Q4_K_S and Gemma4 handle it reliably (100%); IQ4_XS at 93%.
L5 (multi-file refactoring) remains unsolved. 0% for all 35B-A3B variants and Gemma4, 13% for 27B. This confirms the finding from the prior Qwen3.5 benchmark — multi-file refactoring is a genuine capability ceiling for local models at this scale.
TPS correlates with active parameter count, not total model size or quant:
3B active (35B-A3B MoE, Q4_K_S): 173.7 TPS
3B active (35B-A3B MoE, IQ4_XS): 174.5 TPS
~4B (Gemma4 E4B): 82.9 TPS
27B active (dense): 46.9 TPS

Methodology¶

Reuses v2 methodology from the Qwen3.5 provider comparison:

Aspect	Configuration
Iterations	15
Validation	PytestValidator (real pytest tests)
Sampler	temp=0.6, top_p=0.95 (THINKING_CODING)
Timeout	300s per request
Context size	40960 (27B), 32768 (35B-A3B Q4_K_S & IQ4_XS), 65536 (Gemma4)

Recommendations¶

Use Case	Recommended
Best overall	Qwen3.6-35B-A3B Q4_K_S — highest quality, fastest
Co-hosted w/ embed	Qwen3.6-35B-A3B IQ4_XS (~21 GB) + embedding fits 24 GB
Lowest VRAM	Gemma 4 E4B (~16 GB, near-equal quality)
Maximum speed	Qwen3.6-35B-A3B (173-174 TPS, either quant)
Lowest variance	Qwen3.6-35B-A3B Q4_K_S (std dev 0.0)
Budget GPU (<20GB)	Gemma 4 E4B — best quality under 20 GB VRAM

Files¶

File	Description
`result_qwen36-27b.json`	Full benchmark results (15 iterations)
`result_qwen36-35b-a3b.json`	Full benchmark results (15 iterations)
`result_qwen36-35b-a3b-iq4xs.json`	Full benchmark results (15 iterations)
`result_gemma4-e4b.json`	Full benchmark results (15 iterations)
`artifacts//iter_/`	Generated code per iteration per level