GPU Model Benchmark Framework¶
Reproducible benchmarking framework for comparing LLMs running on the llama.cpp router (RTX 4090, 24 GB VRAM).
Quick Start¶
# Run a single model benchmark
uv run python scripts/benchmark.py \
--model nemotron-3-nano \
--output docs/benchmarks/20260207_nemotron_devstral/
# Run with SelfCheckGPT consistency checking (runs factual prompts 5x)
uv run python scripts/benchmark.py \
--model nemotron-3-nano \
--consistency-check --consistency-runs 5 \
--output docs/benchmarks/20260207_nemotron_devstral/
# Run lifecycle (mode-switching) test
uv run python scripts/benchmark.py \
--lifecycle \
--models nemotron-3-nano,devstral-small-2 \
--output docs/benchmarks/20260207_nemotron_devstral/
# Generate comparison charts from scored JSON files
uv run python scripts/benchmark_chart.py \
docs/benchmarks/20260207_nemotron_devstral/*.json \
--output docs/benchmarks/20260207_nemotron_devstral/charts/
Directory Layout¶
docs/benchmarks/
README.md <- this file (methodology)
20260207_nemotron_devstral/ <- one folder per benchmark run
20260207_nemotron.json <- raw results + scores
20260207_devstral.json
20260207_lifecycle.json
20260207_nemotron_devstral_report.md
charts/
radar.png
performance.png
lifecycle.png
Methodology¶
Temperature¶
All models are tested with temperature = 0.2 for deterministic, comparable outputs. This overrides any model-specific preset defaults.
Prompt Categories (Radar Chart Axes)¶
| # | Category | Prompts | Hallucination Detection |
|---|---|---|---|
| 1 | Factual Knowledge | 3 | Yes - ground truth verification |
| 2 | Reasoning and Logic | 3 | No |
| 3 | Code Generation | 3 | No |
| 4 | Hard Code Generation | 5 | No |
| 5 | Tool Use / Structured Output | 2 | No |
| 6 | Writing and Summarization | 2 | No |
| 7 | German Language | 3 | Yes - for factual prompts |
| 8 | Grammar Evaluation | 3 | No |
Hard Code Generation includes: - LRU cache implementation (data structures) - Async code refactoring (concurrency patterns) - Thread-safe queue (synchronization) - SQL injection detection (security) - Regex parser (parsing)
German Language tests multilingual capabilities: - German historical facts (with ground truth) - German math word problems (MGSM-style) - German-to-English translation
Grammar Evaluation tests linguistic competence: - Grammar correction - Grammaticality judgment (CoLA-style) - Style improvement
Hallucination Detection (Factual Knowledge)¶
Each factual prompt has a ground truth consisting of verifiable facts. After the model responds, the checker runs two tests:
- Presence check - required keywords that MUST appear in the response. Missing = incomplete answer (score penalty, not hallucination).
- Forbidden-claim check - keywords indicating a known false claim. Present = hallucination (e.g., Pluto listed as a planet).
Metrics produced per factual prompt:
| Metric | Meaning |
|---|---|
facts_correct |
Verified facts found in response |
facts_missing |
Expected facts not mentioned |
facts_hallucinated |
Forbidden/false claims detected |
hallucination_rate |
hallucinated / (correct + hallucinated) |
Consistency Checking (SelfCheckGPT)¶
Optional multi-run consistency analysis using the SelfCheckGPT approach. When enabled, each factual prompt is run multiple times (default: 5), and responses are analyzed for self-consistency.
uv run python scripts/benchmark.py \
--model nemotron-3-nano \
--consistency-check \
--consistency-runs 5 \
--output docs/benchmarks/run/
The insight: if a model knows a fact, it will consistently produce it; if hallucinating, responses will diverge across runs.
Metrics produced when consistency checking is enabled:
| Metric | Meaning |
|---|---|
consistency_score |
0.0-1.0, how consistent facts are across runs |
inconsistent_facts |
Facts that appeared in < 50% of runs |
runs |
Number of times the prompt was executed |
A low consistency score (< 0.7) indicates potential hallucinations that keyword-based detection might miss.
Performance Metrics¶
| Metric | Source | Unit |
|---|---|---|
| Time to First Token (TTFT) | Client-side streaming | ms |
| Generation speed | Server timings |
tok/s |
| Prompt processing speed | Server timings |
tok/s |
| Total response time | Client-side | ms |
| VRAM (peak) | nvidia-smi |
MB |
| Model load time | Router load API | ms |
| Model unload time | Router unload API | ms |
Mode-Switching Lifecycle Test¶
Measures user-visible latency for mode transitions:
Each step: 1. Unload current model (timed) 2. Load next model (timed) 3. Verification request - simple prompt confirming model works (timed)
Total switch time = unload + load + first-request TTFT.
Quality Scoring¶
All responses are scored 1-5 by the evaluator (Claude Opus 4.6) after the benchmark run completes. Scoring rubric:
| Score | Meaning |
|---|---|
| 5 | Perfect - complete, accurate, well-formatted |
| 4 | Good - minor issues, mostly correct |
| 3 | Acceptable - some errors or missing info |
| 2 | Poor - significant errors or incomplete |
| 1 | Failed - wrong answer or gibberish |
For factual prompts, the hallucination rate adjusts the score: - 0% hallucination + all facts present -> 5 - 0% hallucination + some facts missing -> 4 - Any hallucination detected -> capped at 2
Adding a New Model¶
- Register the model in
llama_cpp_gguf_presets.ini - Run the benchmark script with
--model <model-id> - The evaluator scores the responses
- Regenerate charts
Job Queue Challenge¶
A graduated difficulty benchmark for evaluating LLM coding capabilities. See 20260226_qwen35_job_queue_challenge/README.md for full details.
Judge: Claude Code (Opus 4.6) — designed prompts, ran benchmarks, scored via pytest
Quick Start¶
# Run single iteration (quick test)
uv run python docs/benchmarks/20260226_qwen35_job_queue_challenge/benchmark_runner.py \
--model qwen35-27b-q3 --port 7093
# Run 5 iterations for reliable results (recommended)
uv run python docs/benchmarks/20260226_qwen35_job_queue_challenge/benchmark_runner.py \
--model qwen35-27b-q3 --port 7093 \
--iterations 5 \
--output docs/benchmarks/20260226_qwen35_job_queue_challenge/
Difficulty Levels¶
| Level | Task | Points |
|---|---|---|
| L1 | Basic queue (add/get, FIFO) | 25 |
| L2 | Retry with exponential backoff | 25 |
| L3 | Priority scheduling | 25 |
| L4 | Find & fix concurrency bug | 15 |
| L5 | Multi-file refactoring | 10 |
Total: 100 points
Multi-Iteration Mode¶
For reliable comparisons, run 5 iterations:
This produces: - Best score (used for artifacts) - Average, min, max scores - All individual scores for variance analysis
Results are saved to result_<model>.json with statistics.