Benchmark Report: Nemotron-3-Nano vs Devstral-Small-2¶

Date: 2026-02-07 Hardware: RTX 4090 (24 GB VRAM) Runtime: llama.cpp router mode, single GPU Temperature: 0.2 (fixed for all models) Evaluator: Claude Opus 4.6

Executive Summary¶

Devstral-Small-2-24B (Q4_K_M) outperforms Nemotron-3-Nano-30B-A3B (UD-Q4_K_XL) across nearly all quality and latency metrics. Nemotron's only advantage is raw generation speed (89 tok/s vs 59 tok/s), but this is negated by its 30x higher time-to-first-token and frequent response truncation due to hidden reasoning tokens.

Recommendation: Use Devstral-Small-2 as the primary code model. Nemotron is not suitable for interactive use at the current 512 max_tokens limit because its reasoning tokens consume the output budget invisibly.

Quality Scores (1-5)¶

Category	Devstral	Nemotron	Winner
Factual Knowledge	5.0	5.0	Tie
Reasoning & Logic	4.0	2.7	Devstral
Code Generation	4.7	3.7	Devstral
Tool Use / Structured Output	4.0	3.0	Devstral
Writing & Summarization	5.0	4.5	Devstral
Overall Average	4.5	3.8	Devstral

Hallucination Detection¶

Both models achieved 0% hallucination rate across all factual prompts (22 facts checked per model). No forbidden claims detected, no fabricated laws, no Pluto-as-planet errors. Both models are factually reliable at temperature 0.2.

Metric	Devstral	Nemotron
Facts checked	22	22
Facts correct	22	22
Facts hallucinated	0	0
Hallucination rate	0.0%	0.0%

Performance Metrics¶

Metric	Devstral	Nemotron	Notes
Avg TTFT	59 ms	1,757 ms	Devstral 30x faster
Min/Max TTFT	37-108 ms	0-4,893 ms	Nemotron TTFT is highly variable
Gen tok/s	59.4	88.9	Nemotron 50% faster generation
Prompt tok/s	1,626-2,740	250-314	Devstral 6-10x faster prompt processing
VRAM (loaded)	21,715 MB	22,641 MB	Both fit on 24 GB with ~2 GB headroom
Model load time	17.0 s	54.2 s	Devstral 3x faster to load
Model unload time	3-4 ms	3 ms	Instant for both
VRAM after unload	2,575 MB	2,575 MB	Same baseline

TTFT Analysis¶

Nemotron's high TTFT is caused by its reasoning/thinking phase. Even at temperature 0.2, the model generates internal reasoning tokens before producing visible output. These tokens consume the max_tokens budget but are not visible in the response. This manifests as:

Long pauses before any output appears (up to 5 seconds)
Truncated responses when reasoning consumes most of the token budget
Empty responses when ALL tokens are consumed by reasoning (constraint puzzle)

Devstral has no reasoning phase, delivering its first token in 37-108ms consistently.

Mode-Switching Lifecycle¶

Transition times for blank -> code_model -> RAG -> code_model:

Transition	Devstral	Nemotron
blank -> model (load)	17,047 ms	54,156 ms
blank -> model (verify TTFT)	83 ms	<1 ms*
model -> RAG (unload)	4 ms	3 ms
VRAM freed after unload	19,140 MB	20,064 MB
RAG -> model (reload)	16,056 ms	51,138 ms
RAG -> model (verify TTFT)	82 ms	<1 ms*
Total round-trip	33,272 ms	105,300 ms

*Nemotron's verify TTFT shows 0ms because the short "READY" response completes without a visible thinking phase. Real TTFT for substantive prompts averages 1,757ms.

User-visible switch time: - Devstral: ~17s to switch modes (acceptable for IDE workflow) - Nemotron: ~54s to switch modes (noticeable delay)

Detailed Scoring Rationale¶

Factual Knowledge (Devstral 5.0 / Nemotron 5.0)¶

Both models answered all three factual prompts perfectly: - Thermodynamics: All four laws correct with core principles - Solar System: All 8 planets in order, Jupiter largest, Mercury smallest - Programming Languages: Correct years and creators for Python/Java/Rust

Nemotron provided slightly more detailed responses (extra context about organizations), but both achieved perfect accuracy.

Reasoning & Logic (Devstral 4.0 / Nemotron 2.7)¶

Syllogism: Both correct (cannot conclude). Nemotron used formal logic notation which was more rigorous.
Math problem: Devstral solved correctly (10:37 AM). Nemotron's response was truncated at 512 tokens with no answer reached -- thinking consumed most tokens.
Constraint puzzle: Devstral gave a wrong answer (yellow adjacent to green, violating constraint). Nemotron returned an empty response (all 512 tokens consumed by thinking, 0 visible output).

Code Generation (Devstral 4.7 / Nemotron 3.7)¶

FizzBuzz: Both correct. Devstral slightly cleaner. Nemotron's docstring had a minor example bug.
Binary Search: Devstral complete with a sorted-check caveat (O(n log n)). Nemotron's code was truncated mid-docstring.
Bug Detection: Both correctly identified the merge_sorted bug. Devstral provided a complete fix with demonstration. Nemotron's was correct but truncated.

Tool Use (Devstral 4.0 / Nemotron 3.0)¶

JSON Extraction: Both correct. Nemotron returned plain JSON (as asked). Devstral wrapped it in markdown fences (minor format issue).
Function Calling: Devstral returned correct JSON. Nemotron returned only { -- thinking consumed all tokens, producing no useful output.

Writing (Devstral 5.0 / Nemotron 4.5)¶

Summarization: Both produced exactly 3 bullet points, accurate and comprehensive. Tie.
Technical Explanation: Both used good analogies. Devstral stayed under 150 words. Nemotron exceeded the limit (~167 words).

Key Findings¶

1. Nemotron's Reasoning Tokens Are a Double-Edged Sword¶

The model's internal reasoning capability (even with enable_thinking=false in the template) still produces hidden tokens that consume the output budget. At 512 max_tokens, this causes: - 3 of 13 responses truncated or empty - Unpredictable response length - User-perceived quality drop

Mitigation: Increase max_tokens to 2048+ for Nemotron, or use the model only for tasks where reasoning overhead adds value (math, complex analysis) and shorter outputs are acceptable.

2. Devstral Is the Better All-Rounder¶

Devstral delivers consistent, predictable output across all categories with excellent latency. It is the better choice for: - Interactive coding (low TTFT, consistent output) - Tool calling and structured output - IDE integration (faster mode switching)

3. Both Models Are Factually Reliable¶

Zero hallucinations detected across 22 fact checks per model at temperature 0.2. Both can be trusted for factual queries at low temperature.

4. VRAM Usage Is Comparable¶

Both models fit comfortably on the 24 GB RTX 4090 with ~2 GB headroom for the embedding model: - Devstral: 21.7 GB loaded - Nemotron: 22.6 GB loaded - Baseline after unload: 2.6 GB

Charts¶

Quality Comparison (Radar)¶

Quality Comparison Radar

Performance Comparison¶

Performance Comparison

Hallucination Detection Results¶

Hallucination Results

Mode-Switching Lifecycle¶

Mode-Switching Lifecycle

Methodology¶

See ../README.md for full methodology, including prompt suite, hallucination detection approach, and scoring rubric.

Benchmark Report: Nemotron-3-Nano vs Devstral-Small-2¶

Executive Summary¶

Quality Scores (1-5)¶

Hallucination Detection¶

Performance Metrics¶

TTFT Analysis¶

Mode-Switching Lifecycle¶

Detailed Scoring Rationale¶

Factual Knowledge (Devstral 5.0 / Nemotron 5.0)¶

Reasoning & Logic (Devstral 4.0 / Nemotron 2.7)¶

Code Generation (Devstral 4.7 / Nemotron 3.7)¶

Tool Use (Devstral 4.0 / Nemotron 3.0)¶

Writing (Devstral 5.0 / Nemotron 4.5)¶

Key Findings¶

1. Nemotron's Reasoning Tokens Are a Double-Edged Sword¶

2. Devstral Is the Better All-Rounder¶

3. Both Models Are Factually Reliable¶

4. VRAM Usage Is Comparable¶

Charts¶

Quality Comparison (Radar)¶

Performance Comparison¶

Hallucination Detection Results¶

Mode-Switching Lifecycle¶

Methodology¶

Raw Data¶