Skip to content

Benchmark Report: Nemotron-3-Nano vs Devstral-Small-2

Date: 2026-02-07 Hardware: RTX 4090 (24 GB VRAM) Runtime: llama.cpp router mode, single GPU Temperature: 0.2 (fixed for all models) Evaluator: Claude Opus 4.6

Executive Summary

Devstral-Small-2-24B (Q4_K_M) outperforms Nemotron-3-Nano-30B-A3B (UD-Q4_K_XL) across nearly all quality and latency metrics. Nemotron's only advantage is raw generation speed (89 tok/s vs 59 tok/s), but this is negated by its 30x higher time-to-first-token and frequent response truncation due to hidden reasoning tokens.

Recommendation: Use Devstral-Small-2 as the primary code model. Nemotron is not suitable for interactive use at the current 512 max_tokens limit because its reasoning tokens consume the output budget invisibly.

Quality Scores (1-5)

Category Devstral Nemotron Winner
Factual Knowledge 5.0 5.0 Tie
Reasoning & Logic 4.0 2.7 Devstral
Code Generation 4.7 3.7 Devstral
Tool Use / Structured Output 4.0 3.0 Devstral
Writing & Summarization 5.0 4.5 Devstral
Overall Average 4.5 3.8 Devstral

Hallucination Detection

Both models achieved 0% hallucination rate across all factual prompts (22 facts checked per model). No forbidden claims detected, no fabricated laws, no Pluto-as-planet errors. Both models are factually reliable at temperature 0.2.

Metric Devstral Nemotron
Facts checked 22 22
Facts correct 22 22
Facts hallucinated 0 0
Hallucination rate 0.0% 0.0%

Performance Metrics

Metric Devstral Nemotron Notes
Avg TTFT 59 ms 1,757 ms Devstral 30x faster
Min/Max TTFT 37-108 ms 0-4,893 ms Nemotron TTFT is highly variable
Gen tok/s 59.4 88.9 Nemotron 50% faster generation
Prompt tok/s 1,626-2,740 250-314 Devstral 6-10x faster prompt processing
VRAM (loaded) 21,715 MB 22,641 MB Both fit on 24 GB with ~2 GB headroom
Model load time 17.0 s 54.2 s Devstral 3x faster to load
Model unload time 3-4 ms 3 ms Instant for both
VRAM after unload 2,575 MB 2,575 MB Same baseline

TTFT Analysis

Nemotron's high TTFT is caused by its reasoning/thinking phase. Even at temperature 0.2, the model generates internal reasoning tokens before producing visible output. These tokens consume the max_tokens budget but are not visible in the response. This manifests as:

  • Long pauses before any output appears (up to 5 seconds)
  • Truncated responses when reasoning consumes most of the token budget
  • Empty responses when ALL tokens are consumed by reasoning (constraint puzzle)

Devstral has no reasoning phase, delivering its first token in 37-108ms consistently.

Mode-Switching Lifecycle

Transition times for blank -> code_model -> RAG -> code_model:

Transition Devstral Nemotron
blank -> model (load) 17,047 ms 54,156 ms
blank -> model (verify TTFT) 83 ms <1 ms*
model -> RAG (unload) 4 ms 3 ms
VRAM freed after unload 19,140 MB 20,064 MB
RAG -> model (reload) 16,056 ms 51,138 ms
RAG -> model (verify TTFT) 82 ms <1 ms*
Total round-trip 33,272 ms 105,300 ms

*Nemotron's verify TTFT shows 0ms because the short "READY" response completes without a visible thinking phase. Real TTFT for substantive prompts averages 1,757ms.

User-visible switch time: - Devstral: ~17s to switch modes (acceptable for IDE workflow) - Nemotron: ~54s to switch modes (noticeable delay)

Detailed Scoring Rationale

Factual Knowledge (Devstral 5.0 / Nemotron 5.0)

Both models answered all three factual prompts perfectly: - Thermodynamics: All four laws correct with core principles - Solar System: All 8 planets in order, Jupiter largest, Mercury smallest - Programming Languages: Correct years and creators for Python/Java/Rust

Nemotron provided slightly more detailed responses (extra context about organizations), but both achieved perfect accuracy.

Reasoning & Logic (Devstral 4.0 / Nemotron 2.7)

  • Syllogism: Both correct (cannot conclude). Nemotron used formal logic notation which was more rigorous.
  • Math problem: Devstral solved correctly (10:37 AM). Nemotron's response was truncated at 512 tokens with no answer reached -- thinking consumed most tokens.
  • Constraint puzzle: Devstral gave a wrong answer (yellow adjacent to green, violating constraint). Nemotron returned an empty response (all 512 tokens consumed by thinking, 0 visible output).

Code Generation (Devstral 4.7 / Nemotron 3.7)

  • FizzBuzz: Both correct. Devstral slightly cleaner. Nemotron's docstring had a minor example bug.
  • Binary Search: Devstral complete with a sorted-check caveat (O(n log n)). Nemotron's code was truncated mid-docstring.
  • Bug Detection: Both correctly identified the merge_sorted bug. Devstral provided a complete fix with demonstration. Nemotron's was correct but truncated.

Tool Use (Devstral 4.0 / Nemotron 3.0)

  • JSON Extraction: Both correct. Nemotron returned plain JSON (as asked). Devstral wrapped it in markdown fences (minor format issue).
  • Function Calling: Devstral returned correct JSON. Nemotron returned only { -- thinking consumed all tokens, producing no useful output.

Writing (Devstral 5.0 / Nemotron 4.5)

  • Summarization: Both produced exactly 3 bullet points, accurate and comprehensive. Tie.
  • Technical Explanation: Both used good analogies. Devstral stayed under 150 words. Nemotron exceeded the limit (~167 words).

Key Findings

1. Nemotron's Reasoning Tokens Are a Double-Edged Sword

The model's internal reasoning capability (even with enable_thinking=false in the template) still produces hidden tokens that consume the output budget. At 512 max_tokens, this causes: - 3 of 13 responses truncated or empty - Unpredictable response length - User-perceived quality drop

Mitigation: Increase max_tokens to 2048+ for Nemotron, or use the model only for tasks where reasoning overhead adds value (math, complex analysis) and shorter outputs are acceptable.

2. Devstral Is the Better All-Rounder

Devstral delivers consistent, predictable output across all categories with excellent latency. It is the better choice for: - Interactive coding (low TTFT, consistent output) - Tool calling and structured output - IDE integration (faster mode switching)

3. Both Models Are Factually Reliable

Zero hallucinations detected across 22 fact checks per model at temperature 0.2. Both can be trusted for factual queries at low temperature.

4. VRAM Usage Is Comparable

Both models fit comfortably on the 24 GB RTX 4090 with ~2 GB headroom for the embedding model: - Devstral: 21.7 GB loaded - Nemotron: 22.6 GB loaded - Baseline after unload: 2.6 GB

Charts

Quality Comparison (Radar)

Quality Comparison Radar

Performance Comparison

Performance Comparison

Hallucination Detection Results

Hallucination Results

Mode-Switching Lifecycle

Mode-Switching Lifecycle

Methodology

See ../README.md for full methodology, including prompt suite, hallucination detection approach, and scoring rubric.

Raw Data