Pick a Mode for Your Workload¶
Learn which bundled mode fits your use case and your GPU, before you switch into it.
A mode is a named bundle of services that gpumod starts and stops together (see Modes deep dive for how switching works). The bundled modes were sized for a 24 GB card; this page helps you pick one — or decide to build your own.
Prerequisites¶
- gpumod initialized (
gpumod init) — see Getting Started - Know your VRAM:
gpumod statusprints it on the first line
Decision tree¶
Start with your primary task:
- Coding agent(s) (Claude Code, aider, Hermes driving tools)
- Need 2–3 concurrent agents →
hermes-agentorcode— both run Gemma 4 26B-A4B multi-slot (N=3 × 128K ctx, text-only) plus a code-embedding service. See Multi-agent with cont-batching. - Need 5 cheap parallel sessions, smaller model →
multi-agent-gpt-oss(GPT-OSS 20B, 14 GB).
- Need 2–3 concurrent agents →
- Research / long-form reasoning →
hermes-agent(3 symmetric agents on cont-batching is the highest-aggregate-throughput configuration measured) ornemotronfor a single reasoning model. - RAG / semantic search only (no local chat LLM) →
rag— two embedding models, 7.5 GB total, leaves the rest of the GPU free. - Voice (ASR + TTS + chat) →
speak. - Vision (image inputs) → no bundled mode; the multi-slot Gemma preset
is text-only. Start the single-slot
gemma4-26b-a4b-q4service directly, or create a mode around it. - Benchmarking / measuring a model in isolation →
blankfirst (stops everything, 0 MB), then start only the service under test. - Finetuning on the same GPU →
finetuning— keeps only the small embedding service (2.5 GB) so the training job gets the rest.
VRAM-fit table¶
Total VRAM per bundled mode (from gpumod mode list) against common GPU
sizes. ✓ = fits with headroom, ⚠ = fits but tight, ✗ = does not fit.
| Mode | Total VRAM | 12 GB | 16 GB | 24 GB |
|---|---|---|---|---|
blank |
0 MB | ✓ | ✓ | ✓ |
finetuning |
2,500 MB | ✓ | ✓ | ✓ |
rag |
7,500 MB | ✓ | ✓ | ✓ |
multi-agent-gpt-oss |
14,000 MB | ✗ | ⚠ | ✓ |
multi-agent-qwen3-coder |
20,000 MB | ✗ | ✗ | ✓ |
speak |
22,000 MB | ✗ | ✗ | ⚠ |
multi-agent-qwen3-next |
22,000 MB | ✗ | ✗ | ⚠ |
code / hermes-agent |
22,500 MB | ✗ | ✗ | ⚠ |
nemotron |
22,500 MB | ✗ | ✗ | ⚠ |
hacker |
22,500 MB | ✗ | ✗ | ⚠ |
"Tight" means under ~2 GB of headroom on a 24 GB card — fine for the designed workload, but don't co-run anything else on the GPU. Always confirm with a simulation, which accounts for your GPU and any context overrides:
Expected output:
Which model does a mode load?¶
Three places to look, in increasing detail:
gpumod mode status(after switching) orgpumod mode list— names and VRAM totals.- The mode YAML in
modes/— the service ID list, often with extensive design-rationale comments (e.g.hermes-agent.yamldocuments the whole slots-vs-agents model). - The service preset in
presets/— the exact model file, quantization, context size, and launch flags.
What happened¶
Mode VRAM totals are additive: each member preset declares an empirically
measured vram_mb, and gpumod refuses switches whose sum exceeds your GPU.
The bundled catalog therefore encodes the host's tested configurations —
picking a mode is picking a validated VRAM budget, not just a model.
Next steps¶
- Run your first service — if you haven't started anything yet
- Modes deep dive — lifecycle, sleep, and drift recovery
- OpenAI-compatible clients — connect a client once your mode is up
- CLI Reference —
mode createfor your own bundles