Run Your First Service¶
Learn the full gpumod service lifecycle — preset → systemd unit → running endpoint — in about five minutes, using a small model that fits on any GPU.
We use vllm-embedding-code (Qwen3-Embedding-0.6B, ~2.5 GB VRAM, ~60 s boot)
because it's the smallest bundled preset. Every step works the same for the
large chat presets — only the VRAM and boot time change.
Prerequisites¶
- gpumod installed and initialized — see Getting Started
(
uv sync,gpumod init) - User-level systemd lingering enabled:
sudo loginctl enable-linger $USER - A free GPU with at least 3 GB of VRAM available (
gpumod status)
Steps¶
1. Check the GPU and find the preset¶
Expected output (abbreviated):
GPU: NVIDIA GeForce RTX 4090 VRAM: 24564 MB
Mode: blank
Services
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Name ┃ State ┃ VRAM (MB) ┃ Driver ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ Qwen3-Embedding-0.6B │ stopped │ 2500 │ vllm │
│ ... │ │ │ │
└────────────────────────────────────────────┴──────────┴───────────┴──────────┘
The service we want is registered as vllm-embedding-code (from
presets/embedding/vllm-embedding-code.yaml).
2. Install the systemd unit¶
gpumod renders a systemd user unit from the preset — you never write unit files by hand:
Expected output:
3. Start the service¶
Expected output (after the model loads, ~60 s):
Quiesce period
If you just stopped another GPU service, the start may be refused for a few seconds:
Error: Lifecycle error for 'vllm-embedding-code' during start: Quiesce period
active: 6s remaining (configured: 10s). A heavy GPU service was stopped 4s ago.
Wait for the GPU driver to fully reclaim memory, or use --no-quiesce to bypass.
This is intentional — the NVIDIA driver needs a moment to reclaim VRAM after a service stops. Wait the indicated seconds and retry.
4. Verify health¶
Expected output:
╭──────────────────────── Service: vllm-embedding-code ────────────────────────╮
│ Name: Qwen3-Embedding-0.6B │
│ Driver: vllm │
│ Port: 8210 │
│ VRAM: 2500 MB │
│ State: running │
│ Health: OK │
╰──────────────────────────────────────────────────────────────────────────────╯
Or hit the health endpoint directly:
Expected output:
5. Send your first request¶
This is an embedding model, so the first request is an embedding (chat
models use /v1/chat/completions instead — see
OpenAI-compatible clients):
curl -s http://localhost:8210/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"model": "Qwen/Qwen3-Embedding-0.6B", "input": "gpumod manages GPU services"}'
Expected output (embedding vector truncated):
{
"object": "list",
"model": "Qwen/Qwen3-Embedding-0.6B",
"data": [{"object": "embedding", "index": 0, "embedding": [0.0026, -0.0789, -0.009, -0.0673, "..."]}],
"usage": {"prompt_tokens": 7, "total_tokens": 7}
}
A 1024-dimension vector comes back — the service is fully operational.
6. Stop the service (optional)¶
Expected output:
What happened¶
gpumod read the preset YAML, rendered a sandboxed Jinja2 template into a
systemd user unit (~/.config/systemd/user/, no sudo needed), and
delegated the lifecycle to systemctl --user. Before launch, the
preflight checks verified there was
enough RAM and VRAM and that the model files exist — a failing preflight
aborts the start instead of wedging the host. Health was confirmed by
polling the preset's health_endpoint until it returned 200.
Managing services one at a time works, but the real workflow is modes — named bundles of services that gpumod starts and stops together, with VRAM accounting.
Next steps¶
- Pick a mode for your workload — switch service bundles instead of individual services
- OpenAI-compatible clients — wire chat clients to a running LLM service
- CLI Reference — all
gpumod service,mode, andtemplatecommands - Configuration — environment variables and preflight thresholds