Host Stability and gpumod doctor¶
Learn why loading a large model can freeze an entire Linux host without any OOM message — and how to verify your machine is protected against it.
The failure mode: a freeze that isn't OOM¶
The most dangerous failure on a shared GPU host is not running out of
memory. It's cudaHostAlloc — the CUDA pinned-memory allocator — hanging
the NVIDIA driver when contiguous physical RAM pages are unavailable.
Symptoms:
- The host becomes unresponsive during a model load (SSH stalls, editors freeze).
- There is no OOM log —
dmesgis silent,systemd-oomdnever triggers, memory-pressure metrics show nothing. The kernel is stuck in uninterruptible I/O wait inside the NVIDIA allocator. - In the worst case the only way out is a power cycle.
Because there is no OOM signal, the usual defenses (cgroup memory.high,
systemd-oomd, PSI thresholds) cannot catch it. Swap doesn't help
either: pinned memory is non-swappable, and swap cannot make contiguous
pages appear in fragmented RAM. The full root-cause investigation is in the
cudaHostAlloc research findings
(gpumod-x7rv).
The defense stack¶
gpumod layers four protections, in priority order:
1. GGML_CUDA_NO_PINNED=1 (default for all llama.cpp services)¶
Every gpumod-rendered llama.cpp unit sets GGML_CUDA_NO_PINNED=1
unconditionally. This makes llama.cpp skip cudaMallocHost entirely —
no pinned allocation, no contiguous-page requirement, the freeze class is
eliminated. The measured cost was a 0.28% throughput regression.
When not to override: essentially never. Do not remove this variable from a unit unless you have re-benchmarked pinned memory on your exact host and accept the freeze risk.
2. Preflight RAM check (runs before every service start)¶
The systemd unit's ExecStartPre runs gpumod's preflight, which refuses to
start a service when MemAvailable is below model_size × 1.1 + 1024 MB.
This threshold is empirically calibrated: in testing, loads at 18 GiB
available succeeded, while 12 GiB reproduced the hard-reboot freeze. See
Configuration for tuning knobs.
3. Kernel reserve: vm.min_free_kbytes¶
Keeping a 1 GiB kernel page reserve reduces fragmentation pressure.
Installer: scripts/install-gpumod-sysctl.sh (in the repository).
4. Editor protection during load spikes¶
A systemd drop-in gives code-server a guaranteed memory floor so your
editor stays responsive while a model load pushes the host into expected
memory pressure. Installer: scripts/oom-protection/install.sh.
Prerequisites¶
- gpumod installed — Getting Started
sudoaccess for the two installers (one-time)
Steps: verify your host with gpumod doctor¶
1. Check the kernel reserve¶
Expected output:
If it fails, install the sysctl drop-in and re-check:
2. Check OOM-protection drop-ins¶
Expected output:
If it fails:
3. Check a service's driver venv¶
Driver/library version drift (e.g. a vLLM upgrade that breaks a pinned model) is caught by the venv compatibility check:
Expected output: nothing — the command is silent and exits 0 when
every package in the service's venv satisfies its compat: constraints.
On a mismatch it reports the offending package and exits 1. See
Venv Isolation for how per-service venvs work.
All three checks use exit codes (0 = OK), so they slot directly into
shell scripts and CI.
What happened¶
gpumod doctor verifies the host-level half of the defense stack — the
parts that live outside gpumod's own process (kernel sysctl, systemd
drop-ins, driver venvs). The service-level half (pinned-memory bypass and
RAM preflight) is baked into every rendered unit automatically, so a
correctly installed host plus default presets means the known freeze class
cannot trigger.
One way to reintroduce the freeze
Flags that massively inflate VRAM allocation can still wedge the host
even with all protections installed. The known case: --swa-full on a
Gemma 4 multi-slot preset, which caused a hard-reboot freeze during KV
allocation. See the warning in
Multi-agent with cont-batching.
Next steps¶
- Configuration — preflight thresholds and environment variables
- Venv Isolation — per-service driver venvs
- Research: cudaHostAlloc freeze findings — the empirical MemAvailable test matrix behind the thresholds