gpumod-x7rv — Spike Findings: LLM Load Host Protection¶

Date: 2026-05-25 Status: Root cause confirmed. Code-server protection installed and verified. Follow-up filed.

TL;DR¶

The host-freeze class documented in gpu-stability.md is cudaHostAlloc hanging the NVIDIA driver when contiguous high-order pages are unavailable. It is NOT OOM. systemd-oomd, cgroup memory.high, and PSI thresholds cannot detect or stop it because there is no OOM signal — the kernel is stuck in uninterruptible I/O wait inside NVIDIA's allocator.

What we found, in priority order:

The current preflight formula (model × 1.1 + 1024 MB) is empirically correct. Going below it (12 GiB MemAvailable test) reproduced the documented hard-reboot. Going above it (18 GiB and 15 GiB tests) succeeded.
Code-server protection works. The drop-in installed via install_protection.sh makes code-server stay responsive at 15 GiB MemAvailable (peak 24ms vs frozen for 55s previously). At 12 GiB it still becomes unresponsive because the whole kernel is busy on uninterruptible I/O — but MemoryMin=1G keeps it from being swapped out, so it recovers immediately when the freeze ends or after reboot.
The escape hatch is GGML_CUDA_NO_PINNED=1. Set this on llama-server's environment and cudaMallocHost is bypassed. No high-order-page requirement, no driver hang risk. Performance cost unclear — filed as gpumod-56md to benchmark. If overhead is small (<5%), this should be the default for all llama.cpp services.

The mistake in our initial reading¶

Phase 2 baseline sampler showed VmPin=0 and Mlocked unchanged during a clean model load. I concluded "no pinned memory is used during load."

That was wrong. NVIDIA's cudaHostAlloc uses kernel-internal page-locking via UVM, which is invisible to /proc/<pid>/status:VmPin and /proc/meminfo:Mlocked. The pinning IS happening; Linux just doesn't expose it via the standard counters.

Confirmed by source-code inspection: ~/bin/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1493:

static void * ggml_cuda_host_malloc(size_t size) {
    if (getenv("GGML_CUDA_NO_PINNED") != nullptr) {
        return nullptr;
    }
    void * ptr = nullptr;
    cudaError_t err = cudaMallocHost((void **) &ptr, size);
    ...
}

cudaMallocHost is cudaHostAlloc(..., cudaHostAllocDefault) — exactly the hang-trigger documented in gpu-stability.md and Ollama issue #11317.

Empirical results¶

All runs use qwen36-35b-a3b-mtp-iq4xs-preserve (131072 ctx + q8_0 KV, 17,365 MB GGUF).

Test	MemAvailable	Protected?	LLM load time	Swap consumed	Peak PSI	code-server	Outcome
Phase 2 baseline	24 GiB	n/a (no pressure)	49s	0 MB	0.0%	not measured	healthy
Phase 4a	18.7 GiB	partial (just code-server)	9s (warm)	0 MB	0.0%	not measured	healthy
Phase 4b unprotected	15 GiB	no	55s	2.8 GB	10.3%	frozen ~55s	healthy LLM but UI dead
Phase 4b protected	15 GiB	yes (drop-in installed)	49s	2.0 GB	13.6%	24ms peak	healthy LLM AND UI
Phase 4c protected	12 GiB	yes	78 min (then aborted)	—	—	frozen 78 min	HARD REBOOT

Reading the table¶

18 GiB to 24 GiB: safe region. The current preflight (20.1 GiB) sits in the middle of this range. Slightly conservative, but appropriate margin.
15 GiB: workable IF code-server is protected. Without protection, the kernel swaps out code-server's anonymous pages to make room for the model's page cache, freezing the UI for the entire load.
12 GiB: danger zone. Anonymous pressure + model load fragments high-order pages → cudaHostAlloc hangs → no recovery without power-cycle.

Why the original bd memory was right (but I misread it)¶

swap-does-not-help-llm-loading-on-this:

cudaHostAlloc (pinned memory) requires contiguous physical RAM, non-swappable. mmap'd model pages are file-backed.

Both halves are correct. The pinned allocation is small but it's the actual hang trigger, and it requires contiguous high-order pages (a stricter condition than "total RAM available"). Swap CAN displace anonymous app RSS to make more headroom (we measured 2.8 GB displaced at 15 GiB), but it CANNOT make contiguous high-order pages appear when memory is fragmented.

I updated the memory with the corrected wording — same conclusion, sharper explanation.

Recommendation (final)¶

Action	Status
Keep current preflight formula (`model × 1.1 + 1024 MB`)	Keep as-is. Don't touch.
Install code-server protection drop-in	Done — verified at 15 GiB
Test `GGML_CUDA_NO_PINNED=1` performance impact	Filed as gpumod-56md
Add zram or aggressive sysctl tuning	Skip — masks the issue without fixing it
Update `gpu-stability.md` with `GGML_CUDA_NO_PINNED` option	TODO (cross-project, separate task)
Update bd memory	Done

For gpumod-aop¶

The Hermes-agent MTP-preserve swap is safe given the current preflight + code-server protection. The operator pain (frequent drop_caches before mode switches) is a steady-state RAM hygiene issue — not an architectural problem. Two paths to relieve it:

Short term: auto drop_caches in gpumod mode switch when MemAvailable falls below a threshold. Cheap; doesn't change the safety envelope.
Long term: if gpumod-56md shows GGML_CUDA_NO_PINNED=1 is cheap, enable it by default and relax preflight to model + 1024 MB. Eliminates the freeze class entirely.

The aop blocker (gpumod-aop depends on gpumod-x7rv) can now be released — the spike's recommendation is "deploy with the current preflight; the protection drop-in is enough."

Files in this folder¶

File	Purpose
`00_baseline_state.md`	Phase 0 snapshot of host-protection state before any changes
`FINDINGS.md`	this file — final summary
`sampler.py`	10 Hz sampler of /proc//status + /proc/meminfo + /proc/pressure/memory + nvidia-smi
`ram_pressure.py`	controlled-pressure helper for Phase 4
`run_load.sh`	run-load wrapper using `gpumod service start` (blocking; captures steady state)
`run_load_direct.sh`	run-load wrapper using raw systemctl + bypass (non-blocking; captures load phase + code-server probe)
`trace_load.bt`	bpftrace script for mmap/mlock/madvise syscalls (not used in final analysis but available)
`configs/[email protected]`	systemd drop-in for code-server
`configs/oomd.conf.d-gpumod.conf`	systemd-oomd tuning drop-in
`configs/install_protection.sh`	sudo installer for the drop-ins
`runs/<label>/`	per-run snapshots, sampler CSV, code-server probe CSV, llama-server log