Multi-Agent Capacity Spike for hermes-agent (gpumod-8xaq)¶

Date: 2026-06-04 Ticket: gpumod-8xaq Host: RTX 4090 (24 GB), gemma4-26b-a4b-q4 + vllm-embedding-code co-tenant, llama.cpp b9500

Question¶

How many concurrent agent slots can hermes-agent serve from a single shared gemma4-26b-a4b-q4 model on one 4090, with vllm-embedding-code (Qwen3-Embedding-0.6B, 2.5 GB) co-running? What is the right --parallel N × context_size knee?

Phase 1 — VRAM ceiling (complete)¶

Method¶

Stopped gemma4-26b-a4b-q4 gpumod service. Left vllm-embedding-code running (the realistic Hermes co-tenant). Launched raw llama-server directly with --parallel N --cont-batching --ctx-size (N × per_slot_ctx), waited for /health, captured nvidia-smi --query-gpu=memory.used,memory.free, then killed the process. 5-second driver-reclaim quiesce between configs.

llama-server flags matched the production preset: q8_0 KV cache, mmproj-BF16 vision encoder loaded, flash_attn on, --chat-template-kwargs '{"enable_thinking":true}', threads=16.

Raw data: phase1_results.tsv.

Results¶

Config	N	per-slot ctx	total -c	GPU used after load	GPU free	Boot
N3 ctx128K	3	131,072	393,216	22,980 MiB	1,102 MiB	20s
N3 ctx64K	3	65,536	196,608	21,110 MiB	2,972 MiB	10s
N5 ctx64K	5	65,536	327,680	23,038 MiB	1,044 MiB	15s
N5 ctx32K	5	32,768	163,840	20,928 MiB	3,154 MiB	10s
N7 ctx32K	7	32,768	229,376	22,184 MiB	1,898 MiB	6s

All 5 configs booted successfully — no OOM at idle. Headroom is what differs.

Per-slot KV cost (empirical)¶

GPU breakdown for each config:

Baseline (always-on): vllm-embedding-code 2.5 GB + node IDE compositor 0.4 GB + CUDA scratch ~0.5 GB = ~3.4 GB shared overhead before gemma loads
gemma backbone (loaded once regardless of N): model 12.97 GB + mmproj 1.14 GB = ~14.1 GB shared
The remainder is per-slot KV cache.

Solving per_slot_KV ≈ (GPU_used − 3.4 − 14.1 − ~0.5 fragmentation) / N:

Per-slot ctx	Theoretical (q8_0: 16 MB / 1K ctx)	Empirical (this run)	Per-slot overhead beyond raw KV
128K	2048 MB	~1,800 MB	~150 MB shared inefficiency captured in scratch
64K	1024 MB	~1,100 MB	~75 MB per slot (slot bookkeeping, attention scratch)
32K	512 MB	~660 MB	~150 MB per slot (fixed per-slot overhead dominates at small ctx)

Takeaway: at small per-slot context, the fixed per-slot overhead (~100-150 MB for slot state) is a non-trivial fraction of total. Doesn't matter at 64K+, matters at 32K-16K.

Headroom interpretation¶

The Phase 1 result is "boots cleanly at idle". Real-world Hermes use needs to also absorb:

Demand	Size
One image input (e.g. 1000×562 PNG)	~200 MiB vision-token KV
KV cache growth per turn (q8_0)	~16 MiB / 1K generated tokens (already counted in per-slot ctx allocation)
CUDA fragmentation, mmproj per-image scratch	~200-500 MiB margin

So practical free-VRAM floor is ≥ 1.5 GiB during a vision-enabled session. By that bar:

Config	Free at idle	Free after one image + 5K tokens response	Verdict
N3 ctx128K	1102 MiB	~700 MiB	Risky — single image acceptable, two concurrent image inputs likely OOM
N3 ctx64K	2972 MiB	~2400 MiB	Comfortable — vision + multi-turn growth all within budget
N5 ctx64K	1044 MiB	~600 MiB	Risky — same problem as N3 ctx128K, worse since 5 slots mean more image events
N5 ctx32K	3154 MiB	~2500 MiB	Comfortable — but 32K is short for tool-heavy turns
N7 ctx32K	1898 MiB	~1300 MiB	Marginal — single-image OK, two concurrent risky

Recommendation from Phase 1 alone¶

Two production-viable configs for a 3+ agent setup:

N=3, per-slot ctx=64K — comfortable 3 GB headroom, 64K per slot is enough for most tool-heavy coding turns IF the agent compacts older turns aggressively. Best balance for Workflow A (TL + Dev + QA, asymmetric load).
N=5, per-slot ctx=32K — comfortable 3.2 GB headroom but 32K is tight for tool-heavy turns. Best for chat-style multi-agent where each conversation is short (less than ~5 tool-heavy turns).

Notable: the original spike question of "N=3 at 128K" boots, but headroom is only 1.1 GB. That's not safe for vision-enabled multi-turn use — recommend dropping the 128K hope and settling on 64K per slot. This changes the tool-overhead analysis in the spike doc: at 64K per slot, a Dev turn with 25-30K tokens of tool overhead occupies ~half the context after one turn. Aggressive compaction (drop oldest tool-result turns) becomes mandatory, not optional.

Open questions for Phase 2¶

Phase 1 establishes which configs FIT. Phase 2 establishes whether they're FAST ENOUGH:

At N=3 ctx=64K, what's the per-slot TPS when all 3 are generating? (Workflow B floor case)
At N=5 ctx=32K, does cont-batching's idle-borrowing cover for asymmetric loads, or does the per-slot floor become ~25 TPS even in Workflow A?
Does mmproj vision encoder become a contention point under concurrent image inputs across slots?
Tool-storm: rapid context churn — does cont-batching slot eviction stay clean?

Production state restored after Phase 1¶

hermes-agent mode restarted with the original gemma4-26b-a4b-q4 (single-slot, 128K ctx) preset — no production state changed by Phase 1.

Phase 2 — synthetic workload bench (complete)¶

Method¶

Built a stdlib-only Python workload runner (phase2_workload_runner.py) that spawns N concurrent threads, each representing one "agent slot" hitting /v1/chat/completions with a configurable role profile. Each turn prepends a synthetic [TOOL RESULT] blob of code-like filler text to the user message to exercise the same KV-cache pressure path as real tool use without requiring a tool executor.

Orchestrators (phase2_orchestrator.sh, phase2_followup_textonly.sh, phase2_n3_200k.sh, phase2_n3_128k.sh) boot raw llama-server at each config, wait for /health, capture VRAM, run the workload for 600 s, kill, quiesce 8 s, next config.

Safeguards (added after initial dry-run)¶

The original orchestrator launched llama-server directly, bypassing all gpumod preflight. With MemAvailable at 14 GiB (between the 12 GiB unrecoverable-hang and 18 GiB safe-floor thresholds from gpumod-x7rv), this re-exposed the cudaHostAlloc-freeze class. Retrofitted before the first config booted:

Safeguard	Threshold	Action
`GGML_CUDA_NO_PINNED=1` env var	always	disables `cudaMallocHost`, eliminates the freeze class (gpumod-56md)
RAM preflight	`MemAvailable ≥ 13 GiB`	refuse boot if below
VRAM preflight	`free VRAM ≥ 400 MiB` after boot	skip workload if below
Side watchdog	`MemAvailable < 8 GiB` OR `free VRAM < 200 MiB` sustained 15 s	SIGTERM orchestrator

No watchdog or preflight ever triggered during Phase 2. All 9 configs completed cleanly.

Workload profiles¶

Defined in phase2_workload_runner.py under PROFILES:

Role	Cadence	max_tokens	Synthetic tool overhead	Thinking
TL	1 req / 120 s	1500	400 tokens	on
Dev	continuous	8000	15000 tokens	on
QA	1 req / 60 s	2500	5000 tokens	on
Research	continuous	5000	8000 tokens	on
ToolStorm	continuous	500	800 tokens	off

Workflow A = TL,Dev,QA (asymmetric coding team, idle-borrowing case). Workflow B = 3xResearch (uniform continuous load, floor case).

Results — all configs¶

Raw JSON: phase2_results/. Re-aggregate with phase2_aggregate.py.

Config	Slots	Workload	Agg TPS	Per-call TPS	Dev p50	Dev p95	Free VRAM at boot
01 N=1 ctx=128K	1	Dev (baseline)	117.2	109.5	73 s	73 s	4,164 MiB
02 N=3 ctx=64K	3	TL,Dev,QA	142.3	69.7	93 s	114 s	2,972 MiB
03 N=3 ctx=64K	3	3×Research	180.2	58.7	—	—	2,972 MiB
04 N=5 ctx=32K	5	TL,2×Dev,2×QA	181.2	42.9	158 s	166 s	3,595 MiB
05 N=5 ctx=32K	5	3×Research+Dev+TL	169.3	38.8	173 s	183 s	3,595 MiB
06 N=3 ctx=64K	3	3×ToolStorm	75.6	25.7	2 s p50	3 s p95	3,413 MiB
07 N=3 ctx=256K	3	TL,Dev,QA	10.5	10.1	— (0 turns ok)	—	1,077 MiB (text-only)
08 N=3 ctx=200K	3	TL,Dev,QA	78.9	29.2	293 s	293 s	971 MiB (text-only)
09 N=3 ctx=128K	3	TL,Dev,QA	132.1	68.6	96 s	122 s	1,823 MiB (text-only)

Headline finding 1 — the N=3 Workflow A scaling curve¶

Per-slot ctx	Agg TPS	Per-call TPS	Dev p50	Verdict
64K	142	70	93 s	Throughput max — narrow win
128K	132	69	96 s	⭐ Sweet spot — 2× per-slot context for only 7% TPS cost
200K	79	29	293 s	-45% TPS, 3× Dev latency — degraded
256K	10	—	—	Collapsed — Dev never finished a turn in 600 s

Headline finding 2 — the 256K throughput collapse (the surprise)¶

VRAM ceiling and TPS ceiling are not the same wall. At N=3 ctx=256K the model fits (1,077 MiB VRAM free), boots cleanly, exchanges HTTP requests — but aggregate throughput collapses by 13× vs ctx=64K, and the Dev role (8000-token turns) never finishes a single turn in the 600 s window.

Likely root cause: llama.cpp's attention compute scales with allocated slot context, not used context, so each generated token pays a per-token compute penalty proportional to slot size even when the conversation occupies a small fraction. Flash-attention with quantized KV may not optimize for sparse slots. The cliff is steep but continuous: 64K → 128K → 200K → 256K shows smooth degradation, then 256K falls off entirely.

Recommend filing a follow-up spike to confirm the mechanism (compare attention kernel performance at different --ctx-size values with otherwise-identical prompts) before deciding whether 256K is reachable with kernel-config changes or is fundamentally bandwidth-bound.

Headline finding 3 — Workflow B (uniform continuous) is the highest-yield case¶

3×Research at N=3 ctx=64K hit 180 TPS aggregate — the highest of any config. Per-slot rate drops from 109 (single-slot baseline) to 58 (~half), but total aggregate grows 54%. Symmetric continuous load is the cont-batching best case: no idle slots to optimize around, no role-mix scheduling overhead.

For research / multi-document-synthesis workloads, N=3 ctx=64K with three identical Research agents is the right config — even better than asymmetric Workflow A.

Headline finding 4 — N=5 doesn't pay back¶

N=5 ctx=32K stretched Workflow A: 181 TPS aggregate (essentially tied with N=3 Workflow B). 5 slots adds capacity but not throughput — per-slot TPS drops to 43, Dev p95 ≈ 166 s (vs 114 s at N=3 ctx=64K). The 4th and 5th slots only make sense when concurrent-agent count itself is the bottleneck (not aggregate throughput) and Dev latency is acceptable.

Mixing 3×Research + Dev + TL at N=5 ctx=32K gave 169 TPS — worse than pure 3×Research at N=3. Mixing burst (TL) + steady (Research) + heavy (Dev) on cont-batching is worse than either uniform load alone.

Headline finding 5 — Tool-storm churn is fine¶

3×ToolStorm at N=3 ctx=64K: 1052 turns in 600 s, p50 turn 1.5 s, p95 3.1 s, per-call 26 TPS. Cont-batching handles rapid small-turn churn cleanly — no slot-eviction errors, no latency spikes. Useful for tool-call-heavy agents that send many short responses.

Recommended production configs¶

Use case	Preset	Mode
Default coding (TL + Dev + QA team)	`gemma4-26b-a4b-q4-multi3-128k` (text-only, `--parallel 3 --cont-batching --ctx-size 393216`)	`hermes-agent-multi`
Research / chat multi-agent	Same preset, three Research-style agents	`hermes-research-multi`
Heavy concurrent agent count (5-agent chat, latency-tolerant)	`gemma4-26b-a4b-q4-multi5-32k`	`hermes-agent-multi5`
Vision-enabled single-agent (existing)	`gemma4-26b-a4b-q4` (vision, `--parallel 1 --ctx-size 131072`)	`hermes-agent` (unchanged)

The multi-slot presets are text-only — they drop --mmproj to free ~1.1 GiB VRAM that would otherwise eat the 128K headroom. Vision-enabled multi-slot would need to fall back to ctx=64K per slot.

Open questions / follow-up tickets¶

256K throughput-collapse root cause — file a spike to confirm attention-compute scaling and check whether --no-context-shift, different --batch-size, or alternative attention backends restore throughput at 256K.
N=2 ctx=256K viability — projected to fit at ~22.4 GiB; would test whether the throughput collapse is a per-slot or aggregate-bandwidth effect.
Workflow B at ctx=128K — measure aggregate TPS for 3×Research at the recommended production ctx to confirm the 180 TPS finding holds at the larger per-slot context.

Production state restored after Phase 2¶

hermes-agent mode restarted with the original gemma4-26b-a4b-q4 (single-slot, 128K ctx, vision-enabled) preset. Multi-slot presets and modes proposed but not yet created — that's Phase 4 work.