Multi-Agent with Continuous Batching¶
Learn how to serve 3 concurrent agents from a single 24 GB GPU using the multi-slot preset — including slot pinning and saving an agent's KV cache to disk so it survives disconnects.
This guide uses gemma4-26b-a4b-q4-multi
(preset),
the production multi-slot configuration chosen by the
multi-agent capacity spike
(gpumod-8xaq).
Slots are not agents¶
--parallel 3 makes llama-server allocate three independent KV-cache
slots — it does not create three agents. An "agent" is just any HTTP
client sending requests to the same endpoint:
- Run 3 separate client processes (3 Claude Code tabs, 3 Hermes sessions,
3 aider invocations…), each pointed at the same base URL
http://localhost:7109/v1. - The cont-batching scheduler routes each incoming request to a free slot, first-come-first-served. A 4th concurrent request queues (milliseconds to seconds) until a slot frees.
- The server is stateless across requests — every
/v1/chat/completionsPOST carries the fullmessageshistory. Multi-turn conversations stay fast anyway because the radix-tree prefix cache matches the new request against a slot's existing KV.
When to use the multi-slot preset¶
| Situation | Preset |
|---|---|
| One interactive session, may send images | gemma4-26b-a4b-q4 (single slot, vision) |
| 2–3 concurrent agents (agent team, cron + chat + subagent) | gemma4-26b-a4b-q4-multi (N=3, text-only) |
| Image inputs and multiple agents | Not on a 24 GB card — the multi preset drops --mmproj to free ~1.1 GiB; fall back to single-slot |
Key numbers from the capacity spike (see the research README for the full data):
- 128K per slot is the sweet spot — 132 TPS aggregate with 3 active agents, only 7% below the 64K-per-slot maximum, at 2× the context.
- Do not raise per-slot context above 128K. Throughput degrades at 200K (79 TPS) and collapses at 256K (10 TPS — a heavy agent never finishes a turn), even though the model still fits in VRAM.
- N=5 doesn't pay back — it adds slots but not aggregate throughput, and forces 32K per slot.
Never add --swa-full to a Gemma 4 multi-slot preset
--swa-full forces every layer to allocate full-context KV instead of
Gemma 4's 1024-token sliding window. At multi-slot context sizes this
blows past the 24 GiB VRAM ceiling during cache allocation and
triggers a silent host freeze that requires a hard reboot — no OOM log,
no recoverable error (see Host stability).
Slot save/restore works correctly without it
(gpumod-8viu incident report).
Prerequisites¶
- gpumod set up and a service started before — Run your first service
- ~22 GB of free VRAM (the multi preset + its embedding co-tenant)
- One-time host setup for slot persistence:
(The systemd unit fails loudly at start if this directory is missing.)
Steps¶
1. Switch to the multi-agent mode¶
The hermes-agent mode bundles gemma4-26b-a4b-q4-multi with a small
embedding service:
Verify:
Expected output:
2. Confirm the slots exist¶
curl -s http://localhost:7109/props | python3 -c \
"import json,sys; d=json.load(sys.stdin); print('slots:', d['total_slots'], '| ctx per slot:', d['default_generation_settings']['n_ctx'])"
Expected output:
Three slots, 128K context each (the preset passes --ctx-size 393216;
llama-server divides it by --parallel).
3. Send a slot-pinned chat request¶
Adding "id_slot": 0 pins the request to slot 0. This is an undocumented
but functional passthrough on /v1/chat/completions — you need it whenever
an agent must land on a specific slot (e.g. to pair with save/restore
below):
curl -s http://localhost:7109/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"id_slot": 0,
"messages": [{"role": "user", "content": "Remember the magic word: tutorialdemo. Reply with one short sentence confirming you memorised it."}],
"max_tokens": 1500
}'
Expected output (abbreviated — the response also contains a
reasoning_content field because the preset enables thinking):
{
"choices": [{"message": {"content": "I have memorized the magic word: tutorialdemo."}, "finish_reason": "stop"}],
"usage": {"completion_tokens": 212, "prompt_tokens": 35, "total_tokens": 247}
}
Note completion_tokens: 212 for a one-sentence answer — the thinking
tokens count against max_tokens. Always send max_tokens >= 1500 on this
preset (see OpenAI-compatible clients).
Without id_slot, requests are still served fine — they just land on
whichever slot is free, which is what you want for stateless concurrent
clients.
4. Save the slot's KV cache to disk¶
The preset enables --slot-save-path, which exposes per-slot save/restore
endpoints. Persist slot 0's state:
curl -s -X POST 'http://localhost:7109/slots/0?action=save' \
-H 'Content-Type: application/json' \
-d '{"filename": "agent_A.bin"}'
Expected output:
{"id_slot":0,"filename":"agent_A.bin","n_saved":246,"n_written":29448948,"timings":{"save_ms":12.457}}
246 cached tokens → ~29 MB on disk (~120 KB/token, full-precision KV dump). Budget ~5–20 GB of disk for 3 long-lived agents and add a cleanup policy — a 50K-token tool-heavy conversation is ~6 GB.
5. Restore and resume¶
After the agent disconnects (and possibly another client evicted slot 0), restore before reconnecting:
curl -s -X POST 'http://localhost:7109/slots/0?action=restore' \
-H 'Content-Type: application/json' \
-d '{"filename": "agent_A.bin"}'
Expected output:
{"id_slot":0,"filename":"agent_A.bin","n_restored":246,"n_read":29448948,"timings":{"restore_ms":6.216}}
Then continue the conversation (same id_slot: 0, full message history):
curl -s http://localhost:7109/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"id_slot": 0,
"messages": [
{"role": "user", "content": "Remember the magic word: tutorialdemo. Reply with one short sentence confirming you memorised it."},
{"role": "assistant", "content": "I have memorized the magic word: tutorialdemo."},
{"role": "user", "content": "What was the magic word? Answer in one short sentence."}
],
"max_tokens": 1500
}'
Expected output (abbreviated):
The restored turn skips re-prefilling the saved tokens — the validation run measured a 3× faster prompt-eval rate after restore versus a cold prefill (gpumod-8viu).
6. Wrap it into an agent lifecycle¶
The production pattern (also documented in the
hermes-agent mode file):
# Before reconnect (404 on the very first session is fine):
curl -sf -X POST 'http://localhost:7109/slots/0?action=restore' \
-H 'Content-Type: application/json' -d '{"filename":"agent_A.bin"}' || true
# Every chat turn — keep the id_slot pin:
curl -X POST http://localhost:7109/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"id_slot": 0, "messages": [...], "max_tokens": 1500}'
# After each response:
curl -sf -X POST 'http://localhost:7109/slots/0?action=save' \
-H 'Content-Type: application/json' -d '{"filename":"agent_A.bin"}'
Use one filename (and one slot id) per persistent agent: agent_A.bin on
slot 0, agent_B.bin on slot 1, agent_C.bin on slot 2.
What happened¶
One model, loaded once (~14 GB), serves three independent conversations:
each slot holds its own 128K-token KV cache (~1.8 GB per slot at q8_0), and
the continuous-batching scheduler packs all active generations into shared
forward passes — which is why 3 concurrent agents reach 132 TPS aggregate
instead of one agent's 109 TPS. Slot save/restore dumps a slot's KV tensors
to disk so a returning agent pays a ~6 ms restore instead of a multi-second
re-prefill, and id_slot pinning guarantees the restore and the follow-up
turn land on the same slot.
Next steps¶
- OpenAI-compatible clients — point Hermes, Claude Code, aider, or Continue.dev at this endpoint
- Modes deep dive — what
mode switchdoes under the hood - Capacity spike research — the full N×ctx measurement matrix
- Slot persistence research
— validation details and the
--swa-fullincident