Configuration¶

All settings are configurable via environment variables with the GPUMOD_ prefix. Settings are managed by pydantic-settings.

Environment Variables¶

Environment Variable	Type	Default	Description
`GPUMOD_DB_PATH`	`str`	`~/.config/gpumod/gpumod.db`	Path to the SQLite database file
`GPUMOD_PRESETS_DIR`	`str`	Auto-resolved	Path to the built-in presets directory
`GPUMOD_MODES_DIR`	`str`	Auto-resolved	Path to the built-in modes directory
`GPUMOD_LOG_LEVEL`	`str`	`INFO`	Logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`)
`GPUMOD_LLM_BACKEND`	`str`	`openai`	LLM provider backend (`openai`, `anthropic`, `ollama`)
`GPUMOD_LLM_API_KEY`	`str`	None	API key for the LLM backend (stored as `SecretStr`)
`GPUMOD_LLM_MODEL`	`str`	`gpt-4o-mini`	LLM model identifier
`GPUMOD_LLM_BASE_URL`	`str`	None	Custom base URL for the LLM API (e.g., for Ollama or proxy)
`GPUMOD_MCP_RATE_LIMIT`	`int`	`10`	Maximum MCP requests per minute (must be >= 1)
`GPUMOD_RAM_MIN_FREE_MB`	`int`	`1024`	Minimum free RAM (MB) — blocks service start below this
`GPUMOD_RAM_WARN_FREE_MB`	`int`	`4096`	Warn threshold (MB) — logs warning below this
`GPUMOD_VRAM_SAFETY_MARGIN_MB`	`int`	`512`	Extra VRAM buffer (MB) required beyond service allocation
`GPUMOD_QUIESCE_SECONDS`	`float`	`10.0`	Seconds to wait after heavy stop before allowing new heavy start (range 0-300). Helps the kernel consolidate freed pages so the next CUDA-pinned allocation doesn't hit a fragmentation hang.

Example: Using Ollama locally¶

export GPUMOD_LLM_BACKEND=ollama
export GPUMOD_LLM_BASE_URL=http://localhost:11434
export GPUMOD_LLM_MODEL=llama3.1
gpumod plan suggest

Example: Custom database location¶

export GPUMOD_DB_PATH=/data/gpumod/services.db
gpumod init

Example: Tuning preflight thresholds¶

On machines with high memory pressure (e.g., ZFS caches), the default RAM thresholds may be too aggressive. Lower them via environment variables:

export GPUMOD_RAM_MIN_FREE_MB=512
export GPUMOD_RAM_WARN_FREE_MB=2048
gpumod service start my-model

To give large models extra VRAM headroom:

export GPUMOD_VRAM_SAFETY_MARGIN_MB=1024

A .env.example file is included in the repository root — copy it to .env and uncomment the variables you want to override.

AI Planning¶

gpumod integrates with LLM APIs to provide AI-assisted VRAM allocation planning via gpumod plan suggest.

How It Works¶

gpumod gathers your registered services, their VRAM requirements, and GPU capacity.
A carefully constructed prompt is sent to the configured LLM backend with only minimal data (service IDs, VRAM amounts, GPU capacity).
The LLM returns a structured JSON plan with service allocations and reasoning.
gpumod validates all IDs and values in the LLM response against strict regex patterns and VRAM limits.
The plan is simulated through the SimulationEngine to verify feasibility.
Results are displayed with advisory CLI commands you can choose to execute.

Supported Backends¶

Backend	Environment Variable	Notes
OpenAI	`GPUMOD_LLM_API_KEY`	Default backend, uses `gpt-4o-mini`
Anthropic	`GPUMOD_LLM_API_KEY`	Set `GPUMOD_LLM_BACKEND=anthropic`
Ollama	(no key required)	Set `GPUMOD_LLM_BACKEND=ollama`, runs locally

Example¶

# Configure the LLM backend
export GPUMOD_LLM_BACKEND=openai
export GPUMOD_LLM_API_KEY=sk-...
export GPUMOD_LLM_MODEL=gpt-4o-mini

# Get a plan
gpumod plan suggest

Output:

Fits: 9216 / 24576 MB (headroom: 15360 MB)
            AI-Suggested VRAM Plan
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Service ID     ┃ VRAM (MB) ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ llama-3-1-8b   │      8192 │
│ bge-large      │      1024 │
│ Total          │      9216 │
└────────────────┴───────────┘

Reasoning: The Llama 3.1 8B model is the primary chat service
requiring 8GB VRAM. BGE-Large provides embedding retrieval at
only 1GB. Together they fit well within the 24GB RTX 4090 with
15GB headroom for KV cache growth.

Suggested commands (advisory only):
  gpumod simulate services llama-3-1-8b,bge-large
  gpumod service start llama-3-1-8b
  gpumod service start bge-large

Dry Run¶

Preview the prompt that would be sent to the LLM without actually calling the API:

gpumod plan suggest --dry-run

This is useful for reviewing what data is sent to the LLM, verifying your configuration, and debugging prompt templates.