CLI Reference¶

gpumod uses a subcommand structure. Every command supports --help for detailed usage.

gpumod --help

gpumod status¶

Show system status including GPU info, VRAM usage, active mode, and running services.

# Default table output
gpumod status

# Visual VRAM bar chart
gpumod status --visual

# JSON output for scripting
gpumod status --json

gpumod init¶

Initialize the database and load preset configurations.

# Use default database path (~/.config/gpumod/gpumod.db)
gpumod init

# Specify a custom database path
gpumod init --db-path /path/to/custom.db

# Specify an additional preset directory
gpumod init --preset-dir /path/to/my/presets

gpumod service¶

Manage individual GPU services.

service list¶

List all registered services with their current state.

gpumod service list
gpumod service list --json

service status¶

Show detailed status of a specific service.

gpumod service status llama-3-1-8b
gpumod service status llama-3-1-8b --json

Output includes service name, driver type, port, VRAM allocation, state, uptime, and health check status.

service start¶

Start a registered service.

gpumod service start llama-3-1-8b

service stop¶

Stop a running service.

gpumod service stop llama-3-1-8b

gpumod mode¶

Manage service modes -- named groups of services for specific use cases.

mode list¶

List all defined modes.

gpumod mode list
gpumod mode list --json

mode status¶

Show the currently active mode.

gpumod mode status
gpumod mode status --json

mode switch¶

Switch to a different mode. This starts services in the target mode and stops services not in the target mode.

gpumod mode switch chat-mode
gpumod mode switch chat-mode --json

Drift recovery (gpumod-hrgg): if the database already records the target mode as current but the systemd units have drifted (host reboot, manual stop, prior failed boot), mode switch now reconciles against the actual running set and re-issues systemctl start for any target service that's not actively running. The Started: block in the output will list each service that was launched.

mode sync¶

Sync YAML mode files into the database. Compares each modes/*.yaml against the DB and inserts new modes or updates changed ones. Calculates total VRAM from member service VRAM values. Run this after editing any mode YAML file or after template install-all re-renders service units — otherwise mode switch may target the stale service set.

gpumod mode sync
gpumod mode sync --json

mode create¶

Create a new mode from existing services.

gpumod mode create "Coding Mode" \
  --services devstral-small-2,bge-large \
  --description "Code completion with embedding retrieval"

The mode ID is auto-generated from the name (e.g., "Coding Mode" becomes coding-mode).

gpumod template¶

Manage Jinja2 systemd unit file templates.

template list¶

List available template files.

gpumod template list
gpumod template list --json

template show¶

Show a rendered template with sample context.

gpumod template show vllm.service.j2

template generate¶

Generate a systemd unit file for a registered service. The template is selected automatically based on the service driver type.

# Preview the generated unit file
gpumod template generate llama-3-1-8b

# Write to a file
gpumod template generate llama-3-1-8b --output /tmp/llama.service

# JSON output
gpumod template generate llama-3-1-8b --json

template install¶

Install a generated unit file to the systemd directory. Requires confirmation via --yes.

# Preview first (no --yes)
gpumod template install llama-3-1-8b

# Install with confirmation
gpumod template install llama-3-1-8b --yes

The unit file is written to ~/.config/systemd/user/{service_id}.service.

template install-all¶

Re-render every registered service against the current templates. Mandatory acceptance step for any change to src/gpumod/templates/ or any preset YAML schema expectation — the test suite covers the template engine but not the full preset matrix; running install-all surfaces latent preset bugs that pytest misses (gpumod-56md lesson).

gpumod template install-all --yes

gpumod model¶

Manage the ML model registry for VRAM estimation.

model list¶

List all registered models.

gpumod model list
gpumod model list --json

model info¶

Show detailed model information and VRAM estimates.

# Default context size (4096 tokens)
gpumod model info meta-llama/Llama-3.1-8B-Instruct

# Custom context size
gpumod model info meta-llama/Llama-3.1-8B-Instruct --context-size 32768

model register¶

Register a model in the registry. Metadata is automatically fetched from HuggingFace Hub for huggingface source models.

# Register a HuggingFace model (metadata auto-fetched)
gpumod model register meta-llama/Llama-3.1-8B-Instruct

# Register a GGUF model with file path
gpumod model register my-gguf-model \
  --source gguf \
  --file-path ~/models/model.gguf

# Register a local model with manual metadata
gpumod model register my-local-model \
  --source local \
  --vram 8192 \
  --params 7.0 \
  --architecture llama

model remove¶

Remove a model from the registry.

gpumod model remove meta-llama/Llama-3.1-8B-Instruct

gpumod simulate¶

Simulate VRAM usage without starting or stopping services.

simulate mode¶

Simulate VRAM usage for a defined mode, optionally adding or removing services from the simulation.

# Simulate a mode as-is
gpumod simulate mode chat-mode

# Add a service to the simulation
gpumod simulate mode chat-mode --add mistral-7b

# Remove a service from the simulation
gpumod simulate mode chat-mode --remove bge-large

# Override context sizes for specific services
gpumod simulate mode chat-mode --context llama-3-1-8b=32768

# Visual VRAM bar comparison
gpumod simulate mode chat-mode --visual

# JSON output
gpumod simulate mode chat-mode --json

When services exceed GPU VRAM, gpumod suggests alternatives such as dropping optional services or reducing context window sizes.

simulate services¶

Simulate VRAM usage for an explicit list of services.

# Simulate specific services
gpumod simulate services llama-3-1-8b,bge-large,mistral-7b

# With context overrides
gpumod simulate services llama-3-1-8b,bge-large \
  --context llama-3-1-8b=16384

# Visual output
gpumod simulate services llama-3-1-8b,bge-large --visual

gpumod plan¶

AI-assisted VRAM allocation planning.

plan suggest¶

Get an AI-generated VRAM allocation plan. The LLM analyzes your registered services and GPU capacity to suggest an optimal configuration.

# Get a plan for all registered services
gpumod plan suggest

# Plan for a specific mode
gpumod plan suggest --mode chat-mode

# Set a VRAM budget (e.g., leave headroom for other processes)
gpumod plan suggest --budget 20000

# Preview the prompt without calling the LLM
gpumod plan suggest --dry-run

# JSON output
gpumod plan suggest --json

The plan output includes:

AI-suggested service allocations with VRAM amounts
Simulation results showing whether the plan fits the GPU
Advisory CLI commands you can copy-paste to implement the plan
Reasoning from the LLM about its allocation decisions

Plans are advisory only -- gpumod never auto-executes LLM suggestions.

gpumod tui¶

Launch an interactive terminal dashboard powered by Textual.

gpumod tui

The TUI displays a live GPU status bar, service list with state indicators, a command input for /status, /switch <mode>, /simulate, and /quit, and a footer with keyboard shortcuts.

Press q to quit or type /help for available commands.

gpumod discover¶

Discover GGUF models from HuggingFace and generate ready-to-use presets.

# List unsloth models (default)
gpumod discover

# Search by name
gpumod discover --search deepseek

# Search in a specific organization
gpumod discover -s kimi -a moonshotai

# Combine author and search
gpumod discover --author bartowski --search llama

# Filter by task type
gpumod discover --task code

Options¶

Option	Short	Description
`--search`	`-s`	Search models by name (e.g., "deepseek", "kimi")
`--author`	`-a`	HuggingFace organization (default: unsloth)
`--task`		Filter by task: code, chat, embed, reasoning
`--vram`		VRAM budget in MB (default: detected available)
`--context`		Context size (default: 8192)
`--dry-run`		Preview without writing files
`--json`		Output JSON, no interaction
`--no-cache`		Bypass HuggingFace cache
`--verbose`	`-v`	Debug output

Workflow¶

System detection -- Detects GPU, VRAM, and running services
Model search -- Fetches GGUF models matching your criteria
VRAM filtering -- Shows only models that fit your VRAM budget
Interactive selection -- Pick a model and quantization
Preset generation -- Creates a ready-to-use YAML preset

Examples¶

# Find coding models from unsloth that fit 24GB VRAM
gpumod discover --task code --vram 24576

# Search for DeepSeek models from any author, output as JSON
gpumod discover --search deepseek --author "" --json

# Preview a preset without writing it
gpumod discover --search qwen --dry-run

After generating a preset, follow the printed instructions to download the model and register the service.

gpumod watch¶

Watch preset and mode directories for YAML file changes and automatically sync to the database. Useful for rapid iteration during development.

# Start watching with default settings (500ms debounce)
gpumod watch

# Custom debounce window
gpumod watch --debounce 200

# Skip initial sync before starting
gpumod watch --no-sync

# Timeout after N seconds (for testing)
gpumod watch --timeout 60

The watcher monitors .yaml and .yml files, ignoring editor temp files (.swp, .tmp, ~ backups). Changes are debounced to coalesce rapid saves from editors. Press Ctrl+C to stop.

gpumod preset¶

Manage service presets.

preset sync¶

Sync YAML preset files into the database. Compares each preset YAML against the DB and inserts new services or updates changed ones. Unchanged services are skipped.

# Sync presets using the default database
gpumod preset sync

# Use a custom database path
gpumod preset sync --db-path /path/to/custom.db

gpumod install-server¶

Install the gpumod MCP server as a systemd user service. Auto-detects the Python interpreter, venv, and project working directory.

# Preview the generated unit file without writing
gpumod install-server --dry-run

# Install to ~/.config/systemd/user/gpumod-mcp.service
gpumod install-server

# Install with custom host and port
gpumod install-server --host 0.0.0.0 --port 8808

# Install to a custom directory, skip daemon-reload
gpumod install-server --unit-dir /tmp/units --no-reload

After installation, enable and start the service:

systemctl --user enable gpumod-mcp
systemctl --user start gpumod-mcp

gpumod doctor¶

System health checks for GPU stability and host protection.

doctor sysctl¶

Verify kernel sysctl tuning for GPU stability. Checks that vm.min_free_kbytes is high enough to prevent CUDA pinned-memory allocation hangs caused by page fragmentation.

# Check with default threshold (1 GiB)
gpumod doctor sysctl

# Custom threshold
gpumod doctor sysctl --threshold 524288

Exit codes: 0 = OK, 1 = below threshold or unreadable.

doctor oom-protection¶

Verify that OOM protection drop-ins are installed for code-server and systemd-oomd. These drop-ins prevent systemd-oomd from killing code-server during LLM model loads.

gpumod doctor oom-protection

If the check fails, install the drop-ins:

sudo scripts/oom-protection/install.sh

Exit codes: 0 = protection correctly installed, 1 = missing or misconfigured.