MCP Workflows Guide¶
Practical workflows for managing GPU services via MCP tools in Claude Code, Cursor, or other AI assistants.
Quick Navigation¶
| Workflow | Use Case | Key Tools |
|---|---|---|
| 1. Find & Run a Model | Try a new model | search_hf_models, list_gguf_files, switch_mode |
| 2. Set Up RAG | Embedding + chat | simulate_mode, switch_mode |
| 3. Run Large Models | 20GB+ models | gpu_status, switch_mode("blank") |
| 4. Create a Preset | Save model config | generate_preset, fetch_model_config |
| 5. Create a Mode | Group services | simulate_mode, list_modes |
| 6. Daily Mode Switching | Work transitions | switch_mode |
| 7. Troubleshoot | Debug failures | gpu_status, list_services |
Workflow 1: Find and Run a Model¶
Goal: Discover a model on HuggingFace, check if it fits, and run it.
Steps¶
1. Check current state
2. Search for models
Tool: search_hf_models(query="GLM-4 code GGUF")
3. Check quantizations
Tool: list_gguf_files(repo_id="THUDM/GLM-4.7-Flash-UD-GGUF")
{"files": [
{"filename": "Q4_K_M.gguf", "vram_estimate_mb": 17500},
{"filename": "Q4_K_XL.gguf", "vram_estimate_mb": 21000}
]}
4. Simulate fit
Tool: simulate_mode → fits: true, margin: 3.5GB
5. Switch and load
Tool: switch_mode("code")
If the GGUF file doesn't exist, gpumod detects this and asks for confirmation:
Model file not found: /data/models/gguf/GLM-4.7-Flash-UD-Q4_K_XL.gguf
Download from HuggingFace? (21.5 GB) [y/N]:
Downloads require explicit user confirmation (default NO).
Outcome¶
Model loads in 30-60 seconds. Service available at configured port.
Workflow 2: Set Up RAG¶
Goal: Run embedding + chat models together within 24GB.
Steps¶
1. Plan VRAM budget
Assistant calculates: - Embedding (Qwen-2B): ~3GB - Chat (Mistral-Small-24B Q4): ~15GB - Total: 18GB ✓
2. Verify with simulation
Tool: simulate_mode("rag")
3. Switch mode
Tool: switch_mode("rag")
Outcome¶
Both services running. Embedding on port 8200, chat on port 7070.
Workflow 3: Run Large Models¶
Goal: Run a 20GB+ model that needs nearly all VRAM.
Steps¶
1. Check what's using VRAM
2. Clear VRAM first
Tool: switch_mode("blank")
Wait for VRAM release (~5-10s).
3. Verify empty
4. Start large model
Tool: switch_mode("nemotron")
Outcome¶
23GB model loads. Takes 45-60 seconds for large models.
Key Point¶
Large models (>20GB) need blank mode first. Don't try to load over existing services.
Workflow 4: Create a Preset¶
Goal: Save a model configuration for reuse.
Steps¶
1. Research the model
Tool: fetch_model_config(repo_id="mistralai/Devstral-Small-2505")
2. Generate preset
Tool: generate_preset
id: devstral-small
driver: vllm
port: 8000
vram_mb: 16000
model_id: mistralai/Devstral-Small-2505
unit_vars:
max_model_len: 32768
3. Save and register
# Save the YAML under your preset directory, then sync to the DB:
gpumod preset sync
# (or: gpumod init --preset-dir <dir> for an alternate preset root)
4. Verify
Tool: list_services → shows new service (stopped).
Outcome¶
Service registered and available for mode inclusion.
Workflow 5: Create a Mode¶
Goal: Group services into a named configuration.
Steps¶
1. Plan and verify VRAM
Check if it fits:
| Service | VRAM |
|---|---|
| qwen-embedding | 3GB |
| mistral-reasoner | 15GB |
| Total | 18GB ✓ |
2. Create mode
3. Verify
Tool: list_modes
Outcome¶
New mode ready for switch_mode("research").
Workflow 6: Daily Mode Switching¶
Goal: Transition between work contexts throughout the day.
Morning: Code¶
Tool: switch_mode("code") → GLM-4.7 loads (~30s)
Midday: Research¶
Tool: switch_mode("research")
What happens: 1. Stops code mode services 2. Waits for VRAM release 3. Starts embedding + reasoner
Evening: Shutdown¶
Tool: switch_mode("blank") → GPU idle
Key Points¶
- Mode switches handle VRAM automatically
- No manual service management needed
- Same endpoint ports across modes
Workflow 7: Troubleshoot¶
Goal: Diagnose why a model won't load.
Check 1: VRAM conflict¶
Problem: Another model using VRAM.
Fix: switch_mode("blank") first, or switch directly to target mode.
Check 2: Service health¶
Problem: Service started but not responding.
Fix: Check logs with journalctl --user -u gpumod-glm-code.service -n 50
Check 3: Model file missing¶
gpumod's preflight check detects missing GGUF files before starting:
Preflight failed: Model file not found
/data/models/gguf/GLM-4.7-Flash-UD-Q4_K_XL.gguf
Download from HuggingFace? (21.5 GB) [y/N]:
Options:
- Type y to download (requires disk space + time for large models)
- Type n to abort and download manually:
wget -c "https://huggingface.co/THUDM/GLM-4.7-Flash-UD-GGUF/resolve/main/Q4_K_XL.gguf" \
-O /data/models/gguf/GLM-4.7-Flash-UD-Q4_K_XL.gguf
Recovery¶
When in doubt:
This stops everything and releases VRAM. Always works.
Tool Reference¶
Tier 1: Direct Operations¶
| Tool | Purpose | Example |
|---|---|---|
gpu_status |
Check GPU state | "What's my VRAM?" |
list_services |
Show all services | "What's running?" |
list_modes |
Show available modes | "What modes exist?" |
service_info |
Detailed info for one service | "Show me the qwen-embedding preset" |
switch_mode |
Change mode (reconciles drift, gpumod-hrgg) | "Switch to rag" |
simulate_mode |
Test VRAM fit | "Will this fit?" |
start_service |
Start one service | "Start embedding" |
stop_service |
Stop one service | "Stop chat model" |
Tier 2: Discovery¶
| Tool | Purpose | Example |
|---|---|---|
search_hf_models |
Find models | "Find Qwen GGUF" |
list_gguf_files |
Check quantizations | "What sizes available?" |
list_model_files |
List files for any HF repo | "What's in the unsloth gemma 4 26b repo?" |
model_info |
VRAM + arch detail for a registered model | "How big is gemma 4 26b at 128K ctx?" |
fetch_model_config |
Get architecture | "What's the context length?" |
generate_preset |
Create config | "Make a preset" |
fetch_driver_docs |
Get llama.cpp/vLLM docs | "What flags available?" |
Tier 3: Complex Reasoning¶
| Tool | Purpose | Example |
|---|---|---|
consult |
Multi-step analysis | "Can I run Qwen-235B on 24GB?" |
(16 tools total. Run mcp list-tools gpumod from your AI assistant to see
the live list — new tools are added without breaking existing ones.)
Best Practices¶
Before Loading Large Models¶
- Check VRAM:
gpu_status - Switch to blank for 20GB+ models
- Wait for load (30-60s for large models)
For Mode Switching¶
- Trust automatic VRAM management
- Don't force on timeout — wait longer
- Use
switch_mode("blank")for recovery
For RAG Setups¶
- Plan total VRAM (add all services + 1GB margin)
- Start embedding first (smaller, verifies setup)
- Check health before sending requests
Security¶
- Services bind to localhost (127.0.0.1) by default
- Override with
host: "0.0.0.0"in preset for external access
See Also¶
- CLI Reference — Command-line interface
- MCP Integration — Setup for AI assistants
Sources: - GitBook: Documentation Structure Tips - MDN: Creating Effective Technical Documentation