DockerDriver Design Document¶

Ticket: gpumod-86k (P7-S0 SPIKE) Status: Complete Author: AI Architect Agent Date: 2026-02-07

1. Decision Record: Docker SDK vs Subprocess¶

Decision¶

Use the Docker SDK for Python (docker-py) instead of subprocess calls to docker/podman.

Context¶

gpumod needs a DockerDriver to manage containerized services (e.g., Qdrant vector database, Langfuse observability platform) alongside systemd-managed ML inference services. The driver must implement the ServiceDriver ABC: start(), stop(), status(), health_check().

Two approaches were evaluated:

Subprocess (asyncio.create_subprocess_exec calling docker CLI)
Docker SDK (docker-py Python library)

Evaluation¶

Criterion	Subprocess	Docker SDK
Shell injection risk	Requires input validation + command construction; one missed argument = injection vector	Eliminated entirely; typed Python API, no string commands
Command surface area	Docker CLI has 50+ subcommands with complex flag combinations; allowlist approach from `systemd.py` does not scale	SDK handles argument construction internally
Type safety	Returns raw strings; manual parsing required	Returns typed objects (`Container`, `Image`) with properties
Health checks	Must parse `docker inspect` JSON output	Native `.status`, `.attrs["State"]["Health"]` access
Async support	Naturally async via `create_subprocess_exec`	Sync library; requires `asyncio.to_thread()` wrapper
Dependency weight	No extra dependency	Adds `docker>=7.0,<8.0` (~2MB)
Consistency	Different pattern from other drivers (systemd uses subprocess but has tiny surface)	Aligns with "use the best tool" principle
Podman compat	Works with `podman` CLI directly	Works via `podman`'s Docker-compatible API socket

Rationale¶

The existing systemd.py uses subprocess safely because systemctl has only 8 allowlisted commands and a simple unit_name argument. Docker CLI has a fundamentally larger attack surface (50+ commands, complex flags, volume specs, environment variables, port mappings). Reproducing the systemd.py allowlist pattern for Docker would require validating dozens of argument combinations -- brittle and error-prone.

The Docker SDK eliminates the entire class of shell/command injection attacks by using a typed Python API. The tradeoff (sync library requiring asyncio.to_thread()) is acceptable since container operations are infrequent and already involve network I/O to the Docker daemon.

Consequences¶

Add docker>=7.0,<8.0 to pyproject.toml dependencies with upper bound.
Wrap all Docker SDK calls in asyncio.to_thread() for async compatibility.
Mock docker.DockerClient in tests (no real Docker daemon needed for unit tests).

2. Security Threat Analysis¶

2.1 Container-Specific Threats¶

#	Threat	Vector	Impact	Mitigation	Ref
T18	Image name injection	LLM passes `"image; rm -rf /"` as container image via MCP tool	Arbitrary command execution (if using subprocess); with SDK, API error but potential confusion	SEC-D7: Image name validated against strict regex before passing to SDK; reject names containing shell metacharacters	SEC-D7
T19	Volume mount path traversal	LLM passes `"../../etc/shadow:/data"` as volume mount in `extra_config`	Host filesystem read/write outside allowed directories	SEC-D8: Volume source paths resolved via `os.path.realpath()` and validated to be under explicitly allowed base directories (`$HOME`, `/tmp`); reject absolute paths outside allowlist	SEC-D8
T20	Privileged container escape	`extra_config` contains `privileged: true` or dangerous capabilities (`SYS_ADMIN`, `SYS_PTRACE`)	Full host access, container escape	SEC-D9: `privileged=False` hardcoded in driver; capability allowlist enforced (only `--gpus` via nvidia runtime); `pid_mode`, `network_mode="host"`, `ipc_mode="host"` blocked	SEC-D9
T21	Container name / env var injection	Malicious container name with ANSI escapes or env vars with sensitive data in `extra_config`	Terminal escape injection in CLI/TUI output; info disclosure if env vars leak	SEC-D10: Container names constructed from sanitized `service.id` with `gpumod-` prefix; env var keys validated against `^[A-Z_][A-Z0-9_]*$` regex; env var values sanitized (no newlines, no shell expansion)	SEC-D10

2.2 Additional Container Security Controls¶

Docker Socket Access: - gpumod connects to the Docker daemon via the default socket (/var/run/docker.sock). - The user running gpumod must be in the docker group or use rootless Docker. - gpumod does NOT expose the Docker socket to managed containers (no DinD).

GPU Access: - Containers requiring GPU access use runtime="nvidia" (nvidia-container-runtime). - The --gpus flag is the only device-related option exposed. - No --device passthrough for arbitrary host devices.

Resource Limits: - Containers can have mem_limit set via extra_config to prevent host OOM. - No CPU pinning unless explicitly configured.

Image Pull Policy: - gpumod does NOT auto-pull images. Images must be pre-pulled on the host. - This prevents supply-chain attacks via typosquatted image names. - A future enhancement could add allowlisted registries.

3. Interface Design¶

3.1 DockerDriver Class¶

class DockerDriver(ServiceDriver):
    """Driver for Docker-containerized services.

    Manages container lifecycle via the Docker SDK. Does not support
    sleep/wake -- containers use stop/start for memory release.

    Parameters
    ----------
    client:
        Optional Docker client for dependency injection (testing).
        Defaults to docker.from_env().
    http_timeout:
        Timeout in seconds for HTTP health check requests.
    container_prefix:
        Prefix for container names. Default "gpumod".
    """

    def __init__(
        self,
        client: docker.DockerClient | None = None,
        http_timeout: float = 10.0,
        container_prefix: str = "gpumod",
    ) -> None: ...

    async def start(self, service: Service) -> None: ...
    async def stop(self, service: Service) -> None: ...
    async def status(self, service: Service) -> ServiceStatus: ...
    async def health_check(self, service: Service) -> bool: ...

    @property
    def supports_sleep(self) -> bool:
        return False

3.2 Method Specifications¶

start(service) 1. Validate image name via validate_docker_image() (SEC-D7). 2. Validate volume mounts via validate_volume_mounts() (SEC-D8). 3. Construct container config: name={prefix}-{service.id}, image from extra_config["image"], ports/volumes/env from extra_config. 4. Hardcode privileged=False, apply capability restrictions (SEC-D9). 5. Call client.containers.run(detach=True, ...) via asyncio.to_thread().

stop(service) 1. Look up container by name {prefix}-{service.id}. 2. Call container.stop(timeout=30) then container.remove(). 3. Handle docker.errors.NotFound gracefully (container already gone).

status(service) 1. Look up container by name. 2. Map Docker status to ServiceState: - "running" + healthy -> ServiceState.RUNNING - "running" + unhealthy -> ServiceState.UNHEALTHY - "exited" / "created" -> ServiceState.STOPPED - "restarting" -> ServiceState.STARTING - Not found -> ServiceState.STOPPED - Error -> ServiceState.UNKNOWN 3. Combine with HTTP health check (same pattern as VLLMDriver).

health_check(service) 1. Use httpx.AsyncClient to GET http://localhost:{port}{health_endpoint}. 2. Return True if status code 200, False otherwise. 3. Same implementation as VLLMDriver/FastAPIDriver (consistency).

3.3 Container Naming Convention¶

gpumod-{service.id}

service.id is already validated by validate_service_id() (SEC-V1).
The gpumod- prefix prevents name collisions with user containers.
The full name is sanitized via sanitize_name() for display purposes.

3.4 Configuration via `extra_config`¶

The following keys are recognized in service.extra_config for Docker services:

Key	Type	Required	Description	Validation
`image`	`str`	Yes	Docker image name with optional tag	SEC-D7 regex
`volumes`	`dict[str, str]`	No	Host path -> container path mappings	SEC-D8 path validation
`environment`	`dict[str, str]`	No	Environment variables	SEC-D10 key/value validation
`ports`	`dict[str, int]`	No	Container port -> host port mappings	Port range 1-65535
`command`	`str`	No	Override container CMD	No shell metacharacters
`runtime`	`str`	No	Container runtime (e.g., `"nvidia"`)	Allowlist: `"nvidia"`, `"runc"`
`mem_limit`	`str`	No	Memory limit (e.g., `"2g"`)	Regex: `^[0-9]+[kmgKMG]?$`

These keys must be added to EXTRA_CONFIG_ALLOWED_KEYS in validation.py (SEC-D6).

4. Validation Functions¶

Add to src/gpumod/validation.py:

# Docker image name: lowercase alphanum, dots, hyphens, slashes, optional tag
DOCKER_IMAGE_RE = re.compile(
    r"^[a-z0-9][a-z0-9._/-]{0,127}"   # repository name
    r"(:[a-zA-Z0-9._-]{1,128})?$"      # optional tag
)

DOCKER_ENV_KEY_RE = re.compile(r"^[A-Z_][A-Z0-9_]{0,127}$")

ALLOWED_RUNTIMES: frozenset[str] = frozenset({"runc", "nvidia"})

VOLUME_ALLOWED_BASES: tuple[str, ...] = (
    os.path.expanduser("~"),
    "/tmp",
)


def validate_docker_image(value: str) -> str:
    """Validate a Docker image name (SEC-D7)."""
    ...

def validate_volume_mounts(volumes: dict[str, str]) -> dict[str, str]:
    """Validate volume mount paths are under allowed bases (SEC-D8)."""
    ...

def validate_docker_env(env: dict[str, str]) -> dict[str, str]:
    """Validate environment variable keys and values (SEC-D10)."""
    ...

def validate_container_runtime(runtime: str) -> str:
    """Validate container runtime against allowlist (SEC-D9)."""
    ...

5. Container Lifecycle¶

                    ┌─────────────┐
                    │   Service   │
                    │  (DB row)   │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  start()    │
                    │             │
                    │ 1. Validate │
                    │ 2. Run      │
                    │ 3. Wait     │
                    └──────┬──────┘
                           │
              ┌────────────▼────────────┐
              │    Container Running    │
              │                        │
              │  health_check() polls  │
              │  status() inspects     │
              └────────────┬───────────┘
                           │
                    ┌──────▼──────┐
                    │   stop()    │
                    │             │
                    │ 1. Stop     │
                    │ 2. Remove   │
                    └─────────────┘

Key behaviors: - start() is idempotent: if container already exists and is running, no-op. - stop() is idempotent: if container does not exist, no-op. - Containers are removed on stop (not just stopped) to avoid stale state. - No restart policy is set by gpumod; container lifecycle is fully managed.

6. Test Strategy¶

6.1 Unit Tests (`tests/unit/test_docker_driver.py`)¶

Mock docker.DockerClient for all tests. Follow existing patterns from test_vllm_driver.py:

Test	Description	Security
`test_start_runs_container`	Verify `client.containers.run()` called with correct args
`test_start_validates_image`	Reject `"image; rm -rf /"`	SEC-D7
`test_start_validates_volumes`	Reject `"../../etc/shadow:/data"`	SEC-D8
`test_start_rejects_privileged`	Ensure `privileged=False` always set	SEC-D9
`test_start_sanitizes_container_name`	Container name uses sanitized service.id	SEC-D10
`test_start_validates_env_keys`	Reject env keys with special chars	SEC-D10
`test_stop_removes_container`	Verify `container.stop()` + `container.remove()`
`test_stop_handles_not_found`	No error when container already gone
`test_status_maps_running`	Docker "running" -> ServiceState.RUNNING
`test_status_maps_exited`	Docker "exited" -> ServiceState.STOPPED
`test_status_maps_not_found`	NotFound -> ServiceState.STOPPED
`test_status_maps_error`	Exception -> ServiceState.UNKNOWN
`test_health_check_success`	HTTP 200 -> True
`test_health_check_failure`	HTTP 500 / ConnectError -> False
`test_supports_sleep_false`	`supports_sleep` returns `False`
`test_rejects_host_network`	`network_mode="host"` blocked	SEC-D9
`test_rejects_pid_host`	`pid_mode="host"` blocked	SEC-D9
`test_validates_runtime_allowlist`	Only `"runc"` and `"nvidia"` accepted	SEC-D9
`test_validates_port_range`	Ports must be 1-65535
`test_image_no_auto_pull`	Images must be pre-pulled

6.2 Integration Tests¶

End-to-end container lifecycle with mocked Docker daemon.
Security integration: injection attempts rejected at MCP boundary.

6.3 Mocking Pattern¶

from unittest.mock import MagicMock, patch

@pytest.fixture
def mock_docker_client() -> MagicMock:
    client = MagicMock(spec=docker.DockerClient)
    client.containers = MagicMock()
    return client

@pytest.fixture
def driver(mock_docker_client: MagicMock) -> DockerDriver:
    return DockerDriver(client=mock_docker_client)

7. ServiceRegistry Integration¶

DriverType.DOCKER already exists in src/gpumod/models.py. The registry needs one change in src/gpumod/services/registry.py:

from gpumod.services.drivers.docker import DockerDriver

# In ServiceRegistry.__init__():
self._drivers: dict[DriverType, ServiceDriver] = {
    DriverType.VLLM: VLLMDriver(),
    DriverType.LLAMACPP: LlamaCppDriver(),
    DriverType.FASTAPI: FastAPIDriver(),
    DriverType.DOCKER: DockerDriver(),  # NEW
}

This follows the Open/Closed Principle: no changes to existing drivers.

8. Dependency Addition¶

Add to pyproject.toml:

dependencies = [
    # ... existing deps ...
    "docker>=7.0,<8.0",
]

The docker package is a well-maintained library (PyPI, Docker Inc.) with no transitive security concerns beyond requests and urllib3 which are already common in Python ecosystems.

9. Out of Scope (for P7-T2 implementation)¶

Docker Compose support (multi-container stacks).
Image build from Dockerfile.
Container log streaming to gpumod CLI (future enhancement).
Registry authentication (image pull from private registries).
Docker Swarm or Kubernetes integration.
podman specific API differences (works via compatible socket).