After an update, container recreate, or docker daemon restart, nomad_ollama's
HostConfig.DeviceRequests still lists the nvidia driver — but the NVIDIA
Container Toolkit binding inside the container is torn. `nvidia-smi` returns
"Failed to initialize NVML: Unknown Error" and Ollama silently falls back to
CPU inference. PR #208 detects this and shows a banner with a "Fix: Reinstall
AI Assistant" button. This change does that click automatically on admin boot.
New provider GpuPassthroughRemediationProvider runs once on web env boot:
1. Skip when KV `ai.autoFixGpuPassthrough = false` (default true).
2. Skip when Docker has no `nvidia` runtime registered (AMD-only and CPU-only
hosts unaffected).
3. Skip when nomad_ollama isn't running.
4. Exec `nvidia-smi --query-gpu=name --format=csv,noheader` inside the
container with an 8-second timeout. If the output matches
"Failed to initialize NVML", "Unknown Error", "TIMEOUT", or contains no
alphabetic characters, treat the passthrough as broken.
5. On broken: call DockerService.forceReinstall('nomad_ollama'). The existing
force-reinstall preserves the Ollama volume + installed models. Stamp
`gpu.autoRemediatedAt` on success.
6. On healthy: log and exit.
AMD passthrough_failed is intentionally not handled — its fix path is HSA
override handling (PR #804) rather than a simple service recreate, and false
positives during AMD startup log parsing would loop a recreate without fixing
anything. Left to a follow-up if it proves to be a recurring AMD issue.
Validated on NOMAD3 (RTX 5060, v1.32.0-rc.3 + this patch hot-applied):
- After admin restart with passthrough healthy: log line
"[GpuPassthroughRemediationProvider] NVIDIA passthrough healthy — no action
needed." Provider exits cleanly without touching the container.
- The broken-state branch hits the existing forceReinstall path, which was
manually invoked earlier in the same session to fix this exact box and
recovered GPU access in ~45s with model volume intact. No new failure mode
is introduced — the auto-trigger removes the user click but the underlying
operation is the same one the banner Fix button already calls.
Closes#755.
Adds a check to RAG health to make sure nomad_qdrant is online, if not
then the user will be blocked from clicking any buttons in the KB modal
until they click the start qdrant button and let the container start
There is a new file qdrant_restart_policy_provider.ts, which tries to
ensure that the restart policy always exists for the nomad_qdrant
container even though the policy should have been there when the
container is created.