project-nomad/admin/types
Chris Sherwood 2997637ce0 feat(GPU): auto-remediate nomad_ollama passthrough loss on admin boot (#755)
After an update, container recreate, or docker daemon restart, nomad_ollama's
HostConfig.DeviceRequests still lists the nvidia driver — but the NVIDIA
Container Toolkit binding inside the container is torn. `nvidia-smi` returns
"Failed to initialize NVML: Unknown Error" and Ollama silently falls back to
CPU inference. PR #208 detects this and shows a banner with a "Fix: Reinstall
AI Assistant" button. This change does that click automatically on admin boot.

New provider GpuPassthroughRemediationProvider runs once on web env boot:

1. Skip when KV `ai.autoFixGpuPassthrough = false` (default true).
2. Skip when Docker has no `nvidia` runtime registered (AMD-only and CPU-only
   hosts unaffected).
3. Skip when nomad_ollama isn't running.
4. Exec `nvidia-smi --query-gpu=name --format=csv,noheader` inside the
   container with an 8-second timeout. If the output matches
   "Failed to initialize NVML", "Unknown Error", "TIMEOUT", or contains no
   alphabetic characters, treat the passthrough as broken.
5. On broken: call DockerService.forceReinstall('nomad_ollama'). The existing
   force-reinstall preserves the Ollama volume + installed models. Stamp
   `gpu.autoRemediatedAt` on success.
6. On healthy: log and exit.

AMD passthrough_failed is intentionally not handled — its fix path is HSA
override handling (PR #804) rather than a simple service recreate, and false
positives during AMD startup log parsing would loop a recreate without fixing
anything. Left to a follow-up if it proves to be a recurring AMD issue.

Validated on NOMAD3 (RTX 5060, v1.32.0-rc.3 + this patch hot-applied):

- After admin restart with passthrough healthy: log line
  "[GpuPassthroughRemediationProvider] NVIDIA passthrough healthy — no action
  needed." Provider exits cleanly without touching the container.
- The broken-state branch hits the existing forceReinstall path, which was
  manually invoked earlier in the same session to fix this exact box and
  recovered GPU access in ~45s with model volume intact. No new failure mode
  is introduced — the auto-trigger removes the user click but the underlying
  operation is the same one the banner Fix button already calls.

Closes #755.
2026-05-20 10:16:00 -07:00
..
benchmark.ts refactor(Benchmarks): cleanup api calls 2026-02-01 05:23:11 +00:00
chat.ts fix(AI): improved perf via rewrite and streaming logic 2026-03-03 20:51:38 -08:00
collections.ts feat(content-updates): show size, surface downloads in Active Downloads 2026-05-20 10:16:00 -07:00
docker.ts feat: initial commit 2025-06-29 15:51:08 -07:00
downloads.ts feat(downloads): rich progress, friendly names, cancel, and live status (#554) 2026-04-03 14:26:50 -07:00
files.ts feat: curated content system overhaul 2026-02-11 15:44:46 -08:00
kv_store.ts feat(GPU): auto-remediate nomad_ollama passthrough loss on admin boot (#755) 2026-05-20 10:16:00 -07:00
maps.ts feat(Maps): regional map downloads via go-pmtiles extract (#780) 2026-05-20 10:16:00 -07:00
ollama.ts feat(AI): enable remote AI chat host 2026-04-03 14:26:50 -07:00
rag.ts feat(AI): enable remote AI chat host 2026-04-03 14:26:50 -07:00
services.ts feat: support for updating services 2026-03-11 14:08:09 -07:00
system.ts feat(AI): improved AMD GPU acceleration for Ollama via ROCm + HSA override (#804) 2026-05-20 10:16:00 -07:00
util.ts feat: initial commit 2025-06-29 15:51:08 -07:00
zim.ts fix(ZIM): accumulate across Kiwix pages to prevent empty Content Explorer (#746) 2026-04-21 14:26:28 -07:00