After an update, container recreate, or docker daemon restart, nomad_ollama's
HostConfig.DeviceRequests still lists the nvidia driver — but the NVIDIA
Container Toolkit binding inside the container is torn. `nvidia-smi` returns
"Failed to initialize NVML: Unknown Error" and Ollama silently falls back to
CPU inference. PR #208 detects this and shows a banner with a "Fix: Reinstall
AI Assistant" button. This change does that click automatically on admin boot.
New provider GpuPassthroughRemediationProvider runs once on web env boot:
1. Skip when KV `ai.autoFixGpuPassthrough = false` (default true).
2. Skip when Docker has no `nvidia` runtime registered (AMD-only and CPU-only
hosts unaffected).
3. Skip when nomad_ollama isn't running.
4. Exec `nvidia-smi --query-gpu=name --format=csv,noheader` inside the
container with an 8-second timeout. If the output matches
"Failed to initialize NVML", "Unknown Error", "TIMEOUT", or contains no
alphabetic characters, treat the passthrough as broken.
5. On broken: call DockerService.forceReinstall('nomad_ollama'). The existing
force-reinstall preserves the Ollama volume + installed models. Stamp
`gpu.autoRemediatedAt` on success.
6. On healthy: log and exit.
AMD passthrough_failed is intentionally not handled — its fix path is HSA
override handling (PR #804) rather than a simple service recreate, and false
positives during AMD startup log parsing would loop a recreate without fixing
anything. Left to a follow-up if it proves to be a recurring AMD issue.
Validated on NOMAD3 (RTX 5060, v1.32.0-rc.3 + this patch hot-applied):
- After admin restart with passthrough healthy: log line
"[GpuPassthroughRemediationProvider] NVIDIA passthrough healthy — no action
needed." Provider exits cleanly without touching the container.
- The broken-state branch hits the existing forceReinstall path, which was
manually invoked earlier in the same session to fix this exact box and
recovered GPU access in ~45s with model volume intact. No new failure mode
is introduced — the auto-trigger removes the user click but the underlying
operation is the same one the banner Fix button already calls.
Closes#755.
* feat(maps): add regional map downloads via go-pmtiles extract
* address Copilot review feedback on PR #780
- auto-refresh preflight on selection/maxzoom change with 400ms debounce and
requestId stale-safety so the confirm button no longer requires a two-step
"Estimate Size" -> "Start Download" dance
- safeUpdateProgress helper replaces fire-and-forget updateProgress().catch()
pattern so cancelled-job errors (code -1) can't surface as unhandled rejections
- gate world basemap source on worldBasemapReady - when ensureWorldBasemap()
fails we already delete world.pmtiles, so emitting the source was producing
404s on every tile request
- verify go-pmtiles binary SHA256 at image build time; upstream doesn't ship a
checksums file so per-arch hashes are pinned as build args with a regenerate
note when bumping PMTILES_VERSION
Adds a check to RAG health to make sure nomad_qdrant is online, if not
then the user will be blocked from clicking any buttons in the KB modal
until they click the start qdrant button and let the container start
There is a new file qdrant_restart_policy_provider.ts, which tries to
ensure that the restart policy always exists for the nomad_qdrant
container even though the policy should have been there when the
container is created.