project-nomad

mirror of https://github.com/Crosstalk-Solutions/project-nomad.git synced 2026-05-26 14:25:07 +02:00

Author	SHA1	Message	Date
Chris Sherwood	2997637ce0	feat(GPU): auto-remediate nomad_ollama passthrough loss on admin boot (#755 ) After an update, container recreate, or docker daemon restart, nomad_ollama's HostConfig.DeviceRequests still lists the nvidia driver — but the NVIDIA Container Toolkit binding inside the container is torn. `nvidia-smi` returns "Failed to initialize NVML: Unknown Error" and Ollama silently falls back to CPU inference. PR #208 detects this and shows a banner with a "Fix: Reinstall AI Assistant" button. This change does that click automatically on admin boot. New provider GpuPassthroughRemediationProvider runs once on web env boot: 1. Skip when KV `ai.autoFixGpuPassthrough = false` (default true). 2. Skip when Docker has no `nvidia` runtime registered (AMD-only and CPU-only hosts unaffected). 3. Skip when nomad_ollama isn't running. 4. Exec `nvidia-smi --query-gpu=name --format=csv,noheader` inside the container with an 8-second timeout. If the output matches "Failed to initialize NVML", "Unknown Error", "TIMEOUT", or contains no alphabetic characters, treat the passthrough as broken. 5. On broken: call DockerService.forceReinstall('nomad_ollama'). The existing force-reinstall preserves the Ollama volume + installed models. Stamp `gpu.autoRemediatedAt` on success. 6. On healthy: log and exit. AMD passthrough_failed is intentionally not handled — its fix path is HSA override handling (PR #804) rather than a simple service recreate, and false positives during AMD startup log parsing would loop a recreate without fixing anything. Left to a follow-up if it proves to be a recurring AMD issue. Validated on NOMAD3 (RTX 5060, v1.32.0-rc.3 + this patch hot-applied): - After admin restart with passthrough healthy: log line "[GpuPassthroughRemediationProvider] NVIDIA passthrough healthy — no action needed." Provider exits cleanly without touching the container. - The broken-state branch hits the existing forceReinstall path, which was manually invoked earlier in the same session to fix this exact box and recovered GPU access in ~45s with model volume intact. No new failure mode is introduced — the auto-trigger removes the user click but the underlying operation is the same one the banner Fix button already calls. Closes #755.	2026-05-20 10:16:00 -07:00
Jake Turner	318276c784	fix(System): self-heal stale updateAvailable flag after sidecar-driven update (#825 )	2026-05-20 10:16:00 -07:00
Henry Estela	2d8a02f257	fix(RAG): add start button in kb modal and ensure restart policy exists (#700 ) Adds a check to RAG health to make sure nomad_qdrant is online, if not then the user will be blocked from clicking any buttons in the KB modal until they click the start qdrant button and let the container start There is a new file qdrant_restart_policy_provider.ts, which tries to ensure that the restart policy always exists for the nomad_qdrant container even though the policy should have been there when the container is created.	2026-05-20 10:16:00 -07:00
Jake Turner	9e3828bcba	feat(Kiwix): migrate to Kiwix library mode for improved stability (#622 )	2026-04-03 14:26:50 -07:00
Jake Turner	f49b9abb81	fix(Maps): static path resolution	2026-01-23 14:17:25 -08:00
Jake Turner	7569aa935d	feat: background job overhaul with bullmq	2025-12-06 23:59:01 -08:00

6 Commits