project-nomad

mirror of https://github.com/Crosstalk-Solutions/project-nomad.git synced 2026-05-12 16:10:11 +02:00

History

chriscrosstalk 5924056502 feat(AI): improved AMD GPU acceleration for Ollama via ROCm + HSA override (#804 ) * feat(AI): re-enable AMD GPU acceleration for Ollama via ROCm + HSA override Re-enables AMD GPU support that was disabled in `77f1868` pending validation of the ROCm image and device discovery. Validation done 2026-04-28 on a Minisforum UM890 Pro (Ryzen 9 PRO 8945HS + Radeon 780M iGPU) — Ollama correctly offloaded all model layers to the iGPU when the container was started with /dev/kfd + /dev/dri passthrough and HSA_OVERRIDE_GFX_VERSION=11.0.0. On llama3.2:1b, GPU inference ran at 51.83 tok/s vs 33.16 tok/s on CPU (same hardware, same prompt) — a 1.56x speedup confirmed by Ollama logs showing "load_tensors: offloaded 17/17 layers to GPU". Changes ------- docker_service.ts - Restore _discoverAMDDevices() (simplified — pass /dev/dri as a directory entry, mirroring `docker run --device /dev/dri` behavior, instead of the prior brittle hardcoded card0/renderD128 fallback that broke on systems where the AMD GPU enumerates as card1+). - Restore the AMD branch in _createContainer(): - Switches Ollama image to ollama/ollama:rocm - Mounts /dev/kfd + /dev/dri via Devices - Sets HSA_OVERRIDE_GFX_VERSION=11.0.0 (required for unsupported-but-RDNA3 iGPUs like gfx1103; harmless on supported discrete cards) - KV opt-out via ai.amdGpuAcceleration (default on) - Mirror the AMD branch in updateContainer(): - Lifted GPU detection above docker.pull() so AMD updates pull :rocm rather than the standard :targetVersion tag (per-version ROCm tags aren't always published) - Replaces stale HSA_OVERRIDE in the inspect-captured env on update, so containers built before this PR pick up the current value system_service.ts - New getOllamaInferenceComputeFromLogs() — parses Ollama startup log line "msg=\"inference compute\" ... library=CUDA\|ROCm ..." which Ollama emits for both NVIDIA and AMD. Catches silent CPU fallback (e.g. NVML death after update, or HSA_OVERRIDE failure) that the prior nvidia-smi exec probe couldn't detect. - gpuHealth refactored to use log parsing as the primary probe for both vendors, with nvidia-smi exec retained as the NVIDIA-only secondary path for hardware enrichment when log parsing has no startup line yet. - AMD path uses gpu.type KV value (persisted by DockerService._detectGPUType) + ai.amdGpuAcceleration opt-out to determine hasRocmRuntime. types/system.ts - GpuHealthStatus extended additively: hasRocmRuntime + optional gpuVendor. types/kv_store.ts - New ai.amdGpuAcceleration boolean (default-on). settings/models.tsx, settings/system.tsx - passthrough_failed banner copy now reads vendor from gpuHealth.gpuVendor ("an AMD GPU" vs "an NVIDIA GPU"). Same Fix button hits the same force-reinstall endpoint, which now configures AMD correctly. install_nomad.sh - AMD detection in verify_gpu_setup() upgraded from a strict-positive "ROCm not currently available" message to "ROCm acceleration will be configured automatically." Also tightens the lspci match to display controller classes (avoids false positives from AMD CPU host bridges, matching the same fix already in DockerService._detectGPUType). Auto-remediation ---------------- Issue #755 proposes auto-remediation when gpuHealth.status flips to passthrough_failed (today the user has to click "Fix: Reinstall AI Assistant"). When that PR lands, AMD coverage falls out for free since this PR uses the same passthrough_failed status code via the shared gpuHealth machinery — #755's guard will need to flip from hasNvidiaRuntime === true to (hasNvidiaRuntime \|\| hasRocmRuntime). Closes #124 (AMD GPU support). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(AI): detect AMD GPU presence inside admin container via marker file The admin container doesn't have lspci installed, and AMD GPUs don't register a Docker runtime the way NVIDIA does — so DockerService._detectGPUType() and SystemService.gpuHealth had no way to know an AMD GPU was present. The previous implementation fell through to lspci, which silently failed inside the admin container, leaving gpu.type unset and gpuHealth stuck at 'no_gpu' even on systems with an AMD GPU. (NVIDIA worked because Docker registers the nvidia runtime, which is reachable via dockerInfo.Runtimes from any container.) Discovered while testing the AMD acceleration patch on a Minisforum UM890 Pro: the AMD branch in _createContainer() never fired because _detectGPUType() returned 'none' even on a host with a working /dev/kfd. Fix --- install_nomad.sh writes the host-detected GPU type ('nvidia' \| 'amd') to a marker file in the storage volume the admin container already bind-mounts: /opt/project-nomad/storage/.nomad-gpu-type → /app/storage/.nomad-gpu-type DockerService._detectGPUType() reads the marker as a secondary probe (after the Docker runtime check) — covers AMD detection from inside the container without requiring lspci or a /dev bind mount. SystemService falls back to the marker file when KV gpu.type is empty so the System page reflects AMD presence even before the user installs AI Assistant for the first time. (Without this, the page would say 'no_gpu' until Ollama was installed, even on hosts with an AMD GPU detected at install time.) Verified on NOMAD6 (UM890 Pro, Ubuntu 24.04, 780M iGPU): with the marker file in place and admin restarted, the patch's AMD branch fires correctly on Force Reinstall AI Assistant. Resulting nomad_ollama runs ollama/ollama:rocm with /dev/kfd + /dev/dri passthrough and HSA_OVERRIDE_GFX_VERSION=11.0.0; Ollama logs show 'library=ROCm compute=gfx1100 ... type=iGPU'. NOMAD's in-product benchmark on the same hardware climbed from 33.8 tok/s (CPU) to 57.3 tok/s (GPU) — a 1.69x speedup, with TTFT dropping from 148ms to 66ms. Migration for existing AMD installs ----------------------------------- Users on an existing NOMAD install with an AMD GPU have no marker file (the install script wrote it on a fresh install). Two paths get them on the GPU: 1. Re-run install_nomad.sh — writes the marker, no other side effects 2. Manually: echo amd \| sudo tee /opt/project-nomad/storage/.nomad-gpu-type Either then triggers AMD detection on the next AI Assistant install/reinstall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(AI): pull ollama/ollama:rocm separately when AMD branch overrides image The pull-if-missing logic in _createContainer ran against service.container_image (the DB-pinned tag, e.g. ollama/ollama:0.18.2). The AMD branch then overrode finalImage to ollama/ollama:rocm — but if that image wasn't already local, the container creation step failed with "no such image: ollama/ollama:rocm". Caught while validating on NOMAD2 (Ryzen AI 9 HX 370 + Radeon 890M / RDNA 3.5): the prior end-to-end test on NOMAD6 had silently passed because the rocm image was already pulled there from an earlier sidecar test, masking the bug. Fix: inside the AMD branch, after setting finalImage to ollama/ollama:rocm, run a parallel _checkImageExists + docker.pull dance for the new tag. Also confirmed via this validation: the same HSA_OVERRIDE_GFX_VERSION=11.0.0 override works on the 890M (gfx1150 / RDNA 3.5) — Ollama logs report 'library=ROCm compute=gfx1100 description="AMD Radeon 890M Graphics"' and inference runs at 51.68 tok/s (matching the existing X1 Pro published tile of 51.7 tok/s on the same hardware class). RDNA 3 (780M, gfx1103) and RDNA 3.5 (890M, gfx1150) both use the same override successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build(Dockerfile): include pciutils for lspci gpu detection fallback --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Jake Turner <jturner@cosmistack.com>		2026-04-28 21:53:56 -07:00
..
sidecar-disk-collector	fix(disk-collector): fix storage reporting for NFS mounts	2026-04-17 12:15:43 -07:00
sidecar-updater	fix: update channel flexibility	2026-03-05 04:06:56 +00:00
collect_disk_info.sh	feat: disk info collection	2025-12-07 19:13:43 -08:00
entrypoint.sh	feat: make Nomad fully composable	2026-03-20 11:46:10 -07:00
install_nomad.sh	feat(AI): improved AMD GPU acceleration for Ollama via ROCm + HSA override (#804 )	2026-04-28 21:53:56 -07:00
management_compose.yaml	feat: gzip compression by default for all registered routes	2026-04-03 14:26:50 -07:00
migrate-disk-collector.md	build: compose and install script updates for disk-collector sidecar	2026-03-14 19:54:51 -07:00
migrate-disk-collector.sh	chore: add additional warnings to migrate-disk-collector	2026-03-15 03:19:52 +00:00
run_updater_fixes.sh	fix: container update pattern in run_updater_fixes	2026-03-05 04:32:09 +00:00
start_nomad.sh	feat(install): add start & stop helper scripts	2025-08-08 15:07:32 -07:00
stop_nomad.sh	feat(install): add start & stop helper scripts	2025-08-08 15:07:32 -07:00
uninstall_nomad.sh	ops: added a check for docker-compose version in Nomad utility scripts	2026-03-20 11:46:10 -07:00
update_nomad.sh	ops: added a check for docker-compose version in Nomad utility scripts	2026-03-20 11:46:10 -07:00
wikipedia_en_100_mini_2025-06.zim	fix(ZIM): host initial zim download in GH repo	2025-09-02 22:44:01 -07:00
wikipedia_en_100_mini_2026-01.zim	build: add latest initial zim file	2026-03-24 02:08:18 +00:00