mirror of
https://github.com/Crosstalk-Solutions/project-nomad.git
synced 2026-05-12 16:10:11 +02:00
Closes #810. ## Bug A: HSA_OVERRIDE_GFX_VERSION=11.0.0 was unconditional PR #804 set HSA_OVERRIDE_GFX_VERSION=11.0.0 for any AMD GPU. The inline comment claimed this was harmless on supported discrete cards (gfx1030 RX 6800, etc.) — empirically false. With the override, Ollama crashes during GPU discovery on gfx1030 and falls back to CPU silently. Affects every NOMAD user with an RX 6800 or other RDNA 2 discrete card. The correct value depends on the gfx version: - gfx1030, gfx1100, gfx1101, gfx1102: officially supported by ROCm — no override - gfx1031..gfx1036 (RDNA 2 variants + iGPUs like Rembrandt 680M): 10.3.0 - gfx1103, gfx1150, gfx1151 (Phoenix 780M, Strix 890M, Strix Halo): 11.0.0 ### Resolution chain in `_resolveAmdHsaOverride()` 1. KV `ai.amdHsaOverride` — manual override; accepts 'none' to disable, or a semver-style value to force. 2. Marker file `/app/storage/.nomad-amd-gfx` — written by install_nomad.sh based on lspci codename. Mapped to override via `_mapGfxToHsaOverride()`. 3. Default: `11.0.0` — preserves prior behavior so existing iGPU users (780M / 890M, the dominant AMD population today) don't regress on upgrade. Discrete RDNA 2 users on existing installs can opt out via `ai.amdHsaOverride='none'` and force-reinstall AI Assistant, OR re-run install_nomad.sh to refresh the marker file. The helper is used in both `createContainer` (initial install) and `updateContainer` (image update) paths, replacing the unconditional push. ## Bug B: BenchmarkService had no AMD discrete detection path `BenchmarkService.getHardwareInfo()` had three GPU detection fallbacks: 1. `si.graphics()` — empty inside Docker for AMD 2. nvidia-smi — NVIDIA only 3. AMD APU regex from CPU model — integrated only Result: AMD discrete cards (RX 6800, RX 7900 XTX, etc.) showed up as "GPU: Not detected" on the leaderboard despite ROCm working. Corrupts leaderboard data quality for that population. Fix: after the existing fallbacks, call `SystemService.getSystemInfo()` and read `graphics.controllers[0].model`. That path already handles AMD via the marker file + Ollama log probe added in PR #804, so we're reusing existing plumbing rather than duplicating detection logic. ## install_nomad.sh changes The existing AMD detection block already runs lspci. Added a codename parse step that maps Navi 21/22/23/24, Rembrandt, Phoenix1/Phoenix2, Strix/Strix Point/Strix Halo, and Navi 31/32/33 to gfx versions, then writes `/opt/project-nomad/storage/.nomad-amd-gfx`. Unknown codenames write nothing (admin handles missing-marker case via the backward-compat default). ## Validation Both bugs were originally surfaced and validated empirically on RX 6800 / gfx1030 / Ubuntu 24.04 + kernel 6.17 + ollama/ollama:rocm during the #810 filing. Validation grid from that report: | Run | NOMAD Score | tok/s | GPU detected | |-----------------------------------------------|-------------|-------|-------------------------| | Pre-fix (Bug A active) | n/a | 0 | yes, but library=cpu | | HSA_OVERRIDE removed, Bug B unfixed | 73.8 | 221.6 | "Not detected" | | Both fixes hot-patched (this PR's behavior) | 73.7 | 216.0 | AMD Radeon RX 6800 | Local checks: `npm run typecheck` clean, `npm run build` clean. |
||
|---|---|---|
| .. | ||
| sidecar-disk-collector | ||
| sidecar-updater | ||
| collect_disk_info.sh | ||
| entrypoint.sh | ||
| install_nomad.sh | ||
| management_compose.yaml | ||
| migrate-disk-collector.md | ||
| migrate-disk-collector.sh | ||
| run_updater_fixes.sh | ||
| start_nomad.sh | ||
| stop_nomad.sh | ||
| uninstall_nomad.sh | ||
| update_nomad.sh | ||
| wikipedia_en_100_mini_2025-06.zim | ||
| wikipedia_en_100_mini_2026-01.zim | ||