project-nomad/admin
Chris Sherwood a2e2f7fc40 fix(AI): vendor-aware AMD HSA override + benchmark discrete-GPU detection
Closes #810.

## Bug A: HSA_OVERRIDE_GFX_VERSION=11.0.0 was unconditional

PR #804 set HSA_OVERRIDE_GFX_VERSION=11.0.0 for any AMD GPU. The inline comment
claimed this was harmless on supported discrete cards (gfx1030 RX 6800, etc.) — empirically false. With the override, Ollama crashes during GPU discovery on
gfx1030 and falls back to CPU silently. Affects every NOMAD user with an
RX 6800 or other RDNA 2 discrete card.

The correct value depends on the gfx version:
- gfx1030, gfx1100, gfx1101, gfx1102: officially supported by ROCm — no override
- gfx1031..gfx1036 (RDNA 2 variants + iGPUs like Rembrandt 680M): 10.3.0
- gfx1103, gfx1150, gfx1151 (Phoenix 780M, Strix 890M, Strix Halo): 11.0.0

### Resolution chain in `_resolveAmdHsaOverride()`

1. KV `ai.amdHsaOverride` — manual override; accepts 'none' to disable, or a
   semver-style value to force.
2. Marker file `/app/storage/.nomad-amd-gfx` — written by install_nomad.sh
   based on lspci codename. Mapped to override via `_mapGfxToHsaOverride()`.
3. Default: `11.0.0` — preserves prior behavior so existing iGPU users
   (780M / 890M, the dominant AMD population today) don't regress on upgrade.

Discrete RDNA 2 users on existing installs can opt out via
`ai.amdHsaOverride='none'` and force-reinstall AI Assistant, OR re-run
install_nomad.sh to refresh the marker file.

The helper is used in both `createContainer` (initial install) and
`updateContainer` (image update) paths, replacing the unconditional push.

## Bug B: BenchmarkService had no AMD discrete detection path

`BenchmarkService.getHardwareInfo()` had three GPU detection fallbacks:
1. `si.graphics()` — empty inside Docker for AMD
2. nvidia-smi — NVIDIA only
3. AMD APU regex from CPU model — integrated only

Result: AMD discrete cards (RX 6800, RX 7900 XTX, etc.) showed up as
"GPU: Not detected" on the leaderboard despite ROCm working. Corrupts
leaderboard data quality for that population.

Fix: after the existing fallbacks, call `SystemService.getSystemInfo()` and
read `graphics.controllers[0].model`. That path already handles AMD via the
marker file + Ollama log probe added in PR #804, so we're reusing existing
plumbing rather than duplicating detection logic.

## install_nomad.sh changes

The existing AMD detection block already runs lspci. Added a codename parse
step that maps Navi 21/22/23/24, Rembrandt, Phoenix1/Phoenix2, Strix/Strix
Point/Strix Halo, and Navi 31/32/33 to gfx versions, then writes
`/opt/project-nomad/storage/.nomad-amd-gfx`. Unknown codenames write nothing
(admin handles missing-marker case via the backward-compat default).

## Validation

Both bugs were originally surfaced and validated empirically on RX 6800 /
gfx1030 / Ubuntu 24.04 + kernel 6.17 + ollama/ollama:rocm during the #810
filing. Validation grid from that report:

| Run                                           | NOMAD Score | tok/s | GPU detected            |
|-----------------------------------------------|-------------|-------|-------------------------|
| Pre-fix (Bug A active)                        | n/a         | 0     | yes, but library=cpu    |
| HSA_OVERRIDE removed, Bug B unfixed           | 73.8        | 221.6 | "Not detected"          |
| Both fixes hot-patched (this PR's behavior)   | 73.7        | 216.0 | AMD Radeon RX 6800      |

Local checks: `npm run typecheck` clean, `npm run build` clean.
2026-05-20 10:16:00 -07:00
..
app fix(AI): vendor-aware AMD HSA override + benchmark discrete-GPU detection 2026-05-20 10:16:00 -07:00
bin feat: curated content system overhaul 2026-02-11 15:44:46 -08:00
commands feat(Maps): regional map downloads via go-pmtiles extract (#780) 2026-05-20 10:16:00 -07:00
config fix: cache docker list requests, aiAssistantName fetching, and ensure inertia used properly 2026-04-03 14:26:50 -07:00
constants feat(Maps): regional map downloads via go-pmtiles extract (#780) 2026-05-20 10:16:00 -07:00
database feat(Content): custom ZIM library sources with pre-seeded mirrors (#593) 2026-05-20 10:16:00 -07:00
docs docs: update release notes 2026-05-20 10:16:00 -07:00
inertia fix(Maps): render notes in marker popup when populated 2026-05-20 10:16:00 -07:00
providers fix(System): self-heal stale updateAvailable flag after sidecar-driven update (#825) 2026-05-20 10:16:00 -07:00
public feat: switch all PNG images to WEBP (#575) 2026-04-03 14:26:50 -07:00
resources feat(Maps): regional map downloads via go-pmtiles extract (#780) 2026-05-20 10:16:00 -07:00
start feat(Content): custom ZIM library sources with pre-seeded mirrors (#593) 2026-05-20 10:16:00 -07:00
tests fix(UI): improve global map banner display logic (#702) 2026-05-20 10:16:00 -07:00
types fix(AI): vendor-aware AMD HSA override + benchmark discrete-GPU detection 2026-05-20 10:16:00 -07:00
util feat: display model download progress 2026-02-06 16:22:23 -08:00
views feat: initial commit 2025-06-29 15:51:08 -07:00
.editorconfig feat: initial commit 2025-06-29 15:51:08 -07:00
.env.example feat: Add Windows Docker Desktop support for local development 2026-01-19 10:29:24 -08:00
ace.js feat: initial commit 2025-06-29 15:51:08 -07:00
adonisrc.ts fix(System): self-heal stale updateAvailable flag after sidecar-driven update (#825) 2026-05-20 10:16:00 -07:00
eslint.config.js feat: openwebui+ollama and zim management 2025-07-09 09:08:21 -07:00
package-lock.json build(deps): bump picomatch in /admin 2026-05-20 10:16:00 -07:00
package.json chore(deps): pin all deps to exact versions 2026-05-20 10:16:00 -07:00
tailwind.config.ts feat: initial commit 2025-06-29 15:51:08 -07:00
tsconfig.json feat: initial commit 2025-06-29 15:51:08 -07:00
vite.config.ts fix(Maps): ensure proper parsing of hostnames (#640) 2026-04-03 14:26:50 -07:00