Commit Graph

72 Commits

Author SHA1 Message Date
Jake Turner
a0047c1555 fix(KB): surface file-warning compute failures instead of masking as healthy (PR #895 review)
`computeFileWarnings()` previously caught all errors and returned an empty
map, which the frontend rendered as "every file is healthy" — reintroducing
exactly the silent-failure mode this surface exists to expose.

Return `{ ok, warnings }`; flip `ok: false` from the catch. KB modal renders
an inline amber notice under the Stored Files header when `ok === false`,
leaving per-row warning rendering untouched. Transient failures self-heal on
the next 30s poll; no toast spam.
2026-05-20 10:16:00 -07:00
Jake Turner
102998ec96 refactor(KB): move FileWarning to shared types/rag following existing convention 2026-05-20 10:16:00 -07:00
Chris Sherwood
43ca584b6c feat(KB): status pill + last-activity timestamp on Processing Queue (RFC #883 §5/§10)
Each in-flight (or stuck) embedding job gets a colored health pill,
relative-activity timestamp, and chunk counter so users can tell at a
glance whether ingestion is making progress.

## Health states

- **🟢 Active** — last batch < 2 min ago
- **🟡 Slow** — last batch 2-5 min ago (CPU-paced multi-batch ingestion
  lives here naturally; not always a problem)
- **🔴 Stalled** — last batch > 5 min ago (likely real problem)
- ** Waiting** — queued, no batch started yet
- **🔴 Failed** — job recorded failed status

## What lands

- New backend util `kb_job_health.ts` with pure `computeJobHealth(input)`
  decision function. Time-based thresholds (2 min / 5 min) inlined as
  constants. 9 unit tests pin the boundaries.
- `EmbedJobWithProgress` gains `lastBatchAt`, `startedAt`, `chunks` —
  already set by `EmbedFileJob.handle` on every batch transition, just
  not previously surfaced through `listActiveJobs`.
- Frontend `kb_job_health_display.ts` maps each status to a Tailwind
  dot color, label, and aria-label so backend and UI stay in sync.
- `ActiveEmbedJobs.tsx` renders the pill, "last activity Xs ago", and
  chunk counter above each progress bar. Adds a manual Refresh button
  and "Last updated Xs ago" line — the existing 2s/30s auto-poll
  cadence in `useEmbedJobs` is left intact.
- Live tick at 5s keeps the relative timestamps current without
  re-fetching from the API.

## Not in scope

- Per-card Cancel / Retry / Un-index — separate Phase 2 PR
- Conditional warnings A/B/C — separate Phase 2 PR
- Computing throughput rate (chunks/min) — needs ratio registry consumer
  (Phase 2 follow-up); for now the pill answers the "is it stuck?"
  question directly without a rate estimate.
2026-05-20 10:16:00 -07:00
chriscrosstalk
743549ca74 feat(KB): per-file ingest state machine (Phase 1 of RFC #883) (#888)
Adds a persistent state machine for AI knowledge-base ingestion so the
scanner can distinguish "fully indexed", "user opted out", "failed", and
"stalled" from each other — none of which were derivable from the prior
binary "any chunks in Qdrant ⇒ embedded" check.

## What lands

- New table `kb_ingest_state` keyed by `file_path` with enum state column
  (`pending_decision | indexed | browse_only | failed | stalled`).
  Independent of `installed_resources` so it covers both curated downloads
  and manually-uploaded KB files.
- New KV key `rag.defaultIngestPolicy` (string: `Always | Manual`).
  Registered now but not consumed yet — JIT prompt + wizard step land in
  Phase 3 of the RFC.
- `EmbedFileJob.handle` writes state on terminal outcomes:
  - Success (final batch) → `indexed` + chunks count
  - `UnrecoverableError` → `failed` + error message
  - Retryable errors are left to BullMQ's existing retry path
- `scanAndSyncStorage` swaps the binary qdrant check for a state-aware
  decision tree (see `decideScanAction`). Existing installs auto-backfill
  on first scan: files with chunks in Qdrant but no state row become
  `indexed`; new files start as `pending_decision`.
- `deleteFileBySource` drops the state row last, so removed files
  disappear entirely instead of leaving an orphan that the next scan
  would re-dispatch into nothing.

## What does NOT land here

- Ratio registry (separate PR) — needed for partial-stall detection and
  cost estimates, but a separable concern.
- #880 follow-up initial-progress anchor (separate tiny PR).
- Phase 2 UI (status pill, per-card actions, conditional warnings).
- Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal).
- PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All
  / Reset & Rebuild would also want to set state, but #886 isn't merged
  yet; that wiring goes in a follow-up once #886 lands.

## Target

This is forward work for v1.40.0 (RFC #883). Branching off `rc` because
that's the current latest base and post-GA Jake will sync rc→dev; a
retarget at PR-open time is a fast-forward if requested.

## Tests

- 9 new unit tests for `decideScanAction` covering all five states plus
  the no-row / chunks-present / chunks-missing combinations
- Type-check clean
- Smoke-tested end-to-end on NOMAD3 via hot-patch:
  - Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all
    came back `indexed` on first scan
  - Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`)
    came back `pending_decision` and was correctly re-dispatched (Bull
    deduped to its historical `:completed` jobId — bgauger's #886 fix
    drains that)
  - Delete hook: deleting a KB upload via `DELETE /api/rag/files`
    removed both the disk file and the state row

Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com>
2026-05-20 10:16:00 -07:00
Chris Sherwood
2997637ce0 feat(GPU): auto-remediate nomad_ollama passthrough loss on admin boot (#755)
After an update, container recreate, or docker daemon restart, nomad_ollama's
HostConfig.DeviceRequests still lists the nvidia driver — but the NVIDIA
Container Toolkit binding inside the container is torn. `nvidia-smi` returns
"Failed to initialize NVML: Unknown Error" and Ollama silently falls back to
CPU inference. PR #208 detects this and shows a banner with a "Fix: Reinstall
AI Assistant" button. This change does that click automatically on admin boot.

New provider GpuPassthroughRemediationProvider runs once on web env boot:

1. Skip when KV `ai.autoFixGpuPassthrough = false` (default true).
2. Skip when Docker has no `nvidia` runtime registered (AMD-only and CPU-only
   hosts unaffected).
3. Skip when nomad_ollama isn't running.
4. Exec `nvidia-smi --query-gpu=name --format=csv,noheader` inside the
   container with an 8-second timeout. If the output matches
   "Failed to initialize NVML", "Unknown Error", "TIMEOUT", or contains no
   alphabetic characters, treat the passthrough as broken.
5. On broken: call DockerService.forceReinstall('nomad_ollama'). The existing
   force-reinstall preserves the Ollama volume + installed models. Stamp
   `gpu.autoRemediatedAt` on success.
6. On healthy: log and exit.

AMD passthrough_failed is intentionally not handled — its fix path is HSA
override handling (PR #804) rather than a simple service recreate, and false
positives during AMD startup log parsing would loop a recreate without fixing
anything. Left to a follow-up if it proves to be a recurring AMD issue.

Validated on NOMAD3 (RTX 5060, v1.32.0-rc.3 + this patch hot-applied):

- After admin restart with passthrough healthy: log line
  "[GpuPassthroughRemediationProvider] NVIDIA passthrough healthy — no action
  needed." Provider exits cleanly without touching the container.
- The broken-state branch hits the existing forceReinstall path, which was
  manually invoked earlier in the same session to fix this exact box and
  recovered GPU access in ~45s with model volume intact. No new failure mode
  is introduced — the auto-trigger removes the user click but the underlying
  operation is the same one the banner Fix button already calls.

Closes #755.
2026-05-20 10:16:00 -07:00
Chris Sherwood
a2e2f7fc40 fix(AI): vendor-aware AMD HSA override + benchmark discrete-GPU detection
Closes #810.

## Bug A: HSA_OVERRIDE_GFX_VERSION=11.0.0 was unconditional

PR #804 set HSA_OVERRIDE_GFX_VERSION=11.0.0 for any AMD GPU. The inline comment
claimed this was harmless on supported discrete cards (gfx1030 RX 6800, etc.) — empirically false. With the override, Ollama crashes during GPU discovery on
gfx1030 and falls back to CPU silently. Affects every NOMAD user with an
RX 6800 or other RDNA 2 discrete card.

The correct value depends on the gfx version:
- gfx1030, gfx1100, gfx1101, gfx1102: officially supported by ROCm — no override
- gfx1031..gfx1036 (RDNA 2 variants + iGPUs like Rembrandt 680M): 10.3.0
- gfx1103, gfx1150, gfx1151 (Phoenix 780M, Strix 890M, Strix Halo): 11.0.0

### Resolution chain in `_resolveAmdHsaOverride()`

1. KV `ai.amdHsaOverride` — manual override; accepts 'none' to disable, or a
   semver-style value to force.
2. Marker file `/app/storage/.nomad-amd-gfx` — written by install_nomad.sh
   based on lspci codename. Mapped to override via `_mapGfxToHsaOverride()`.
3. Default: `11.0.0` — preserves prior behavior so existing iGPU users
   (780M / 890M, the dominant AMD population today) don't regress on upgrade.

Discrete RDNA 2 users on existing installs can opt out via
`ai.amdHsaOverride='none'` and force-reinstall AI Assistant, OR re-run
install_nomad.sh to refresh the marker file.

The helper is used in both `createContainer` (initial install) and
`updateContainer` (image update) paths, replacing the unconditional push.

## Bug B: BenchmarkService had no AMD discrete detection path

`BenchmarkService.getHardwareInfo()` had three GPU detection fallbacks:
1. `si.graphics()` — empty inside Docker for AMD
2. nvidia-smi — NVIDIA only
3. AMD APU regex from CPU model — integrated only

Result: AMD discrete cards (RX 6800, RX 7900 XTX, etc.) showed up as
"GPU: Not detected" on the leaderboard despite ROCm working. Corrupts
leaderboard data quality for that population.

Fix: after the existing fallbacks, call `SystemService.getSystemInfo()` and
read `graphics.controllers[0].model`. That path already handles AMD via the
marker file + Ollama log probe added in PR #804, so we're reusing existing
plumbing rather than duplicating detection logic.

## install_nomad.sh changes

The existing AMD detection block already runs lspci. Added a codename parse
step that maps Navi 21/22/23/24, Rembrandt, Phoenix1/Phoenix2, Strix/Strix
Point/Strix Halo, and Navi 31/32/33 to gfx versions, then writes
`/opt/project-nomad/storage/.nomad-amd-gfx`. Unknown codenames write nothing
(admin handles missing-marker case via the backward-compat default).

## Validation

Both bugs were originally surfaced and validated empirically on RX 6800 /
gfx1030 / Ubuntu 24.04 + kernel 6.17 + ollama/ollama:rocm during the #810
filing. Validation grid from that report:

| Run                                           | NOMAD Score | tok/s | GPU detected            |
|-----------------------------------------------|-------------|-------|-------------------------|
| Pre-fix (Bug A active)                        | n/a         | 0     | yes, but library=cpu    |
| HSA_OVERRIDE removed, Bug B unfixed           | 73.8        | 221.6 | "Not detected"          |
| Both fixes hot-patched (this PR's behavior)   | 73.7        | 216.0 | AMD Radeon RX 6800      |

Local checks: `npm run typecheck` clean, `npm run build` clean.
2026-05-20 10:16:00 -07:00
0xGlitch
94059b0aaf feat(Maps): regional map downloads via go-pmtiles extract (#780)
* feat(maps): add regional map downloads via go-pmtiles extract

* address Copilot review feedback on PR #780

- auto-refresh preflight on selection/maxzoom change with 400ms debounce and
  requestId stale-safety so the confirm button no longer requires a two-step
  "Estimate Size" -> "Start Download" dance
- safeUpdateProgress helper replaces fire-and-forget updateProgress().catch()
  pattern so cancelled-job errors (code -1) can't surface as unhandled rejections
- gate world basemap source on worldBasemapReady - when ensureWorldBasemap()
  fails we already delete world.pmtiles, so emitting the source was producing
  404s on every tile request
- verify go-pmtiles binary SHA256 at image build time; upstream doesn't ship a
  checksums file so per-arch hashes are pinned as build args with a regenerate
  note when bumping PMTILES_VERSION
2026-05-20 10:16:00 -07:00
Chris Sherwood
299b767e63 feat(content-updates): show size, surface downloads in Active Downloads
Content Updates had three UX problems that compounded:

1. No size column, so users had to guess how big an update would be before
   clicking Update All. Upstream /api/v1/resources/check-updates doesn't
   return size, so CollectionUpdateService now enriches each update with
   a Content-Length HEAD request in parallel (5s timeout, non-fatal on
   failure — the row just renders an em-dash).

2. Small ZIM updates (1-8 MB) never appeared in Active Downloads. Two
   causes, both fixed: handleApply / handleApplyAll didn't invalidate the
   download-jobs query after dispatching, and useDownloads idled at 30s
   between polls — enough for a fast job to dispatch, download, and get
   cleaned up by removeOnComplete before the next refetch.

3. applyUpdate didn't forward title / totalBytes to RunDownloadJob, so
   any update that did briefly surface in Active Downloads had no label
   and no byte-count progress, just a filename and a percentage. It now
   passes both (matching zim_service's dispatch pattern).

Also parallelized applyAllUpdates so dispatching five updates doesn't
serialize five sequential BullMQ round-trips.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:16:00 -07:00
chriscrosstalk
73e2115245 feat(AI): improved AMD GPU acceleration for Ollama via ROCm + HSA override (#804)
* feat(AI): re-enable AMD GPU acceleration for Ollama via ROCm + HSA override

Re-enables AMD GPU support that was disabled in 77f1868 pending validation
of the ROCm image and device discovery. Validation done 2026-04-28 on a
Minisforum UM890 Pro (Ryzen 9 PRO 8945HS + Radeon 780M iGPU) — Ollama
correctly offloaded all model layers to the iGPU when the container was
started with /dev/kfd + /dev/dri passthrough and HSA_OVERRIDE_GFX_VERSION=11.0.0.
On llama3.2:1b, GPU inference ran at 51.83 tok/s vs 33.16 tok/s on CPU
(same hardware, same prompt) — a 1.56x speedup confirmed by Ollama logs
showing "load_tensors: offloaded 17/17 layers to GPU".

Changes
-------

docker_service.ts
- Restore _discoverAMDDevices() (simplified — pass /dev/dri as a directory
  entry, mirroring `docker run --device /dev/dri` behavior, instead of the
  prior brittle hardcoded card0/renderD128 fallback that broke on systems
  where the AMD GPU enumerates as card1+).
- Restore the AMD branch in _createContainer():
  - Switches Ollama image to ollama/ollama:rocm
  - Mounts /dev/kfd + /dev/dri via Devices
  - Sets HSA_OVERRIDE_GFX_VERSION=11.0.0 (required for unsupported-but-RDNA3
    iGPUs like gfx1103; harmless on supported discrete cards)
  - KV opt-out via ai.amdGpuAcceleration (default on)
- Mirror the AMD branch in updateContainer():
  - Lifted GPU detection above docker.pull() so AMD updates pull :rocm
    rather than the standard :targetVersion tag (per-version ROCm tags
    aren't always published)
  - Replaces stale HSA_OVERRIDE in the inspect-captured env on update,
    so containers built before this PR pick up the current value

system_service.ts
- New getOllamaInferenceComputeFromLogs() — parses Ollama startup log line
  "msg=\"inference compute\" ... library=CUDA|ROCm ..." which Ollama emits
  for both NVIDIA and AMD. Catches silent CPU fallback (e.g. NVML death
  after update, or HSA_OVERRIDE failure) that the prior nvidia-smi exec
  probe couldn't detect.
- gpuHealth refactored to use log parsing as the primary probe for both
  vendors, with nvidia-smi exec retained as the NVIDIA-only secondary
  path for hardware enrichment when log parsing has no startup line yet.
- AMD path uses gpu.type KV value (persisted by DockerService._detectGPUType)
  + ai.amdGpuAcceleration opt-out to determine hasRocmRuntime.

types/system.ts
- GpuHealthStatus extended additively: hasRocmRuntime + optional gpuVendor.

types/kv_store.ts
- New ai.amdGpuAcceleration boolean (default-on).

settings/models.tsx, settings/system.tsx
- passthrough_failed banner copy now reads vendor from gpuHealth.gpuVendor
  ("an AMD GPU" vs "an NVIDIA GPU"). Same Fix button hits the same
  force-reinstall endpoint, which now configures AMD correctly.

install_nomad.sh
- AMD detection in verify_gpu_setup() upgraded from a strict-positive
  "ROCm not currently available" message to "ROCm acceleration will be
  configured automatically." Also tightens the lspci match to display
  controller classes (avoids false positives from AMD CPU host bridges,
  matching the same fix already in DockerService._detectGPUType).

Auto-remediation
----------------

Issue #755 proposes auto-remediation when gpuHealth.status flips to
passthrough_failed (today the user has to click "Fix: Reinstall AI
Assistant"). When that PR lands, AMD coverage falls out for free since
this PR uses the same passthrough_failed status code via the shared
gpuHealth machinery — #755's guard will need to flip from
hasNvidiaRuntime === true to (hasNvidiaRuntime || hasRocmRuntime).

Closes #124 (AMD GPU support).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(AI): detect AMD GPU presence inside admin container via marker file

The admin container doesn't have lspci installed, and AMD GPUs don't register
a Docker runtime the way NVIDIA does — so DockerService._detectGPUType() and
SystemService.gpuHealth had no way to know an AMD GPU was present.

The previous implementation fell through to lspci, which silently failed inside
the admin container, leaving gpu.type unset and gpuHealth stuck at 'no_gpu'
even on systems with an AMD GPU. (NVIDIA worked because Docker registers the
nvidia runtime, which is reachable via dockerInfo.Runtimes from any container.)

Discovered while testing the AMD acceleration patch on a Minisforum UM890 Pro:
the AMD branch in _createContainer() never fired because _detectGPUType()
returned 'none' even on a host with a working /dev/kfd.

Fix
---

install_nomad.sh writes the host-detected GPU type ('nvidia' | 'amd') to a
marker file in the storage volume the admin container already bind-mounts:

  /opt/project-nomad/storage/.nomad-gpu-type  →  /app/storage/.nomad-gpu-type

DockerService._detectGPUType() reads the marker as a secondary probe (after
the Docker runtime check) — covers AMD detection from inside the container
without requiring lspci or a /dev bind mount.

SystemService falls back to the marker file when KV gpu.type is empty so the
System page reflects AMD presence even before the user installs AI Assistant
for the first time. (Without this, the page would say 'no_gpu' until Ollama
was installed, even on hosts with an AMD GPU detected at install time.)

Verified on NOMAD6 (UM890 Pro, Ubuntu 24.04, 780M iGPU): with the marker file
in place and admin restarted, the patch's AMD branch fires correctly on Force
Reinstall AI Assistant. Resulting nomad_ollama runs ollama/ollama:rocm with
/dev/kfd + /dev/dri passthrough and HSA_OVERRIDE_GFX_VERSION=11.0.0; Ollama
logs show 'library=ROCm compute=gfx1100 ... type=iGPU'. NOMAD's in-product
benchmark on the same hardware climbed from 33.8 tok/s (CPU) to 57.3 tok/s
(GPU) — a 1.69x speedup, with TTFT dropping from 148ms to 66ms.

Migration for existing AMD installs
-----------------------------------

Users on an existing NOMAD install with an AMD GPU have no marker file (the
install script wrote it on a fresh install). Two paths get them on the GPU:
  1. Re-run install_nomad.sh — writes the marker, no other side effects
  2. Manually: echo amd | sudo tee /opt/project-nomad/storage/.nomad-gpu-type

Either then triggers AMD detection on the next AI Assistant install/reinstall.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(AI): pull ollama/ollama:rocm separately when AMD branch overrides image

The pull-if-missing logic in _createContainer ran against service.container_image
(the DB-pinned tag, e.g. ollama/ollama:0.18.2). The AMD branch then overrode
finalImage to ollama/ollama:rocm — but if that image wasn't already local, the
container creation step failed with "no such image: ollama/ollama:rocm".

Caught while validating on NOMAD2 (Ryzen AI 9 HX 370 + Radeon 890M / RDNA 3.5):
the prior end-to-end test on NOMAD6 had silently passed because the rocm image
was already pulled there from an earlier sidecar test, masking the bug.

Fix: inside the AMD branch, after setting finalImage to ollama/ollama:rocm,
run a parallel _checkImageExists + docker.pull dance for the new tag.

Also confirmed via this validation: the same HSA_OVERRIDE_GFX_VERSION=11.0.0
override works on the 890M (gfx1150 / RDNA 3.5) — Ollama logs report
'library=ROCm compute=gfx1100 description="AMD Radeon 890M Graphics"' and
inference runs at 51.68 tok/s (matching the existing X1 Pro published tile
of 51.7 tok/s on the same hardware class). RDNA 3 (780M, gfx1103) and RDNA
3.5 (890M, gfx1150) both use the same override successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* build(Dockerfile): include pciutils for lspci gpu detection fallback

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Jake Turner <jturner@cosmistack.com>
2026-05-20 10:16:00 -07:00
chriscrosstalk
810a70acb7 fix(ZIM): accumulate across Kiwix pages to prevent empty Content Explorer (#746)
When many ZIMs are already installed locally, a single Kiwix catalog page
(12 items) could return 12 already-installed items, which zim_service
would fully filter out client-side. The endpoint returned items: [] with
has_more: true, and the frontend's infinite-scroll guard
(flatData.length > 0) blocked fetchNextPage — leaving the user with
"No records found" despite plenty of uninstalled ZIMs available.

Backend now accumulates across up to 5 Kiwix fetches (60 items each)
until it has enough post-filter results to return, dedupes by entry id,
advances currentStart by actual entries returned (not requested), and
returns a next_start cursor. The frontend consumes that cursor instead
of computing Kiwix offsets locally, and the flatData.length > 0 guard is
removed so the existing on-mount effect drives bounded auto-fetch when
a short page lands.

The pre-existing has_more off-by-one (compared totalResults against the
input start rather than the post-fetch position) is fixed implicitly.

Diagnosis credit: @johno10661.

Closes #731

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:26:28 -07:00
Henry Estela
0edfdead90 feat(AI): enable flash_attn by default and disable ollama cloud (#616)
New defaults:
OLLAMA_NO_CLOUD=1 - "Ollama can run in local only mode by disabling
Ollama’s cloud features. By turning off Ollama’s cloud features, you
will lose the ability to use Ollama’s cloud models and web search."
https://ollama.com/blog/web-search
https://docs.ollama.com/faq#how-do-i-disable-ollama%E2%80%99s-cloud-features
example output:
```
ollama run minimax-m2.7:cloud
Error: ollama cloud is disabled: remote model details are unavailable
```
This setting can be safely disabled as you have to click on a link to
login to ollama cloud and theres no real way to do that in nomad outside
of looking at the nomad_ollama logs.

This one can be disabled in settings in case theres a model out there
that doesn't play nice. but that doesnt seem necessary so far.
OLLAMA_FLASH_ATTENTION=1 - "Flash Attention is a feature of most modern
models that can significantly reduce memory usage as the context size
grows. "

Tested with llama3.2:
```
docker logs nomad_ollama --tail 1000 2>&1 |grep --color -i flash_attn
llama_context: flash_attn    = enabled
```

And with second_constantine/deepseek-coder-v2 with is based on
https://huggingface.co/lmstudio-community/DeepSeek-Coder-V2-Lite-Instruct-GGUF
which is a model that specifically calls out that you should disable
flash attention, but during testing it seems ollama can do this for you
automatically:
```
docker logs nomad_ollama --tail 1000 2>&1 |grep --color -i flash_attn
llama_context: flash_attn    = disabled
```
2026-04-03 14:26:50 -07:00
chriscrosstalk
bac53e28dc feat(downloads): rich progress, friendly names, cancel, and live status (#554)
* feat(downloads): rich progress, friendly names, cancel, and live status

Redesign the Active Downloads UI with four improvements:

- Rich progress: BullMQ jobs now report downloadedBytes/totalBytes instead
  of just a percentage, showing "2.3 GB / 5.1 GB" instead of "78% / 100%"
- Friendly names: dispatch title metadata from curated categories, Content
  Explorer library, Wikipedia selector, and map collections
- Cancel button: Redis-based cross-process abort signal lets users cancel
  active downloads with file cleanup. Confirmation step prevents accidents.
- Live status indicator: green pulsing dot with transfer speed for active
  downloads, orange stall warning after 60s of no data, gray dot for queued

Backward compatible with in-flight jobs that have integer-only progress.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(downloads): fix cancel, dismiss, speed, and retry bugs

- Speed indicator: only set prevBytesRef on first observation to prevent
  intermediate re-renders from inflating the calculated speed
- Cancel: throw UnrecoverableError on abort to prevent BullMQ retries
- Dismiss: remove stale BullMQ lock before job.remove() so cancelled
  jobs can actually be dismissed
- Retry: add getActiveByUrl() helper that checks job state before
  blocking re-download, auto-cleans terminal jobs
- Wikipedia: reset selection status to failed on cancel so the
  "downloading" state doesn't persist

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(downloads): improve cancellation logic and surface true BullMQ job states

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jake Turner <jturner@cosmistack.com>
2026-04-03 14:26:50 -07:00
Henry Estela
69c15b8b1e feat(AI): enable remote AI chat host 2026-04-03 14:26:50 -07:00
Chris Sherwood
571f6bb5a2 fix(GPU): persist GPU type to KV store for reliable passthrough
GPU detection results were only applied at container creation time and
never persisted. If live detection failed transiently (Docker daemon
hiccup, runtime temporarily unavailable), Ollama would silently fall
back to CPU-only mode with no way to recover short of force-reinstall.

Now _detectGPUType() persists successful detections to the KV store
(gpu.type = 'nvidia' | 'amd') and uses the saved value as a fallback
when live detection returns nothing. This ensures GPU config survives
across container recreations regardless of transient detection failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:46:10 -07:00
Chris Sherwood
b0b8f07661 fix: improve download reliability with stall detection, failure visibility, and Wikipedia status tracking
Three bugs caused downloads to hang, disappear, or leave stuck spinners:
1. Wikipedia downloads that failed never updated the DB status from 'downloading',
   leaving the spinner stuck forever. Now the worker's failed handler marks them as failed.
2. No stall detection on streaming downloads - if data stopped flowing mid-download,
   the job hung indefinitely. Added a 5-minute stall timer that triggers retry.
3. Failed jobs were invisible to users since only waiting/active/delayed states were
   queried. Now failed jobs appear with error indicators in the download list.

Closes #364, closes #216

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:46:10 -07:00
Chris Sherwood
b1edef27e8 feat(UI): add Night Ops dark mode with theme toggle
Add a warm charcoal dark mode ("Night Ops") using CSS variable swapping
under [data-theme="dark"]. All 23 desert palette variables are overridden
with dark-mode counterparts, and ~313 generic Tailwind classes (bg-white,
text-gray-*, border-gray-*) are replaced with semantic tokens.

Infrastructure:
- CSS variable overrides in app.css for both themes
- ThemeProvider + useTheme hook (localStorage + KV store sync)
- ThemeToggle component (moon/sun icons, "Night Ops"/"Day Ops" labels)
- FOUC prevention script in inertia_layout.edge
- Toggle placed in StyledSidebar and Footer for access on every page

Color replacements across 50 files:
- bg-white → bg-surface-primary
- bg-gray-50/100 → bg-surface-secondary
- text-gray-900/800 → text-text-primary
- text-gray-600/500 → text-text-secondary/text-text-muted
- border-gray-200/300 → border-border-subtle/border-border-default
- text-desert-white → text-white (fixes invisible text on colored bg)
- Button hover/active states use dedicated btn-green-hover/active vars

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 11:46:10 -07:00
Jake Turner
96e5027055 feat(AI Assistant): performance improvements and smarter RAG context usage 2026-03-11 14:08:09 -07:00
Jake Turner
460756f581 feat(AI Assistant): improved state management and performance 2026-03-11 14:08:09 -07:00
Jake Turner
6f0fae0033 feat(AI Assistant): remember last model used 2026-03-11 14:08:09 -07:00
Jake Turner
58b106f388 feat: support for updating services 2026-03-11 14:08:09 -07:00
Chris Sherwood
650ae407f3 feat(GPU): warn when GPU passthrough not working and offer one-click fix
Ollama can silently run on CPU even when the host has an NVIDIA GPU,
resulting in ~3 tok/s instead of ~167 tok/s. This happens when Ollama
was installed before the GPU toolkit, or when the container was
recreated without proper DeviceRequests. Users had zero indication.

Adds a GPU health check to the system info API response that detects
when the host has an NVIDIA runtime but nvidia-smi fails inside the
Ollama container. Shows a warning banner on the System Information
and AI Settings pages with a one-click "Reinstall AI Assistant"
button that force-reinstalls Ollama with GPU passthrough.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 14:08:09 -07:00
Jake Turner
99b96c3df7 feat(RAG): display embedding queue and improve progress tracking 2026-03-04 20:05:14 -08:00
Jake Turner
96beab7e69 feat(AI Assistant): custom name option for AI Assistant 2026-03-04 20:05:14 -08:00
Jake Turner
efa57ec010 feat: early access release channel 2026-03-03 20:51:38 -08:00
Jake Turner
6817e2e47e fix: improve type-safety for KVStore values 2026-03-03 20:51:38 -08:00
Jake Turner
00bd864831 fix(AI): improved perf via rewrite and streaming logic 2026-03-03 20:51:38 -08:00
Jake Turner
98b65c421c feat(AI): thinking and response streaming 2026-02-18 21:22:53 -08:00
Jake Turner
d55ff7b466
feat: curated content update checking 2026-02-11 21:49:46 -08:00
Jake Turner
32d206cfd7
feat: curated content system overhaul 2026-02-11 15:44:46 -08:00
Jake Turner
df6247b425 feat(Easy Setup): visual cue to start at Easy Setup for OOBE 2026-02-11 11:16:52 -08:00
Chris Sherwood
b0be99700d fix(System): show host OS, hostname, GPU instead of container info
Inside Docker, systeminformation reports the container's Alpine Linux
distro, container ID as hostname, and no GPU. This enriches the System
Information page with actual host details via the Docker API:

- Distribution and kernel version from docker.info()
- Real hostname from docker.info().Name
- GPU model and VRAM via nvidia-smi inside the Ollama container
- Graphics card in System Details (Model, Vendor, VRAM)
- Friendly uptime display (days/hours/minutes instead of minutes only)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 13:23:39 -08:00
Jake Turner
8726700a0a feat: zim content embedding 2026-02-08 13:20:10 -08:00
Jake Turner
2e0ab10075 feat: cron job for system update checks 2026-02-06 15:40:30 -08:00
Jake Turner
a91c13867d fix: filter cloud models from API response 2026-02-04 17:05:20 -08:00
Jake Turner
ab07551719 feat: auto add NOMAD docs to KB on AI install 2026-02-03 23:15:54 -08:00
Chris Sherwood
2c4fc59428 feat(ContentManager): Display friendly names instead of filenames
Content Manager now shows Title and Summary columns from Kiwix metadata
instead of just raw filenames. Metadata is captured when files are
downloaded from Content Explorer and stored in a new zim_file_metadata
table. Existing files without metadata gracefully fall back to showing
the filename.

Changes:
- Add zim_file_metadata table and model for storing title, summary, author
- Update download flow to capture and store metadata from Kiwix library
- Update Content Manager UI to display Title and Summary columns
- Clean up metadata when ZIM files are deleted

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 23:14:28 -08:00
Jake Turner
1923cd4cde feat(AI): chat suggestions and assistant settings 2026-02-01 07:24:21 +00:00
Jake Turner
4584844ca6 refactor(Benchmarks): cleanup api calls 2026-02-01 05:23:11 +00:00
Chris Sherwood
68f374e3a8 feat: Add dedicated Wikipedia Selector with smart package management
Adds a standalone Wikipedia selection section that appears prominently in both
the Easy Setup Wizard and Content Explorer. Features include:

- Six Wikipedia package options ranging from Quick Reference (313MB) to Complete
  Wikipedia with Full Media (99.6GB)
- Card-based radio selection UI with clear size indicators
- Smart replacement: downloads new package before deleting old one
- Status tracking: shows Installed, Selected, or Downloading badges
- "No Wikipedia" option for users who want to skip or remove Wikipedia

Technical changes:
- New wikipedia_selections database table and model
- New /api/zim/wikipedia and /api/zim/wikipedia/select endpoints
- WikipediaSelector component with consistent styling
- Integration with existing download queue system
- Callback updates status to 'installed' on successful download
- Wikipedia removed from tiered category system to avoid duplication

UI improvements:
- Added section dividers and icons (AI Models, Wikipedia, Additional Content)
- Consistent spacing between major sections in Easy Setup Wizard
- Content Explorer gets matching Wikipedia section with submit button

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 21:00:51 -08:00
Jake Turner
adf76d272e fix: remove Open WebUI 2026-01-31 20:39:49 -08:00
Jake Turner
243f749090 feat: [wip] native AI chat interface 2026-01-31 20:39:49 -08:00
Jake Turner
50174d2edb feat(RAG): [wip] RAG capabilities 2026-01-31 20:39:49 -08:00
chriscrosstalk
7a5a254dd5
feat(benchmark): Require full benchmark with AI for community sharing (#99)
* feat(benchmark): Require full benchmark with AI for community sharing

Only allow users to share benchmark results with the community leaderboard
when they have completed a full benchmark that includes AI performance data.

Frontend changes:
- Add AI Assistant installation check via service API query
- Show pre-flight warning when clicking Full Benchmark without AI installed
- Disable AI Only button when AI Assistant not installed
- Show "Partial Benchmark" info alert for non-shareable results
- Only display "Share with Community" for full benchmarks with AI data
- Add note about AI installation requirement with link to Apps page

Backend changes:
- Validate benchmark_type is 'full' before allowing submission
- Require ai_tokens_per_second > 0 for community submission
- Return clear error messages explaining requirements

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(benchmark): UI improvements and GPU detection fix

- Fix GPU detection to properly identify AMD discrete GPUs
- Fix gauge colors (high scores now green, low scores red)
- Fix gauge centering (SVG size matches container)
- Add info tooltips for Tokens/sec and Time to First Token

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(benchmark): Extract iGPU from AMD APU CPU name as fallback

When systeminformation doesn't detect graphics controllers (common on
headless Linux), extract the integrated GPU name from AMD APU CPU model
strings like "AMD Ryzen AI 9 HX 370 w/ Radeon 890M".

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat(benchmark): Add Builder Tag system for community leaderboard

- Add builder_tag column to benchmark_results table
- Create BuilderTagSelector component with word dropdowns + randomize
- Add 50 adjectives and 50 nouns for NOMAD-themed tags (e.g., Tactical-Llama-1234)
- Add anonymous sharing option checkbox
- Add builder tag display in Benchmark Details section
- Add Benchmark History section showing all past benchmarks
- Update submission API to accept anonymous flag
- Add /api/benchmark/builder-tag endpoint to update tags

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat(benchmark): Add HMAC signing for leaderboard submissions

Sign benchmark submissions with HMAC-SHA256 to prevent casual API abuse.
Includes X-NOMAD-Timestamp and X-NOMAD-Signature headers.

Note: Since NOMAD is open source, a determined attacker could extract
the secret. This provides protection against casual abuse only.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 00:24:31 -08:00
Chris Sherwood
5afc3a270a feat: Improve curated collections UX with persistent tier selection
- Add installed_tiers table to persist user's tier selection per category
- Change tier selection behavior: clicking a tier now highlights it locally,
  user must click "Submit" to confirm (previously clicked = immediate download)
- Remove "Recommended" badge and asterisk (*) from tier displays
- Highlight installed tier instead of recommended tier in CategoryCard
- Add "Click to choose" hint when no tier is installed
- Save installed tier when downloading from Content Explorer or Easy Setup
- Pass installed tier to modal as default selection

Database:
- New migration: create installed_tiers table (category_slug unique, tier_slug)
- New model: InstalledTier

Backend:
- ZimService.listCuratedCategories() now includes installedTierSlug
- New ZimService.saveInstalledTier() method
- New POST /api/zim/save-installed-tier endpoint

Frontend:
- TierSelectionModal: local selection state, "Close" → "Submit" button
- CategoryCard: highlight based on installedTierSlug, add "Click to choose"
- Content Explorer: save tier after download, refresh categories
- Easy Setup: save tiers on wizard completion

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 15:33:50 -08:00
Chris Sherwood
e31f956289 fix(benchmark): Fix AI benchmark connectivity and improve error handling
- Add OLLAMA_API_URL environment variable for Docker networking
- Use host.docker.internal to reach Ollama from NOMAD container
- Add extra_hosts config in compose for Linux compatibility
- Add downloading_ai_model status with clear progress indicator
- Show model download progress on first AI benchmark run
- Fail AI-only benchmarks with clear error if AI unavailable
- Display benchmark errors to users via Alert component
- Improve error messages with error codes for debugging

Fixes issue where AI benchmark silently failed due to NOMAD container
being unable to reach Ollama at localhost:11434.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 15:27:56 -08:00
Jake Turner
438d683bac fix(Benchmark): cleanup types for SSOT 2026-01-22 21:48:12 -08:00
Chris Sherwood
755807f95e feat: Add system benchmark feature with NOMAD Score
Add comprehensive benchmarking capability to measure server performance:

Backend:
- BenchmarkService with CPU, memory, disk, and AI benchmarks using sysbench
- Database migrations for benchmark_results and benchmark_settings tables
- REST API endpoints for running benchmarks and retrieving results
- CLI commands: benchmark:run, benchmark:results, benchmark:submit
- BullMQ job for async benchmark execution with SSE progress updates
- Synchronous mode option (?sync=true) for simpler local dev setup

Frontend:
- Benchmark settings page with circular gauges for scores
- NOMAD Score display with weighted composite calculation
- System Performance section (CPU, Memory, Disk Read/Write)
- AI Performance section (tokens/sec, time to first token)
- Hardware Information display
- Expandable Benchmark Details section
- Progress simulation during sync benchmark execution

Easy Setup Integration:
- Added System Benchmark to Additional Tools section
- Built-in capability pattern for non-Docker features
- Click-to-navigate behavior for built-in tools

Fixes:
- Docker log multiplexing issue (Tty: true) for proper output parsing
- Consolidated disk benchmarks into single container execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 21:48:12 -08:00
Chris Sherwood
24f10ea3d5 feat: Use friendly app names on Dashboard with open source attribution
Updates the Dashboard to use the same user-friendly names as the Easy Setup
Wizard, giving credit to the open source projects powering each capability:

- Kiwix → Information Library (Powered by Kiwix)
- Kolibri → Education Platform (Powered by Kolibri)
- Open WebUI → AI Assistant (Powered by Open WebUI + Ollama)
- FlatNotes → Notes (Powered by FlatNotes)
- CyberChef → Data Tools (Powered by CyberChef)

Also reorders Dashboard cards to prioritize Core Capabilities first, with
Maps promoted to Core Capability status, followed by Additional Tools,
then system items (Easy Setup, Install Apps, Docs, Settings).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 16:43:32 -08:00
Jake Turner
937da5d869 feat(Open WebUI): manage models via Command Center 2026-01-19 22:15:52 -08:00
Jake Turner
b6e6e10328 fix(CuratedCategories): improve fetching from Github 2026-01-19 14:41:51 -08:00