project-nomad

mirror of https://github.com/Crosstalk-Solutions/project-nomad.git synced 2026-05-26 06:15:07 +02:00

Author	SHA1	Message	Date
Jake Turner	a0047c1555	fix(KB): surface file-warning compute failures instead of masking as healthy (PR #895 review) `computeFileWarnings()` previously caught all errors and returned an empty map, which the frontend rendered as "every file is healthy" — reintroducing exactly the silent-failure mode this surface exists to expose. Return `{ ok, warnings }`; flip `ok: false` from the catch. KB modal renders an inline amber notice under the Stored Files header when `ok === false`, leaving per-row warning rendering untouched. Transient failures self-heal on the next 30s poll; no toast spam.	2026-05-20 10:16:00 -07:00
Jake Turner	102998ec96	refactor(KB): move FileWarning to shared types/rag following existing convention	2026-05-20 10:16:00 -07:00
Chris Sherwood	43ca584b6c	feat(KB): status pill + last-activity timestamp on Processing Queue (RFC #883 §5/§10) Each in-flight (or stuck) embedding job gets a colored health pill, relative-activity timestamp, and chunk counter so users can tell at a glance whether ingestion is making progress. ## Health states - 🟢 Active — last batch < 2 min ago - 🟡 Slow — last batch 2-5 min ago (CPU-paced multi-batch ingestion lives here naturally; not always a problem) - 🔴 Stalled — last batch > 5 min ago (likely real problem) - ⚪ Waiting — queued, no batch started yet - 🔴 Failed — job recorded failed status ## What lands - New backend util `kb_job_health.ts` with pure `computeJobHealth(input)` decision function. Time-based thresholds (2 min / 5 min) inlined as constants. 9 unit tests pin the boundaries. - `EmbedJobWithProgress` gains `lastBatchAt`, `startedAt`, `chunks` — already set by `EmbedFileJob.handle` on every batch transition, just not previously surfaced through `listActiveJobs`. - Frontend `kb_job_health_display.ts` maps each status to a Tailwind dot color, label, and aria-label so backend and UI stay in sync. - `ActiveEmbedJobs.tsx` renders the pill, "last activity Xs ago", and chunk counter above each progress bar. Adds a manual Refresh button and "Last updated Xs ago" line — the existing 2s/30s auto-poll cadence in `useEmbedJobs` is left intact. - Live tick at 5s keeps the relative timestamps current without re-fetching from the API. ## Not in scope - Per-card Cancel / Retry / Un-index — separate Phase 2 PR - Conditional warnings A/B/C — separate Phase 2 PR - Computing throughput rate (chunks/min) — needs ratio registry consumer (Phase 2 follow-up); for now the pill answers the "is it stuck?" question directly without a rate estimate.	2026-05-20 10:16:00 -07:00
chriscrosstalk	743549ca74	feat(KB): per-file ingest state machine (Phase 1 of RFC #883 ) (#888 ) Adds a persistent state machine for AI knowledge-base ingestion so the scanner can distinguish "fully indexed", "user opted out", "failed", and "stalled" from each other — none of which were derivable from the prior binary "any chunks in Qdrant ⇒ embedded" check. ## What lands - New table `kb_ingest_state` keyed by `file_path` with enum state column (`pending_decision \| indexed \| browse_only \| failed \| stalled`). Independent of `installed_resources` so it covers both curated downloads and manually-uploaded KB files. - New KV key `rag.defaultIngestPolicy` (string: `Always \| Manual`). Registered now but not consumed yet — JIT prompt + wizard step land in Phase 3 of the RFC. - `EmbedFileJob.handle` writes state on terminal outcomes: - Success (final batch) → `indexed` + chunks count - `UnrecoverableError` → `failed` + error message - Retryable errors are left to BullMQ's existing retry path - `scanAndSyncStorage` swaps the binary qdrant check for a state-aware decision tree (see `decideScanAction`). Existing installs auto-backfill on first scan: files with chunks in Qdrant but no state row become `indexed`; new files start as `pending_decision`. - `deleteFileBySource` drops the state row last, so removed files disappear entirely instead of leaving an orphan that the next scan would re-dispatch into nothing. ## What does NOT land here - Ratio registry (separate PR) — needed for partial-stall detection and cost estimates, but a separable concern. - #880 follow-up initial-progress anchor (separate tiny PR). - Phase 2 UI (status pill, per-card actions, conditional warnings). - Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal). - PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All / Reset & Rebuild would also want to set state, but #886 isn't merged yet; that wiring goes in a follow-up once #886 lands. ## Target This is forward work for v1.40.0 (RFC #883). Branching off `rc` because that's the current latest base and post-GA Jake will sync rc→dev; a retarget at PR-open time is a fast-forward if requested. ## Tests - 9 new unit tests for `decideScanAction` covering all five states plus the no-row / chunks-present / chunks-missing combinations - Type-check clean - Smoke-tested end-to-end on NOMAD3 via hot-patch: - Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all came back `indexed` on first scan - Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`) came back `pending_decision` and was correctly re-dispatched (Bull deduped to its historical `:completed` jobId — bgauger's #886 fix drains that) - Delete hook: deleting a KB upload via `DELETE /api/rag/files` removed both the disk file and the state row Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com>	2026-05-20 10:16:00 -07:00
Chris Sherwood	2997637ce0	feat(GPU): auto-remediate nomad_ollama passthrough loss on admin boot (#755 ) After an update, container recreate, or docker daemon restart, nomad_ollama's HostConfig.DeviceRequests still lists the nvidia driver — but the NVIDIA Container Toolkit binding inside the container is torn. `nvidia-smi` returns "Failed to initialize NVML: Unknown Error" and Ollama silently falls back to CPU inference. PR #208 detects this and shows a banner with a "Fix: Reinstall AI Assistant" button. This change does that click automatically on admin boot. New provider GpuPassthroughRemediationProvider runs once on web env boot: 1. Skip when KV `ai.autoFixGpuPassthrough = false` (default true). 2. Skip when Docker has no `nvidia` runtime registered (AMD-only and CPU-only hosts unaffected). 3. Skip when nomad_ollama isn't running. 4. Exec `nvidia-smi --query-gpu=name --format=csv,noheader` inside the container with an 8-second timeout. If the output matches "Failed to initialize NVML", "Unknown Error", "TIMEOUT", or contains no alphabetic characters, treat the passthrough as broken. 5. On broken: call DockerService.forceReinstall('nomad_ollama'). The existing force-reinstall preserves the Ollama volume + installed models. Stamp `gpu.autoRemediatedAt` on success. 6. On healthy: log and exit. AMD passthrough_failed is intentionally not handled — its fix path is HSA override handling (PR #804) rather than a simple service recreate, and false positives during AMD startup log parsing would loop a recreate without fixing anything. Left to a follow-up if it proves to be a recurring AMD issue. Validated on NOMAD3 (RTX 5060, v1.32.0-rc.3 + this patch hot-applied): - After admin restart with passthrough healthy: log line "[GpuPassthroughRemediationProvider] NVIDIA passthrough healthy — no action needed." Provider exits cleanly without touching the container. - The broken-state branch hits the existing forceReinstall path, which was manually invoked earlier in the same session to fix this exact box and recovered GPU access in ~45s with model volume intact. No new failure mode is introduced — the auto-trigger removes the user click but the underlying operation is the same one the banner Fix button already calls. Closes #755.	2026-05-20 10:16:00 -07:00
Chris Sherwood	a2e2f7fc40	fix(AI): vendor-aware AMD HSA override + benchmark discrete-GPU detection Closes #810. ## Bug A: HSA_OVERRIDE_GFX_VERSION=11.0.0 was unconditional PR #804 set HSA_OVERRIDE_GFX_VERSION=11.0.0 for any AMD GPU. The inline comment claimed this was harmless on supported discrete cards (gfx1030 RX 6800, etc.) — empirically false. With the override, Ollama crashes during GPU discovery on gfx1030 and falls back to CPU silently. Affects every NOMAD user with an RX 6800 or other RDNA 2 discrete card. The correct value depends on the gfx version: - gfx1030, gfx1100, gfx1101, gfx1102: officially supported by ROCm — no override - gfx1031..gfx1036 (RDNA 2 variants + iGPUs like Rembrandt 680M): 10.3.0 - gfx1103, gfx1150, gfx1151 (Phoenix 780M, Strix 890M, Strix Halo): 11.0.0 ### Resolution chain in `_resolveAmdHsaOverride()` 1. KV `ai.amdHsaOverride` — manual override; accepts 'none' to disable, or a semver-style value to force. 2. Marker file `/app/storage/.nomad-amd-gfx` — written by install_nomad.sh based on lspci codename. Mapped to override via `_mapGfxToHsaOverride()`. 3. Default: `11.0.0` — preserves prior behavior so existing iGPU users (780M / 890M, the dominant AMD population today) don't regress on upgrade. Discrete RDNA 2 users on existing installs can opt out via `ai.amdHsaOverride='none'` and force-reinstall AI Assistant, OR re-run install_nomad.sh to refresh the marker file. The helper is used in both `createContainer` (initial install) and `updateContainer` (image update) paths, replacing the unconditional push. ## Bug B: BenchmarkService had no AMD discrete detection path `BenchmarkService.getHardwareInfo()` had three GPU detection fallbacks: 1. `si.graphics()` — empty inside Docker for AMD 2. nvidia-smi — NVIDIA only 3. AMD APU regex from CPU model — integrated only Result: AMD discrete cards (RX 6800, RX 7900 XTX, etc.) showed up as "GPU: Not detected" on the leaderboard despite ROCm working. Corrupts leaderboard data quality for that population. Fix: after the existing fallbacks, call `SystemService.getSystemInfo()` and read `graphics.controllers[0].model`. That path already handles AMD via the marker file + Ollama log probe added in PR #804, so we're reusing existing plumbing rather than duplicating detection logic. ## install_nomad.sh changes The existing AMD detection block already runs lspci. Added a codename parse step that maps Navi 21/22/23/24, Rembrandt, Phoenix1/Phoenix2, Strix/Strix Point/Strix Halo, and Navi 31/32/33 to gfx versions, then writes `/opt/project-nomad/storage/.nomad-amd-gfx`. Unknown codenames write nothing (admin handles missing-marker case via the backward-compat default). ## Validation Both bugs were originally surfaced and validated empirically on RX 6800 / gfx1030 / Ubuntu 24.04 + kernel 6.17 + ollama/ollama:rocm during the #810 filing. Validation grid from that report: \| Run \| NOMAD Score \| tok/s \| GPU detected \| \|-----------------------------------------------\|-------------\|-------\|-------------------------\| \| Pre-fix (Bug A active) \| n/a \| 0 \| yes, but library=cpu \| \| HSA_OVERRIDE removed, Bug B unfixed \| 73.8 \| 221.6 \| "Not detected" \| \| Both fixes hot-patched (this PR's behavior) \| 73.7 \| 216.0 \| AMD Radeon RX 6800 \| Local checks: `npm run typecheck` clean, `npm run build` clean.	2026-05-20 10:16:00 -07:00
0xGlitch	94059b0aaf	feat(Maps): regional map downloads via go-pmtiles extract (#780 ) * feat(maps): add regional map downloads via go-pmtiles extract * address Copilot review feedback on PR #780 - auto-refresh preflight on selection/maxzoom change with 400ms debounce and requestId stale-safety so the confirm button no longer requires a two-step "Estimate Size" -> "Start Download" dance - safeUpdateProgress helper replaces fire-and-forget updateProgress().catch() pattern so cancelled-job errors (code -1) can't surface as unhandled rejections - gate world basemap source on worldBasemapReady - when ensureWorldBasemap() fails we already delete world.pmtiles, so emitting the source was producing 404s on every tile request - verify go-pmtiles binary SHA256 at image build time; upstream doesn't ship a checksums file so per-arch hashes are pinned as build args with a regenerate note when bumping PMTILES_VERSION	2026-05-20 10:16:00 -07:00
Chris Sherwood	299b767e63	feat(content-updates): show size, surface downloads in Active Downloads Content Updates had three UX problems that compounded: 1. No size column, so users had to guess how big an update would be before clicking Update All. Upstream /api/v1/resources/check-updates doesn't return size, so CollectionUpdateService now enriches each update with a Content-Length HEAD request in parallel (5s timeout, non-fatal on failure — the row just renders an em-dash). 2. Small ZIM updates (1-8 MB) never appeared in Active Downloads. Two causes, both fixed: handleApply / handleApplyAll didn't invalidate the download-jobs query after dispatching, and useDownloads idled at 30s between polls — enough for a fast job to dispatch, download, and get cleaned up by removeOnComplete before the next refetch. 3. applyUpdate didn't forward title / totalBytes to RunDownloadJob, so any update that did briefly surface in Active Downloads had no label and no byte-count progress, just a filename and a percentage. It now passes both (matching zim_service's dispatch pattern). Also parallelized applyAllUpdates so dispatching five updates doesn't serialize five sequential BullMQ round-trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 10:16:00 -07:00
chriscrosstalk	73e2115245	feat(AI): improved AMD GPU acceleration for Ollama via ROCm + HSA override (#804 ) * feat(AI): re-enable AMD GPU acceleration for Ollama via ROCm + HSA override Re-enables AMD GPU support that was disabled in `77f1868` pending validation of the ROCm image and device discovery. Validation done 2026-04-28 on a Minisforum UM890 Pro (Ryzen 9 PRO 8945HS + Radeon 780M iGPU) — Ollama correctly offloaded all model layers to the iGPU when the container was started with /dev/kfd + /dev/dri passthrough and HSA_OVERRIDE_GFX_VERSION=11.0.0. On llama3.2:1b, GPU inference ran at 51.83 tok/s vs 33.16 tok/s on CPU (same hardware, same prompt) — a 1.56x speedup confirmed by Ollama logs showing "load_tensors: offloaded 17/17 layers to GPU". Changes ------- docker_service.ts - Restore _discoverAMDDevices() (simplified — pass /dev/dri as a directory entry, mirroring `docker run --device /dev/dri` behavior, instead of the prior brittle hardcoded card0/renderD128 fallback that broke on systems where the AMD GPU enumerates as card1+). - Restore the AMD branch in _createContainer(): - Switches Ollama image to ollama/ollama:rocm - Mounts /dev/kfd + /dev/dri via Devices - Sets HSA_OVERRIDE_GFX_VERSION=11.0.0 (required for unsupported-but-RDNA3 iGPUs like gfx1103; harmless on supported discrete cards) - KV opt-out via ai.amdGpuAcceleration (default on) - Mirror the AMD branch in updateContainer(): - Lifted GPU detection above docker.pull() so AMD updates pull :rocm rather than the standard :targetVersion tag (per-version ROCm tags aren't always published) - Replaces stale HSA_OVERRIDE in the inspect-captured env on update, so containers built before this PR pick up the current value system_service.ts - New getOllamaInferenceComputeFromLogs() — parses Ollama startup log line "msg=\"inference compute\" ... library=CUDA\|ROCm ..." which Ollama emits for both NVIDIA and AMD. Catches silent CPU fallback (e.g. NVML death after update, or HSA_OVERRIDE failure) that the prior nvidia-smi exec probe couldn't detect. - gpuHealth refactored to use log parsing as the primary probe for both vendors, with nvidia-smi exec retained as the NVIDIA-only secondary path for hardware enrichment when log parsing has no startup line yet. - AMD path uses gpu.type KV value (persisted by DockerService._detectGPUType) + ai.amdGpuAcceleration opt-out to determine hasRocmRuntime. types/system.ts - GpuHealthStatus extended additively: hasRocmRuntime + optional gpuVendor. types/kv_store.ts - New ai.amdGpuAcceleration boolean (default-on). settings/models.tsx, settings/system.tsx - passthrough_failed banner copy now reads vendor from gpuHealth.gpuVendor ("an AMD GPU" vs "an NVIDIA GPU"). Same Fix button hits the same force-reinstall endpoint, which now configures AMD correctly. install_nomad.sh - AMD detection in verify_gpu_setup() upgraded from a strict-positive "ROCm not currently available" message to "ROCm acceleration will be configured automatically." Also tightens the lspci match to display controller classes (avoids false positives from AMD CPU host bridges, matching the same fix already in DockerService._detectGPUType). Auto-remediation ---------------- Issue #755 proposes auto-remediation when gpuHealth.status flips to passthrough_failed (today the user has to click "Fix: Reinstall AI Assistant"). When that PR lands, AMD coverage falls out for free since this PR uses the same passthrough_failed status code via the shared gpuHealth machinery — #755's guard will need to flip from hasNvidiaRuntime === true to (hasNvidiaRuntime \|\| hasRocmRuntime). Closes #124 (AMD GPU support). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(AI): detect AMD GPU presence inside admin container via marker file The admin container doesn't have lspci installed, and AMD GPUs don't register a Docker runtime the way NVIDIA does — so DockerService._detectGPUType() and SystemService.gpuHealth had no way to know an AMD GPU was present. The previous implementation fell through to lspci, which silently failed inside the admin container, leaving gpu.type unset and gpuHealth stuck at 'no_gpu' even on systems with an AMD GPU. (NVIDIA worked because Docker registers the nvidia runtime, which is reachable via dockerInfo.Runtimes from any container.) Discovered while testing the AMD acceleration patch on a Minisforum UM890 Pro: the AMD branch in _createContainer() never fired because _detectGPUType() returned 'none' even on a host with a working /dev/kfd. Fix --- install_nomad.sh writes the host-detected GPU type ('nvidia' \| 'amd') to a marker file in the storage volume the admin container already bind-mounts: /opt/project-nomad/storage/.nomad-gpu-type → /app/storage/.nomad-gpu-type DockerService._detectGPUType() reads the marker as a secondary probe (after the Docker runtime check) — covers AMD detection from inside the container without requiring lspci or a /dev bind mount. SystemService falls back to the marker file when KV gpu.type is empty so the System page reflects AMD presence even before the user installs AI Assistant for the first time. (Without this, the page would say 'no_gpu' until Ollama was installed, even on hosts with an AMD GPU detected at install time.) Verified on NOMAD6 (UM890 Pro, Ubuntu 24.04, 780M iGPU): with the marker file in place and admin restarted, the patch's AMD branch fires correctly on Force Reinstall AI Assistant. Resulting nomad_ollama runs ollama/ollama:rocm with /dev/kfd + /dev/dri passthrough and HSA_OVERRIDE_GFX_VERSION=11.0.0; Ollama logs show 'library=ROCm compute=gfx1100 ... type=iGPU'. NOMAD's in-product benchmark on the same hardware climbed from 33.8 tok/s (CPU) to 57.3 tok/s (GPU) — a 1.69x speedup, with TTFT dropping from 148ms to 66ms. Migration for existing AMD installs ----------------------------------- Users on an existing NOMAD install with an AMD GPU have no marker file (the install script wrote it on a fresh install). Two paths get them on the GPU: 1. Re-run install_nomad.sh — writes the marker, no other side effects 2. Manually: echo amd \| sudo tee /opt/project-nomad/storage/.nomad-gpu-type Either then triggers AMD detection on the next AI Assistant install/reinstall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(AI): pull ollama/ollama:rocm separately when AMD branch overrides image The pull-if-missing logic in _createContainer ran against service.container_image (the DB-pinned tag, e.g. ollama/ollama:0.18.2). The AMD branch then overrode finalImage to ollama/ollama:rocm — but if that image wasn't already local, the container creation step failed with "no such image: ollama/ollama:rocm". Caught while validating on NOMAD2 (Ryzen AI 9 HX 370 + Radeon 890M / RDNA 3.5): the prior end-to-end test on NOMAD6 had silently passed because the rocm image was already pulled there from an earlier sidecar test, masking the bug. Fix: inside the AMD branch, after setting finalImage to ollama/ollama:rocm, run a parallel _checkImageExists + docker.pull dance for the new tag. Also confirmed via this validation: the same HSA_OVERRIDE_GFX_VERSION=11.0.0 override works on the 890M (gfx1150 / RDNA 3.5) — Ollama logs report 'library=ROCm compute=gfx1100 description="AMD Radeon 890M Graphics"' and inference runs at 51.68 tok/s (matching the existing X1 Pro published tile of 51.7 tok/s on the same hardware class). RDNA 3 (780M, gfx1103) and RDNA 3.5 (890M, gfx1150) both use the same override successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build(Dockerfile): include pciutils for lspci gpu detection fallback --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Jake Turner <jturner@cosmistack.com>	2026-05-20 10:16:00 -07:00
chriscrosstalk	810a70acb7	fix(ZIM): accumulate across Kiwix pages to prevent empty Content Explorer (#746 ) When many ZIMs are already installed locally, a single Kiwix catalog page (12 items) could return 12 already-installed items, which zim_service would fully filter out client-side. The endpoint returned items: [] with has_more: true, and the frontend's infinite-scroll guard (flatData.length > 0) blocked fetchNextPage — leaving the user with "No records found" despite plenty of uninstalled ZIMs available. Backend now accumulates across up to 5 Kiwix fetches (60 items each) until it has enough post-filter results to return, dedupes by entry id, advances currentStart by actual entries returned (not requested), and returns a next_start cursor. The frontend consumes that cursor instead of computing Kiwix offsets locally, and the flatData.length > 0 guard is removed so the existing on-mount effect drives bounded auto-fetch when a short page lands. The pre-existing has_more off-by-one (compared totalResults against the input start rather than the post-fetch position) is fixed implicitly. Diagnosis credit: @johno10661. Closes #731 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 14:26:28 -07:00
Henry Estela	0edfdead90	feat(AI): enable flash_attn by default and disable ollama cloud (#616 ) New defaults: OLLAMA_NO_CLOUD=1 - "Ollama can run in local only mode by disabling Ollama’s cloud features. By turning off Ollama’s cloud features, you will lose the ability to use Ollama’s cloud models and web search." https://ollama.com/blog/web-search https://docs.ollama.com/faq#how-do-i-disable-ollama%E2%80%99s-cloud-features example output: ``` ollama run minimax-m2.7:cloud Error: ollama cloud is disabled: remote model details are unavailable ``` This setting can be safely disabled as you have to click on a link to login to ollama cloud and theres no real way to do that in nomad outside of looking at the nomad_ollama logs. This one can be disabled in settings in case theres a model out there that doesn't play nice. but that doesnt seem necessary so far. OLLAMA_FLASH_ATTENTION=1 - "Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows. " Tested with llama3.2: ``` docker logs nomad_ollama --tail 1000 2>&1 \|grep --color -i flash_attn llama_context: flash_attn = enabled ``` And with second_constantine/deepseek-coder-v2 with is based on https://huggingface.co/lmstudio-community/DeepSeek-Coder-V2-Lite-Instruct-GGUF which is a model that specifically calls out that you should disable flash attention, but during testing it seems ollama can do this for you automatically: ``` docker logs nomad_ollama --tail 1000 2>&1 \|grep --color -i flash_attn llama_context: flash_attn = disabled ```	2026-04-03 14:26:50 -07:00
chriscrosstalk	bac53e28dc	feat(downloads): rich progress, friendly names, cancel, and live status (#554 ) * feat(downloads): rich progress, friendly names, cancel, and live status Redesign the Active Downloads UI with four improvements: - Rich progress: BullMQ jobs now report downloadedBytes/totalBytes instead of just a percentage, showing "2.3 GB / 5.1 GB" instead of "78% / 100%" - Friendly names: dispatch title metadata from curated categories, Content Explorer library, Wikipedia selector, and map collections - Cancel button: Redis-based cross-process abort signal lets users cancel active downloads with file cleanup. Confirmation step prevents accidents. - Live status indicator: green pulsing dot with transfer speed for active downloads, orange stall warning after 60s of no data, gray dot for queued Backward compatible with in-flight jobs that have integer-only progress. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(downloads): fix cancel, dismiss, speed, and retry bugs - Speed indicator: only set prevBytesRef on first observation to prevent intermediate re-renders from inflating the calculated speed - Cancel: throw UnrecoverableError on abort to prevent BullMQ retries - Dismiss: remove stale BullMQ lock before job.remove() so cancelled jobs can actually be dismissed - Retry: add getActiveByUrl() helper that checks job state before blocking re-download, auto-cleans terminal jobs - Wikipedia: reset selection status to failed on cancel so the "downloading" state doesn't persist Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(downloads): improve cancellation logic and surface true BullMQ job states --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jake Turner <jturner@cosmistack.com>	2026-04-03 14:26:50 -07:00
Henry Estela	69c15b8b1e	feat(AI): enable remote AI chat host	2026-04-03 14:26:50 -07:00
Chris Sherwood	571f6bb5a2	fix(GPU): persist GPU type to KV store for reliable passthrough GPU detection results were only applied at container creation time and never persisted. If live detection failed transiently (Docker daemon hiccup, runtime temporarily unavailable), Ollama would silently fall back to CPU-only mode with no way to recover short of force-reinstall. Now _detectGPUType() persists successful detections to the KV store (gpu.type = 'nvidia' \| 'amd') and uses the saved value as a fallback when live detection returns nothing. This ensures GPU config survives across container recreations regardless of transient detection failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 11:46:10 -07:00
Chris Sherwood	b0b8f07661	fix: improve download reliability with stall detection, failure visibility, and Wikipedia status tracking Three bugs caused downloads to hang, disappear, or leave stuck spinners: 1. Wikipedia downloads that failed never updated the DB status from 'downloading', leaving the spinner stuck forever. Now the worker's failed handler marks them as failed. 2. No stall detection on streaming downloads - if data stopped flowing mid-download, the job hung indefinitely. Added a 5-minute stall timer that triggers retry. 3. Failed jobs were invisible to users since only waiting/active/delayed states were queried. Now failed jobs appear with error indicators in the download list. Closes #364, closes #216 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 11:46:10 -07:00
Chris Sherwood	b1edef27e8	feat(UI): add Night Ops dark mode with theme toggle Add a warm charcoal dark mode ("Night Ops") using CSS variable swapping under [data-theme="dark"]. All 23 desert palette variables are overridden with dark-mode counterparts, and ~313 generic Tailwind classes (bg-white, text-gray-, border-gray-) are replaced with semantic tokens. Infrastructure: - CSS variable overrides in app.css for both themes - ThemeProvider + useTheme hook (localStorage + KV store sync) - ThemeToggle component (moon/sun icons, "Night Ops"/"Day Ops" labels) - FOUC prevention script in inertia_layout.edge - Toggle placed in StyledSidebar and Footer for access on every page Color replacements across 50 files: - bg-white → bg-surface-primary - bg-gray-50/100 → bg-surface-secondary - text-gray-900/800 → text-text-primary - text-gray-600/500 → text-text-secondary/text-text-muted - border-gray-200/300 → border-border-subtle/border-border-default - text-desert-white → text-white (fixes invisible text on colored bg) - Button hover/active states use dedicated btn-green-hover/active vars Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 11:46:10 -07:00
Jake Turner	96e5027055	feat(AI Assistant): performance improvements and smarter RAG context usage	2026-03-11 14:08:09 -07:00
Jake Turner	460756f581	feat(AI Assistant): improved state management and performance	2026-03-11 14:08:09 -07:00
Jake Turner	6f0fae0033	feat(AI Assistant): remember last model used	2026-03-11 14:08:09 -07:00
Jake Turner	58b106f388	feat: support for updating services	2026-03-11 14:08:09 -07:00
Chris Sherwood	650ae407f3	feat(GPU): warn when GPU passthrough not working and offer one-click fix Ollama can silently run on CPU even when the host has an NVIDIA GPU, resulting in ~3 tok/s instead of ~167 tok/s. This happens when Ollama was installed before the GPU toolkit, or when the container was recreated without proper DeviceRequests. Users had zero indication. Adds a GPU health check to the system info API response that detects when the host has an NVIDIA runtime but nvidia-smi fails inside the Ollama container. Shows a warning banner on the System Information and AI Settings pages with a one-click "Reinstall AI Assistant" button that force-reinstalls Ollama with GPU passthrough. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 14:08:09 -07:00
Jake Turner	99b96c3df7	feat(RAG): display embedding queue and improve progress tracking	2026-03-04 20:05:14 -08:00
Jake Turner	96beab7e69	feat(AI Assistant): custom name option for AI Assistant	2026-03-04 20:05:14 -08:00
Jake Turner	efa57ec010	feat: early access release channel	2026-03-03 20:51:38 -08:00
Jake Turner	6817e2e47e	fix: improve type-safety for KVStore values	2026-03-03 20:51:38 -08:00
Jake Turner	00bd864831	fix(AI): improved perf via rewrite and streaming logic	2026-03-03 20:51:38 -08:00
Jake Turner	98b65c421c	feat(AI): thinking and response streaming	2026-02-18 21:22:53 -08:00
Jake Turner	d55ff7b466	feat: curated content update checking	2026-02-11 21:49:46 -08:00
Jake Turner	32d206cfd7	feat: curated content system overhaul	2026-02-11 15:44:46 -08:00
Jake Turner	df6247b425	feat(Easy Setup): visual cue to start at Easy Setup for OOBE	2026-02-11 11:16:52 -08:00
Chris Sherwood	b0be99700d	fix(System): show host OS, hostname, GPU instead of container info Inside Docker, systeminformation reports the container's Alpine Linux distro, container ID as hostname, and no GPU. This enriches the System Information page with actual host details via the Docker API: - Distribution and kernel version from docker.info() - Real hostname from docker.info().Name - GPU model and VRAM via nvidia-smi inside the Ollama container - Graphics card in System Details (Model, Vendor, VRAM) - Friendly uptime display (days/hours/minutes instead of minutes only) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 13:23:39 -08:00
Jake Turner	8726700a0a	feat: zim content embedding	2026-02-08 13:20:10 -08:00
Jake Turner	2e0ab10075	feat: cron job for system update checks	2026-02-06 15:40:30 -08:00
Jake Turner	a91c13867d	fix: filter cloud models from API response	2026-02-04 17:05:20 -08:00
Jake Turner	ab07551719	feat: auto add NOMAD docs to KB on AI install	2026-02-03 23:15:54 -08:00
Chris Sherwood	2c4fc59428	feat(ContentManager): Display friendly names instead of filenames Content Manager now shows Title and Summary columns from Kiwix metadata instead of just raw filenames. Metadata is captured when files are downloaded from Content Explorer and stored in a new zim_file_metadata table. Existing files without metadata gracefully fall back to showing the filename. Changes: - Add zim_file_metadata table and model for storing title, summary, author - Update download flow to capture and store metadata from Kiwix library - Update Content Manager UI to display Title and Summary columns - Clean up metadata when ZIM files are deleted Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 23:14:28 -08:00
Jake Turner	1923cd4cde	feat(AI): chat suggestions and assistant settings	2026-02-01 07:24:21 +00:00
Jake Turner	4584844ca6	refactor(Benchmarks): cleanup api calls	2026-02-01 05:23:11 +00:00
Chris Sherwood	68f374e3a8	feat: Add dedicated Wikipedia Selector with smart package management Adds a standalone Wikipedia selection section that appears prominently in both the Easy Setup Wizard and Content Explorer. Features include: - Six Wikipedia package options ranging from Quick Reference (313MB) to Complete Wikipedia with Full Media (99.6GB) - Card-based radio selection UI with clear size indicators - Smart replacement: downloads new package before deleting old one - Status tracking: shows Installed, Selected, or Downloading badges - "No Wikipedia" option for users who want to skip or remove Wikipedia Technical changes: - New wikipedia_selections database table and model - New /api/zim/wikipedia and /api/zim/wikipedia/select endpoints - WikipediaSelector component with consistent styling - Integration with existing download queue system - Callback updates status to 'installed' on successful download - Wikipedia removed from tiered category system to avoid duplication UI improvements: - Added section dividers and icons (AI Models, Wikipedia, Additional Content) - Consistent spacing between major sections in Easy Setup Wizard - Content Explorer gets matching Wikipedia section with submit button Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 21:00:51 -08:00
Jake Turner	adf76d272e	fix: remove Open WebUI	2026-01-31 20:39:49 -08:00
Jake Turner	243f749090	feat: [wip] native AI chat interface	2026-01-31 20:39:49 -08:00
Jake Turner	50174d2edb	feat(RAG): [wip] RAG capabilities	2026-01-31 20:39:49 -08:00
chriscrosstalk	7a5a254dd5	feat(benchmark): Require full benchmark with AI for community sharing (#99 ) * feat(benchmark): Require full benchmark with AI for community sharing Only allow users to share benchmark results with the community leaderboard when they have completed a full benchmark that includes AI performance data. Frontend changes: - Add AI Assistant installation check via service API query - Show pre-flight warning when clicking Full Benchmark without AI installed - Disable AI Only button when AI Assistant not installed - Show "Partial Benchmark" info alert for non-shareable results - Only display "Share with Community" for full benchmarks with AI data - Add note about AI installation requirement with link to Apps page Backend changes: - Validate benchmark_type is 'full' before allowing submission - Require ai_tokens_per_second > 0 for community submission - Return clear error messages explaining requirements Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(benchmark): UI improvements and GPU detection fix - Fix GPU detection to properly identify AMD discrete GPUs - Fix gauge colors (high scores now green, low scores red) - Fix gauge centering (SVG size matches container) - Add info tooltips for Tokens/sec and Time to First Token Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(benchmark): Extract iGPU from AMD APU CPU name as fallback When systeminformation doesn't detect graphics controllers (common on headless Linux), extract the integrated GPU name from AMD APU CPU model strings like "AMD Ryzen AI 9 HX 370 w/ Radeon 890M". Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(benchmark): Add Builder Tag system for community leaderboard - Add builder_tag column to benchmark_results table - Create BuilderTagSelector component with word dropdowns + randomize - Add 50 adjectives and 50 nouns for NOMAD-themed tags (e.g., Tactical-Llama-1234) - Add anonymous sharing option checkbox - Add builder tag display in Benchmark Details section - Add Benchmark History section showing all past benchmarks - Update submission API to accept anonymous flag - Add /api/benchmark/builder-tag endpoint to update tags Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(benchmark): Add HMAC signing for leaderboard submissions Sign benchmark submissions with HMAC-SHA256 to prevent casual API abuse. Includes X-NOMAD-Timestamp and X-NOMAD-Signature headers. Note: Since NOMAD is open source, a determined attacker could extract the secret. This provides protection against casual abuse only. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 00:24:31 -08:00
Chris Sherwood	5afc3a270a	feat: Improve curated collections UX with persistent tier selection - Add installed_tiers table to persist user's tier selection per category - Change tier selection behavior: clicking a tier now highlights it locally, user must click "Submit" to confirm (previously clicked = immediate download) - Remove "Recommended" badge and asterisk (*) from tier displays - Highlight installed tier instead of recommended tier in CategoryCard - Add "Click to choose" hint when no tier is installed - Save installed tier when downloading from Content Explorer or Easy Setup - Pass installed tier to modal as default selection Database: - New migration: create installed_tiers table (category_slug unique, tier_slug) - New model: InstalledTier Backend: - ZimService.listCuratedCategories() now includes installedTierSlug - New ZimService.saveInstalledTier() method - New POST /api/zim/save-installed-tier endpoint Frontend: - TierSelectionModal: local selection state, "Close" → "Submit" button - CategoryCard: highlight based on installedTierSlug, add "Click to choose" - Content Explorer: save tier after download, refresh categories - Easy Setup: save tiers on wizard completion Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 15:33:50 -08:00
Chris Sherwood	e31f956289	fix(benchmark): Fix AI benchmark connectivity and improve error handling - Add OLLAMA_API_URL environment variable for Docker networking - Use host.docker.internal to reach Ollama from NOMAD container - Add extra_hosts config in compose for Linux compatibility - Add downloading_ai_model status with clear progress indicator - Show model download progress on first AI benchmark run - Fail AI-only benchmarks with clear error if AI unavailable - Display benchmark errors to users via Alert component - Improve error messages with error codes for debugging Fixes issue where AI benchmark silently failed due to NOMAD container being unable to reach Ollama at localhost:11434. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 15:27:56 -08:00
Jake Turner	438d683bac	fix(Benchmark): cleanup types for SSOT	2026-01-22 21:48:12 -08:00
Chris Sherwood	755807f95e	feat: Add system benchmark feature with NOMAD Score Add comprehensive benchmarking capability to measure server performance: Backend: - BenchmarkService with CPU, memory, disk, and AI benchmarks using sysbench - Database migrations for benchmark_results and benchmark_settings tables - REST API endpoints for running benchmarks and retrieving results - CLI commands: benchmark:run, benchmark:results, benchmark:submit - BullMQ job for async benchmark execution with SSE progress updates - Synchronous mode option (?sync=true) for simpler local dev setup Frontend: - Benchmark settings page with circular gauges for scores - NOMAD Score display with weighted composite calculation - System Performance section (CPU, Memory, Disk Read/Write) - AI Performance section (tokens/sec, time to first token) - Hardware Information display - Expandable Benchmark Details section - Progress simulation during sync benchmark execution Easy Setup Integration: - Added System Benchmark to Additional Tools section - Built-in capability pattern for non-Docker features - Click-to-navigate behavior for built-in tools Fixes: - Docker log multiplexing issue (Tty: true) for proper output parsing - Consolidated disk benchmarks into single container execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 21:48:12 -08:00
Chris Sherwood	24f10ea3d5	feat: Use friendly app names on Dashboard with open source attribution Updates the Dashboard to use the same user-friendly names as the Easy Setup Wizard, giving credit to the open source projects powering each capability: - Kiwix → Information Library (Powered by Kiwix) - Kolibri → Education Platform (Powered by Kolibri) - Open WebUI → AI Assistant (Powered by Open WebUI + Ollama) - FlatNotes → Notes (Powered by FlatNotes) - CyberChef → Data Tools (Powered by CyberChef) Also reorders Dashboard cards to prioritize Core Capabilities first, with Maps promoted to Core Capability status, followed by Additional Tools, then system items (Easy Setup, Install Apps, Docs, Settings). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 16:43:32 -08:00
Jake Turner	937da5d869	feat(Open WebUI): manage models via Command Center	2026-01-19 22:15:52 -08:00
Jake Turner	b6e6e10328	fix(CuratedCategories): improve fetching from Github	2026-01-19 14:41:51 -08:00

1 2

72 Commits