project-nomad

mirror of https://github.com/Crosstalk-Solutions/project-nomad.git synced 2026-05-25 13:55:05 +02:00

Author	SHA1	Message	Date
Jake Turner	cbae48a3c8	fix(KB): surface file-warning compute failures instead of masking as healthy (PR #895 review) `computeFileWarnings()` previously caught all errors and returned an empty map, which the frontend rendered as "every file is healthy" — reintroducing exactly the silent-failure mode this surface exists to expose. Return `{ ok, warnings }`; flip `ok: false` from the catch. KB modal renders an inline amber notice under the Stored Files header when `ok === false`, leaving per-row warning rendering untouched. Transient failures self-heal on the next 30s poll; no toast spam.	2026-05-16 21:51:29 -07:00
Jake Turner	cbd86b7af9	refactor(KB): move FileWarning to shared types/rag following existing convention	2026-05-16 21:51:29 -07:00
Chris Sherwood	7c2282acf1	feat(KB): conditional warnings A + B on Stored Files (RFC #883 §6) Surfaces two silent failure modes that the prior binary "any-chunks-in-Qdrant ⇒ embedded" check could not distinguish from healthy ingestion: - Warning A — Zero-chunk file (file_size > 100 MB, chunks = 0) Fires on video-only / image-only ZIMs (`lrnselfreliance_en_all`, TED talks, etc.) that the pipeline completes "successfully" with no extractable text. AI Assistant literally cannot reference these. - Warning B — Partial-embed stall (chunks < 50% of expected from the ratio registry). Surfaces the simple_wiki "266 of 600,000 chunks" case observed during NOMAD1 ingestion testing — previously these looked identical to fully-completed embeds in the UI. Both warnings render only when their condition is met (silent by default; noisy only on real problems). Base is `feat/kb-ratio-registry` (#891) because Warning B's "expected chunks" estimate comes from `KbRatioRegistry.estimateChunks()`. GitHub fast-forwards to `rc` once #891 merges. - `app/utils/kb_warning_decision.ts` — pure `decideWarnings(inputs)` with thresholds (`100 MB`, `0.5×`) as exported constants. 10 unit tests cover the healthy case, both warnings, the under/at/over boundary, the registry-miss suppression, and the video-only registry case (`expectedChunks: 0` correctly skips Warning B). - `RagService.computeFileWarnings()` — single Qdrant scroll tallies chunks per source, filesystem walk fills in zero-chunk files, ratio registry estimates the expectation, decision function emits. - New endpoint `GET /api/rag/file-warnings` returns `Record<source, FileWarning[]>` (sources with no warnings are omitted, so the frontend can `warnings[source] ?? []` for clean defaults). - KB modal: warnings render inline under the file name as amber-tinted pills. Polled every 30s alongside the existing health check. - Warning C — chunks skipped due to length. PR #890 (#881 fix) prevents the silent drop at the embed boundary, so the underlying condition shouldn't fire anymore. If we still want to surface "we truncated N chunks to fit", that needs separate `skipped_count` tracking in EmbedFileJob — a Phase 2 follow-up. - Suppressing Warning B during active mid-ingestion. The user can cross- reference the Processing Queue to know it's in-flight; suppressing warnings while a job runs would mask real stalls where the job died mid-batch. Will revisit when per-card status is wired through. - Use of `kb_ingest_state.chunks_embedded` (#888) as the chunk count source. This PR uses Qdrant scroll directly so it can land independently of #888. - 10 new unit tests on `decideWarnings`, all pass - Type-check clean - Hot-patch + browser smoke test deferred until #891 lands (the ratio registry needs to exist in the DB for `estimateChunks()` to return non-null estimates — without it, only Warning A fires which is still useful but Warning B stays dormant)	2026-05-16 21:51:29 -07:00
Chris Sherwood	ab8281d08b	feat(KB): surface embedding-disk estimate in curated tier-change modal (RFC #883 §1) When a user picks a tier in TierSelectionModal, show how much additional disk space the AI Assistant will need if the new ZIMs are indexed, plus a policy-aware footer explaining whether they'll auto-index (Always) or wait for opt-in (Manual). Estimates consume #891's KbRatioRegistry via a new POST /api/rag/estimate-batch endpoint. Backend - New POST /api/rag/estimate-batch route + RagController.estimateBatch - VineJS schema accepting array of {filename, sizeBytes}, capped at 500 - KbRatioRegistry.estimateBatch aggregates via the existing prefix-match lookup, returns {totalChunks, totalBytes, hasUnknown} - New BYTES_PER_CHUNK_ON_DISK constant (~8 KB: 3 KB vector + ~3 KB chunk text + ~2 KB payload/index overhead). Tunable; will be replaced by Phase 4 self-calibration once we have real measurements. - Controller normalizes incoming filenames via path.basename so callers that send full paths or URLs still match registry prefixes correctly. Frontend - api.estimateEmbeddingBatch() client method - TierSelectionModal: when localSelectedSlug is set, resolve the tier's resources (incl. inherited tiers), POST to /estimate-batch, and render a new info block with the +~X GB figure + ingest-policy copy. Also fetches rag.defaultIngestPolicy so the same block surfaces whether indexing will fire automatically or wait for the user. - resourceFilename() helper extracts the basename from the resource URL so the registry lookup hits the right prefix regardless of mirror. Tests - 4 new cases in tests/unit/kb_ratio_lookup.spec.ts covering the estimateBatch aggregator: standard sum, unknown-flagging, video-only ZIM (0 chunks but known, hasUnknown stays false), empty input. Stacks on feat/kb-ratio-registry (#891) — consumes the registry table seeded by that PR. Once #891 merges to rc, this PR auto-rebases. Out of scope for this PR (deferred to follow-ups): - Per-batch opt-in checkbox (RFC §1's '☑ Also index these for AI') needs a per-batch policy override path and is a separate PR - Guardrail modal at 50 GB / 10% free / 6 hr thresholds (RFC §7) is also separate; this PR is informational, not gating - Time-to-embed estimate awaits a chunks-per-second metric per host	2026-05-16 21:33:30 -07:00
chriscrosstalk	603a7070e8	feat(KB): Always/Manual ingest policy toggle (RFC #883 §1/§4) (#894 ) * feat(KB): per-file ingest state machine (Phase 1 of RFC #883) Adds a persistent state machine for AI knowledge-base ingestion so the scanner can distinguish "fully indexed", "user opted out", "failed", and "stalled" from each other — none of which were derivable from the prior binary "any chunks in Qdrant ⇒ embedded" check. ## What lands - New table `kb_ingest_state` keyed by `file_path` with enum state column (`pending_decision \| indexed \| browse_only \| failed \| stalled`). Independent of `installed_resources` so it covers both curated downloads and manually-uploaded KB files. - New KV key `rag.defaultIngestPolicy` (string: `Always \| Manual`). Registered now but not consumed yet — JIT prompt + wizard step land in Phase 3 of the RFC. - `EmbedFileJob.handle` writes state on terminal outcomes: - Success (final batch) → `indexed` + chunks count - `UnrecoverableError` → `failed` + error message - Retryable errors are left to BullMQ's existing retry path - `scanAndSyncStorage` swaps the binary qdrant check for a state-aware decision tree (see `decideScanAction`). Existing installs auto-backfill on first scan: files with chunks in Qdrant but no state row become `indexed`; new files start as `pending_decision`. - `deleteFileBySource` drops the state row last, so removed files disappear entirely instead of leaving an orphan that the next scan would re-dispatch into nothing. ## What does NOT land here - Ratio registry (separate PR) — needed for partial-stall detection and cost estimates, but a separable concern. - #880 follow-up initial-progress anchor (separate tiny PR). - Phase 2 UI (status pill, per-card actions, conditional warnings). - Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal). - PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All / Reset & Rebuild would also want to set state, but #886 isn't merged yet; that wiring goes in a follow-up once #886 lands. ## Target This is forward work for v1.40.0 (RFC #883). Branching off `rc` because that's the current latest base and post-GA Jake will sync rc→dev; a retarget at PR-open time is a fast-forward if requested. ## Tests - 9 new unit tests for `decideScanAction` covering all five states plus the no-row / chunks-present / chunks-missing combinations - Type-check clean - Smoke-tested end-to-end on NOMAD3 via hot-patch: - Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all came back `indexed` on first scan - Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`) came back `pending_decision` and was correctly re-dispatched (Bull deduped to its historical `:completed` jobId — bgauger's #886 fix drains that) - Delete hook: deleting a KB upload via `DELETE /api/rag/files` removed both the disk file and the state row * feat(KB): Always/Manual ingest policy toggle (RFC #883 §1/§4) Activates the `rag.defaultIngestPolicy` KV registered in Phase 1 (#888) so users on a fresh install (or anyone who picks Manual mode) no longer get every new ZIM auto-dispatched to the embed pipeline. ## Stacks on #888 This PR's base is `feat/kb-ingest-state-machine` (#888). The state machine has to be in place for the decision function to be policy-aware; GitHub will fast-forward the base to `rc` once #888 merges. ## Backend changes - `decideScanAction` now takes a `policy: 'Always' \| 'Manual'` argument (defaults to `Always` for backward compatibility). - New `ScanAction` kind: `create_pending`. Manual mode records that the scanner has seen a new file (so the UI can surface a per-card Index affordance later) without dispatching an EmbedFileJob. - `scanAndSyncStorage` reads the KV and passes it through. The scan-result log line now includes the active policy and a `waiting on user` count for Manual-mode hits. - `rag.defaultIngestPolicy` added to `SETTINGS_KEYS` so it's reachable through the existing `GET/PATCH /api/system/settings` surface — no new endpoint. ## Frontend changes - New section in the KB panel between "Why upload" and "Processing Queue": "Auto-index new content for AI? [Always \| Manual]" — segmented radio with copy explaining the 5-10× disk multiplier. Default Always. - `useQuery('ingestPolicy')` reads the current value; clicking the inactive option mutates and shows a notification confirming the new behavior. ## Tests - 14 unit tests on `decideScanAction` (was 9) — split into Always-mode cases (preserves Phase 1's contract) and Manual-mode cases (`create_pending`, `pending_decision → skip`, etc.). - Type-check clean. - Hot-patch + browser verification deferred until #888 lands; the state machine smoke-tested cleanly on NOMAD3 in #888's PR, and this PR's decision-tree changes are exhaustively unit-tested. ## RFC open question §3 — policy-change re-trigger Switching Manual → Always doesn't auto-dispatch existing `pending_decision` rows immediately. The next scan re-evaluates and dispatches them under the new policy. This matches the RFC's "treat the switch as I've- thought-about-it" instinct for the guardrail; full guardrail implementation lands in Phase 3 task 14. --------- Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com>	2026-05-16 21:00:11 -07:00
Chris Sherwood	ca5569c8ea	feat(KB): status pill + last-activity timestamp on Processing Queue (RFC #883 §5/§10) Each in-flight (or stuck) embedding job gets a colored health pill, relative-activity timestamp, and chunk counter so users can tell at a glance whether ingestion is making progress. ## Health states - 🟢 Active — last batch < 2 min ago - 🟡 Slow — last batch 2-5 min ago (CPU-paced multi-batch ingestion lives here naturally; not always a problem) - 🔴 Stalled — last batch > 5 min ago (likely real problem) - ⚪ Waiting — queued, no batch started yet - 🔴 Failed — job recorded failed status ## What lands - New backend util `kb_job_health.ts` with pure `computeJobHealth(input)` decision function. Time-based thresholds (2 min / 5 min) inlined as constants. 9 unit tests pin the boundaries. - `EmbedJobWithProgress` gains `lastBatchAt`, `startedAt`, `chunks` — already set by `EmbedFileJob.handle` on every batch transition, just not previously surfaced through `listActiveJobs`. - Frontend `kb_job_health_display.ts` maps each status to a Tailwind dot color, label, and aria-label so backend and UI stay in sync. - `ActiveEmbedJobs.tsx` renders the pill, "last activity Xs ago", and chunk counter above each progress bar. Adds a manual Refresh button and "Last updated Xs ago" line — the existing 2s/30s auto-poll cadence in `useEmbedJobs` is left intact. - Live tick at 5s keeps the relative timestamps current without re-fetching from the API. ## Not in scope - Per-card Cancel / Retry / Un-index — separate Phase 2 PR - Conditional warnings A/B/C — separate Phase 2 PR - Computing throughput rate (chunks/min) — needs ratio registry consumer (Phase 2 follow-up); for now the pill answers the "is it stuck?" question directly without a rate estimate.	2026-05-16 20:37:20 -07:00
Chris Sherwood	7d7459bc14	feat(KB): group admin docs into single row in Stored Files (RFC #883 §9) Project NOMAD's bundled docs (`/app/docs/*.md` and `README.md`) each embed as their own KB source — currently rendering as 12+ individual rows that swamp user-uploaded content in the Stored Files table. Collapse them into one informational row: > Project NOMAD documentation · 12 files · Managed by NOMAD The admin-docs row hides the Delete button (those files would be re-embedded on the next sync anyway, so deleting is a footgun). User uploads and ZIMs keep their existing per-row Delete UX. Also adds deterministic sort: ZIMs → user uploads → admin docs → other, alphabetical within each bucket. Pure frontend change — `/api/rag/files` response shape unchanged. Decision logic extracted to `kb_file_grouping.ts` with 9 unit tests covering bucket classification, sort order, count noun pluralization, and empty-input handling.	2026-05-16 20:30:08 -07:00
Jake Turner	460065ae85	fix(KB): align chunks_per_mb column type with TS contract Switch kb_ratio_registry.chunks_per_mb from DECIMAL(10,2) to UNSIGNED INTEGER so the value mysql2 returns matches the `number` type declared on the model. DECIMAL columns deserialize as strings by default, which would break `=== 0` checks for video-only ZIMs and silently coerce through arithmetic in Phase 2 consumers. All seeds are whole numbers and the heuristic's real-world variance (~±50%) makes sub-integer precision meaningless.	2026-05-16 20:23:47 -07:00
Chris Sherwood	68e1bd5ff2	feat(KB): ratio registry for disk + time estimates (Phase 1B of RFC #883 ) Foundation for the cost estimates and partial-stall detection that Phase 2 will surface. No consumers yet — this PR just lays the table, the seed rows, and the lookup helper so subsequent UI work has estimates available without a per-ZIM benchmark. ## What lands - New table `kb_ratio_registry` (pattern, chunks_per_mb, sample_count, notes). Migration creates and seeds heuristic defaults from the RFC appendix: devdocs (1100/MB), Wikipedia variants (270/MB), iFixit (50/MB), Stack Exchange Q&A (200/MB), video-only ZIMs (0), plus a catch-all fallback at 100/MB. - `KbRatioRegistry` model with static `lookup()` and `estimateChunks()`. - Pure helper `kb_ratio_lookup.ts` doing longest-prefix-match — a specific entry (`wikipedia_en_simple_`) overrides a broader one (`wikipedia_en_`). 9 unit tests covering the lookup boundary. - `sample_count` starts at 0 (heuristic seed) and is reserved for Phase 4 self-calibration to increment as observed ZIMs update each row. ## Not in scope - Self-calibration on successful ingestion (Phase 4) - UI consumers — Warning B (partial-embed stall) and the storage budget meter / time estimates land in Phase 2. ## Tested - Type-check clean - 9 unit tests pass for `findChunksPerMb` and `estimateChunkCount` - Migration applied on NOMAD3 via hot-patch; 9 seed rows verified in DB	2026-05-16 20:23:47 -07:00
Jake Turner	8ce5790ab5	fix(AI): add truncation DEBUG log	2026-05-15 23:09:59 -07:00
Chris Sherwood	c9ccd4a202	fix(AI): pre-cap embed input + log fallback reason (#881 ) The OpenAI-compatible /v1/embeddings fallback path can't pass `truncate:true` / `num_ctx:8192` to the model, so any chunk that exceeds the model's loaded context_length (often 2048 for nomic-embed-text:v1.5) returns a 400 BadRequestError and is silently dropped from Qdrant. Two CPU-only ingestion runs on NOMAD1 hit this on dense technical content (medlineplus, arduino.stackexchange) even after PR #763's num_ctx fix on the native path. Pre-cap each input string at 4000 chars before either backend call. That's ~1000-2000 tokens depending on density, comfortably under the model's 2048 default. The chunker in RagService is sized for MAX_SAFE_TOKENS=1600 (3200 chars at its conservative 2 chars/token estimate), so well-formed inputs are never touched; this is purely a runtime safety net for the edge cases that slip through. Also stop swallowing the original error in the catch. The bare `} catch {}` here has masked recurring "input length exceeds context length" failures for months (#369, #670, #881). Capture and warn-log the message so future investigations see why we fell back. Same root cause as #369 and #670 which were closed without an actual fix to the fallback path.	2026-05-15 23:09:59 -07:00
chriscrosstalk	68c0a37cab	fix(RAG): anchor continuation-batch initial progress to overall-file frame (#889 ) Each continuation batch of a multi-batch ZIM embed runs as a fresh BullMQ job, so handle() ran the hardcoded `safeUpdateProgress(job, 5)` even when the file was already 100k articles into a 600k-article ZIM. The UI gauge briefly dropped to 5% before the per-batch onProgress callback caught up to the true overall percentage, reading as a backward jump every time a new batch started. Compute initialPercent from batchOffset / totalArticles when available, falling back to 5 for single-batch files (uploaded PDFs, txts) where totalArticles isn't set. Capped at 99 to leave headroom for the 100% final-batch marker. Follow-up to PR #880 (which fixed the 0-100% scaling during a batch but still had the initial-frame regression).	2026-05-15 23:01:45 -07:00
chriscrosstalk	69cf66c1f3	feat(KB): per-file ingest state machine (Phase 1 of RFC #883 ) (#888 ) Adds a persistent state machine for AI knowledge-base ingestion so the scanner can distinguish "fully indexed", "user opted out", "failed", and "stalled" from each other — none of which were derivable from the prior binary "any chunks in Qdrant ⇒ embedded" check. ## What lands - New table `kb_ingest_state` keyed by `file_path` with enum state column (`pending_decision \| indexed \| browse_only \| failed \| stalled`). Independent of `installed_resources` so it covers both curated downloads and manually-uploaded KB files. - New KV key `rag.defaultIngestPolicy` (string: `Always \| Manual`). Registered now but not consumed yet — JIT prompt + wizard step land in Phase 3 of the RFC. - `EmbedFileJob.handle` writes state on terminal outcomes: - Success (final batch) → `indexed` + chunks count - `UnrecoverableError` → `failed` + error message - Retryable errors are left to BullMQ's existing retry path - `scanAndSyncStorage` swaps the binary qdrant check for a state-aware decision tree (see `decideScanAction`). Existing installs auto-backfill on first scan: files with chunks in Qdrant but no state row become `indexed`; new files start as `pending_decision`. - `deleteFileBySource` drops the state row last, so removed files disappear entirely instead of leaving an orphan that the next scan would re-dispatch into nothing. ## What does NOT land here - Ratio registry (separate PR) — needed for partial-stall detection and cost estimates, but a separable concern. - #880 follow-up initial-progress anchor (separate tiny PR). - Phase 2 UI (status pill, per-card actions, conditional warnings). - Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal). - PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All / Reset & Rebuild would also want to set state, but #886 isn't merged yet; that wiring goes in a follow-up once #886 lands. ## Target This is forward work for v1.40.0 (RFC #883). Branching off `rc` because that's the current latest base and post-GA Jake will sync rc→dev; a retarget at PR-open time is a fast-forward if requested. ## Tests - 9 new unit tests for `decideScanAction` covering all five states plus the no-row / chunks-present / chunks-missing combinations - Type-check clean - Smoke-tested end-to-end on NOMAD3 via hot-patch: - Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all came back `indexed` on first scan - Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`) came back `pending_decision` and was correctly re-dispatched (Bull deduped to its historical `:completed` jobId — bgauger's #886 fix drains that) - Delete hook: deleting a KB upload via `DELETE /api/rag/files` removed both the disk file and the state row Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com>	2026-05-15 22:51:06 -07:00
Chris Sherwood	5193f74410	fix(ZIM): preserve co-existing Wikipedia corpora on cleanup (#884 ) onWikipediaDownloadComplete was deleting every file whose name starts with `wikipedia_en_`, treating distinct corpora (simple, medicine, wikivoyage, climate_change, etc.) as competing versions of the same selection slot. Whichever wiki finished second silently wiped the other from disk. Match by filename stem instead — strip the trailing `_YYYY-MM(-DD).zim` date suffix and only delete files with the same stem as the new download. Different release dates of the same variant still get cleaned up; distinct variants are preserved. Extracted the predicate to `app/utils/zim_filename.ts` so the boundary is covered by unit tests (8 cases incl. the #884 repro scenario).	2026-05-15 22:29:17 -07:00
Jake Turner	d621761412	fix(KB): add re-embed and reset & rebuild opts to fix broken embeddings (#886 )	2026-05-15 22:05:19 -07:00
Chris Sherwood	fe599173ef	fix(RAG): report ZIM ingestion progress in overall-file frame Before this change, the Active Downloads / Processing Queue UI showed the ingestion progress gauge jumping wildly during multi-batch ZIM ingestion (e.g. 5% → 88% → 27% → 5% → 56% → 36% over ~60 seconds for cooking SE). Each continuation batch is a separate BullMQ job, and `EmbedFileJob.handle()` reported `job.progress` in two different reference frames depending on where it was in the batch lifecycle: - During-batch (via the onProgress callback): 5% → 95% scaled across "% through this batch's chunks" - End-of-batch (just before dispatching the next): overwritten to `(nextOffset / totalArticles) * 100` — % through the whole file - Next continuation batch starts with progress = 5% explicitly, then climbs through the per-batch range again `listActiveJobs()` returns the latest active BullMQ job's progress. With GPU-accelerated ingestion completing a batch every ~4 seconds, the UI saw the jobId rotate constantly and the gauge whipsaw between the two reference frames. `totalArticles` was already wired through the EmbedFileJob params shape and used end-of-batch — but RagService never actually populated it, so any frame-scaling that depended on it silently fell back to the per-batch range. Two fixes together: 1. `ZIMExtractionService.extractZIMContent()` now returns `{ chunks: ZIMContentChunk[]; totalArticles: number }` instead of a raw chunks array, surfacing `archive.articleCount` to the caller. Single caller (rag_service) updated to destructure. 2. `RagService.processZimFile()` includes `totalArticles` in its result so `EmbedFileJob.dispatch()` can propagate it to the continuation batch (which the existing code already does via `totalArticles: totalArticles \|\| result.totalArticles`). 3. `EmbedFileJob`'s onProgress callback scales the service-reported per-batch percent into the overall-file frame when `totalArticles` is known: `((batchOffset + (percent/100) * ZIM_BATCH_SIZE) / totalArticles) * 100`. Capped at 99% to leave room for the explicit 100% set at file completion. Falls back to the original 5-95% range for single-batch files (uploaded PDFs/txts) where totalArticles is undefined — the gauge then represents % through the only batch, which is what the UI expects for one-shot files. Validated on NOMAD8 (RX 6800, ROCm-accelerated nomic): - devdocs python (small, ~1500 articles): batch progressions seen monotonically across continuation jobIds: 1501@30% → 1510@33% → 1514@43% → 1518@52%. - ifixit (huge, ~100k articles): stays near 3% for the first many batches at offset 0..3000 — correct, the file is enormous. - wikipedia_en_medicine (large, ~70k articles): stays near 0-1% for the first batches — also correct. - Brief 0-5% blip on continuation handoff (the explicit `safeUpdateProgress(job, 5)` at batch start, before the first onProgress callback fires) — visible but quickly resolves to the overall-frame value. No more 5% ↔ 88% chaos.	2026-05-13 16:10:51 -07:00
Chris Sherwood	4f82b69572	fix(System): validate StartedAt with fallback to tail:500 (PR review) Jake noted that `inspect.State.StartedAt` could be missing/malformed, which would land NaN inside `container.logs({ since, until })`. Add defensive validation that the parsed timestamp is finite and positive before using it, with a fallback to the previous tail:500 strategy (plus a warn log) when it isn't. Happy path is unchanged.	2026-05-13 15:07:48 -07:00
Chris Sherwood	0390e9584e	fix(System): correct AMD VRAM in Graphics card + harden log probe Two related fixes to make the System Information page reliably show real GPU info instead of misleading lspci BAR0 readings or N/A. 1. Generalize bogus-VRAM detection to AMD. Same root cause as #835 (NVIDIA showing 32 MB), this time for AMD: lspci parses the first PCI memory Region (BAR0, typically 1-16 MiB on Navi cards) as `vram`. On NOMAD8 (Threadripper 3960X + Radeon RX 6800), the System Information page showed "1 MB" instead of "16 GB". PR #850 fixed this for NVIDIA by clearing the bogus value and re-running the Ollama log probe; the check was vendor-gated to NVIDIA only. `isBogusNvidiaVram` becomes `isBogusDgpuVram` with a `isDiscreteGpuVendor` helper matching /nvidia\|advanced micro devices\|amd\|ati/i. Same 256-MiB threshold — no real discrete GPU has less than that, while Intel iGPUs (which legitimately report small shared-memory VRAM via lspci) are left untouched. The probe gate condition is similarly renamed. 2. Read Ollama logs from the startup window, not tail:N. `getOllamaInferenceComputeFromLogs()` was reading the last 500 log lines and grepping for the "inference compute" line. That line is written once during Ollama's GPU discovery phase within seconds of startup. Under active embedding workloads we measured >1000 log lines/min, which pushes the line past any reasonable tail within minutes — at which point the probe returns null and the UI flips to "GPU Not Accessible" even though Ollama is happily using the GPU (size_vram > 0 in /api/ps). Switch from `tail: 500` to `since: containerStartedAt, until: containerStartedAt + 300s`. The 5-minute window is bounded regardless of container uptime and always captures Ollama's GPU discovery output. The inference-compute line is emitted in the first few seconds of startup, so 5 min is generous headroom. Validated on NOMAD8 (RX 6800, container uptime ~10 min with sustained ingestion that generated 6,345 log lines): Before: controllers[0]: { model: "Navi 21 ...", vram: 1 } After (bogus AMD VRAM cleared, log probe stale due to tail:500 churn): controllers[0]: { model: "Navi 21 ...", vram: null } gpuHealth: { status: "passthrough_failed" } -> UI shows "N/A" and the banner from PR #208 After (bogus cleared + log probe reads startup window): controllers[0]: { model: "AMD Radeon RX 6800", vram: 16384 } gpuHealth: { status: "ok", hasRocmRuntime: true, ollamaGpuAccessible: true } -> UI shows "16 GB", no banner Both branches of the fix exercise correctly: NVIDIA path unchanged (same code, just renamed identifiers), AMD path now triggers the probe and the probe reliably finds the GPU info regardless of container age.	2026-05-13 15:07:48 -07:00
Jake Turner	63170df6f0	fix(DockerService): improve volume logic and documentation in forceReinstall	2026-05-13 14:26:59 -07:00
Chris Sherwood	fe51dc49b0	feat(GPU): auto-remediate nomad_ollama passthrough loss on admin boot (#755 ) After an update, container recreate, or docker daemon restart, nomad_ollama's HostConfig.DeviceRequests still lists the nvidia driver — but the NVIDIA Container Toolkit binding inside the container is torn. `nvidia-smi` returns "Failed to initialize NVML: Unknown Error" and Ollama silently falls back to CPU inference. PR #208 detects this and shows a banner with a "Fix: Reinstall AI Assistant" button. This change does that click automatically on admin boot. New provider GpuPassthroughRemediationProvider runs once on web env boot: 1. Skip when KV `ai.autoFixGpuPassthrough = false` (default true). 2. Skip when Docker has no `nvidia` runtime registered (AMD-only and CPU-only hosts unaffected). 3. Skip when nomad_ollama isn't running. 4. Exec `nvidia-smi --query-gpu=name --format=csv,noheader` inside the container with an 8-second timeout. If the output matches "Failed to initialize NVML", "Unknown Error", "TIMEOUT", or contains no alphabetic characters, treat the passthrough as broken. 5. On broken: call DockerService.forceReinstall('nomad_ollama'). The existing force-reinstall preserves the Ollama volume + installed models. Stamp `gpu.autoRemediatedAt` on success. 6. On healthy: log and exit. AMD passthrough_failed is intentionally not handled — its fix path is HSA override handling (PR #804) rather than a simple service recreate, and false positives during AMD startup log parsing would loop a recreate without fixing anything. Left to a follow-up if it proves to be a recurring AMD issue. Validated on NOMAD3 (RTX 5060, v1.32.0-rc.3 + this patch hot-applied): - After admin restart with passthrough healthy: log line "[GpuPassthroughRemediationProvider] NVIDIA passthrough healthy — no action needed." Provider exits cleanly without touching the container. - The broken-state branch hits the existing forceReinstall path, which was manually invoked earlier in the same session to fix this exact box and recovered GPU access in ~45s with model volume intact. No new failure mode is introduced — the auto-trigger removes the user click but the underlying operation is the same one the banner Fix button already calls. Closes #755.	2026-05-13 14:26:59 -07:00
Chris Sherwood	ba661a9da1	fix(RAG): pace continuation batches when embedding is CPU-only Stacks on top of the multi-batch ZIM ingestion fix. After that fix, multi-batch ZIM ingestion completes correctly — but on installs where Ollama runs the embedding model on CPU (currently every AMD ROCm install, since Ollama's ROCm build doesn't accelerate nomic-bert), the now-correct sustained 100% CPU saturation across all cores can starve other services hard enough to take the box down. Confirmed on a Threadripper 3960X + RX 6800 NOMAD: a wikipedia-class ZIM ingestion pegged 48 threads cleanly enough that sshd lost banner-exchange responsiveness and the box ultimately required a power-cycle. NVIDIA installs aren't affected — nomic-embed-text:v1.5 runs at 100% GPU on RTX 5060 (verified via `ollama ps`). Detect placement at runtime, pace only when needed: 1. OllamaService.isEmbeddingGpuAccelerated() — queries /api/ps and returns true if any loaded embedding model reports size_vram > 0. Fails closed (returns false) if /api/ps is unreachable or no embed model is loaded yet — over-pacing is safer than crashing. 2. EmbedFileJob.handle() — between batches (hasMoreBatches: true branch), check placement and `await setTimeout(CPU_BATCH_DELAY_MS)` when CPU-only. CPU_BATCH_DELAY_MS = 1000 (1s) — enough to give the OS scheduler a window for sshd/disk-collector/etc., small enough that total ingestion time isn't meaningfully affected (each batch is ~60-90s of work). GPU-accelerated installs see zero behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:52:02 -07:00
Chris Sherwood	e51ead616f	fix(queue): singleton QueueService to stop ioredis connection leak Every static call site instantiated a fresh QueueService (24 call sites across 8 files). QueueService.getQueue() opens a BullMQ Queue per call when not cached, and each Queue opens two ioredis connections (one for commands, one blocking). Because every static call constructed a new QueueService, its internal `queues` cache was never shared, every call opened a fresh pair, and none were ever closed. In normal operation this leaked a few connections per API hit. During multi-batch ZIM ingestion after PR #872 (where EmbedFileJob.handle() dispatches the next batch every 50 articles), every batch completion opened two new connections. On NOMAD3 at ~one batch every 4s sustained, that's ~1800 leaked connections/hour. Redis hit its 10,000-maxclient ceiling in ~5 hours and the admin container fell into an EPIPE flood that required a restart to recover. Fix: collapse QueueService to a true process-wide singleton with a private constructor and getInstance() accessor. The existing per-queue Map is now shared across every dispatch / status / cleanup call, so each queue's underlying connections are opened exactly once for the lifetime of the process. close() now clears the map so the singleton can be torn down cleanly if a graceful-shutdown hook is ever wired up. Validated on NOMAD3 (RTX 5060, v1.32.0-rc.4 + this patch hot-applied): under sustained multi-batch wikipedia_en_simple_all_nopic ingestion, connected_clients held flat at 21-22 across a 5-minute window. Pre-fix the same scenario climbed to 10,000+ over hours.	2026-05-13 13:48:21 -07:00
Chris Sherwood	7acc53444d	fix(RAG): unbreak multi-batch ZIM ingestion (jobId dedupe) EmbedFileJob.dispatch() uses a deterministic per-file jobId (sha256(filePath).slice(0,16)) for every batch. The parent batch's handle() calls EmbedFileJob.dispatch({ batchOffset }) before returning, so the parent is still in `active` state and locked when the continuation tries to enqueue. BullMQ silently returns the locked parent instead of creating a new job — and in newer BullMQ versions it does so without throwing, so the existing `catch (error.message.includes('job already exists'))` branch never fires. After the parent completes, its entry stays in the `completed` ZSET (held by `removeOnComplete: { count: 50 }`), continuing to trip jobId dedupe for any subsequent re-dispatch attempts. Result: every NOMAD install since 2026-02-08 (feat: zim content embedding) with a multi-batch ZIM (wikipedia, cooking SE, ifixit, lrnselfreliance, etc.) has only the first 50 articles indexed in qdrant. The RAG feature has been silently degraded for ~3 months — the user sees the file appear in their KB, qdrant accumulates ~50 articles' worth of vectors, and pagination quietly halts. No error surfaces anywhere. Fix: dispatch() skips the deterministic jobId for continuation batches (batchOffset > 0), letting BullMQ auto-generate a unique one so each batch stacks as an independent queue entry. Initial dispatches keep the deterministic jobId so re-triggering an install (UI re-click, sync rescan) remains idempotent. The existing 'job already exists' branch is now gated on !isContinuation, since by construction continuation batches will never hit dedupe. Validated on NOMAD8 (RX 6800 / Threadripper 3960X, rc.3 + this patch): devdocs_en_python (~1,500 chunks across multiple batches) correctly paginates end-to-end. admin.log shows the expected sequence of "Dispatched embedding job for file: X (continuation @ offset N)" followed by "Starting embedding process for: X (batch offset: N)" for each batch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:38:23 -07:00
Chris Sherwood	5a72560929	fix(AI): rewrite RAG query on first follow-up (off-by-one in skip-rewrite threshold) The short-conversation skip in `rewriteQueryWithContext` used `userMessages.length <= 2`, which short-circuits both the very first turn AND the first follow-up. The follow-up is the moment the rewriter matters most — it's where pronouns and shorthand ("the bars", "how long does it last?") need to be resolved against earlier turns before the embedding search runs. With the rewriter skipped, RAG queries against the raw last message, scores nothing above the 0.3 threshold, and no context gets injected for that turn. The visible symptom is the assistant treating the first follow-up in any chat as a brand-new question — e.g. "great - they threw up 2 of the bars it looks like" answered as if it were a recipe-bars question, with no carry-forward of the prior chocolate- poisoning context. Threshold lowered to `< 2`: skip only when there's exactly one user message (nothing to rewrite from). From the first follow-up onward the rewriter runs, as originally intended before commit `96e5027`. Validated against `mistral-nemo:12b` on NOMAD3 by hot-patching the compiled controller and replaying the dog-chocolate scenario. Post-patch response correctly threads "3 Hershey's bars" from turn 1 into turn 2's answer; pre-patch (per reporter's screenshot) pivoted to peanut butter bar recipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 20:34:30 -07:00
Ben Gauger	e42f9331b6	fix(Downloads): treat missing Content-Type as octet-stream (#848 ) download.kiwix.org (and some of its mirrors) don't always set a Content-Type header on .zim responses. The MIME validator was reading `headers['content-type'] \|\| ''`, then running each allowlist entry through `''.includes(...)` which is always false, so every download from those hosts was rejected with `MIME type is not allowed`. RFC 7231 §3.1.1.5 says missing Content-Type may be treated as application/octet-stream by the recipient, and that's already in every binary-content allowlist we use (ZIM, PMTILES, base assets). Default the missing case to that and the validator does the right thing. Strict callers that don't list octet-stream still reject as before.	2026-05-11 21:09:40 -07:00
Chris Sherwood	0a7bd9b11b	fix(AI): preserve semver tag in DB on AMD Ollama updates Closes #855. PR #804's AMD branch in `updateContainer()` overrode `newImage` to `ollama/ollama:rocm` and then persisted that literal string to `service.container_image` (line 1273). Two downstream consequences for every AMD user who clicked Update on AI Assistant: 1. Apps page (`apps.tsx`) extracts the displayed version from `container_image` and rendered the literal string "rocm". 2. `ContainerRegistryService.getAvailableUpdates()` parsed `currentTag = "rocm"`, which isn't semver, so `parseMajorVersion` returned NaN, the filter didn't reject newer tags by major-version, and `isNewerVersion` treated any future tag as newer. Result: the same update reappeared on every check, forever. Fix: separate "what we run" from "what we persist". `runtimeImage` holds the tag passed to `docker.pull()` and `createContainer()` (still `:rocm` for AMD), while `newImage` keeps the semver tag and is the value written to the DB. Surgical: 3 references renamed plus 1 declaration added. The install path (`_createContainer`) already had the right shape (runtime-only override, no DB write of the override), so this PR only touches `updateContainer`. Test plan: - `npm run typecheck` passes locally. - Manual repro on NOMAD2 (AMD HX 370 / 890M, rc.2): before fix, DB shows `container_image = ollama/ollama:rocm` after triggering an Ollama update via Settings > Apps; Apps page shows version "rocm"; `/api/system/services/check-updates` immediately re-reports the same update available. After fix, DB shows `container_image = ollama/ollama:<targetVersion>`; Apps page shows the semver; check- updates does not re-report the same update. - nomad_ollama container itself still runs the `:rocm` image (verified via `docker inspect`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 21:08:08 -07:00
Ben Gauger	525a1f1789	fix(System): correct NVIDIA VRAM in Graphics card (#835 ) PR #804 added pciutils to the admin image so AMD detection could fall back to lspci. Side effect for NVIDIA: si.graphics() now finds the card via lspci but reads the BAR0 region size (16-32 MiB on most NVIDIA cards) as VRAM, since nvidia-smi isn't installed in the admin image to enrich the result. A GTX 1050 Ti showed 32 MB instead of 4096. The nvidia-smi-via-Ollama and Ollama log probes already give the right number, but they only ran when graphics.controllers came back empty. Extend the trigger so they also run when the only entries are NVIDIA controllers reporting under 256 MiB (no real dGPU has that little). If the probes can't reach a value either (Ollama not installed, passthrough broken), VRAM now falls back to N/A instead of the bogus 32. Verified locally on an RTX 4070 Ti by simulating the container condition (lspci available, nvidia-smi unreachable). Before the fix: vram 32, model "AD104 [GeForce RTX 4070 Ti]". After: vram 12288, model "NVIDIA GeForce RTX 4070 Ti", from the Ollama inference-compute log line. Also confirmed the same result inside the actual admin Docker image.	2026-05-11 15:49:42 -07:00
Chris Sherwood	0b25638a3e	fix(AI): vendor-aware AMD HSA override + benchmark discrete-GPU detection Closes #810. ## Bug A: HSA_OVERRIDE_GFX_VERSION=11.0.0 was unconditional PR #804 set HSA_OVERRIDE_GFX_VERSION=11.0.0 for any AMD GPU. The inline comment claimed this was harmless on supported discrete cards (gfx1030 RX 6800, etc.) — empirically false. With the override, Ollama crashes during GPU discovery on gfx1030 and falls back to CPU silently. Affects every NOMAD user with an RX 6800 or other RDNA 2 discrete card. The correct value depends on the gfx version: - gfx1030, gfx1100, gfx1101, gfx1102: officially supported by ROCm — no override - gfx1031..gfx1036 (RDNA 2 variants + iGPUs like Rembrandt 680M): 10.3.0 - gfx1103, gfx1150, gfx1151 (Phoenix 780M, Strix 890M, Strix Halo): 11.0.0 ### Resolution chain in `_resolveAmdHsaOverride()` 1. KV `ai.amdHsaOverride` — manual override; accepts 'none' to disable, or a semver-style value to force. 2. Marker file `/app/storage/.nomad-amd-gfx` — written by install_nomad.sh based on lspci codename. Mapped to override via `_mapGfxToHsaOverride()`. 3. Default: `11.0.0` — preserves prior behavior so existing iGPU users (780M / 890M, the dominant AMD population today) don't regress on upgrade. Discrete RDNA 2 users on existing installs can opt out via `ai.amdHsaOverride='none'` and force-reinstall AI Assistant, OR re-run install_nomad.sh to refresh the marker file. The helper is used in both `createContainer` (initial install) and `updateContainer` (image update) paths, replacing the unconditional push. ## Bug B: BenchmarkService had no AMD discrete detection path `BenchmarkService.getHardwareInfo()` had three GPU detection fallbacks: 1. `si.graphics()` — empty inside Docker for AMD 2. nvidia-smi — NVIDIA only 3. AMD APU regex from CPU model — integrated only Result: AMD discrete cards (RX 6800, RX 7900 XTX, etc.) showed up as "GPU: Not detected" on the leaderboard despite ROCm working. Corrupts leaderboard data quality for that population. Fix: after the existing fallbacks, call `SystemService.getSystemInfo()` and read `graphics.controllers[0].model`. That path already handles AMD via the marker file + Ollama log probe added in PR #804, so we're reusing existing plumbing rather than duplicating detection logic. ## install_nomad.sh changes The existing AMD detection block already runs lspci. Added a codename parse step that maps Navi 21/22/23/24, Rembrandt, Phoenix1/Phoenix2, Strix/Strix Point/Strix Halo, and Navi 31/32/33 to gfx versions, then writes `/opt/project-nomad/storage/.nomad-amd-gfx`. Unknown codenames write nothing (admin handles missing-marker case via the backward-compat default). ## Validation Both bugs were originally surfaced and validated empirically on RX 6800 / gfx1030 / Ubuntu 24.04 + kernel 6.17 + ollama/ollama:rocm during the #810 filing. Validation grid from that report: \| Run \| NOMAD Score \| tok/s \| GPU detected \| \|-----------------------------------------------\|-------------\|-------\|-------------------------\| \| Pre-fix (Bug A active) \| n/a \| 0 \| yes, but library=cpu \| \| HSA_OVERRIDE removed, Bug B unfixed \| 73.8 \| 221.6 \| "Not detected" \| \| Both fixes hot-patched (this PR's behavior) \| 73.7 \| 216.0 \| AMD Radeon RX 6800 \| Local checks: `npm run typecheck` clean, `npm run build` clean.	2026-05-05 12:11:56 -07:00
Chris Sherwood	63282565a9	fix(Maps): render notes in marker popup when populated Closes #796. The maps API has accepted and persisted `notes` on map markers since PR #770, but the marker popup component still rendered name only and ignored the field. Now the popup shows a notes block beneath the name when it's populated, with whitespace preserved and long text wrapped. Threaded `notes` through the read path: - `api.listMapMarkers` / `api.createMapMarker` response types - `MapMarker` interface in `useMapMarkers` and the data.map projection - `MapComponent`'s selectedMarker popup The create/update UI is unchanged — users still set notes via the API or DB directly, matching the issue's stated scope. A marker entry with empty/whitespace-only notes renders the same as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 10:38:56 -07:00
Ben Gauger	03ab614f99	fix(Maps): send filename instead of full path to delete endpoint	2026-05-05 10:37:12 -07:00
chriscrosstalk	1ad898bc8b	fix(UI): four fixes for the System Update page (#827 ) Closes #826. 1. Heading and subtext now read from `versionInfo` state (which the Check Again mutation already populates) instead of the server-rendered `props.system`. Previously the card kept showing "System Up to Date / Your system is running the latest version!" alongside the new `Latest Version` row + Start Update button after a successful recheck. Status icon also switched to `versionInfo` for consistency. 2. The pulling-state heading rendered the lowercase status enum (`pulling`, `pulled`, ...) and relied on a Tailwind `capitalize` class for the visible glyph. Screen readers and other accessible-name consumers got the lowercase value with no transform applied. Replaced with a `STAGE_LABELS` map so visual + accessible names match. 3. The sidecar (install/sidecar-updater/update-watcher.sh) writes `complete` for ~5s, then resets the status file to `idle`. The SPA could miss that window across the admin container restart, leaving the page parked on its last observed progress percentage indefinitely while the upgrade was actually finished on disk. A `seenAdvancedStageRef` now records whether the session ever observed an advanced stage; a later poll seeing `idle` is treated as the missed completion, and the page reloads as advertised in step 3 of the on-screen process. Reset on each Start Update. 4. Toggling Enable Early Access now triggers a recheck on success, so the eligible-version list updates immediately instead of requiring a manual Check Again click. Single file touched: admin/inertia/pages/settings/update.tsx. Typecheck (tsc --noEmit) passes; static UI changes verified in source. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 10:33:18 -07:00
Jake Turner	0fdf31c2e4	docs: update release notes	2026-05-04 19:30:06 +00:00
Jake Turner	0ddcfe9011	fix(System): self-heal stale updateAvailable flag after sidecar-driven update (#825 )	2026-05-04 11:54:56 -07:00
chriscrosstalk	a7dbee55c4	feat(Content): custom ZIM library sources with pre-seeded mirrors (#593 ) * feat(content): add custom ZIM library sources with pre-seeded mirrors Users reported slow download speeds from the default Kiwix CDN. This adds the ability to browse and download ZIM files from alternative Kiwix mirrors or self-hosted repositories, all through the GUI. - Add "Custom Libraries" button next to "Browse the Kiwix Library" - Source dropdown to switch between Default (Kiwix) and custom libraries - Browsable directory structure with breadcrumb navigation - 5 pre-seeded official Kiwix mirrors (US, DE, DK, UK, Global CDN) - Built-in mirrors protected from deletion - Downloads use existing pipeline (progress, cancel, Kiwix restart) - Source selection persists across page loads via localStorage - Scrollable directory browser (600px max) with sticky header - SSRF protection on all custom library URLs Closes #576 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(content): recognize Wikipedia downloads from mirror sources When Wikipedia is downloaded via a custom mirror instead of the default Kiwix server, the completion callback now matches by filename instead of exact URL. This ensures the Wikipedia selector correctly shows "Installed" status and triggers old-version cleanup regardless of which mirror was used. Also handles the case where no Wikipedia selection exists yet (file downloaded before visiting the selector), creating the record automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(ZIM): use cheerio for custom mirror directory parsing * fix(ZIM): use URL constructor for more robust joining --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jake Turner <jturner@cosmistack.com>	2026-05-04 11:30:59 -07:00
dependabot[bot]	d66eaa3d42	build(deps): bump picomatch in /admin Bumps and [picomatch](https://github.com/micromatch/picomatch). These dependencies needed to be updated together. Updates `picomatch` from 4.0.3 to 4.0.4 - [Release notes](https://github.com/micromatch/picomatch/releases) - [Changelog](https://github.com/micromatch/picomatch/blob/master/CHANGELOG.md) - [Commits](https://github.com/micromatch/picomatch/compare/4.0.3...4.0.4) Updates `picomatch` from 2.3.1 to 2.3.2 - [Release notes](https://github.com/micromatch/picomatch/releases) - [Changelog](https://github.com/micromatch/picomatch/blob/master/CHANGELOG.md) - [Commits](https://github.com/micromatch/picomatch/compare/4.0.3...4.0.4) --- updated-dependencies: - dependency-name: picomatch dependency-version: 4.0.4 dependency-type: indirect - dependency-name: picomatch dependency-version: 2.3.2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-04 10:50:13 -07:00
Jake Turner	d81b66bb14	chore(deps): pin all deps to exact versions	2026-05-04 17:45:18 +00:00
Kenneth Brewer	a2f3a84446	feat(maps): show map coordinates on mouse move (#786 ) * feat: Updated the map to show the coordinates as the user moves the cursor over the map. Changed the cursor to a crosshairs to make it easier to place map markers. * Moved the scale unit control to its own component file for easier maintenance. Enhanced the behavior of the coordinate display on the map to not display when over the on screen controls, and the navigation bar. Added a toggle to turn off the coordinate display if the user doesn't wish to see it. Intentionally left the coordinate display when over a map marker so that the coordinates of the map marker can be estimated. In the future I intend to add the coordinates of a map marker when the map marker is clicked so that behavior may change in the future. --------- Co-authored-by: Kenneth Brewer <kennethbrewer3@protonmail.com>	2026-05-03 14:11:03 -07:00
chriscrosstalk	822b94629c	fix(UI): Country Picker UX polish + auto-refresh stored files (#817 ) Three UX issues from manual testing of #780 on NOMAD3. 1. Slider was unusable for multi-step zoom changes `setLoading(true)` fired immediately on every selection or maxzoom change, which disabled the slider until the request returned. Even with the 400ms debounce delaying the network call, the UI was locked the whole time. User couldn't drag through zoom levels to find the right one. Fix: bump debounce to 1500ms, move `setLoading(true)` inside the setTimeout so it only flips after the debounce expires. Slider stays interactive throughout the wait. Slider `disabled` now only ties to `downloading` (active extract dispatch), not `loading` (preflight in flight). The existing requestId stale-safe pattern handles concurrent changes. 2. Newly-downloaded maps didn't show in Stored Map Files until manual refresh `props.maps.regionFiles` is rendered server-side and passed through Inertia props; without a partial reload it stayed stale until the user navigated away and back. Fix: watch `useDownloads({ filetype: 'map' })` count via a ref. When the count drops (a download finished), trigger `router.reload({ only: ['maps'] })` to refresh just the maps prop. Existing pattern from elsewhere in the codebase. 3. Country picker didn't surface already-downloaded countries When a user re-opened "Choose Countries" after downloading UK, UK appeared unchecked with no indication it was already on disk. Fix: pass installed pmtiles filenames into the modal as a prop; parse with regex `^([a-z]{2})_[\w-]+_z\d+\.pmtiles$` to extract country codes from single-country extracts (matching MapService.buildRegionSlug's iso2 lowercase slug pattern). Render an "Installed" badge on those countries with a tooltip explaining they're re-selectable for redownload at a different zoom. Group / custom multi-country extracts don't reverse-map cleanly from filename and are skipped here. Could be a follow-up if useful. Files: admin/inertia/components/CountryPickerModal.tsx - SINGLE_COUNTRY_FILENAME_RE: iso2 + flexible date + zoom - installedFilenames prop with default [] - installedCountrySet derivation via useMemo - "Installed" badge rendering on country list rows - Debounce: 400ms -> 1500ms; setLoading inside setTimeout - Slider disabled: only on `downloading` admin/inertia/pages/settings/maps.tsx - import useEffect/useRef - destructure activeMapDownloads from useDownloads - useEffect on download count drop -> router.reload({ only: ['maps'] }) - pass installedFilenames to CountryPickerModal All three fixes tested end-to-end on NOMAD3.	2026-05-03 14:06:56 -07:00
0xGlitch	27cd803090	feat(Maps): regional map downloads via go-pmtiles extract (#780 ) * feat(maps): add regional map downloads via go-pmtiles extract * address Copilot review feedback on PR #780 - auto-refresh preflight on selection/maxzoom change with 400ms debounce and requestId stale-safety so the confirm button no longer requires a two-step "Estimate Size" -> "Start Download" dance - safeUpdateProgress helper replaces fire-and-forget updateProgress().catch() pattern so cancelled-job errors (code -1) can't surface as unhandled rejections - gate world basemap source on worldBasemapReady - when ensureWorldBasemap() fails we already delete world.pmtiles, so emitting the source was producing 404s on every tile request - verify go-pmtiles binary SHA256 at image build time; upstream doesn't ship a checksums file so per-arch hashes are pinned as build args with a regenerate note when bumping PMTILES_VERSION	2026-05-03 13:47:53 -07:00
Chris Sherwood	360e7a0af4	feat(content-updates): show size, surface downloads in Active Downloads Content Updates had three UX problems that compounded: 1. No size column, so users had to guess how big an update would be before clicking Update All. Upstream /api/v1/resources/check-updates doesn't return size, so CollectionUpdateService now enriches each update with a Content-Length HEAD request in parallel (5s timeout, non-fatal on failure — the row just renders an em-dash). 2. Small ZIM updates (1-8 MB) never appeared in Active Downloads. Two causes, both fixed: handleApply / handleApplyAll didn't invalidate the download-jobs query after dispatching, and useDownloads idled at 30s between polls — enough for a fast job to dispatch, download, and get cleaned up by removeOnComplete before the next refetch. 3. applyUpdate didn't forward title / totalBytes to RunDownloadJob, so any update that did briefly surface in Active Downloads had no label and no byte-count progress, just a filename and a percentage. It now passes both (matching zim_service's dispatch pattern). Also parallelized applyAllUpdates so dispatching five updates doesn't serialize five sequential BullMQ round-trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 13:17:07 -07:00
cuyua9	bb1834a364	fix(UI): wire map file delete confirmation to API (#732 ) Co-authored-by: cuyua9 <cuyua9@users.noreply.github.com>	2026-05-03 12:49:06 -07:00
Kenneth Brewer	0836d84bb2	docs: added notes field info to the map pin API reference (#803 )	2026-04-28 21:55:11 -07:00
chriscrosstalk	5924056502	feat(AI): improved AMD GPU acceleration for Ollama via ROCm + HSA override (#804 ) * feat(AI): re-enable AMD GPU acceleration for Ollama via ROCm + HSA override Re-enables AMD GPU support that was disabled in `77f1868` pending validation of the ROCm image and device discovery. Validation done 2026-04-28 on a Minisforum UM890 Pro (Ryzen 9 PRO 8945HS + Radeon 780M iGPU) — Ollama correctly offloaded all model layers to the iGPU when the container was started with /dev/kfd + /dev/dri passthrough and HSA_OVERRIDE_GFX_VERSION=11.0.0. On llama3.2:1b, GPU inference ran at 51.83 tok/s vs 33.16 tok/s on CPU (same hardware, same prompt) — a 1.56x speedup confirmed by Ollama logs showing "load_tensors: offloaded 17/17 layers to GPU". Changes ------- docker_service.ts - Restore _discoverAMDDevices() (simplified — pass /dev/dri as a directory entry, mirroring `docker run --device /dev/dri` behavior, instead of the prior brittle hardcoded card0/renderD128 fallback that broke on systems where the AMD GPU enumerates as card1+). - Restore the AMD branch in _createContainer(): - Switches Ollama image to ollama/ollama:rocm - Mounts /dev/kfd + /dev/dri via Devices - Sets HSA_OVERRIDE_GFX_VERSION=11.0.0 (required for unsupported-but-RDNA3 iGPUs like gfx1103; harmless on supported discrete cards) - KV opt-out via ai.amdGpuAcceleration (default on) - Mirror the AMD branch in updateContainer(): - Lifted GPU detection above docker.pull() so AMD updates pull :rocm rather than the standard :targetVersion tag (per-version ROCm tags aren't always published) - Replaces stale HSA_OVERRIDE in the inspect-captured env on update, so containers built before this PR pick up the current value system_service.ts - New getOllamaInferenceComputeFromLogs() — parses Ollama startup log line "msg=\"inference compute\" ... library=CUDA\|ROCm ..." which Ollama emits for both NVIDIA and AMD. Catches silent CPU fallback (e.g. NVML death after update, or HSA_OVERRIDE failure) that the prior nvidia-smi exec probe couldn't detect. - gpuHealth refactored to use log parsing as the primary probe for both vendors, with nvidia-smi exec retained as the NVIDIA-only secondary path for hardware enrichment when log parsing has no startup line yet. - AMD path uses gpu.type KV value (persisted by DockerService._detectGPUType) + ai.amdGpuAcceleration opt-out to determine hasRocmRuntime. types/system.ts - GpuHealthStatus extended additively: hasRocmRuntime + optional gpuVendor. types/kv_store.ts - New ai.amdGpuAcceleration boolean (default-on). settings/models.tsx, settings/system.tsx - passthrough_failed banner copy now reads vendor from gpuHealth.gpuVendor ("an AMD GPU" vs "an NVIDIA GPU"). Same Fix button hits the same force-reinstall endpoint, which now configures AMD correctly. install_nomad.sh - AMD detection in verify_gpu_setup() upgraded from a strict-positive "ROCm not currently available" message to "ROCm acceleration will be configured automatically." Also tightens the lspci match to display controller classes (avoids false positives from AMD CPU host bridges, matching the same fix already in DockerService._detectGPUType). Auto-remediation ---------------- Issue #755 proposes auto-remediation when gpuHealth.status flips to passthrough_failed (today the user has to click "Fix: Reinstall AI Assistant"). When that PR lands, AMD coverage falls out for free since this PR uses the same passthrough_failed status code via the shared gpuHealth machinery — #755's guard will need to flip from hasNvidiaRuntime === true to (hasNvidiaRuntime \|\| hasRocmRuntime). Closes #124 (AMD GPU support). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(AI): detect AMD GPU presence inside admin container via marker file The admin container doesn't have lspci installed, and AMD GPUs don't register a Docker runtime the way NVIDIA does — so DockerService._detectGPUType() and SystemService.gpuHealth had no way to know an AMD GPU was present. The previous implementation fell through to lspci, which silently failed inside the admin container, leaving gpu.type unset and gpuHealth stuck at 'no_gpu' even on systems with an AMD GPU. (NVIDIA worked because Docker registers the nvidia runtime, which is reachable via dockerInfo.Runtimes from any container.) Discovered while testing the AMD acceleration patch on a Minisforum UM890 Pro: the AMD branch in _createContainer() never fired because _detectGPUType() returned 'none' even on a host with a working /dev/kfd. Fix --- install_nomad.sh writes the host-detected GPU type ('nvidia' \| 'amd') to a marker file in the storage volume the admin container already bind-mounts: /opt/project-nomad/storage/.nomad-gpu-type → /app/storage/.nomad-gpu-type DockerService._detectGPUType() reads the marker as a secondary probe (after the Docker runtime check) — covers AMD detection from inside the container without requiring lspci or a /dev bind mount. SystemService falls back to the marker file when KV gpu.type is empty so the System page reflects AMD presence even before the user installs AI Assistant for the first time. (Without this, the page would say 'no_gpu' until Ollama was installed, even on hosts with an AMD GPU detected at install time.) Verified on NOMAD6 (UM890 Pro, Ubuntu 24.04, 780M iGPU): with the marker file in place and admin restarted, the patch's AMD branch fires correctly on Force Reinstall AI Assistant. Resulting nomad_ollama runs ollama/ollama:rocm with /dev/kfd + /dev/dri passthrough and HSA_OVERRIDE_GFX_VERSION=11.0.0; Ollama logs show 'library=ROCm compute=gfx1100 ... type=iGPU'. NOMAD's in-product benchmark on the same hardware climbed from 33.8 tok/s (CPU) to 57.3 tok/s (GPU) — a 1.69x speedup, with TTFT dropping from 148ms to 66ms. Migration for existing AMD installs ----------------------------------- Users on an existing NOMAD install with an AMD GPU have no marker file (the install script wrote it on a fresh install). Two paths get them on the GPU: 1. Re-run install_nomad.sh — writes the marker, no other side effects 2. Manually: echo amd \| sudo tee /opt/project-nomad/storage/.nomad-gpu-type Either then triggers AMD detection on the next AI Assistant install/reinstall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(AI): pull ollama/ollama:rocm separately when AMD branch overrides image The pull-if-missing logic in _createContainer ran against service.container_image (the DB-pinned tag, e.g. ollama/ollama:0.18.2). The AMD branch then overrode finalImage to ollama/ollama:rocm — but if that image wasn't already local, the container creation step failed with "no such image: ollama/ollama:rocm". Caught while validating on NOMAD2 (Ryzen AI 9 HX 370 + Radeon 890M / RDNA 3.5): the prior end-to-end test on NOMAD6 had silently passed because the rocm image was already pulled there from an earlier sidecar test, masking the bug. Fix: inside the AMD branch, after setting finalImage to ollama/ollama:rocm, run a parallel _checkImageExists + docker.pull dance for the new tag. Also confirmed via this validation: the same HSA_OVERRIDE_GFX_VERSION=11.0.0 override works on the 890M (gfx1150 / RDNA 3.5) — Ollama logs report 'library=ROCm compute=gfx1100 description="AMD Radeon 890M Graphics"' and inference runs at 51.68 tok/s (matching the existing X1 Pro published tile of 51.7 tok/s on the same hardware class). RDNA 3 (780M, gfx1103) and RDNA 3.5 (890M, gfx1150) both use the same override successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build(Dockerfile): include pciutils for lspci gpu detection fallback --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Jake Turner <jturner@cosmistack.com>	2026-04-28 21:53:56 -07:00
Ryanba	322087c1b7	fix(UI): improve global map banner display logic (#702 )	2026-04-28 20:55:25 -07:00
Henry Estela	cc789c1863	fix(RAG): add start button in kb modal and ensure restart policy exists (#700 ) Adds a check to RAG health to make sure nomad_qdrant is online, if not then the user will be blocked from clicking any buttons in the KB modal until they click the start qdrant button and let the container start There is a new file qdrant_restart_policy_provider.ts, which tries to ensure that the restart policy always exists for the nomad_qdrant container even though the policy should have been there when the container is created.	2026-04-27 22:26:46 -07:00
Kenneth Brewer	fe57d59868	docs: add map markers to API reference (#783 ) Co-authored-by: Kenneth Brewer <kennethbrewer3@protonmail.com>	2026-04-27 22:21:06 -07:00
John Scherer	269c7ce695	fix(API): accept notes, marker_type, and position on markers endpoints (#770 ) The VineJS validators in createMarker and updateMarker silently dropped fields not in their schema. The MapMarker model and DB include notes and marker_type, and GET responses return them, but POST and PATCH would not persist them. updateMarker additionally did not accept latitude/longitude, so markers could not be repositioned via the API after creation. - Add notes and marker_type to both validators and model assignments. - Add latitude/longitude to the update validator. - Add coordinate range validation on both endpoints. Closes #768	2026-04-27 22:11:19 -07:00
chriscrosstalk	b194dfa136	fix(RAG): pass num_ctx and truncate to Ollama embed call (#763 ) Some Ollama installs ship nomic-embed-text:v1.5 with the embedding model's default num_ctx=2048, which the RAG chunker (sized for ~1500 tokens of estimated content with ratio=2 chars/token) can exceed on dense PDFs. The result is `400 the input length exceeds the context length` from /api/embed, which then hits the OpenAI-compatible fallback (which also errors), and surfaces as a BadRequestError. Pass options.num_ctx=8192 (nomic-embed-text v1.5's RoPE-extrapolated max) and truncate=true (silent truncation safety net) on every embed call so we don't depend on the local modelfile defaults. Reported on #756 by @NC4WD; same root cause as #369 and #670 which were closed without an actual fix. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:43:10 -07:00
chriscrosstalk	00b4b26224	fix(API): skip compression for Server-Sent Events (#798 ) * fix(stream): skip compression for Server-Sent Events The global compression middleware (added in v1.31.0-rc.2) buffers response writes to determine encoding, which collapses per-token streaming into a single block delivered after generation completes. This broke the AI chat streaming UX from v1.31.0-rc.2 onward — text no longer appears progressively as the model generates it, only at the end. Adds a filter to compression() that returns false when the response Content-Type is text/event-stream. Other responses still go through the default compression filter (compressible types are still compressed; e.g. text/html via Brotli). Reproduced on NOMAD3 v1.31.1: before fix, all SSE chunks for a 1B model arrive within 10ms of each other after the model finishes. After fix, tokens arrive at ~150ms intervals as they're generated on a 12B model, with no Content-Encoding header on the SSE response. Verified on the same host that /home still returns Content-Encoding: br for HTML responses. Closes #781. Reported and bisected by @toasterking (works in v1.31.0-rc.1, broken from v1.31.0-rc.2 onward). * fix(stream): use any for filter params to match existing as-any pattern The compression library types its filter as (req: Request, res: Response) expecting Express types, but AdonisJS passes raw IncomingMessage/ServerResponse which is why the surrounding middleware uses `as any` casts at the call site. The IncomingMessage/ServerResponse types I added are runtime-correct but fail tsc against the library's declared types. Drop the typed import in favor of `any` parameters, which matches how the existing `compress(request.request as any, response.response as any, ...)` call resolves the same mismatch.	2026-04-27 19:00:31 -07:00
chriscrosstalk	3bacd14dbd	feat(content-manager): add sortable file size column (#698 ) Closes #685 Content Manager now surfaces the on-disk size of each ZIM file alongside title/summary, and lets users sort the list by Size or Title. Defaults to Size descending so the largest files are visible first. - ZimService.list() now stats each file and returns size_bytes - Content Manager table adds a formatted Size column (via formatBytes) - Sortable headers for Title and Size with asc/desc toggle	2026-04-27 18:49:51 -07:00

1 2 3 4 5 ...

383 Commits