project-nomad

mirror of https://github.com/Crosstalk-Solutions/project-nomad.git synced 2026-05-24 05:15:05 +02:00

Author	SHA1	Message	Date
Jake Turner	102998ec96	refactor(KB): move FileWarning to shared types/rag following existing convention	2026-05-20 10:16:00 -07:00
Chris Sherwood	563f86a22b	feat(KB): conditional warnings A + B on Stored Files (RFC #883 §6) Surfaces two silent failure modes that the prior binary "any-chunks-in-Qdrant ⇒ embedded" check could not distinguish from healthy ingestion: - Warning A — Zero-chunk file (file_size > 100 MB, chunks = 0) Fires on video-only / image-only ZIMs (`lrnselfreliance_en_all`, TED talks, etc.) that the pipeline completes "successfully" with no extractable text. AI Assistant literally cannot reference these. - Warning B — Partial-embed stall (chunks < 50% of expected from the ratio registry). Surfaces the simple_wiki "266 of 600,000 chunks" case observed during NOMAD1 ingestion testing — previously these looked identical to fully-completed embeds in the UI. Both warnings render only when their condition is met (silent by default; noisy only on real problems). Base is `feat/kb-ratio-registry` (#891) because Warning B's "expected chunks" estimate comes from `KbRatioRegistry.estimateChunks()`. GitHub fast-forwards to `rc` once #891 merges. - `app/utils/kb_warning_decision.ts` — pure `decideWarnings(inputs)` with thresholds (`100 MB`, `0.5×`) as exported constants. 10 unit tests cover the healthy case, both warnings, the under/at/over boundary, the registry-miss suppression, and the video-only registry case (`expectedChunks: 0` correctly skips Warning B). - `RagService.computeFileWarnings()` — single Qdrant scroll tallies chunks per source, filesystem walk fills in zero-chunk files, ratio registry estimates the expectation, decision function emits. - New endpoint `GET /api/rag/file-warnings` returns `Record<source, FileWarning[]>` (sources with no warnings are omitted, so the frontend can `warnings[source] ?? []` for clean defaults). - KB modal: warnings render inline under the file name as amber-tinted pills. Polled every 30s alongside the existing health check. - Warning C — chunks skipped due to length. PR #890 (#881 fix) prevents the silent drop at the embed boundary, so the underlying condition shouldn't fire anymore. If we still want to surface "we truncated N chunks to fit", that needs separate `skipped_count` tracking in EmbedFileJob — a Phase 2 follow-up. - Suppressing Warning B during active mid-ingestion. The user can cross- reference the Processing Queue to know it's in-flight; suppressing warnings while a job runs would mask real stalls where the job died mid-batch. Will revisit when per-card status is wired through. - Use of `kb_ingest_state.chunks_embedded` (#888) as the chunk count source. This PR uses Qdrant scroll directly so it can land independently of #888. - 10 new unit tests on `decideWarnings`, all pass - Type-check clean - Hot-patch + browser smoke test deferred until #891 lands (the ratio registry needs to exist in the DB for `estimateChunks()` to return non-null estimates — without it, only Warning A fires which is still useful but Warning B stays dormant)	2026-05-20 10:16:00 -07:00
Chris Sherwood	e68c753e39	feat(KB): surface embedding-disk estimate in curated tier-change modal (RFC #883 §1) When a user picks a tier in TierSelectionModal, show how much additional disk space the AI Assistant will need if the new ZIMs are indexed, plus a policy-aware footer explaining whether they'll auto-index (Always) or wait for opt-in (Manual). Estimates consume #891's KbRatioRegistry via a new POST /api/rag/estimate-batch endpoint. Backend - New POST /api/rag/estimate-batch route + RagController.estimateBatch - VineJS schema accepting array of {filename, sizeBytes}, capped at 500 - KbRatioRegistry.estimateBatch aggregates via the existing prefix-match lookup, returns {totalChunks, totalBytes, hasUnknown} - New BYTES_PER_CHUNK_ON_DISK constant (~8 KB: 3 KB vector + ~3 KB chunk text + ~2 KB payload/index overhead). Tunable; will be replaced by Phase 4 self-calibration once we have real measurements. - Controller normalizes incoming filenames via path.basename so callers that send full paths or URLs still match registry prefixes correctly. Frontend - api.estimateEmbeddingBatch() client method - TierSelectionModal: when localSelectedSlug is set, resolve the tier's resources (incl. inherited tiers), POST to /estimate-batch, and render a new info block with the +~X GB figure + ingest-policy copy. Also fetches rag.defaultIngestPolicy so the same block surfaces whether indexing will fire automatically or wait for the user. - resourceFilename() helper extracts the basename from the resource URL so the registry lookup hits the right prefix regardless of mirror. Tests - 4 new cases in tests/unit/kb_ratio_lookup.spec.ts covering the estimateBatch aggregator: standard sum, unknown-flagging, video-only ZIM (0 chunks but known, hasUnknown stays false), empty input. Stacks on feat/kb-ratio-registry (#891) — consumes the registry table seeded by that PR. Once #891 merges to rc, this PR auto-rebases. Out of scope for this PR (deferred to follow-ups): - Per-batch opt-in checkbox (RFC §1's '☑ Also index these for AI') needs a per-batch policy override path and is a separate PR - Guardrail modal at 50 GB / 10% free / 6 hr thresholds (RFC §7) is also separate; this PR is informational, not gating - Time-to-embed estimate awaits a chunks-per-second metric per host	2026-05-20 10:16:00 -07:00
chriscrosstalk	8eb8809154	feat(KB): Always/Manual ingest policy toggle (RFC #883 §1/§4) (#894 ) * feat(KB): per-file ingest state machine (Phase 1 of RFC #883) Adds a persistent state machine for AI knowledge-base ingestion so the scanner can distinguish "fully indexed", "user opted out", "failed", and "stalled" from each other — none of which were derivable from the prior binary "any chunks in Qdrant ⇒ embedded" check. ## What lands - New table `kb_ingest_state` keyed by `file_path` with enum state column (`pending_decision \| indexed \| browse_only \| failed \| stalled`). Independent of `installed_resources` so it covers both curated downloads and manually-uploaded KB files. - New KV key `rag.defaultIngestPolicy` (string: `Always \| Manual`). Registered now but not consumed yet — JIT prompt + wizard step land in Phase 3 of the RFC. - `EmbedFileJob.handle` writes state on terminal outcomes: - Success (final batch) → `indexed` + chunks count - `UnrecoverableError` → `failed` + error message - Retryable errors are left to BullMQ's existing retry path - `scanAndSyncStorage` swaps the binary qdrant check for a state-aware decision tree (see `decideScanAction`). Existing installs auto-backfill on first scan: files with chunks in Qdrant but no state row become `indexed`; new files start as `pending_decision`. - `deleteFileBySource` drops the state row last, so removed files disappear entirely instead of leaving an orphan that the next scan would re-dispatch into nothing. ## What does NOT land here - Ratio registry (separate PR) — needed for partial-stall detection and cost estimates, but a separable concern. - #880 follow-up initial-progress anchor (separate tiny PR). - Phase 2 UI (status pill, per-card actions, conditional warnings). - Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal). - PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All / Reset & Rebuild would also want to set state, but #886 isn't merged yet; that wiring goes in a follow-up once #886 lands. ## Target This is forward work for v1.40.0 (RFC #883). Branching off `rc` because that's the current latest base and post-GA Jake will sync rc→dev; a retarget at PR-open time is a fast-forward if requested. ## Tests - 9 new unit tests for `decideScanAction` covering all five states plus the no-row / chunks-present / chunks-missing combinations - Type-check clean - Smoke-tested end-to-end on NOMAD3 via hot-patch: - Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all came back `indexed` on first scan - Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`) came back `pending_decision` and was correctly re-dispatched (Bull deduped to its historical `:completed` jobId — bgauger's #886 fix drains that) - Delete hook: deleting a KB upload via `DELETE /api/rag/files` removed both the disk file and the state row * feat(KB): Always/Manual ingest policy toggle (RFC #883 §1/§4) Activates the `rag.defaultIngestPolicy` KV registered in Phase 1 (#888) so users on a fresh install (or anyone who picks Manual mode) no longer get every new ZIM auto-dispatched to the embed pipeline. ## Stacks on #888 This PR's base is `feat/kb-ingest-state-machine` (#888). The state machine has to be in place for the decision function to be policy-aware; GitHub will fast-forward the base to `rc` once #888 merges. ## Backend changes - `decideScanAction` now takes a `policy: 'Always' \| 'Manual'` argument (defaults to `Always` for backward compatibility). - New `ScanAction` kind: `create_pending`. Manual mode records that the scanner has seen a new file (so the UI can surface a per-card Index affordance later) without dispatching an EmbedFileJob. - `scanAndSyncStorage` reads the KV and passes it through. The scan-result log line now includes the active policy and a `waiting on user` count for Manual-mode hits. - `rag.defaultIngestPolicy` added to `SETTINGS_KEYS` so it's reachable through the existing `GET/PATCH /api/system/settings` surface — no new endpoint. ## Frontend changes - New section in the KB panel between "Why upload" and "Processing Queue": "Auto-index new content for AI? [Always \| Manual]" — segmented radio with copy explaining the 5-10× disk multiplier. Default Always. - `useQuery('ingestPolicy')` reads the current value; clicking the inactive option mutates and shows a notification confirming the new behavior. ## Tests - 14 unit tests on `decideScanAction` (was 9) — split into Always-mode cases (preserves Phase 1's contract) and Manual-mode cases (`create_pending`, `pending_decision → skip`, etc.). - Type-check clean. - Hot-patch + browser verification deferred until #888 lands; the state machine smoke-tested cleanly on NOMAD3 in #888's PR, and this PR's decision-tree changes are exhaustively unit-tested. ## RFC open question §3 — policy-change re-trigger Switching Manual → Always doesn't auto-dispatch existing `pending_decision` rows immediately. The next scan re-evaluates and dispatches them under the new policy. This matches the RFC's "treat the switch as I've- thought-about-it" instinct for the guardrail; full guardrail implementation lands in Phase 3 task 14. --------- Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com>	2026-05-20 10:16:00 -07:00
Chris Sherwood	43ca584b6c	feat(KB): status pill + last-activity timestamp on Processing Queue (RFC #883 §5/§10) Each in-flight (or stuck) embedding job gets a colored health pill, relative-activity timestamp, and chunk counter so users can tell at a glance whether ingestion is making progress. ## Health states - 🟢 Active — last batch < 2 min ago - 🟡 Slow — last batch 2-5 min ago (CPU-paced multi-batch ingestion lives here naturally; not always a problem) - 🔴 Stalled — last batch > 5 min ago (likely real problem) - ⚪ Waiting — queued, no batch started yet - 🔴 Failed — job recorded failed status ## What lands - New backend util `kb_job_health.ts` with pure `computeJobHealth(input)` decision function. Time-based thresholds (2 min / 5 min) inlined as constants. 9 unit tests pin the boundaries. - `EmbedJobWithProgress` gains `lastBatchAt`, `startedAt`, `chunks` — already set by `EmbedFileJob.handle` on every batch transition, just not previously surfaced through `listActiveJobs`. - Frontend `kb_job_health_display.ts` maps each status to a Tailwind dot color, label, and aria-label so backend and UI stay in sync. - `ActiveEmbedJobs.tsx` renders the pill, "last activity Xs ago", and chunk counter above each progress bar. Adds a manual Refresh button and "Last updated Xs ago" line — the existing 2s/30s auto-poll cadence in `useEmbedJobs` is left intact. - Live tick at 5s keeps the relative timestamps current without re-fetching from the API. ## Not in scope - Per-card Cancel / Retry / Un-index — separate Phase 2 PR - Conditional warnings A/B/C — separate Phase 2 PR - Computing throughput rate (chunks/min) — needs ratio registry consumer (Phase 2 follow-up); for now the pill answers the "is it stuck?" question directly without a rate estimate.	2026-05-20 10:16:00 -07:00
Chris Sherwood	159d57b2af	feat(KB): ratio registry for disk + time estimates (Phase 1B of RFC #883 ) Foundation for the cost estimates and partial-stall detection that Phase 2 will surface. No consumers yet — this PR just lays the table, the seed rows, and the lookup helper so subsequent UI work has estimates available without a per-ZIM benchmark. ## What lands - New table `kb_ratio_registry` (pattern, chunks_per_mb, sample_count, notes). Migration creates and seeds heuristic defaults from the RFC appendix: devdocs (1100/MB), Wikipedia variants (270/MB), iFixit (50/MB), Stack Exchange Q&A (200/MB), video-only ZIMs (0), plus a catch-all fallback at 100/MB. - `KbRatioRegistry` model with static `lookup()` and `estimateChunks()`. - Pure helper `kb_ratio_lookup.ts` doing longest-prefix-match — a specific entry (`wikipedia_en_simple_`) overrides a broader one (`wikipedia_en_`). 9 unit tests covering the lookup boundary. - `sample_count` starts at 0 (heuristic seed) and is reserved for Phase 4 self-calibration to increment as observed ZIMs update each row. ## Not in scope - Self-calibration on successful ingestion (Phase 4) - UI consumers — Warning B (partial-embed stall) and the storage budget meter / time estimates land in Phase 2. ## Tested - Type-check clean - 9 unit tests pass for `findChunksPerMb` and `estimateChunkCount` - Migration applied on NOMAD3 via hot-patch; 9 seed rows verified in DB	2026-05-20 10:16:00 -07:00
chriscrosstalk	743549ca74	feat(KB): per-file ingest state machine (Phase 1 of RFC #883 ) (#888 ) Adds a persistent state machine for AI knowledge-base ingestion so the scanner can distinguish "fully indexed", "user opted out", "failed", and "stalled" from each other — none of which were derivable from the prior binary "any chunks in Qdrant ⇒ embedded" check. ## What lands - New table `kb_ingest_state` keyed by `file_path` with enum state column (`pending_decision \| indexed \| browse_only \| failed \| stalled`). Independent of `installed_resources` so it covers both curated downloads and manually-uploaded KB files. - New KV key `rag.defaultIngestPolicy` (string: `Always \| Manual`). Registered now but not consumed yet — JIT prompt + wizard step land in Phase 3 of the RFC. - `EmbedFileJob.handle` writes state on terminal outcomes: - Success (final batch) → `indexed` + chunks count - `UnrecoverableError` → `failed` + error message - Retryable errors are left to BullMQ's existing retry path - `scanAndSyncStorage` swaps the binary qdrant check for a state-aware decision tree (see `decideScanAction`). Existing installs auto-backfill on first scan: files with chunks in Qdrant but no state row become `indexed`; new files start as `pending_decision`. - `deleteFileBySource` drops the state row last, so removed files disappear entirely instead of leaving an orphan that the next scan would re-dispatch into nothing. ## What does NOT land here - Ratio registry (separate PR) — needed for partial-stall detection and cost estimates, but a separable concern. - #880 follow-up initial-progress anchor (separate tiny PR). - Phase 2 UI (status pill, per-card actions, conditional warnings). - Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal). - PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All / Reset & Rebuild would also want to set state, but #886 isn't merged yet; that wiring goes in a follow-up once #886 lands. ## Target This is forward work for v1.40.0 (RFC #883). Branching off `rc` because that's the current latest base and post-GA Jake will sync rc→dev; a retarget at PR-open time is a fast-forward if requested. ## Tests - 9 new unit tests for `decideScanAction` covering all five states plus the no-row / chunks-present / chunks-missing combinations - Type-check clean - Smoke-tested end-to-end on NOMAD3 via hot-patch: - Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all came back `indexed` on first scan - Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`) came back `pending_decision` and was correctly re-dispatched (Bull deduped to its historical `:completed` jobId — bgauger's #886 fix drains that) - Delete hook: deleting a KB upload via `DELETE /api/rag/files` removed both the disk file and the state row Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com>	2026-05-20 10:16:00 -07:00
Chris Sherwood	5e2c599c3e	fix(ZIM): preserve co-existing Wikipedia corpora on cleanup (#884 ) onWikipediaDownloadComplete was deleting every file whose name starts with `wikipedia_en_`, treating distinct corpora (simple, medicine, wikivoyage, climate_change, etc.) as competing versions of the same selection slot. Whichever wiki finished second silently wiped the other from disk. Match by filename stem instead — strip the trailing `_YYYY-MM(-DD).zim` date suffix and only delete files with the same stem as the new download. Different release dates of the same variant still get cleaned up; distinct variants are preserved. Extracted the predicate to `app/utils/zim_filename.ts` so the boundary is covered by unit tests (8 cases incl. the #884 repro scenario).	2026-05-20 10:16:00 -07:00
Ben Gauger	3abf338767	fix(Downloads): treat missing Content-Type as octet-stream (#848 ) download.kiwix.org (and some of its mirrors) don't always set a Content-Type header on .zim responses. The MIME validator was reading `headers['content-type'] \|\| ''`, then running each allowlist entry through `''.includes(...)` which is always false, so every download from those hosts was rejected with `MIME type is not allowed`. RFC 7231 §3.1.1.5 says missing Content-Type may be treated as application/octet-stream by the recipient, and that's already in every binary-content allowlist we use (ZIM, PMTILES, base assets). Default the missing case to that and the validator does the right thing. Strict callers that don't list octet-stream still reject as before.	2026-05-20 10:16:00 -07:00
Jake Turner	2b8c847295	fix(Downloads): remove duplicate err listnr and improv Range req stability	2026-04-21 14:26:28 -07:00
Aaron Bird	8d026da06e	fix(downloads): stage downloads to .tmp to prevent Kiwix loading partial files Downloads are now written to `filepath + '.tmp'` and atomically renamed to the final path only on successful completion. Kiwix globs for `*.zim` and ZimService filters `.endsWith('.zim')`, so `.tmp` files are invisible to both during download. The same staging applies to `.pmtiles` map files. Ref #372 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 14:26:28 -07:00
Jake Turner	c8cb79a3a5	fix: prevent ZIM corrupt file crash and deduplicate Ollama download logs (#741 ) Corrupted ZIM files cause a native C++ abort (ZimFileFormatError) that bypasses JS try/catch and kills the process. Add magic number validation before passing files to @openzim/libzim so invalid files are skipped gracefully. Also deduplicate Ollama download progress broadcasts — both within a single stream (skip unchanged percentages) and across concurrent callers (share one download promise per model). Co-authored-by: aegisman <aegis@manicode.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-21 14:26:28 -07:00
Jake Turner	9e3828bcba	feat(Kiwix): migrate to Kiwix library mode for improved stability (#622 )	2026-04-03 14:26:50 -07:00
arn6694	ed8918f2e9	feat(rag): add EPUB file support for Knowledge Base uploads (#257 )	2026-04-03 14:26:50 -07:00
Jake Turner	b8cf1b6127	fix(disk): correct storage display by fixing device matching and dedup mount entries	2026-03-20 11:46:10 -07:00
Chris Sherwood	b0b8f07661	fix: improve download reliability with stall detection, failure visibility, and Wikipedia status tracking Three bugs caused downloads to hang, disappear, or leave stuck spinners: 1. Wikipedia downloads that failed never updated the DB status from 'downloading', leaving the spinner stuck forever. Now the worker's failed handler marks them as failed. 2. No stall detection on streaming downloads - if data stopped flowing mid-download, the job hung indefinitely. Added a 5-minute stall timer that triggers retry. 3. Failed jobs were invisible to users since only waiting/active/delayed states were queried. Now failed jobs appear with error indicators in the download list. Closes #364, closes #216 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 11:46:10 -07:00
Jake Turner	58b106f388	feat: support for updating services	2026-03-11 14:08:09 -07:00
Jake Turner	8726700a0a	feat: zim content embedding	2026-02-08 13:20:10 -08:00
Jake Turner	1923cd4cde	feat(AI): chat suggestions and assistant settings	2026-02-01 07:24:21 +00:00
Jake Turner	243f749090	feat: [wip] native AI chat interface	2026-01-31 20:39:49 -08:00
Jake Turner	50174d2edb	feat(RAG): [wip] RAG capabilities	2026-01-31 20:39:49 -08:00
Jake Turner	9ec514e145	fix(Zim): storage path	2025-12-07 20:18:58 -08:00
Jake Turner	5205d5909d	feat: disk info collection	2025-12-07 19:13:43 -08:00
Jake Turner	2ff7b055b5	fix(Kiwix): initial download and setup	2025-12-07 16:04:41 -08:00
Jake Turner	7569aa935d	feat: background job overhaul with bullmq	2025-12-06 23:59:01 -08:00
Jake Turner	95ba0a95c9	fix: download util improvements	2025-12-05 18:16:23 -08:00
Jake Turner	dd4e7c2c4f	feat: curated zim collections	2025-12-05 15:47:22 -08:00
Jake Turner	12a6f2230d	feat: [wip] new maps system	2025-11-30 22:29:16 -08:00

28 Commits