mirror of
https://github.com/Crosstalk-Solutions/project-nomad.git
synced 2026-05-23 12:55:05 +02:00
Adds a persistent state machine for AI knowledge-base ingestion so the scanner can distinguish "fully indexed", "user opted out", "failed", and "stalled" from each other — none of which were derivable from the prior binary "any chunks in Qdrant ⇒ embedded" check. ## What lands - New table `kb_ingest_state` keyed by `file_path` with enum state column (`pending_decision | indexed | browse_only | failed | stalled`). Independent of `installed_resources` so it covers both curated downloads and manually-uploaded KB files. - New KV key `rag.defaultIngestPolicy` (string: `Always | Manual`). Registered now but not consumed yet — JIT prompt + wizard step land in Phase 3 of the RFC. - `EmbedFileJob.handle` writes state on terminal outcomes: - Success (final batch) → `indexed` + chunks count - `UnrecoverableError` → `failed` + error message - Retryable errors are left to BullMQ's existing retry path - `scanAndSyncStorage` swaps the binary qdrant check for a state-aware decision tree (see `decideScanAction`). Existing installs auto-backfill on first scan: files with chunks in Qdrant but no state row become `indexed`; new files start as `pending_decision`. - `deleteFileBySource` drops the state row last, so removed files disappear entirely instead of leaving an orphan that the next scan would re-dispatch into nothing. ## What does NOT land here - Ratio registry (separate PR) — needed for partial-stall detection and cost estimates, but a separable concern. - #880 follow-up initial-progress anchor (separate tiny PR). - Phase 2 UI (status pill, per-card actions, conditional warnings). - Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal). - PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All / Reset & Rebuild would also want to set state, but #886 isn't merged yet; that wiring goes in a follow-up once #886 lands. ## Target This is forward work for v1.40.0 (RFC #883). Branching off `rc` because that's the current latest base and post-GA Jake will sync rc→dev; a retarget at PR-open time is a fast-forward if requested. ## Tests - 9 new unit tests for `decideScanAction` covering all five states plus the no-row / chunks-present / chunks-missing combinations - Type-check clean - Smoke-tested end-to-end on NOMAD3 via hot-patch: - Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all came back `indexed` on first scan - Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`) came back `pending_decision` and was correctly re-dispatched (Bull deduped to its historical `:completed` jobId — bgauger's #886 fix drains that) - Delete hook: deleting a KB upload via `DELETE /api/rag/files` removed both the disk file and the state row Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| migrations | ||
| seeders | ||