Commit Graph

28 Commits

Author SHA1 Message Date
Jake Turner
102998ec96 refactor(KB): move FileWarning to shared types/rag following existing convention 2026-05-20 10:16:00 -07:00
Chris Sherwood
563f86a22b feat(KB): conditional warnings A + B on Stored Files (RFC #883 §6)
Surfaces two silent failure modes that the prior binary
"any-chunks-in-Qdrant ⇒ embedded" check could not distinguish from
healthy ingestion:

- **Warning A — Zero-chunk file** (file_size > 100 MB, chunks = 0)
  Fires on video-only / image-only ZIMs (`lrnselfreliance_en_all`,
  TED talks, etc.) that the pipeline completes "successfully" with no
  extractable text. AI Assistant literally cannot reference these.

- **Warning B — Partial-embed stall** (chunks < 50% of expected from
  the ratio registry). Surfaces the simple_wiki "266 of 600,000 chunks"
  case observed during NOMAD1 ingestion testing — previously these
  looked identical to fully-completed embeds in the UI.

Both warnings render only when their condition is met (silent by
default; noisy only on real problems).

Base is `feat/kb-ratio-registry` (#891) because Warning B's "expected
chunks" estimate comes from `KbRatioRegistry.estimateChunks()`. GitHub
fast-forwards to `rc` once #891 merges.

- `app/utils/kb_warning_decision.ts` — pure `decideWarnings(inputs)`
  with thresholds (`100 MB`, `0.5×`) as exported constants. 10 unit
  tests cover the healthy case, both warnings, the under/at/over
  boundary, the registry-miss suppression, and the video-only registry
  case (`expectedChunks: 0` correctly skips Warning B).
- `RagService.computeFileWarnings()` — single Qdrant scroll tallies
  chunks per source, filesystem walk fills in zero-chunk files,
  ratio registry estimates the expectation, decision function emits.
- New endpoint `GET /api/rag/file-warnings` returns
  `Record<source, FileWarning[]>` (sources with no warnings are
  omitted, so the frontend can `warnings[source] ?? []` for clean
  defaults).
- KB modal: warnings render inline under the file name as amber-tinted
  pills. Polled every 30s alongside the existing health check.

- Warning C — chunks skipped due to length. PR #890 (#881 fix) prevents
  the silent drop at the embed boundary, so the underlying condition
  shouldn't fire anymore. If we still want to surface "we truncated
  N chunks to fit", that needs separate `skipped_count` tracking in
  EmbedFileJob — a Phase 2 follow-up.
- Suppressing Warning B during active mid-ingestion. The user can cross-
  reference the Processing Queue to know it's in-flight; suppressing
  warnings while a job runs would mask real stalls where the job died
  mid-batch. Will revisit when per-card status is wired through.
- Use of `kb_ingest_state.chunks_embedded` (#888) as the chunk count
  source. This PR uses Qdrant scroll directly so it can land
  independently of #888.

- 10 new unit tests on `decideWarnings`, all pass
- Type-check clean
- Hot-patch + browser smoke test deferred until #891 lands (the ratio
  registry needs to exist in the DB for `estimateChunks()` to return
  non-null estimates — without it, only Warning A fires which is still
  useful but Warning B stays dormant)
2026-05-20 10:16:00 -07:00
Chris Sherwood
e68c753e39 feat(KB): surface embedding-disk estimate in curated tier-change modal (RFC #883 §1)
When a user picks a tier in TierSelectionModal, show how much additional
disk space the AI Assistant will need if the new ZIMs are indexed, plus
a policy-aware footer explaining whether they'll auto-index (Always) or
wait for opt-in (Manual). Estimates consume #891's KbRatioRegistry via a
new POST /api/rag/estimate-batch endpoint.

Backend
- New POST /api/rag/estimate-batch route + RagController.estimateBatch
- VineJS schema accepting array of {filename, sizeBytes}, capped at 500
- KbRatioRegistry.estimateBatch aggregates via the existing prefix-match
  lookup, returns {totalChunks, totalBytes, hasUnknown}
- New BYTES_PER_CHUNK_ON_DISK constant (~8 KB: 3 KB vector + ~3 KB chunk
  text + ~2 KB payload/index overhead). Tunable; will be replaced by
  Phase 4 self-calibration once we have real measurements.
- Controller normalizes incoming filenames via path.basename so callers
  that send full paths or URLs still match registry prefixes correctly.

Frontend
- api.estimateEmbeddingBatch() client method
- TierSelectionModal: when localSelectedSlug is set, resolve the tier's
  resources (incl. inherited tiers), POST to /estimate-batch, and render
  a new info block with the +~X GB figure + ingest-policy copy. Also
  fetches rag.defaultIngestPolicy so the same block surfaces whether
  indexing will fire automatically or wait for the user.
- resourceFilename() helper extracts the basename from the resource URL
  so the registry lookup hits the right prefix regardless of mirror.

Tests
- 4 new cases in tests/unit/kb_ratio_lookup.spec.ts covering the
  estimateBatch aggregator: standard sum, unknown-flagging, video-only
  ZIM (0 chunks but known, hasUnknown stays false), empty input.

Stacks on feat/kb-ratio-registry (#891) — consumes the registry table
seeded by that PR. Once #891 merges to rc, this PR auto-rebases.

Out of scope for this PR (deferred to follow-ups):
- Per-batch opt-in checkbox (RFC §1's '☑ Also index these for AI') needs
  a per-batch policy override path and is a separate PR
- Guardrail modal at 50 GB / 10% free / 6 hr thresholds (RFC §7) is also
  separate; this PR is informational, not gating
- Time-to-embed estimate awaits a chunks-per-second metric per host
2026-05-20 10:16:00 -07:00
chriscrosstalk
8eb8809154 feat(KB): Always/Manual ingest policy toggle (RFC #883 §1/§4) (#894)
* feat(KB): per-file ingest state machine (Phase 1 of RFC #883)

Adds a persistent state machine for AI knowledge-base ingestion so the
scanner can distinguish "fully indexed", "user opted out", "failed", and
"stalled" from each other — none of which were derivable from the prior
binary "any chunks in Qdrant ⇒ embedded" check.

## What lands

- New table `kb_ingest_state` keyed by `file_path` with enum state column
  (`pending_decision | indexed | browse_only | failed | stalled`).
  Independent of `installed_resources` so it covers both curated downloads
  and manually-uploaded KB files.
- New KV key `rag.defaultIngestPolicy` (string: `Always | Manual`).
  Registered now but not consumed yet — JIT prompt + wizard step land in
  Phase 3 of the RFC.
- `EmbedFileJob.handle` writes state on terminal outcomes:
  - Success (final batch) → `indexed` + chunks count
  - `UnrecoverableError` → `failed` + error message
  - Retryable errors are left to BullMQ's existing retry path
- `scanAndSyncStorage` swaps the binary qdrant check for a state-aware
  decision tree (see `decideScanAction`). Existing installs auto-backfill
  on first scan: files with chunks in Qdrant but no state row become
  `indexed`; new files start as `pending_decision`.
- `deleteFileBySource` drops the state row last, so removed files
  disappear entirely instead of leaving an orphan that the next scan
  would re-dispatch into nothing.

## What does NOT land here

- Ratio registry (separate PR) — needed for partial-stall detection and
  cost estimates, but a separable concern.
- #880 follow-up initial-progress anchor (separate tiny PR).
- Phase 2 UI (status pill, per-card actions, conditional warnings).
- Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal).
- PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All
  / Reset & Rebuild would also want to set state, but #886 isn't merged
  yet; that wiring goes in a follow-up once #886 lands.

## Target

This is forward work for v1.40.0 (RFC #883). Branching off `rc` because
that's the current latest base and post-GA Jake will sync rc→dev; a
retarget at PR-open time is a fast-forward if requested.

## Tests

- 9 new unit tests for `decideScanAction` covering all five states plus
  the no-row / chunks-present / chunks-missing combinations
- Type-check clean
- Smoke-tested end-to-end on NOMAD3 via hot-patch:
  - Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all
    came back `indexed` on first scan
  - Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`)
    came back `pending_decision` and was correctly re-dispatched (Bull
    deduped to its historical `:completed` jobId — bgauger's #886 fix
    drains that)
  - Delete hook: deleting a KB upload via `DELETE /api/rag/files`
    removed both the disk file and the state row

* feat(KB): Always/Manual ingest policy toggle (RFC #883 §1/§4)

Activates the `rag.defaultIngestPolicy` KV registered in Phase 1
(#888) so users on a fresh install (or anyone who picks Manual mode)
no longer get every new ZIM auto-dispatched to the embed pipeline.

## Stacks on #888

This PR's base is `feat/kb-ingest-state-machine` (#888). The state
machine has to be in place for the decision function to be policy-aware;
GitHub will fast-forward the base to `rc` once #888 merges.

## Backend changes

- `decideScanAction` now takes a `policy: 'Always' | 'Manual'` argument
  (defaults to `Always` for backward compatibility).
- New `ScanAction` kind: `create_pending`. Manual mode records that the
  scanner has seen a new file (so the UI can surface a per-card Index
  affordance later) without dispatching an EmbedFileJob.
- `scanAndSyncStorage` reads the KV and passes it through. The scan-result
  log line now includes the active policy and a `waiting on user` count
  for Manual-mode hits.
- `rag.defaultIngestPolicy` added to `SETTINGS_KEYS` so it's reachable
  through the existing `GET/PATCH /api/system/settings` surface — no new
  endpoint.

## Frontend changes

- New section in the KB panel between "Why upload" and "Processing Queue":
  "Auto-index new content for AI? [Always | Manual]" — segmented radio
  with copy explaining the 5-10× disk multiplier. Default Always.
- `useQuery('ingestPolicy')` reads the current value; clicking the
  inactive option mutates and shows a notification confirming the new
  behavior.

## Tests

- 14 unit tests on `decideScanAction` (was 9) — split into Always-mode
  cases (preserves Phase 1's contract) and Manual-mode cases
  (`create_pending`, `pending_decision → skip`, etc.).
- Type-check clean.
- Hot-patch + browser verification deferred until #888 lands; the state
  machine smoke-tested cleanly on NOMAD3 in #888's PR, and this PR's
  decision-tree changes are exhaustively unit-tested.

## RFC open question §3 — policy-change re-trigger

Switching Manual → Always doesn't auto-dispatch existing `pending_decision`
rows immediately. The next scan re-evaluates and dispatches them under
the new policy. This matches the RFC's "treat the switch as I've-
thought-about-it" instinct for the guardrail; full guardrail
implementation lands in Phase 3 task 14.

---------

Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com>
2026-05-20 10:16:00 -07:00
Chris Sherwood
43ca584b6c feat(KB): status pill + last-activity timestamp on Processing Queue (RFC #883 §5/§10)
Each in-flight (or stuck) embedding job gets a colored health pill,
relative-activity timestamp, and chunk counter so users can tell at a
glance whether ingestion is making progress.

## Health states

- **🟢 Active** — last batch < 2 min ago
- **🟡 Slow** — last batch 2-5 min ago (CPU-paced multi-batch ingestion
  lives here naturally; not always a problem)
- **🔴 Stalled** — last batch > 5 min ago (likely real problem)
- ** Waiting** — queued, no batch started yet
- **🔴 Failed** — job recorded failed status

## What lands

- New backend util `kb_job_health.ts` with pure `computeJobHealth(input)`
  decision function. Time-based thresholds (2 min / 5 min) inlined as
  constants. 9 unit tests pin the boundaries.
- `EmbedJobWithProgress` gains `lastBatchAt`, `startedAt`, `chunks` —
  already set by `EmbedFileJob.handle` on every batch transition, just
  not previously surfaced through `listActiveJobs`.
- Frontend `kb_job_health_display.ts` maps each status to a Tailwind
  dot color, label, and aria-label so backend and UI stay in sync.
- `ActiveEmbedJobs.tsx` renders the pill, "last activity Xs ago", and
  chunk counter above each progress bar. Adds a manual Refresh button
  and "Last updated Xs ago" line — the existing 2s/30s auto-poll
  cadence in `useEmbedJobs` is left intact.
- Live tick at 5s keeps the relative timestamps current without
  re-fetching from the API.

## Not in scope

- Per-card Cancel / Retry / Un-index — separate Phase 2 PR
- Conditional warnings A/B/C — separate Phase 2 PR
- Computing throughput rate (chunks/min) — needs ratio registry consumer
  (Phase 2 follow-up); for now the pill answers the "is it stuck?"
  question directly without a rate estimate.
2026-05-20 10:16:00 -07:00
Chris Sherwood
159d57b2af feat(KB): ratio registry for disk + time estimates (Phase 1B of RFC #883)
Foundation for the cost estimates and partial-stall detection that
Phase 2 will surface. No consumers yet — this PR just lays the table,
the seed rows, and the lookup helper so subsequent UI work has
estimates available without a per-ZIM benchmark.

## What lands

- New table `kb_ratio_registry` (pattern, chunks_per_mb, sample_count,
  notes). Migration creates and seeds heuristic defaults from the RFC
  appendix: devdocs (1100/MB), Wikipedia variants (270/MB), iFixit
  (50/MB), Stack Exchange Q&A (200/MB), video-only ZIMs (0), plus a
  catch-all fallback at 100/MB.
- `KbRatioRegistry` model with static `lookup()` and `estimateChunks()`.
- Pure helper `kb_ratio_lookup.ts` doing longest-prefix-match — a
  specific entry (`wikipedia_en_simple_`) overrides a broader one
  (`wikipedia_en_`). 9 unit tests covering the lookup boundary.
- `sample_count` starts at 0 (heuristic seed) and is reserved for
  Phase 4 self-calibration to increment as observed ZIMs update each row.

## Not in scope

- Self-calibration on successful ingestion (Phase 4)
- UI consumers — Warning B (partial-embed stall) and the storage budget
  meter / time estimates land in Phase 2.

## Tested

- Type-check clean
- 9 unit tests pass for `findChunksPerMb` and `estimateChunkCount`
- Migration applied on NOMAD3 via hot-patch; 9 seed rows verified in DB
2026-05-20 10:16:00 -07:00
chriscrosstalk
743549ca74 feat(KB): per-file ingest state machine (Phase 1 of RFC #883) (#888)
Adds a persistent state machine for AI knowledge-base ingestion so the
scanner can distinguish "fully indexed", "user opted out", "failed", and
"stalled" from each other — none of which were derivable from the prior
binary "any chunks in Qdrant ⇒ embedded" check.

## What lands

- New table `kb_ingest_state` keyed by `file_path` with enum state column
  (`pending_decision | indexed | browse_only | failed | stalled`).
  Independent of `installed_resources` so it covers both curated downloads
  and manually-uploaded KB files.
- New KV key `rag.defaultIngestPolicy` (string: `Always | Manual`).
  Registered now but not consumed yet — JIT prompt + wizard step land in
  Phase 3 of the RFC.
- `EmbedFileJob.handle` writes state on terminal outcomes:
  - Success (final batch) → `indexed` + chunks count
  - `UnrecoverableError` → `failed` + error message
  - Retryable errors are left to BullMQ's existing retry path
- `scanAndSyncStorage` swaps the binary qdrant check for a state-aware
  decision tree (see `decideScanAction`). Existing installs auto-backfill
  on first scan: files with chunks in Qdrant but no state row become
  `indexed`; new files start as `pending_decision`.
- `deleteFileBySource` drops the state row last, so removed files
  disappear entirely instead of leaving an orphan that the next scan
  would re-dispatch into nothing.

## What does NOT land here

- Ratio registry (separate PR) — needed for partial-stall detection and
  cost estimates, but a separable concern.
- #880 follow-up initial-progress anchor (separate tiny PR).
- Phase 2 UI (status pill, per-card actions, conditional warnings).
- Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal).
- PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All
  / Reset & Rebuild would also want to set state, but #886 isn't merged
  yet; that wiring goes in a follow-up once #886 lands.

## Target

This is forward work for v1.40.0 (RFC #883). Branching off `rc` because
that's the current latest base and post-GA Jake will sync rc→dev; a
retarget at PR-open time is a fast-forward if requested.

## Tests

- 9 new unit tests for `decideScanAction` covering all five states plus
  the no-row / chunks-present / chunks-missing combinations
- Type-check clean
- Smoke-tested end-to-end on NOMAD3 via hot-patch:
  - Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all
    came back `indexed` on first scan
  - Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`)
    came back `pending_decision` and was correctly re-dispatched (Bull
    deduped to its historical `:completed` jobId — bgauger's #886 fix
    drains that)
  - Delete hook: deleting a KB upload via `DELETE /api/rag/files`
    removed both the disk file and the state row

Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com>
2026-05-20 10:16:00 -07:00
Chris Sherwood
5e2c599c3e fix(ZIM): preserve co-existing Wikipedia corpora on cleanup (#884)
onWikipediaDownloadComplete was deleting every file whose name starts
with `wikipedia_en_`, treating distinct corpora (simple, medicine,
wikivoyage, climate_change, etc.) as competing versions of the same
selection slot. Whichever wiki finished second silently wiped the
other from disk.

Match by filename stem instead — strip the trailing `_YYYY-MM(-DD).zim`
date suffix and only delete files with the same stem as the new
download. Different release dates of the same variant still get cleaned
up; distinct variants are preserved.

Extracted the predicate to `app/utils/zim_filename.ts` so the boundary
is covered by unit tests (8 cases incl. the #884 repro scenario).
2026-05-20 10:16:00 -07:00
Ben Gauger
3abf338767 fix(Downloads): treat missing Content-Type as octet-stream (#848)
download.kiwix.org (and some of its mirrors) don't always set a
Content-Type header on .zim responses. The MIME validator was reading
`headers['content-type'] || ''`, then running each allowlist entry
through `''.includes(...)` which is always false, so every download
from those hosts was rejected with `MIME type  is not allowed`.

RFC 7231 §3.1.1.5 says missing Content-Type may be treated as
application/octet-stream by the recipient, and that's already in every
binary-content allowlist we use (ZIM, PMTILES, base assets). Default
the missing case to that and the validator does the right thing.

Strict callers that don't list octet-stream still reject as before.
2026-05-20 10:16:00 -07:00
Jake Turner
2b8c847295 fix(Downloads): remove duplicate err listnr and improv Range req stability 2026-04-21 14:26:28 -07:00
Aaron Bird
8d026da06e fix(downloads): stage downloads to .tmp to prevent Kiwix loading partial files
Downloads are now written to `filepath + '.tmp'` and atomically renamed
to the final path only on successful completion. Kiwix globs for `*.zim`
and ZimService filters `.endsWith('.zim')`, so `.tmp` files are invisible
to both during download. The same staging applies to `.pmtiles` map files.

Ref #372

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 14:26:28 -07:00
Jake Turner
c8cb79a3a5 fix: prevent ZIM corrupt file crash and deduplicate Ollama download logs (#741)
Corrupted ZIM files cause a native C++ abort (ZimFileFormatError) that
bypasses JS try/catch and kills the process. Add magic number validation
before passing files to @openzim/libzim so invalid files are skipped
gracefully. Also deduplicate Ollama download progress broadcasts — both
within a single stream (skip unchanged percentages) and across concurrent
callers (share one download promise per model).

Co-authored-by: aegisman <aegis@manicode.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 14:26:28 -07:00
Jake Turner
9e3828bcba feat(Kiwix): migrate to Kiwix library mode for improved stability (#622) 2026-04-03 14:26:50 -07:00
arn6694
ed8918f2e9 feat(rag): add EPUB file support for Knowledge Base uploads (#257) 2026-04-03 14:26:50 -07:00
Jake Turner
b8cf1b6127 fix(disk): correct storage display by fixing device matching and dedup mount entries 2026-03-20 11:46:10 -07:00
Chris Sherwood
b0b8f07661 fix: improve download reliability with stall detection, failure visibility, and Wikipedia status tracking
Three bugs caused downloads to hang, disappear, or leave stuck spinners:
1. Wikipedia downloads that failed never updated the DB status from 'downloading',
   leaving the spinner stuck forever. Now the worker's failed handler marks them as failed.
2. No stall detection on streaming downloads - if data stopped flowing mid-download,
   the job hung indefinitely. Added a 5-minute stall timer that triggers retry.
3. Failed jobs were invisible to users since only waiting/active/delayed states were
   queried. Now failed jobs appear with error indicators in the download list.

Closes #364, closes #216

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 11:46:10 -07:00
Jake Turner
58b106f388 feat: support for updating services 2026-03-11 14:08:09 -07:00
Jake Turner
8726700a0a feat: zim content embedding 2026-02-08 13:20:10 -08:00
Jake Turner
1923cd4cde feat(AI): chat suggestions and assistant settings 2026-02-01 07:24:21 +00:00
Jake Turner
243f749090 feat: [wip] native AI chat interface 2026-01-31 20:39:49 -08:00
Jake Turner
50174d2edb feat(RAG): [wip] RAG capabilities 2026-01-31 20:39:49 -08:00
Jake Turner
9ec514e145
fix(Zim): storage path 2025-12-07 20:18:58 -08:00
Jake Turner
5205d5909d
feat: disk info collection 2025-12-07 19:13:43 -08:00
Jake Turner
2ff7b055b5
fix(Kiwix): initial download and setup 2025-12-07 16:04:41 -08:00
Jake Turner
7569aa935d
feat: background job overhaul with bullmq 2025-12-06 23:59:01 -08:00
Jake Turner
95ba0a95c9 fix: download util improvements 2025-12-05 18:16:23 -08:00
Jake Turner
dd4e7c2c4f feat: curated zim collections 2025-12-05 15:47:22 -08:00
Jake Turner
12a6f2230d
feat: [wip] new maps system 2025-11-30 22:29:16 -08:00