Surfaces two silent failure modes that the prior binary
"any-chunks-in-Qdrant ⇒ embedded" check could not distinguish from
healthy ingestion:
- **Warning A — Zero-chunk file** (file_size > 100 MB, chunks = 0)
Fires on video-only / image-only ZIMs (`lrnselfreliance_en_all`,
TED talks, etc.) that the pipeline completes "successfully" with no
extractable text. AI Assistant literally cannot reference these.
- **Warning B — Partial-embed stall** (chunks < 50% of expected from
the ratio registry). Surfaces the simple_wiki "266 of 600,000 chunks"
case observed during NOMAD1 ingestion testing — previously these
looked identical to fully-completed embeds in the UI.
Both warnings render only when their condition is met (silent by
default; noisy only on real problems).
Base is `feat/kb-ratio-registry` (#891) because Warning B's "expected
chunks" estimate comes from `KbRatioRegistry.estimateChunks()`. GitHub
fast-forwards to `rc` once #891 merges.
- `app/utils/kb_warning_decision.ts` — pure `decideWarnings(inputs)`
with thresholds (`100 MB`, `0.5×`) as exported constants. 10 unit
tests cover the healthy case, both warnings, the under/at/over
boundary, the registry-miss suppression, and the video-only registry
case (`expectedChunks: 0` correctly skips Warning B).
- `RagService.computeFileWarnings()` — single Qdrant scroll tallies
chunks per source, filesystem walk fills in zero-chunk files,
ratio registry estimates the expectation, decision function emits.
- New endpoint `GET /api/rag/file-warnings` returns
`Record<source, FileWarning[]>` (sources with no warnings are
omitted, so the frontend can `warnings[source] ?? []` for clean
defaults).
- KB modal: warnings render inline under the file name as amber-tinted
pills. Polled every 30s alongside the existing health check.
- Warning C — chunks skipped due to length. PR #890 (#881 fix) prevents
the silent drop at the embed boundary, so the underlying condition
shouldn't fire anymore. If we still want to surface "we truncated
N chunks to fit", that needs separate `skipped_count` tracking in
EmbedFileJob — a Phase 2 follow-up.
- Suppressing Warning B during active mid-ingestion. The user can cross-
reference the Processing Queue to know it's in-flight; suppressing
warnings while a job runs would mask real stalls where the job died
mid-batch. Will revisit when per-card status is wired through.
- Use of `kb_ingest_state.chunks_embedded` (#888) as the chunk count
source. This PR uses Qdrant scroll directly so it can land
independently of #888.
- 10 new unit tests on `decideWarnings`, all pass
- Type-check clean
- Hot-patch + browser smoke test deferred until #891 lands (the ratio
registry needs to exist in the DB for `estimateChunks()` to return
non-null estimates — without it, only Warning A fires which is still
useful but Warning B stays dormant)
When a user picks a tier in TierSelectionModal, show how much additional
disk space the AI Assistant will need if the new ZIMs are indexed, plus
a policy-aware footer explaining whether they'll auto-index (Always) or
wait for opt-in (Manual). Estimates consume #891's KbRatioRegistry via a
new POST /api/rag/estimate-batch endpoint.
Backend
- New POST /api/rag/estimate-batch route + RagController.estimateBatch
- VineJS schema accepting array of {filename, sizeBytes}, capped at 500
- KbRatioRegistry.estimateBatch aggregates via the existing prefix-match
lookup, returns {totalChunks, totalBytes, hasUnknown}
- New BYTES_PER_CHUNK_ON_DISK constant (~8 KB: 3 KB vector + ~3 KB chunk
text + ~2 KB payload/index overhead). Tunable; will be replaced by
Phase 4 self-calibration once we have real measurements.
- Controller normalizes incoming filenames via path.basename so callers
that send full paths or URLs still match registry prefixes correctly.
Frontend
- api.estimateEmbeddingBatch() client method
- TierSelectionModal: when localSelectedSlug is set, resolve the tier's
resources (incl. inherited tiers), POST to /estimate-batch, and render
a new info block with the +~X GB figure + ingest-policy copy. Also
fetches rag.defaultIngestPolicy so the same block surfaces whether
indexing will fire automatically or wait for the user.
- resourceFilename() helper extracts the basename from the resource URL
so the registry lookup hits the right prefix regardless of mirror.
Tests
- 4 new cases in tests/unit/kb_ratio_lookup.spec.ts covering the
estimateBatch aggregator: standard sum, unknown-flagging, video-only
ZIM (0 chunks but known, hasUnknown stays false), empty input.
Stacks on feat/kb-ratio-registry (#891) — consumes the registry table
seeded by that PR. Once #891 merges to rc, this PR auto-rebases.
Out of scope for this PR (deferred to follow-ups):
- Per-batch opt-in checkbox (RFC §1's '☑ Also index these for AI') needs
a per-batch policy override path and is a separate PR
- Guardrail modal at 50 GB / 10% free / 6 hr thresholds (RFC §7) is also
separate; this PR is informational, not gating
- Time-to-embed estimate awaits a chunks-per-second metric per host
* feat(KB): per-file ingest state machine (Phase 1 of RFC #883)
Adds a persistent state machine for AI knowledge-base ingestion so the
scanner can distinguish "fully indexed", "user opted out", "failed", and
"stalled" from each other — none of which were derivable from the prior
binary "any chunks in Qdrant ⇒ embedded" check.
## What lands
- New table `kb_ingest_state` keyed by `file_path` with enum state column
(`pending_decision | indexed | browse_only | failed | stalled`).
Independent of `installed_resources` so it covers both curated downloads
and manually-uploaded KB files.
- New KV key `rag.defaultIngestPolicy` (string: `Always | Manual`).
Registered now but not consumed yet — JIT prompt + wizard step land in
Phase 3 of the RFC.
- `EmbedFileJob.handle` writes state on terminal outcomes:
- Success (final batch) → `indexed` + chunks count
- `UnrecoverableError` → `failed` + error message
- Retryable errors are left to BullMQ's existing retry path
- `scanAndSyncStorage` swaps the binary qdrant check for a state-aware
decision tree (see `decideScanAction`). Existing installs auto-backfill
on first scan: files with chunks in Qdrant but no state row become
`indexed`; new files start as `pending_decision`.
- `deleteFileBySource` drops the state row last, so removed files
disappear entirely instead of leaving an orphan that the next scan
would re-dispatch into nothing.
## What does NOT land here
- Ratio registry (separate PR) — needed for partial-stall detection and
cost estimates, but a separable concern.
- #880 follow-up initial-progress anchor (separate tiny PR).
- Phase 2 UI (status pill, per-card actions, conditional warnings).
- Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal).
- PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All
/ Reset & Rebuild would also want to set state, but #886 isn't merged
yet; that wiring goes in a follow-up once #886 lands.
## Target
This is forward work for v1.40.0 (RFC #883). Branching off `rc` because
that's the current latest base and post-GA Jake will sync rc→dev; a
retarget at PR-open time is a fast-forward if requested.
## Tests
- 9 new unit tests for `decideScanAction` covering all five states plus
the no-row / chunks-present / chunks-missing combinations
- Type-check clean
- Smoke-tested end-to-end on NOMAD3 via hot-patch:
- Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all
came back `indexed` on first scan
- Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`)
came back `pending_decision` and was correctly re-dispatched (Bull
deduped to its historical `:completed` jobId — bgauger's #886 fix
drains that)
- Delete hook: deleting a KB upload via `DELETE /api/rag/files`
removed both the disk file and the state row
* feat(KB): Always/Manual ingest policy toggle (RFC #883 §1/§4)
Activates the `rag.defaultIngestPolicy` KV registered in Phase 1
(#888) so users on a fresh install (or anyone who picks Manual mode)
no longer get every new ZIM auto-dispatched to the embed pipeline.
## Stacks on #888
This PR's base is `feat/kb-ingest-state-machine` (#888). The state
machine has to be in place for the decision function to be policy-aware;
GitHub will fast-forward the base to `rc` once #888 merges.
## Backend changes
- `decideScanAction` now takes a `policy: 'Always' | 'Manual'` argument
(defaults to `Always` for backward compatibility).
- New `ScanAction` kind: `create_pending`. Manual mode records that the
scanner has seen a new file (so the UI can surface a per-card Index
affordance later) without dispatching an EmbedFileJob.
- `scanAndSyncStorage` reads the KV and passes it through. The scan-result
log line now includes the active policy and a `waiting on user` count
for Manual-mode hits.
- `rag.defaultIngestPolicy` added to `SETTINGS_KEYS` so it's reachable
through the existing `GET/PATCH /api/system/settings` surface — no new
endpoint.
## Frontend changes
- New section in the KB panel between "Why upload" and "Processing Queue":
"Auto-index new content for AI? [Always | Manual]" — segmented radio
with copy explaining the 5-10× disk multiplier. Default Always.
- `useQuery('ingestPolicy')` reads the current value; clicking the
inactive option mutates and shows a notification confirming the new
behavior.
## Tests
- 14 unit tests on `decideScanAction` (was 9) — split into Always-mode
cases (preserves Phase 1's contract) and Manual-mode cases
(`create_pending`, `pending_decision → skip`, etc.).
- Type-check clean.
- Hot-patch + browser verification deferred until #888 lands; the state
machine smoke-tested cleanly on NOMAD3 in #888's PR, and this PR's
decision-tree changes are exhaustively unit-tested.
## RFC open question §3 — policy-change re-trigger
Switching Manual → Always doesn't auto-dispatch existing `pending_decision`
rows immediately. The next scan re-evaluates and dispatches them under
the new policy. This matches the RFC's "treat the switch as I've-
thought-about-it" instinct for the guardrail; full guardrail
implementation lands in Phase 3 task 14.
---------
Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com>
Each in-flight (or stuck) embedding job gets a colored health pill,
relative-activity timestamp, and chunk counter so users can tell at a
glance whether ingestion is making progress.
## Health states
- **🟢 Active** — last batch < 2 min ago
- **🟡 Slow** — last batch 2-5 min ago (CPU-paced multi-batch ingestion
lives here naturally; not always a problem)
- **🔴 Stalled** — last batch > 5 min ago (likely real problem)
- **⚪ Waiting** — queued, no batch started yet
- **🔴 Failed** — job recorded failed status
## What lands
- New backend util `kb_job_health.ts` with pure `computeJobHealth(input)`
decision function. Time-based thresholds (2 min / 5 min) inlined as
constants. 9 unit tests pin the boundaries.
- `EmbedJobWithProgress` gains `lastBatchAt`, `startedAt`, `chunks` —
already set by `EmbedFileJob.handle` on every batch transition, just
not previously surfaced through `listActiveJobs`.
- Frontend `kb_job_health_display.ts` maps each status to a Tailwind
dot color, label, and aria-label so backend and UI stay in sync.
- `ActiveEmbedJobs.tsx` renders the pill, "last activity Xs ago", and
chunk counter above each progress bar. Adds a manual Refresh button
and "Last updated Xs ago" line — the existing 2s/30s auto-poll
cadence in `useEmbedJobs` is left intact.
- Live tick at 5s keeps the relative timestamps current without
re-fetching from the API.
## Not in scope
- Per-card Cancel / Retry / Un-index — separate Phase 2 PR
- Conditional warnings A/B/C — separate Phase 2 PR
- Computing throughput rate (chunks/min) — needs ratio registry consumer
(Phase 2 follow-up); for now the pill answers the "is it stuck?"
question directly without a rate estimate.
Foundation for the cost estimates and partial-stall detection that
Phase 2 will surface. No consumers yet — this PR just lays the table,
the seed rows, and the lookup helper so subsequent UI work has
estimates available without a per-ZIM benchmark.
## What lands
- New table `kb_ratio_registry` (pattern, chunks_per_mb, sample_count,
notes). Migration creates and seeds heuristic defaults from the RFC
appendix: devdocs (1100/MB), Wikipedia variants (270/MB), iFixit
(50/MB), Stack Exchange Q&A (200/MB), video-only ZIMs (0), plus a
catch-all fallback at 100/MB.
- `KbRatioRegistry` model with static `lookup()` and `estimateChunks()`.
- Pure helper `kb_ratio_lookup.ts` doing longest-prefix-match — a
specific entry (`wikipedia_en_simple_`) overrides a broader one
(`wikipedia_en_`). 9 unit tests covering the lookup boundary.
- `sample_count` starts at 0 (heuristic seed) and is reserved for
Phase 4 self-calibration to increment as observed ZIMs update each row.
## Not in scope
- Self-calibration on successful ingestion (Phase 4)
- UI consumers — Warning B (partial-embed stall) and the storage budget
meter / time estimates land in Phase 2.
## Tested
- Type-check clean
- 9 unit tests pass for `findChunksPerMb` and `estimateChunkCount`
- Migration applied on NOMAD3 via hot-patch; 9 seed rows verified in DB
Adds a persistent state machine for AI knowledge-base ingestion so the
scanner can distinguish "fully indexed", "user opted out", "failed", and
"stalled" from each other — none of which were derivable from the prior
binary "any chunks in Qdrant ⇒ embedded" check.
## What lands
- New table `kb_ingest_state` keyed by `file_path` with enum state column
(`pending_decision | indexed | browse_only | failed | stalled`).
Independent of `installed_resources` so it covers both curated downloads
and manually-uploaded KB files.
- New KV key `rag.defaultIngestPolicy` (string: `Always | Manual`).
Registered now but not consumed yet — JIT prompt + wizard step land in
Phase 3 of the RFC.
- `EmbedFileJob.handle` writes state on terminal outcomes:
- Success (final batch) → `indexed` + chunks count
- `UnrecoverableError` → `failed` + error message
- Retryable errors are left to BullMQ's existing retry path
- `scanAndSyncStorage` swaps the binary qdrant check for a state-aware
decision tree (see `decideScanAction`). Existing installs auto-backfill
on first scan: files with chunks in Qdrant but no state row become
`indexed`; new files start as `pending_decision`.
- `deleteFileBySource` drops the state row last, so removed files
disappear entirely instead of leaving an orphan that the next scan
would re-dispatch into nothing.
## What does NOT land here
- Ratio registry (separate PR) — needed for partial-stall detection and
cost estimates, but a separable concern.
- #880 follow-up initial-progress anchor (separate tiny PR).
- Phase 2 UI (status pill, per-card actions, conditional warnings).
- Phase 3 policy surfaces (wizard step, JIT prompt, guardrail modal).
- PR #886's bulk-action hookup — `_deletePointsBySource` / Re-embed All
/ Reset & Rebuild would also want to set state, but #886 isn't merged
yet; that wiring goes in a follow-up once #886 lands.
## Target
This is forward work for v1.40.0 (RFC #883). Branching off `rc` because
that's the current latest base and post-GA Jake will sync rc→dev; a
retarget at PR-open time is a fast-forward if requested.
## Tests
- 9 new unit tests for `decideScanAction` covering all five states plus
the no-row / chunks-present / chunks-missing combinations
- Type-check clean
- Smoke-tested end-to-end on NOMAD3 via hot-patch:
- Backfill: 5 ZIMs + 2 KB uploads with existing chunks in Qdrant all
came back `indexed` on first scan
- Pending dispatch: a video-only ZIM with no chunks (`lrnselfreliance`)
came back `pending_decision` and was correctly re-dispatched (Bull
deduped to its historical `:completed` jobId — bgauger's #886 fix
drains that)
- Delete hook: deleting a KB upload via `DELETE /api/rag/files`
removed both the disk file and the state row
Co-authored-by: Jake Turner <52841588+jakeaturner@users.noreply.github.com>
onWikipediaDownloadComplete was deleting every file whose name starts
with `wikipedia_en_`, treating distinct corpora (simple, medicine,
wikivoyage, climate_change, etc.) as competing versions of the same
selection slot. Whichever wiki finished second silently wiped the
other from disk.
Match by filename stem instead — strip the trailing `_YYYY-MM(-DD).zim`
date suffix and only delete files with the same stem as the new
download. Different release dates of the same variant still get cleaned
up; distinct variants are preserved.
Extracted the predicate to `app/utils/zim_filename.ts` so the boundary
is covered by unit tests (8 cases incl. the #884 repro scenario).
download.kiwix.org (and some of its mirrors) don't always set a
Content-Type header on .zim responses. The MIME validator was reading
`headers['content-type'] || ''`, then running each allowlist entry
through `''.includes(...)` which is always false, so every download
from those hosts was rejected with `MIME type is not allowed`.
RFC 7231 §3.1.1.5 says missing Content-Type may be treated as
application/octet-stream by the recipient, and that's already in every
binary-content allowlist we use (ZIM, PMTILES, base assets). Default
the missing case to that and the validator does the right thing.
Strict callers that don't list octet-stream still reject as before.
Downloads are now written to `filepath + '.tmp'` and atomically renamed
to the final path only on successful completion. Kiwix globs for `*.zim`
and ZimService filters `.endsWith('.zim')`, so `.tmp` files are invisible
to both during download. The same staging applies to `.pmtiles` map files.
Ref #372
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Corrupted ZIM files cause a native C++ abort (ZimFileFormatError) that
bypasses JS try/catch and kills the process. Add magic number validation
before passing files to @openzim/libzim so invalid files are skipped
gracefully. Also deduplicate Ollama download progress broadcasts — both
within a single stream (skip unchanged percentages) and across concurrent
callers (share one download promise per model).
Co-authored-by: aegisman <aegis@manicode.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three bugs caused downloads to hang, disappear, or leave stuck spinners:
1. Wikipedia downloads that failed never updated the DB status from 'downloading',
leaving the spinner stuck forever. Now the worker's failed handler marks them as failed.
2. No stall detection on streaming downloads - if data stopped flowing mid-download,
the job hung indefinitely. Added a 5-minute stall timer that triggers retry.
3. Failed jobs were invisible to users since only waiting/active/delayed states were
queried. Now failed jobs appear with error indicators in the download list.
Closes#364, closes#216
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>