project-nomad/admin/app
Chris Sherwood fe599173ef fix(RAG): report ZIM ingestion progress in overall-file frame
Before this change, the Active Downloads / Processing Queue UI showed the
ingestion progress gauge jumping wildly during multi-batch ZIM ingestion
(e.g. 5% → 88% → 27% → 5% → 56% → 36% over ~60 seconds for cooking SE).

Each continuation batch is a separate BullMQ job, and `EmbedFileJob.handle()`
reported `job.progress` in two different reference frames depending on
where it was in the batch lifecycle:

  - During-batch (via the onProgress callback): 5% → 95% scaled across
    "% through this batch's chunks"
  - End-of-batch (just before dispatching the next): overwritten to
    `(nextOffset / totalArticles) * 100` — % through the whole file
  - Next continuation batch starts with progress = 5% explicitly, then
    climbs through the per-batch range again

`listActiveJobs()` returns the latest active BullMQ job's progress. With
GPU-accelerated ingestion completing a batch every ~4 seconds, the UI
saw the jobId rotate constantly and the gauge whipsaw between the two
reference frames.

`totalArticles` was already wired through the EmbedFileJob params shape
and used end-of-batch — but RagService never actually populated it,
so any frame-scaling that depended on it silently fell back to the
per-batch range. Two fixes together:

1. `ZIMExtractionService.extractZIMContent()` now returns
   `{ chunks: ZIMContentChunk[]; totalArticles: number }` instead of a
   raw chunks array, surfacing `archive.articleCount` to the caller.
   Single caller (rag_service) updated to destructure.

2. `RagService.processZimFile()` includes `totalArticles` in its result
   so `EmbedFileJob.dispatch()` can propagate it to the continuation
   batch (which the existing code already does via
   `totalArticles: totalArticles || result.totalArticles`).

3. `EmbedFileJob`'s onProgress callback scales the service-reported
   per-batch percent into the overall-file frame when `totalArticles`
   is known: `((batchOffset + (percent/100) * ZIM_BATCH_SIZE) /
   totalArticles) * 100`. Capped at 99% to leave room for the explicit
   100% set at file completion. Falls back to the original 5-95% range
   for single-batch files (uploaded PDFs/txts) where totalArticles is
   undefined — the gauge then represents % through the only batch,
   which is what the UI expects for one-shot files.

Validated on NOMAD8 (RX 6800, ROCm-accelerated nomic):

  - devdocs python (small, ~1500 articles): batch progressions seen
    monotonically across continuation jobIds:
    1501@30% → 1510@33% → 1514@43% → 1518@52%.
  - ifixit (huge, ~100k articles): stays near 3% for the first many
    batches at offset 0..3000 — correct, the file is enormous.
  - wikipedia_en_medicine (large, ~70k articles): stays near 0-1% for
    the first batches — also correct.
  - Brief 0-5% blip on continuation handoff (the explicit
    `safeUpdateProgress(job, 5)` at batch start, before the first
    onProgress callback fires) — visible but quickly resolves to the
    overall-frame value. No more 5% ↔ 88% chaos.
2026-05-13 16:10:51 -07:00
..
controllers fix(AI): rewrite RAG query on first follow-up (off-by-one in skip-rewrite threshold) 2026-05-12 20:34:30 -07:00
exceptions fix(Docs): documentation renderer fixes 2025-12-23 16:00:33 -08:00
jobs fix(RAG): report ZIM ingestion progress in overall-file frame 2026-05-13 16:10:51 -07:00
middleware fix(API): skip compression for Server-Sent Events (#798) 2026-04-27 19:00:31 -07:00
models feat(Content): custom ZIM library sources with pre-seeded mirrors (#593) 2026-05-04 11:30:59 -07:00
services fix(RAG): report ZIM ingestion progress in overall-file frame 2026-05-13 16:10:51 -07:00
utils fix(Downloads): treat missing Content-Type as octet-stream (#848) 2026-05-11 21:09:40 -07:00
validators feat(Content): custom ZIM library sources with pre-seeded mirrors (#593) 2026-05-04 11:30:59 -07:00