project-nomad

mirror of https://github.com/Crosstalk-Solutions/project-nomad.git synced 2026-06-02 09:36:49 +02:00

History

Chris Sherwood fe599173ef fix(RAG): report ZIM ingestion progress in overall-file frame Before this change, the Active Downloads / Processing Queue UI showed the ingestion progress gauge jumping wildly during multi-batch ZIM ingestion (e.g. 5% → 88% → 27% → 5% → 56% → 36% over ~60 seconds for cooking SE). Each continuation batch is a separate BullMQ job, and `EmbedFileJob.handle()` reported `job.progress` in two different reference frames depending on where it was in the batch lifecycle: - During-batch (via the onProgress callback): 5% → 95% scaled across "% through this batch's chunks" - End-of-batch (just before dispatching the next): overwritten to `(nextOffset / totalArticles) * 100` — % through the whole file - Next continuation batch starts with progress = 5% explicitly, then climbs through the per-batch range again `listActiveJobs()` returns the latest active BullMQ job's progress. With GPU-accelerated ingestion completing a batch every ~4 seconds, the UI saw the jobId rotate constantly and the gauge whipsaw between the two reference frames. `totalArticles` was already wired through the EmbedFileJob params shape and used end-of-batch — but RagService never actually populated it, so any frame-scaling that depended on it silently fell back to the per-batch range. Two fixes together: 1. `ZIMExtractionService.extractZIMContent()` now returns `{ chunks: ZIMContentChunk[]; totalArticles: number }` instead of a raw chunks array, surfacing `archive.articleCount` to the caller. Single caller (rag_service) updated to destructure. 2. `RagService.processZimFile()` includes `totalArticles` in its result so `EmbedFileJob.dispatch()` can propagate it to the continuation batch (which the existing code already does via `totalArticles: totalArticles \|\| result.totalArticles`). 3. `EmbedFileJob`'s onProgress callback scales the service-reported per-batch percent into the overall-file frame when `totalArticles` is known: `((batchOffset + (percent/100) * ZIM_BATCH_SIZE) / totalArticles) * 100`. Capped at 99% to leave room for the explicit 100% set at file completion. Falls back to the original 5-95% range for single-batch files (uploaded PDFs/txts) where totalArticles is undefined — the gauge then represents % through the only batch, which is what the UI expects for one-shot files. Validated on NOMAD8 (RX 6800, ROCm-accelerated nomic): - devdocs python (small, ~1500 articles): batch progressions seen monotonically across continuation jobIds: 1501@30% → 1510@33% → 1514@43% → 1518@52%. - ifixit (huge, ~100k articles): stays near 3% for the first many batches at offset 0..3000 — correct, the file is enormous. - wikipedia_en_medicine (large, ~70k articles): stays near 0-1% for the first batches — also correct. - Brief 0-5% blip on continuation handoff (the explicit `safeUpdateProgress(job, 5)` at batch start, before the first onProgress callback fires) — visible but quickly resolves to the overall-frame value. No more 5% ↔ 88% chaos.		2026-05-13 16:10:51 -07:00
..
controllers	fix(AI): rewrite RAG query on first follow-up (off-by-one in skip-rewrite threshold)	2026-05-12 20:34:30 -07:00
exceptions	fix(Docs): documentation renderer fixes	2025-12-23 16:00:33 -08:00
jobs	fix(RAG): report ZIM ingestion progress in overall-file frame	2026-05-13 16:10:51 -07:00
middleware	fix(API): skip compression for Server-Sent Events (#798 )	2026-04-27 19:00:31 -07:00
models	feat(Content): custom ZIM library sources with pre-seeded mirrors (#593 )	2026-05-04 11:30:59 -07:00
services	fix(RAG): report ZIM ingestion progress in overall-file frame	2026-05-13 16:10:51 -07:00
utils	fix(Downloads): treat missing Content-Type as octet-stream (#848 )	2026-05-11 21:09:40 -07:00
validators	feat(Content): custom ZIM library sources with pre-seeded mirrors (#593 )	2026-05-04 11:30:59 -07:00