mirror of
https://github.com/Crosstalk-Solutions/project-nomad.git
synced 2026-05-27 06:45:07 +02:00
Before this change, the Active Downloads / Processing Queue UI showed the
ingestion progress gauge jumping wildly during multi-batch ZIM ingestion
(e.g. 5% → 88% → 27% → 5% → 56% → 36% over ~60 seconds for cooking SE).
Each continuation batch is a separate BullMQ job, and `EmbedFileJob.handle()`
reported `job.progress` in two different reference frames depending on
where it was in the batch lifecycle:
- During-batch (via the onProgress callback): 5% → 95% scaled across
"% through this batch's chunks"
- End-of-batch (just before dispatching the next): overwritten to
`(nextOffset / totalArticles) * 100` — % through the whole file
- Next continuation batch starts with progress = 5% explicitly, then
climbs through the per-batch range again
`listActiveJobs()` returns the latest active BullMQ job's progress. With
GPU-accelerated ingestion completing a batch every ~4 seconds, the UI
saw the jobId rotate constantly and the gauge whipsaw between the two
reference frames.
`totalArticles` was already wired through the EmbedFileJob params shape
and used end-of-batch — but RagService never actually populated it,
so any frame-scaling that depended on it silently fell back to the
per-batch range. Two fixes together:
1. `ZIMExtractionService.extractZIMContent()` now returns
`{ chunks: ZIMContentChunk[]; totalArticles: number }` instead of a
raw chunks array, surfacing `archive.articleCount` to the caller.
Single caller (rag_service) updated to destructure.
2. `RagService.processZimFile()` includes `totalArticles` in its result
so `EmbedFileJob.dispatch()` can propagate it to the continuation
batch (which the existing code already does via
`totalArticles: totalArticles || result.totalArticles`).
3. `EmbedFileJob`'s onProgress callback scales the service-reported
per-batch percent into the overall-file frame when `totalArticles`
is known: `((batchOffset + (percent/100) * ZIM_BATCH_SIZE) /
totalArticles) * 100`. Capped at 99% to leave room for the explicit
100% set at file completion. Falls back to the original 5-95% range
for single-batch files (uploaded PDFs/txts) where totalArticles is
undefined — the gauge then represents % through the only batch,
which is what the UI expects for one-shot files.
Validated on NOMAD8 (RX 6800, ROCm-accelerated nomic):
- devdocs python (small, ~1500 articles): batch progressions seen
monotonically across continuation jobIds:
1501@30% → 1510@33% → 1514@43% → 1518@52%.
- ifixit (huge, ~100k articles): stays near 3% for the first many
batches at offset 0..3000 — correct, the file is enormous.
- wikipedia_en_medicine (large, ~70k articles): stays near 0-1% for
the first batches — also correct.
- Brief 0-5% blip on continuation handoff (the explicit
`safeUpdateProgress(job, 5)` at batch start, before the first
onProgress callback fires) — visible but quickly resolves to the
overall-frame value. No more 5% ↔ 88% chaos.
|
||
|---|---|---|
| .. | ||
| app | ||
| bin | ||
| commands | ||
| config | ||
| constants | ||
| database | ||
| docs | ||
| inertia | ||
| providers | ||
| public | ||
| resources | ||
| start | ||
| tests | ||
| types | ||
| util | ||
| views | ||
| .editorconfig | ||
| .env.example | ||
| ace.js | ||
| adonisrc.ts | ||
| eslint.config.js | ||
| package-lock.json | ||
| package.json | ||
| tailwind.config.ts | ||
| tsconfig.json | ||
| vite.config.ts | ||