project-nomad/admin
chriscrosstalk 216509ae0d fix(rag): repair ZIM embedding pipeline (sync filter, batch gate, DOM walk) (#745)
Three bugs in the RAG embedding pipeline, diagnosed and patched by @sbruschke
against v1.31.0 with working before/after chunk counts. All three are
root-cause contributors to #388.

1. scanAndSyncStorage queued every file under /storage/zim/ for embedding,
   including Kiwix's generated kiwix-library.xml. EmbedFileJob rejected it
   with "Unsupported file type" and the default 30-attempt retry policy
   kept it looping on every sync, flooding nomad_admin logs. Now gated on
   determineFileType(filePath) !== 'unknown'.

2. hasMoreBatches compared zimChunks.length (section-level chunk count
   under the 'structured' strategy) against ZIM_BATCH_SIZE (an article
   limit). Because articles emit multiple sections, the two are never
   equal for real archives and processing silently stopped after the
   first 50 articles. Now gated on articlesInBatch >= ZIM_BATCH_SIZE.

3. extractStructuredContent walked only direct children of <body>, so any
   ZIM that wraps content in a container div (Devdocs, Wikipedia,
   FreeCodeCamp, React docs, etc.) produced zero sections and silently
   embedded zero chunks while reporting success. Now walks the full DOM
   via $('body').find('h2, h3, h4, p, ul, ol, dl, table'), with a
   whole-body text fallback when the selector walk yields nothing.

Before/after chunk counts confirmed by @sbruschke on v1.31.0:
  devdocs_en_git   0 -> 916
  devdocs_en_react 0 -> 481
  devdocs_en_node  0 -> 423
  libretexts_en_eng 1 -> 35 (climbing)
Wikipedia resumed progressing normally through its 6M articles.

Closes #718
Closes #719
Closes #720
Closes #388

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:26:28 -07:00
..
app fix(rag): repair ZIM embedding pipeline (sync filter, batch gate, DOM walk) (#745) 2026-04-21 14:26:28 -07:00
bin feat: curated content system overhaul 2026-02-11 15:44:46 -08:00
commands fix(Jobs): improved error handling and robustness 2026-04-03 14:26:50 -07:00
config fix: cache docker list requests, aiAssistantName fetching, and ensure inertia used properly 2026-04-03 14:26:50 -07:00
constants feat(Kiwix): migrate to Kiwix library mode for improved stability (#622) 2026-04-03 14:26:50 -07:00
database fix(qdrant): disable anonymous telemetry by default (#747) 2026-04-21 14:26:28 -07:00
docs docs: add Community Add-Ons page with field manuals + W3Schools packs (#753) 2026-04-21 14:26:28 -07:00
inertia fix(ZIM): accumulate across Kiwix pages to prevent empty Content Explorer (#746) 2026-04-21 14:26:28 -07:00
providers feat(Kiwix): migrate to Kiwix library mode for improved stability (#622) 2026-04-03 14:26:50 -07:00
public feat: switch all PNG images to WEBP (#575) 2026-04-03 14:26:50 -07:00
resources/views feat: switch all PNG images to WEBP (#575) 2026-04-03 14:26:50 -07:00
start feat(maps): add scale bar and location markers (#636) 2026-04-03 14:26:50 -07:00
tests feat: initial commit 2025-06-29 15:51:08 -07:00
types fix(ZIM): accumulate across Kiwix pages to prevent empty Content Explorer (#746) 2026-04-21 14:26:28 -07:00
util feat: display model download progress 2026-02-06 16:22:23 -08:00
views feat: initial commit 2025-06-29 15:51:08 -07:00
.editorconfig feat: initial commit 2025-06-29 15:51:08 -07:00
.env.example feat: Add Windows Docker Desktop support for local development 2026-01-19 10:29:24 -08:00
ace.js feat: initial commit 2025-06-29 15:51:08 -07:00
adonisrc.ts feat(Kiwix): migrate to Kiwix library mode for improved stability (#622) 2026-04-03 14:26:50 -07:00
eslint.config.js feat: openwebui+ollama and zim management 2025-07-09 09:08:21 -07:00
package-lock.json build(deps): bump lodash from 4.17.23 to 4.18.1 in /admin (#643) 2026-04-21 14:26:28 -07:00
package.json build(deps-dev): bump vite from 6.4.1 to 6.4.2 in /admin (#677) 2026-04-21 14:26:28 -07:00
tailwind.config.ts feat: initial commit 2025-06-29 15:51:08 -07:00
tsconfig.json feat: initial commit 2025-06-29 15:51:08 -07:00
vite.config.ts fix(Maps): ensure proper parsing of hostnames (#640) 2026-04-03 14:26:50 -07:00