mirror of
https://github.com/Crosstalk-Solutions/project-nomad.git
synced 2026-05-12 16:10:11 +02:00
Three bugs in the RAG embedding pipeline, diagnosed and patched by @sbruschke against v1.31.0 with working before/after chunk counts. All three are root-cause contributors to #388. 1. scanAndSyncStorage queued every file under /storage/zim/ for embedding, including Kiwix's generated kiwix-library.xml. EmbedFileJob rejected it with "Unsupported file type" and the default 30-attempt retry policy kept it looping on every sync, flooding nomad_admin logs. Now gated on determineFileType(filePath) !== 'unknown'. 2. hasMoreBatches compared zimChunks.length (section-level chunk count under the 'structured' strategy) against ZIM_BATCH_SIZE (an article limit). Because articles emit multiple sections, the two are never equal for real archives and processing silently stopped after the first 50 articles. Now gated on articlesInBatch >= ZIM_BATCH_SIZE. 3. extractStructuredContent walked only direct children of <body>, so any ZIM that wraps content in a container div (Devdocs, Wikipedia, FreeCodeCamp, React docs, etc.) produced zero sections and silently embedded zero chunks while reporting success. Now walks the full DOM via $('body').find('h2, h3, h4, p, ul, ol, dl, table'), with a whole-body text fallback when the selector walk yields nothing. Before/after chunk counts confirmed by @sbruschke on v1.31.0: devdocs_en_git 0 -> 916 devdocs_en_react 0 -> 481 devdocs_en_node 0 -> 423 libretexts_en_eng 1 -> 35 (climbing) Wikipedia resumed progressing normally through its 6M articles. Closes #718 Closes #719 Closes #720 Closes #388 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| app | ||
| bin | ||
| commands | ||
| config | ||
| constants | ||
| database | ||
| docs | ||
| inertia | ||
| providers | ||
| public | ||
| resources/views | ||
| start | ||
| tests | ||
| types | ||
| util | ||
| views | ||
| .editorconfig | ||
| .env.example | ||
| ace.js | ||
| adonisrc.ts | ||
| eslint.config.js | ||
| package-lock.json | ||
| package.json | ||
| tailwind.config.ts | ||
| tsconfig.json | ||
| vite.config.ts | ||