mirror of
https://github.com/Crosstalk-Solutions/project-nomad.git
synced 2026-05-23 04:45:06 +02:00
Surfaces NOMAD's previously-silent model-stacking behavior and enforces a
"one chat model in VRAM at a time" invariant (the embedding model is
always exempt). Addresses Chris's NOMAD3 testing observation that
switching the dropdown in the chat header was invisibly slow on low-VRAM
hardware because the prior model was never unloaded — Ollama would
either evict it under memory pressure or load the new one on CPU after
the runner choked.
Three integration points all funnel through one new helper:
- **User changes the model dropdown** in an active chat session →
confirm modal "Switch to {newModel}? Switching to {newModel} will
start a new chat. Your current conversation stays available in the
sidebar." On confirm, fire `keep_alive: 0` against the previous chat
model, clear active session, set the new selection. Cancel snaps the
visible dropdown back to the previous value (no popup state leaks
into `selectedModel`).
- **User clicks a session in the sidebar** → no popup (system-initiated).
Restore the session's stored model into the dropdown and fire
`unloadChatModels(targetModel)` so anything that isn't the target
gets the unload hint.
- **Chat page first mount** → page-load normalization. Anything stacked
from a prior session gets the unload hint with the current selected
model as the target-to-preserve. Guarded by a ref so it only fires
once per page lifetime; gated on `selectedModel` being populated.
Backend surface is a single new helper and a single new route:
`OllamaService.unloadAllChatModelsExcept(targetModel: string | null)`
→ queries `/api/ps`, filters out (a) the embedding model name
(hardcoded `nomic-embed-text:v1.5` to avoid the RagService circular
import) and (b) `targetModel`, fires `POST /api/generate` with empty
prompt + `keep_alive: 0` in parallel against everything else.
Returns the names that were hinted. Best-effort: network or Ollama
errors are logged and swallowed so callers don't fail on housekeeping.
`POST /api/ollama/unload-chat-models` → thin wrapper validating
`{ targetModel?: string | null }`.
Why `keep_alive: 0` is safe against in-flight inference: per Ollama's
scheduler semantics, the hint sets the post-completion eviction timer
to zero — the runner is not terminated. If Session A is mid-response
on gemma when Session B fires the unload, gemma stays resident until
A's request completes, then evicts. The user-visible worst case is the
race where A's longer-running request re-extends the timer back to the
default and the unload is no-op'd; the next transition (or page reload)
gets another chance, and Ollama's own LRU catches up under memory
pressure regardless. Robust in-flight tracking deferred to a follow-up
if we see stale-state in the wild.
Base `rc`: v1.40.0 will inherit everything from rc.6 via the backmerge.
Frontend tests deferred to a follow-up PR; existing inertia tsconfig
errors are pre-existing and unrelated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| env.ts | ||
| kernel.ts | ||
| routes.ts | ||