Surfaces NOMAD's previously-silent model-stacking behavior and enforces a
"one chat model in VRAM at a time" invariant (the embedding model is
always exempt). Addresses Chris's NOMAD3 testing observation that
switching the dropdown in the chat header was invisibly slow on low-VRAM
hardware because the prior model was never unloaded — Ollama would
either evict it under memory pressure or load the new one on CPU after
the runner choked.
Three integration points all funnel through one new helper:
- **User changes the model dropdown** in an active chat session →
confirm modal "Switch to {newModel}? Switching to {newModel} will
start a new chat. Your current conversation stays available in the
sidebar." On confirm, fire `keep_alive: 0` against the previous chat
model, clear active session, set the new selection. Cancel snaps the
visible dropdown back to the previous value (no popup state leaks
into `selectedModel`).
- **User clicks a session in the sidebar** → no popup (system-initiated).
Restore the session's stored model into the dropdown and fire
`unloadChatModels(targetModel)` so anything that isn't the target
gets the unload hint.
- **Chat page first mount** → page-load normalization. Anything stacked
from a prior session gets the unload hint with the current selected
model as the target-to-preserve. Guarded by a ref so it only fires
once per page lifetime; gated on `selectedModel` being populated.
Backend surface is a single new helper and a single new route:
`OllamaService.unloadAllChatModelsExcept(targetModel: string | null)`
→ queries `/api/ps`, filters out (a) the embedding model name
(hardcoded `nomic-embed-text:v1.5` to avoid the RagService circular
import) and (b) `targetModel`, fires `POST /api/generate` with empty
prompt + `keep_alive: 0` in parallel against everything else.
Returns the names that were hinted. Best-effort: network or Ollama
errors are logged and swallowed so callers don't fail on housekeeping.
`POST /api/ollama/unload-chat-models` → thin wrapper validating
`{ targetModel?: string | null }`.
Why `keep_alive: 0` is safe against in-flight inference: per Ollama's
scheduler semantics, the hint sets the post-completion eviction timer
to zero — the runner is not terminated. If Session A is mid-response
on gemma when Session B fires the unload, gemma stays resident until
A's request completes, then evicts. The user-visible worst case is the
race where A's longer-running request re-extends the timer back to the
default and the unload is no-op'd; the next transition (or page reload)
gets another chance, and Ollama's own LRU catches up under memory
pressure regardless. Robust in-flight tracking deferred to a follow-up
if we see stale-state in the wild.
Base `rc`: v1.40.0 will inherit everything from rc.6 via the backmerge.
Frontend tests deferred to a follow-up PR; existing inertia tsconfig
errors are pre-existing and unrelated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The short-conversation skip in `rewriteQueryWithContext` used `userMessages.length <= 2`,
which short-circuits both the very first turn AND the first follow-up. The follow-up is
the moment the rewriter matters most — it's where pronouns and shorthand ("the bars",
"how long does it last?") need to be resolved against earlier turns before the embedding
search runs. With the rewriter skipped, RAG queries against the raw last message, scores
nothing above the 0.3 threshold, and no context gets injected for that turn.
The visible symptom is the assistant treating the first follow-up in any chat as a
brand-new question — e.g. "great - they threw up 2 of the bars it looks like" answered
as if it were a recipe-bars question, with no carry-forward of the prior chocolate-
poisoning context.
Threshold lowered to `< 2`: skip only when there's exactly one user message (nothing to
rewrite from). From the first follow-up onward the rewriter runs, as originally intended
before commit 96e5027.
Validated against `mistral-nemo:12b` on NOMAD3 by hot-patching the compiled controller
and replaying the dog-chocolate scenario. Post-patch response correctly threads "3
Hershey's bars" from turn 1 into turn 2's answer; pre-patch (per reporter's screenshot)
pivoted to peanut butter bar recipes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When users set a remote Ollama URL via AI Settings, the local nomad_ollama
container continued running and competed with the remote host for port 11434
and GPU access. Now configureRemote stops the local container on set and
restores it on clear (if still present). Container and its models volume are
preserved so the local install can be re-enabled later.
Closes#662
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>