project-nomad/admin/start
Chris Sherwood ffa70a54bc feat(chat): confirm-on-switch + one-chat-model-at-a-time enforcement
Surfaces NOMAD's previously-silent model-stacking behavior and enforces a
"one chat model in VRAM at a time" invariant (the embedding model is
always exempt). Addresses Chris's NOMAD3 testing observation that
switching the dropdown in the chat header was invisibly slow on low-VRAM
hardware because the prior model was never unloaded — Ollama would
either evict it under memory pressure or load the new one on CPU after
the runner choked.

Three integration points all funnel through one new helper:

- **User changes the model dropdown** in an active chat session →
  confirm modal "Switch to {newModel}? Switching to {newModel} will
  start a new chat. Your current conversation stays available in the
  sidebar." On confirm, fire `keep_alive: 0` against the previous chat
  model, clear active session, set the new selection. Cancel snaps the
  visible dropdown back to the previous value (no popup state leaks
  into `selectedModel`).

- **User clicks a session in the sidebar** → no popup (system-initiated).
  Restore the session's stored model into the dropdown and fire
  `unloadChatModels(targetModel)` so anything that isn't the target
  gets the unload hint.

- **Chat page first mount** → page-load normalization. Anything stacked
  from a prior session gets the unload hint with the current selected
  model as the target-to-preserve. Guarded by a ref so it only fires
  once per page lifetime; gated on `selectedModel` being populated.

Backend surface is a single new helper and a single new route:

  `OllamaService.unloadAllChatModelsExcept(targetModel: string | null)`
  → queries `/api/ps`, filters out (a) the embedding model name
  (hardcoded `nomic-embed-text:v1.5` to avoid the RagService circular
  import) and (b) `targetModel`, fires `POST /api/generate` with empty
  prompt + `keep_alive: 0` in parallel against everything else.
  Returns the names that were hinted. Best-effort: network or Ollama
  errors are logged and swallowed so callers don't fail on housekeeping.

  `POST /api/ollama/unload-chat-models` → thin wrapper validating
  `{ targetModel?: string | null }`.

Why `keep_alive: 0` is safe against in-flight inference: per Ollama's
scheduler semantics, the hint sets the post-completion eviction timer
to zero — the runner is not terminated. If Session A is mid-response
on gemma when Session B fires the unload, gemma stays resident until
A's request completes, then evicts. The user-visible worst case is the
race where A's longer-running request re-extends the timer back to the
default and the unload is no-op'd; the next transition (or page reload)
gets another chance, and Ollama's own LRU catches up under memory
pressure regardless. Robust in-flight tracking deferred to a follow-up
if we see stale-state in the wild.

Base `rc`: v1.40.0 will inherit everything from rc.6 via the backmerge.
Frontend tests deferred to a follow-up PR; existing inertia tsconfig
errors are pre-existing and unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:16:00 -07:00
..
env.ts feat: gzip compression by default for all registered routes 2026-04-03 14:26:50 -07:00
kernel.ts feat: gzip compression by default for all registered routes 2026-04-03 14:26:50 -07:00
routes.ts feat(chat): confirm-on-switch + one-chat-model-at-a-time enforcement 2026-05-20 10:16:00 -07:00