project-nomad/admin/adonisrc.ts
Chris Sherwood 2997637ce0 feat(GPU): auto-remediate nomad_ollama passthrough loss on admin boot (#755)
After an update, container recreate, or docker daemon restart, nomad_ollama's
HostConfig.DeviceRequests still lists the nvidia driver — but the NVIDIA
Container Toolkit binding inside the container is torn. `nvidia-smi` returns
"Failed to initialize NVML: Unknown Error" and Ollama silently falls back to
CPU inference. PR #208 detects this and shows a banner with a "Fix: Reinstall
AI Assistant" button. This change does that click automatically on admin boot.

New provider GpuPassthroughRemediationProvider runs once on web env boot:

1. Skip when KV `ai.autoFixGpuPassthrough = false` (default true).
2. Skip when Docker has no `nvidia` runtime registered (AMD-only and CPU-only
   hosts unaffected).
3. Skip when nomad_ollama isn't running.
4. Exec `nvidia-smi --query-gpu=name --format=csv,noheader` inside the
   container with an 8-second timeout. If the output matches
   "Failed to initialize NVML", "Unknown Error", "TIMEOUT", or contains no
   alphabetic characters, treat the passthrough as broken.
5. On broken: call DockerService.forceReinstall('nomad_ollama'). The existing
   force-reinstall preserves the Ollama volume + installed models. Stamp
   `gpu.autoRemediatedAt` on success.
6. On healthy: log and exit.

AMD passthrough_failed is intentionally not handled — its fix path is HSA
override handling (PR #804) rather than a simple service recreate, and false
positives during AMD startup log parsing would loop a recreate without fixing
anything. Left to a follow-up if it proves to be a recurring AMD issue.

Validated on NOMAD3 (RTX 5060, v1.32.0-rc.3 + this patch hot-applied):

- After admin restart with passthrough healthy: log line
  "[GpuPassthroughRemediationProvider] NVIDIA passthrough healthy — no action
  needed." Provider exits cleanly without touching the container.
- The broken-state branch hits the existing forceReinstall path, which was
  manually invoked earlier in the same session to fix this exact box and
  recovered GPU access in ~45s with model volume intact. No new failure mode
  is introduced — the auto-trigger removes the user click but the underlying
  operation is the same one the banner Fix button already calls.

Closes #755.
2026-05-20 10:16:00 -07:00

127 lines
3.9 KiB
TypeScript

import { defineConfig } from '@adonisjs/core/app'
export default defineConfig({
/*
|--------------------------------------------------------------------------
| Experimental flags
|--------------------------------------------------------------------------
|
| The following features will be enabled by default in the next major release
| of AdonisJS. You can opt into them today to avoid any breaking changes
| during upgrade.
|
*/
experimental: {
mergeMultipartFieldsAndFiles: true,
shutdownInReverseOrder: true,
},
/*
|--------------------------------------------------------------------------
| Commands
|--------------------------------------------------------------------------
|
| List of ace commands to register from packages. The application commands
| will be scanned automatically from the "./commands" directory.
|
*/
commands: [() => import('@adonisjs/core/commands'), () => import('@adonisjs/lucid/commands')],
/*
|--------------------------------------------------------------------------
| Service providers
|--------------------------------------------------------------------------
|
| List of service providers to import and register when booting the
| application
|
*/
providers: [
() => import('@adonisjs/core/providers/app_provider'),
() => import('@adonisjs/core/providers/hash_provider'),
{
file: () => import('@adonisjs/core/providers/repl_provider'),
environment: ['repl', 'test'],
},
() => import('@adonisjs/core/providers/vinejs_provider'),
() => import('@adonisjs/core/providers/edge_provider'),
() => import('@adonisjs/session/session_provider'),
() => import('@adonisjs/vite/vite_provider'),
() => import('@adonisjs/shield/shield_provider'),
() => import('@adonisjs/static/static_provider'),
() => import('@adonisjs/cors/cors_provider'),
() => import('@adonisjs/lucid/database_provider'),
() => import('@adonisjs/inertia/inertia_provider'),
() => import('@adonisjs/transmit/transmit_provider'),
() => import('#providers/map_static_provider'),
() => import('#providers/kiwix_migration_provider'),
() => import('#providers/qdrant_restart_policy_provider'),
() => import('#providers/version_check_provider'),
() => import('#providers/gpu_passthrough_remediation_provider'),
],
/*
|--------------------------------------------------------------------------
| Preloads
|--------------------------------------------------------------------------
|
| List of modules to import before starting the application.
|
*/
preloads: [() => import('#start/routes'), () => import('#start/kernel')],
/*
|--------------------------------------------------------------------------
| Tests
|--------------------------------------------------------------------------
|
| List of test suites to organize tests by their type. Feel free to remove
| and add additional suites.
|
*/
tests: {
suites: [
{
files: ['tests/unit/**/*.spec(.ts|.js)'],
name: 'unit',
timeout: 2000,
},
{
files: ['tests/functional/**/*.spec(.ts|.js)'],
name: 'functional',
timeout: 30000,
},
],
forceExit: false,
},
/*
|--------------------------------------------------------------------------
| Metafiles
|--------------------------------------------------------------------------
|
| A collection of files you want to copy to the build folder when creating
| the production build.
|
*/
metaFiles: [
{
pattern: 'resources/views/**/*.edge',
reloadServer: false,
},
{
pattern: 'resources/geodata/**/*.geojson',
reloadServer: false,
},
{
pattern: 'public/**',
reloadServer: false,
},
],
assetsBundler: false,
hooks: {
onBuildStarting: [() => import('@adonisjs/vite/build_hook')],
},
})