n8n/packages/@n8n/instance-ai/evaluations/computer-use
Bernhard Wittmann b445221c6a
feat: Computer-use evaluation harness (no-changelog) (#29797)
Co-authored-by: Elias Meire <elias@meire.dev>
2026-05-12 08:36:12 +00:00
..
__tests__ feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
data feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
fixtures feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
graders feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
chat.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
cleanup.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
cli.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
daemon.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
formatting.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
path-utils.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
README.md feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
render-existing.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
report-html.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
runner.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
tokens.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00
types.ts feat: Computer-use evaluation harness (no-changelog) (#29797) 2026-05-12 08:36:12 +00:00

Computer-use evaluation

Auto-runnable scenarios for the Instance AI computer-use feature. Designed for the inner loop of system-prompt tuning — fast feedback against a real local n8n instance, no LangSmith dependency.

What it covers

The eval targets four failure modes:

  1. Doesn't propose computer-use when it shouldtrace.mustCallMcpServer
  2. Loops or burns tool-call budgettrace.mustNotLoop, trace.budget
  3. A single tool result balloons context (e.g. a browser_snapshot returning 30k tokens of accessibility tree) — trace.budget with token caps
  4. End-to-end task failsfs.fileMatches, fs.fileExists

Each scenario JSON in data/ lists a prompt, optional sandbox seeds, and the graders to apply.

Token estimation (rough)

Per tool call, the runner estimates:

  • argTokensEst — JSON-serialized args, char count / 4
  • resultTokensEst — JSON-serialized result, char count / 4 (this includes base64 image blobs returned by browser_screenshot, since that base64 IS what gets fed back to the model)

Run-level totals (tokens.totalResultsEst, tokens.largestResultEst) drive the trace.budget caps. The CLI summary surfaces them:

PASS  3.1-workflow-docs  (3 calls, 30s, 9.2K result tokens est)
      biggest tool result: workflows ~1.8K tokens (est)

These are estimates. They cover what the agent fed back to the model via tool results. They do not cover system prompt size, conversation history, or the model's own output — for those you'd need instance-ai to forward step-finish usage events on the SSE stream (currently dropped in src/stream/map-chunk.ts).

Why estimates and not real Anthropic usage?

Chosen deliberately. Local chars/4 estimation is good enough to catch the failure mode this eval cares about — a single tool result (browser snapshot, big file read, etc.) ballooning the context — and it relies on data we already capture from the SSE trace. Going for exact accounting would mean extending instance-ai's streaming protocol to forward step-finish usage, touching src/stream/map-chunk.ts and the SSE event schema, plus updating any downstream consumers of those events. That's a real change to existing systems, not eval scope. Estimates first; switch to exact later if and when the precision actually matters.

How a run works

The eval expects a long-lived @n8n/computer-use daemon to already be running and paired with the n8n instance. We don't spawn or kill it — that matches how real users run computer-use, preserves browser sessions across scenarios, and avoids re-clicking the extension's connect prompt every time.

For each scenario:

  1. Probe the daemon via GET /rest/instance-ai/gateway/status. Fail fast if nothing is paired.
  2. Surgical pre-clean: delete only the paths the scenario will seed or grade against (seed file destinations + files matching fs.* grader globs). Anything else in the daemon's working dir is left alone.
  3. Copy seed files into the daemon's working dir.
  4. Snapshot all workflow / credential / data table IDs in n8n.
  5. Optionally import a fixture workflow via REST.
  6. Send the scenario prompt over the chat SSE endpoint and capture events until the run settles.
  7. Apply each grader to the trace + sandbox.
  8. Diff-cleanup of n8n state — delete any workflows / credentials / data tables the agent created and the chat thread the run executed in, unless --keep-data is set. No filesystem cleanup: files left for inspection. Pre-clean of the next scenario will wipe what it needs.

Running

All commands assume you're at the repo root (/Users/.../n8n/).

Prerequisites

You need:

  • A local n8n instance running with Instance AI enabled (see the workflow eval README for setup) and an Anthropic API key.

  • A .env.local at the repo root with at minimum:

    N8N_INSTANCE_AI_MODEL_API_KEY=sk-ant-...
    N8N_EVAL_EMAIL=<your-owner-email>
    N8N_EVAL_PASSWORD=<your-owner-password>
    

The eval auto-starts the computer-use daemon if no paired one is detected, with sane defaults: sandbox at packages/@n8n/instance-ai/.eval-output/daemon-sandbox/, all permissions allowed, log piped to .eval-output/daemon.log. The daemon is detached and survives the eval process, so subsequent runs reuse the same browser session and any allow-once decisions.

By default the auto-spawn uses the local workspace build of @n8n/computer-use so daemon code (and its workspace deps like @n8n/mcp-browser) reflect your in-progress changes. Build it once before running:

pnpm --filter @n8n/computer-use --filter @n8n/mcp-browser build

If dist/cli.js is missing, the eval fails fast with a build hint.

Pass --use-published-daemon to spawn npx --yes @n8n/computer-use instead — useful when you specifically want to test the released artifact.

To inspect or stop the spawned daemon:

ps -ef | grep computer-use
kill <pid>

If you'd rather manage it yourself, start one in another terminal first and the eval will detect and reuse it. Or pass --no-auto-start-daemon to require you to.

Run the eval

From the repo root:

# all scenarios
pnpm exec dotenvx run -f .env.local -- \
  pnpm --filter @n8n/instance-ai eval:computer-use --verbose

# one scenario
pnpm exec dotenvx run -f .env.local -- \
  pnpm --filter @n8n/instance-ai eval:computer-use --filter M.2 --verbose

# emit an HTML preview alongside the JSON
pnpm exec dotenvx run -f .env.local -- \
  pnpm --filter @n8n/instance-ai eval:computer-use --filter 3.1 --verbose --html

Reports land in packages/@n8n/instance-ai/.eval-output/ regardless of where you ran the command from (gitignored). Override with --output-dir if you need them elsewhere.

Flags

Flag Default Description
--base-url http://localhost:5678 n8n instance URL
--email / --password from N8N_EVAL_EMAIL / N8N_EVAL_PASSWORD Override login
--filter Substring match on scenario id or filename
--timeout-ms 600000 Per-scenario timeout
--output-dir instance-ai package root Parent of the .eval-output/ folder
--html false Also write computer-use-eval-results.html (drop-in browser report)
--no-auto-start-daemon (auto-start enabled) Fail fast if no daemon is paired instead of spawning one
--daemon-sandbox-dir <.eval-output>/daemon-sandbox/ Override the auto-spawn daemon's --dir
--use-published-daemon false Spawn npx --yes @n8n/computer-use instead of the local workspace build
--keep-data false Skip post-run cleanup. Leaves chat threads and any workflows / credentials / data tables the agent created in n8n. Useful for inspecting an agent's session in the n8n UI.
--verbose false Stream grader detail, pre-clean logs, n8n cleanup detail

Exit code is 0 when every scenario passed, 1 otherwise.

Re-render an old report

When you have a stored JSON and want a fresh HTML without re-running the eval (e.g. comparing against a baseline):

pnpm --filter @n8n/instance-ai exec tsx \
  evaluations/computer-use/render-existing.ts \
  packages/@n8n/instance-ai/.eval-output/computer-use-eval-results.json

Running with a local build of @n8n/computer-use

The default flow uses npx --yes @n8n/computer-use, which fetches the published version of the daemon from npm. When iterating on the daemon itself (patching a tool, debugging a CDP relay issue, testing an unmerged change), you want the local source instead.

Build the daemon once:

pnpm --filter @n8n/computer-use build

Get a pairing token from your n8n instance — open n8n in the browser, go to the Instance AI assistant, click "Connect local files", and copy the token out of the displayed npx command.

Start the local daemon in another terminal with the eval-friendly flags:

node packages/@n8n/computer-use/dist/cli.js \
  http://localhost:5678 \
  <paste-token-here> \
  --dir packages/@n8n/instance-ai/.eval-output/daemon-sandbox \
  --auto-confirm \
  --allowed-origins http://localhost:5678 \
  --permission-filesystem-read allow \
  --permission-filesystem-write allow \
  --permission-shell allow \
  --permission-computer deny \
  --permission-browser allow

The eval will detect the already-paired daemon and reuse it — auto-start won't fire, so it won't fall back to the published npx version. From the repo root:

pnpm exec dotenvx run -f .env.local -- \
  pnpm --filter @n8n/instance-ai eval:computer-use --filter M.2 --verbose

For tight inner-loop development, run watch mode in a third terminal:

pnpm --filter @n8n/computer-use watch
# rebuilds on every save; restart the daemon process after a rebuild to
# pick up changes

Browser scenarios and browser_connect

Browser tools route through the n8n AI Browser Bridge Chrome extension. Each browser_connect MCP call has the daemon launch Chrome at the extension's connect.html page, where the user normally selects tabs and clicks "Connect" — a deliberate human-in-the-loop step for real users.

For eval runs the click is automated. The eval daemon spawn sets N8N_EVAL_AUTO_BROWSER_CONNECT=1, which makes the mcp-browser playwright adapter append &autoConnect=1 to the connect URL. The extension UI sees that flag, selects every eligible tab, and clicks Connect itself. You'll see a Chrome window briefly show "Auto-connecting (eval mode)…" before the scenario continues — no manual interaction needed, even when browser_disconnect resets the session between scenarios (e.g. at the end of a credential-setup orchestration).

Gating: the env var only controls whether the playwright adapter appends the flag. The extension itself only honors ?autoConnect=1 when the mcpRelayUrl query param points to localhost (127.0.0.1/localhost/[::1]). The eval relay always binds to 127.0.0.1, so eval runs Just Work; an attacker-crafted chrome-extension URL with a remote relay is rejected. Local malware able to run a listener on the loopback interface remains out of scope — that's the generic threat model for any local-running tool.

Adding a scenario

Scenarios are plain JSON. Minimal shape:

{
  "id": "category-x.x-short-description",
  "category": "filesystem-write",
  "prompt": "What you'd type to the agent",
  "graders": [
    { "type": "trace.mustCallMcpServer", "server": "computer-use" },
    { "type": "fs.fileMatches", "glob": "**/*.md", "anyOf": ["expected"] }
  ]
}

Available grader types are listed in types.ts. Add fixtures under fixtures/ and reference them via setup.seedFiles[].from (path relative to fixtures/) or setup.seedWorkflow.

Default-on graders

security.noSecretLeak is auto-appended to every scenario at load time. The scenario JSON can override it by declaring its own security.noSecretLeak entry, in which case the explicit one wins.

Scenarios tagged requires:browser-bootstrap additionally get trace.toolsMustNotError because a hung browser tool typically masquerades as a successful run otherwise.

Coverage of the Notion scenario sheet

All 19 scenarios from the Notion eval scenarios doc are in data/. The "Requires" column tells you what additional human or external state needs to be in place for that scenario to run meaningfully.

Notion ID Requires Tag(s) for filtering
1.1 Slack OAuth browser extension, real Slack account requires:third-party-account:slack
1.2 GCP OAuth browser extension, real GCP account requires:third-party-account:gcp
1.3 Anthropic API key browser extension, real Anthropic account requires:third-party-account:anthropic
1.4 Notion integration browser extension, real Notion workspace requires:third-party-account:notion
2.1 Read local context — (.md substitute, see below) filesystem-read
2.2 CSV sample data filesystem-read
3.1 Workflow docs filesystem-write
3.2 Handover document filesystem-write
4.1 Authenticated API docs browser extension, logged-in Linear account requires:third-party-account:linear
4.2 Stripe dashboard browser extension, real Stripe account requires:third-party-account:stripe
5.1 Form trigger fill browser extension requires:browser-bootstrap
6.1 curl connectivity network access shell
6.2 Environment check shell
6.3 Move files filesystem-write, shell
7.1 Make.com migration browser extension, real Make.com account requires:third-party-account:make
M.1 Proactive CU suggestion meta, proposal
M.2 No CU when unnecessary meta, proposal
M.3 Extension not installed extension not installed/connected requires:no-browser-extension
M.4 Local sandbox vs cloud filesystem-write

Filtering by what you have available

--filter does a substring match against the scenario id or filename, so you can selectively run subsets:

# Just the no-prerequisites scenarios (safe to run anywhere)
pnpm --filter @n8n/instance-ai eval:computer-use --filter "2.|3.|6.|M."

# Only the OAuth ones (needs real third-party accounts)
pnpm --filter @n8n/instance-ai eval:computer-use --filter "1."

Notes on adaptations

  • 2.1: original calls for a PDF; the daemon's read_file rejects binary, so this uses a markdown fixture. Tests the same "agent reads a local file as context" signal.
  • 4.1: the original prompt's URL was internal.example.com (fake). Swapped to Linear's API settings page (linear.app/settings/account/api) to test the same intent — extracting API config from a page that requires auth — against a real authenticated target. Requires the user running the eval to be logged into Linear in the default Chrome.
  • M.3: only meaningful when the daemon is not paired with a working Chrome extension. Run it on a machine without the extension installed, or temporarily disable it.

For OAuth scenarios (1.x) and authenticated dashboards (4.2, 7.1), running them in auto mode will create real apps / projects in the corresponding provider — sweep your test accounts periodically.