n8n/packages/@n8n/instance-ai/evaluations/computer-use/README.md
Bernhard Wittmann b445221c6a
feat: Computer-use evaluation harness (no-changelog) (#29797)
Co-authored-by: Elias Meire <elias@meire.dev>
2026-05-12 08:36:12 +00:00

345 lines
14 KiB
Markdown

# Computer-use evaluation
Auto-runnable scenarios for the Instance AI computer-use feature. Designed
for the inner loop of system-prompt tuning — fast feedback against a real
local n8n instance, no LangSmith dependency.
## What it covers
The eval targets four failure modes:
1. **Doesn't propose computer-use when it should**`trace.mustCallMcpServer`
2. **Loops or burns tool-call budget**`trace.mustNotLoop`, `trace.budget`
3. **A single tool result balloons context** (e.g. a `browser_snapshot` returning
30k tokens of accessibility tree) — `trace.budget` with token caps
4. **End-to-end task fails**`fs.fileMatches`, `fs.fileExists`
Each scenario JSON in `data/` lists a prompt, optional sandbox seeds, and
the graders to apply.
## Token estimation (rough)
Per tool call, the runner estimates:
- `argTokensEst` — JSON-serialized args, char count / 4
- `resultTokensEst` — JSON-serialized result, char count / 4 (this includes
base64 image blobs returned by `browser_screenshot`, since that base64 IS
what gets fed back to the model)
Run-level totals (`tokens.totalResultsEst`, `tokens.largestResultEst`) drive
the `trace.budget` caps. The CLI summary surfaces them:
```
PASS 3.1-workflow-docs (3 calls, 30s, 9.2K result tokens est)
biggest tool result: workflows ~1.8K tokens (est)
```
**These are estimates.** They cover what the agent *fed back to the model
via tool results*. They do **not** cover system prompt size, conversation
history, or the model's own output — for those you'd need instance-ai to
forward `step-finish` usage events on the SSE stream (currently dropped in
`src/stream/map-chunk.ts`).
### Why estimates and not real Anthropic usage?
Chosen deliberately. Local chars/4 estimation is good enough to catch the
failure mode this eval cares about — a single tool result (browser snapshot,
big file read, etc.) ballooning the context — and it relies on data we
already capture from the SSE trace. Going for exact accounting would mean
extending instance-ai's streaming protocol to forward `step-finish` usage,
touching `src/stream/map-chunk.ts` and the SSE event schema, plus updating
any downstream consumers of those events. That's a real change to existing
systems, not eval scope. Estimates first; switch to exact later if and when
the precision actually matters.
## How a run works
The eval expects a long-lived `@n8n/computer-use` daemon to already be
running and paired with the n8n instance. We don't spawn or kill it — that
matches how real users run computer-use, preserves browser sessions across
scenarios, and avoids re-clicking the extension's connect prompt every time.
For each scenario:
1. Probe the daemon via `GET /rest/instance-ai/gateway/status`. Fail fast if
nothing is paired.
2. Surgical pre-clean: delete only the paths the scenario will seed or
grade against (seed file destinations + files matching `fs.*` grader
globs). Anything else in the daemon's working dir is left alone.
3. Copy seed files into the daemon's working dir.
4. Snapshot all workflow / credential / data table IDs in n8n.
5. Optionally import a fixture workflow via REST.
6. Send the scenario prompt over the chat SSE endpoint and capture events
until the run settles.
7. Apply each grader to the trace + sandbox.
8. Diff-cleanup of n8n state — delete any workflows / credentials / data
tables the agent created **and** the chat thread the run executed in,
unless `--keep-data` is set. **No filesystem cleanup**: files left for
inspection. Pre-clean of the next scenario will wipe what it needs.
## Running
All commands assume you're at the **repo root** (`/Users/.../n8n/`).
### Prerequisites
You need:
- A local n8n instance running with Instance AI enabled (see the
workflow eval [README](../README.md) for setup) and an Anthropic API key.
- A `.env.local` at the repo root with at minimum:
```env
N8N_INSTANCE_AI_MODEL_API_KEY=sk-ant-...
N8N_EVAL_EMAIL=<your-owner-email>
N8N_EVAL_PASSWORD=<your-owner-password>
```
The eval **auto-starts the computer-use daemon** if no paired one is
detected, with sane defaults: sandbox at
`packages/@n8n/instance-ai/.eval-output/daemon-sandbox/`, all permissions
allowed, log piped to `.eval-output/daemon.log`. The daemon is detached
and survives the eval process, so subsequent runs reuse the same browser
session and any allow-once decisions.
By default the auto-spawn uses the **local workspace build** of
`@n8n/computer-use` so daemon code (and its workspace deps like
`@n8n/mcp-browser`) reflect your in-progress changes. Build it once
before running:
```bash
pnpm --filter @n8n/computer-use --filter @n8n/mcp-browser build
```
If `dist/cli.js` is missing, the eval fails fast with a build hint.
Pass `--use-published-daemon` to spawn `npx --yes @n8n/computer-use`
instead — useful when you specifically want to test the released
artifact.
To inspect or stop the spawned daemon:
```bash
ps -ef | grep computer-use
kill <pid>
```
If you'd rather manage it yourself, start one in another terminal first
and the eval will detect and reuse it. Or pass `--no-auto-start-daemon`
to require you to.
### Run the eval
From the repo root:
```bash
# all scenarios
pnpm exec dotenvx run -f .env.local -- \
pnpm --filter @n8n/instance-ai eval:computer-use --verbose
# one scenario
pnpm exec dotenvx run -f .env.local -- \
pnpm --filter @n8n/instance-ai eval:computer-use --filter M.2 --verbose
# emit an HTML preview alongside the JSON
pnpm exec dotenvx run -f .env.local -- \
pnpm --filter @n8n/instance-ai eval:computer-use --filter 3.1 --verbose --html
```
Reports land in `packages/@n8n/instance-ai/.eval-output/` regardless of
where you ran the command from (gitignored). Override with `--output-dir`
if you need them elsewhere.
### Flags
| Flag | Default | Description |
|---|---|---|
| `--base-url` | `http://localhost:5678` | n8n instance URL |
| `--email` / `--password` | from `N8N_EVAL_EMAIL` / `N8N_EVAL_PASSWORD` | Override login |
| `--filter` | — | Substring match on scenario id or filename |
| `--timeout-ms` | `600000` | Per-scenario timeout |
| `--output-dir` | instance-ai package root | Parent of the `.eval-output/` folder |
| `--html` | `false` | Also write `computer-use-eval-results.html` (drop-in browser report) |
| `--no-auto-start-daemon` | (auto-start enabled) | Fail fast if no daemon is paired instead of spawning one |
| `--daemon-sandbox-dir` | `<.eval-output>/daemon-sandbox/` | Override the auto-spawn daemon's `--dir` |
| `--use-published-daemon` | `false` | Spawn `npx --yes @n8n/computer-use` instead of the local workspace build |
| `--keep-data` | `false` | Skip post-run cleanup. Leaves chat threads and any workflows / credentials / data tables the agent created in n8n. Useful for inspecting an agent's session in the n8n UI. |
| `--verbose` | `false` | Stream grader detail, pre-clean logs, n8n cleanup detail |
Exit code is `0` when every scenario passed, `1` otherwise.
### Re-render an old report
When you have a stored JSON and want a fresh HTML without re-running the
eval (e.g. comparing against a baseline):
```bash
pnpm --filter @n8n/instance-ai exec tsx \
evaluations/computer-use/render-existing.ts \
packages/@n8n/instance-ai/.eval-output/computer-use-eval-results.json
```
### Running with a local build of `@n8n/computer-use`
The default flow uses `npx --yes @n8n/computer-use`, which fetches the
**published** version of the daemon from npm. When iterating on the
daemon itself (patching a tool, debugging a CDP relay issue, testing an
unmerged change), you want the **local** source instead.
Build the daemon once:
```bash
pnpm --filter @n8n/computer-use build
```
Get a pairing token from your n8n instance — open n8n in the browser,
go to the Instance AI assistant, click "Connect local files", and copy
the token out of the displayed `npx` command.
Start the local daemon in another terminal with the eval-friendly flags:
```bash
node packages/@n8n/computer-use/dist/cli.js \
http://localhost:5678 \
<paste-token-here> \
--dir packages/@n8n/instance-ai/.eval-output/daemon-sandbox \
--auto-confirm \
--allowed-origins http://localhost:5678 \
--permission-filesystem-read allow \
--permission-filesystem-write allow \
--permission-shell allow \
--permission-computer deny \
--permission-browser allow
```
The eval will detect the already-paired daemon and reuse it — auto-start
won't fire, so it won't fall back to the published npx version. From the
repo root:
```bash
pnpm exec dotenvx run -f .env.local -- \
pnpm --filter @n8n/instance-ai eval:computer-use --filter M.2 --verbose
```
For tight inner-loop development, run watch mode in a third terminal:
```bash
pnpm --filter @n8n/computer-use watch
# rebuilds on every save; restart the daemon process after a rebuild to
# pick up changes
```
### Browser scenarios and `browser_connect`
Browser tools route through the n8n AI Browser Bridge **Chrome extension**.
Each `browser_connect` MCP call has the daemon launch Chrome at the
extension's `connect.html` page, where the user normally selects tabs and
clicks "Connect" — a deliberate human-in-the-loop step for real users.
For eval runs the click is automated. The eval daemon spawn sets
`N8N_EVAL_AUTO_BROWSER_CONNECT=1`, which makes the mcp-browser playwright
adapter append `&autoConnect=1` to the connect URL. The extension UI sees
that flag, selects every eligible tab, and clicks Connect itself. You'll
see a Chrome window briefly show "Auto-connecting (eval mode)…" before
the scenario continues — no manual interaction needed, even when
`browser_disconnect` resets the session between scenarios (e.g. at the
end of a credential-setup orchestration).
**Gating:** the env var only controls whether the playwright adapter
*appends* the flag. The extension itself only honors `?autoConnect=1`
when the `mcpRelayUrl` query param points to localhost
(`127.0.0.1`/`localhost`/`[::1]`). The eval relay always binds to
`127.0.0.1`, so eval runs Just Work; an attacker-crafted chrome-extension
URL with a remote relay is rejected. Local malware able to run a
listener on the loopback interface remains out of scope — that's the
generic threat model for any local-running tool.
## Adding a scenario
Scenarios are plain JSON. Minimal shape:
```json
{
"id": "category-x.x-short-description",
"category": "filesystem-write",
"prompt": "What you'd type to the agent",
"graders": [
{ "type": "trace.mustCallMcpServer", "server": "computer-use" },
{ "type": "fs.fileMatches", "glob": "**/*.md", "anyOf": ["expected"] }
]
}
```
Available grader types are listed in [`types.ts`](./types.ts). Add fixtures
under `fixtures/` and reference them via `setup.seedFiles[].from` (path
relative to `fixtures/`) or `setup.seedWorkflow`.
### Default-on graders
`security.noSecretLeak` is auto-appended to every scenario at load time.
The scenario JSON can override it by declaring its own
`security.noSecretLeak` entry, in which case the explicit one wins.
Scenarios tagged `requires:browser-bootstrap` additionally get
`trace.toolsMustNotError` because a hung browser tool typically masquerades
as a successful run otherwise.
## Coverage of the Notion scenario sheet
All 19 scenarios from the [Notion eval scenarios doc](https://www.notion.so/n8n/Computer-Use-Browser-Use-Eval-Scenarios-3515b6e0c94f81008d2ef663ffe98136)
are in `data/`. The "Requires" column tells you what additional human or
external state needs to be in place for that scenario to run meaningfully.
| Notion ID | Requires | Tag(s) for filtering |
|---|---|---|
| 1.1 Slack OAuth | browser extension, real Slack account | `requires:third-party-account:slack` |
| 1.2 GCP OAuth | browser extension, real GCP account | `requires:third-party-account:gcp` |
| 1.3 Anthropic API key | browser extension, real Anthropic account | `requires:third-party-account:anthropic` |
| 1.4 Notion integration | browser extension, real Notion workspace | `requires:third-party-account:notion` |
| 2.1 Read local context | — (`.md` substitute, see below) | `filesystem-read` |
| 2.2 CSV sample data | — | `filesystem-read` |
| 3.1 Workflow docs | — | `filesystem-write` |
| 3.2 Handover document | — | `filesystem-write` |
| 4.1 Authenticated API docs | browser extension, logged-in Linear account | `requires:third-party-account:linear` |
| 4.2 Stripe dashboard | browser extension, real Stripe account | `requires:third-party-account:stripe` |
| 5.1 Form trigger fill | browser extension | `requires:browser-bootstrap` |
| 6.1 curl connectivity | network access | `shell` |
| 6.2 Environment check | — | `shell` |
| 6.3 Move files | — | `filesystem-write`, `shell` |
| 7.1 Make.com migration | browser extension, real Make.com account | `requires:third-party-account:make` |
| M.1 Proactive CU suggestion | — | `meta`, `proposal` |
| M.2 No CU when unnecessary | — | `meta`, `proposal` |
| M.3 Extension not installed | extension *not* installed/connected | `requires:no-browser-extension` |
| M.4 Local sandbox vs cloud | — | `filesystem-write` |
### Filtering by what you have available
`--filter` does a substring match against the scenario id *or* filename, so
you can selectively run subsets:
```bash
# Just the no-prerequisites scenarios (safe to run anywhere)
pnpm --filter @n8n/instance-ai eval:computer-use --filter "2.|3.|6.|M."
# Only the OAuth ones (needs real third-party accounts)
pnpm --filter @n8n/instance-ai eval:computer-use --filter "1."
```
### Notes on adaptations
- **2.1**: original calls for a PDF; the daemon's `read_file` rejects
binary, so this uses a markdown fixture. Tests the same
"agent reads a local file as context" signal.
- **4.1**: the original prompt's URL was `internal.example.com` (fake).
Swapped to Linear's API settings page (`linear.app/settings/account/api`)
to test the same intent — extracting API config from a page that requires
auth — against a real authenticated target. Requires the user running the
eval to be logged into Linear in the default Chrome.
- **M.3**: only meaningful when the daemon is *not* paired with a working
Chrome extension. Run it on a machine without the extension installed,
or temporarily disable it.
For OAuth scenarios (1.x) and authenticated dashboards (4.2, 7.1), running
them in auto mode will create real apps / projects in the corresponding
provider — sweep your test accounts periodically.