# Computer-use evaluation Auto-runnable scenarios for the Instance AI computer-use feature. Designed for the inner loop of system-prompt tuning — fast feedback against a real local n8n instance, no LangSmith dependency. ## What it covers The eval targets four failure modes: 1. **Doesn't propose computer-use when it should** — `trace.mustCallMcpServer` 2. **Loops or burns tool-call budget** — `trace.mustNotLoop`, `trace.budget` 3. **A single tool result balloons context** (e.g. a `browser_snapshot` returning 30k tokens of accessibility tree) — `trace.budget` with token caps 4. **End-to-end task fails** — `fs.fileMatches`, `fs.fileExists` Each scenario JSON in `data/` lists a prompt, optional sandbox seeds, and the graders to apply. ## Token estimation (rough) Per tool call, the runner estimates: - `argTokensEst` — JSON-serialized args, char count / 4 - `resultTokensEst` — JSON-serialized result, char count / 4 (this includes base64 image blobs returned by `browser_screenshot`, since that base64 IS what gets fed back to the model) Run-level totals (`tokens.totalResultsEst`, `tokens.largestResultEst`) drive the `trace.budget` caps. The CLI summary surfaces them: ``` PASS 3.1-workflow-docs (3 calls, 30s, 9.2K result tokens est) biggest tool result: workflows ~1.8K tokens (est) ``` **These are estimates.** They cover what the agent *fed back to the model via tool results*. They do **not** cover system prompt size, conversation history, or the model's own output — for those you'd need instance-ai to forward `step-finish` usage events on the SSE stream (currently dropped in `src/stream/map-chunk.ts`). ### Why estimates and not real Anthropic usage? Chosen deliberately. Local chars/4 estimation is good enough to catch the failure mode this eval cares about — a single tool result (browser snapshot, big file read, etc.) ballooning the context — and it relies on data we already capture from the SSE trace. Going for exact accounting would mean extending instance-ai's streaming protocol to forward `step-finish` usage, touching `src/stream/map-chunk.ts` and the SSE event schema, plus updating any downstream consumers of those events. That's a real change to existing systems, not eval scope. Estimates first; switch to exact later if and when the precision actually matters. ## How a run works The eval expects a long-lived `@n8n/computer-use` daemon to already be running and paired with the n8n instance. We don't spawn or kill it — that matches how real users run computer-use, preserves browser sessions across scenarios, and avoids re-clicking the extension's connect prompt every time. For each scenario: 1. Probe the daemon via `GET /rest/instance-ai/gateway/status`. Fail fast if nothing is paired. 2. Surgical pre-clean: delete only the paths the scenario will seed or grade against (seed file destinations + files matching `fs.*` grader globs). Anything else in the daemon's working dir is left alone. 3. Copy seed files into the daemon's working dir. 4. Snapshot all workflow / credential / data table IDs in n8n. 5. Optionally import a fixture workflow via REST. 6. Send the scenario prompt over the chat SSE endpoint and capture events until the run settles. 7. Apply each grader to the trace + sandbox. 8. Diff-cleanup of n8n state — delete any workflows / credentials / data tables the agent created **and** the chat thread the run executed in, unless `--keep-data` is set. **No filesystem cleanup**: files left for inspection. Pre-clean of the next scenario will wipe what it needs. ## Running All commands assume you're at the **repo root** (`/Users/.../n8n/`). ### Prerequisites You need: - A local n8n instance running with Instance AI enabled (see the workflow eval [README](../README.md) for setup) and an Anthropic API key. - A `.env.local` at the repo root with at minimum: ```env N8N_INSTANCE_AI_MODEL_API_KEY=sk-ant-... N8N_EVAL_EMAIL= N8N_EVAL_PASSWORD= ``` The eval **auto-starts the computer-use daemon** if no paired one is detected, with sane defaults: sandbox at `packages/@n8n/instance-ai/.eval-output/daemon-sandbox/`, all permissions allowed, log piped to `.eval-output/daemon.log`. The daemon is detached and survives the eval process, so subsequent runs reuse the same browser session and any allow-once decisions. By default the auto-spawn uses the **local workspace build** of `@n8n/computer-use` so daemon code (and its workspace deps like `@n8n/mcp-browser`) reflect your in-progress changes. Build it once before running: ```bash pnpm --filter @n8n/computer-use --filter @n8n/mcp-browser build ``` If `dist/cli.js` is missing, the eval fails fast with a build hint. Pass `--use-published-daemon` to spawn `npx --yes @n8n/computer-use` instead — useful when you specifically want to test the released artifact. To inspect or stop the spawned daemon: ```bash ps -ef | grep computer-use kill ``` If you'd rather manage it yourself, start one in another terminal first and the eval will detect and reuse it. Or pass `--no-auto-start-daemon` to require you to. ### Run the eval From the repo root: ```bash # all scenarios pnpm exec dotenvx run -f .env.local -- \ pnpm --filter @n8n/instance-ai eval:computer-use --verbose # one scenario pnpm exec dotenvx run -f .env.local -- \ pnpm --filter @n8n/instance-ai eval:computer-use --filter M.2 --verbose # emit an HTML preview alongside the JSON pnpm exec dotenvx run -f .env.local -- \ pnpm --filter @n8n/instance-ai eval:computer-use --filter 3.1 --verbose --html ``` Reports land in `packages/@n8n/instance-ai/.eval-output/` regardless of where you ran the command from (gitignored). Override with `--output-dir` if you need them elsewhere. ### Flags | Flag | Default | Description | |---|---|---| | `--base-url` | `http://localhost:5678` | n8n instance URL | | `--email` / `--password` | from `N8N_EVAL_EMAIL` / `N8N_EVAL_PASSWORD` | Override login | | `--filter` | — | Substring match on scenario id or filename | | `--timeout-ms` | `600000` | Per-scenario timeout | | `--output-dir` | instance-ai package root | Parent of the `.eval-output/` folder | | `--html` | `false` | Also write `computer-use-eval-results.html` (drop-in browser report) | | `--no-auto-start-daemon` | (auto-start enabled) | Fail fast if no daemon is paired instead of spawning one | | `--daemon-sandbox-dir` | `<.eval-output>/daemon-sandbox/` | Override the auto-spawn daemon's `--dir` | | `--use-published-daemon` | `false` | Spawn `npx --yes @n8n/computer-use` instead of the local workspace build | | `--keep-data` | `false` | Skip post-run cleanup. Leaves chat threads and any workflows / credentials / data tables the agent created in n8n. Useful for inspecting an agent's session in the n8n UI. | | `--verbose` | `false` | Stream grader detail, pre-clean logs, n8n cleanup detail | Exit code is `0` when every scenario passed, `1` otherwise. ### Re-render an old report When you have a stored JSON and want a fresh HTML without re-running the eval (e.g. comparing against a baseline): ```bash pnpm --filter @n8n/instance-ai exec tsx \ evaluations/computer-use/render-existing.ts \ packages/@n8n/instance-ai/.eval-output/computer-use-eval-results.json ``` ### Running with a local build of `@n8n/computer-use` The default flow uses `npx --yes @n8n/computer-use`, which fetches the **published** version of the daemon from npm. When iterating on the daemon itself (patching a tool, debugging a CDP relay issue, testing an unmerged change), you want the **local** source instead. Build the daemon once: ```bash pnpm --filter @n8n/computer-use build ``` Get a pairing token from your n8n instance — open n8n in the browser, go to the Instance AI assistant, click "Connect local files", and copy the token out of the displayed `npx` command. Start the local daemon in another terminal with the eval-friendly flags: ```bash node packages/@n8n/computer-use/dist/cli.js \ http://localhost:5678 \ \ --dir packages/@n8n/instance-ai/.eval-output/daemon-sandbox \ --auto-confirm \ --allowed-origins http://localhost:5678 \ --permission-filesystem-read allow \ --permission-filesystem-write allow \ --permission-shell allow \ --permission-computer deny \ --permission-browser allow ``` The eval will detect the already-paired daemon and reuse it — auto-start won't fire, so it won't fall back to the published npx version. From the repo root: ```bash pnpm exec dotenvx run -f .env.local -- \ pnpm --filter @n8n/instance-ai eval:computer-use --filter M.2 --verbose ``` For tight inner-loop development, run watch mode in a third terminal: ```bash pnpm --filter @n8n/computer-use watch # rebuilds on every save; restart the daemon process after a rebuild to # pick up changes ``` ### Browser scenarios and `browser_connect` Browser tools route through the n8n AI Browser Bridge **Chrome extension**. Each `browser_connect` MCP call has the daemon launch Chrome at the extension's `connect.html` page, where the user normally selects tabs and clicks "Connect" — a deliberate human-in-the-loop step for real users. For eval runs the click is automated. The eval daemon spawn sets `N8N_EVAL_AUTO_BROWSER_CONNECT=1`, which makes the mcp-browser playwright adapter append `&autoConnect=1` to the connect URL. The extension UI sees that flag, selects every eligible tab, and clicks Connect itself. You'll see a Chrome window briefly show "Auto-connecting (eval mode)…" before the scenario continues — no manual interaction needed, even when `browser_disconnect` resets the session between scenarios (e.g. at the end of a credential-setup orchestration). **Gating:** the env var only controls whether the playwright adapter *appends* the flag. The extension itself only honors `?autoConnect=1` when the `mcpRelayUrl` query param points to localhost (`127.0.0.1`/`localhost`/`[::1]`). The eval relay always binds to `127.0.0.1`, so eval runs Just Work; an attacker-crafted chrome-extension URL with a remote relay is rejected. Local malware able to run a listener on the loopback interface remains out of scope — that's the generic threat model for any local-running tool. ## Adding a scenario Scenarios are plain JSON. Minimal shape: ```json { "id": "category-x.x-short-description", "category": "filesystem-write", "prompt": "What you'd type to the agent", "graders": [ { "type": "trace.mustCallMcpServer", "server": "computer-use" }, { "type": "fs.fileMatches", "glob": "**/*.md", "anyOf": ["expected"] } ] } ``` Available grader types are listed in [`types.ts`](./types.ts). Add fixtures under `fixtures/` and reference them via `setup.seedFiles[].from` (path relative to `fixtures/`) or `setup.seedWorkflow`. ### Default-on graders `security.noSecretLeak` is auto-appended to every scenario at load time. The scenario JSON can override it by declaring its own `security.noSecretLeak` entry, in which case the explicit one wins. Scenarios tagged `requires:browser-bootstrap` additionally get `trace.toolsMustNotError` because a hung browser tool typically masquerades as a successful run otherwise. ## Coverage of the Notion scenario sheet All 19 scenarios from the [Notion eval scenarios doc](https://www.notion.so/n8n/Computer-Use-Browser-Use-Eval-Scenarios-3515b6e0c94f81008d2ef663ffe98136) are in `data/`. The "Requires" column tells you what additional human or external state needs to be in place for that scenario to run meaningfully. | Notion ID | Requires | Tag(s) for filtering | |---|---|---| | 1.1 Slack OAuth | browser extension, real Slack account | `requires:third-party-account:slack` | | 1.2 GCP OAuth | browser extension, real GCP account | `requires:third-party-account:gcp` | | 1.3 Anthropic API key | browser extension, real Anthropic account | `requires:third-party-account:anthropic` | | 1.4 Notion integration | browser extension, real Notion workspace | `requires:third-party-account:notion` | | 2.1 Read local context | — (`.md` substitute, see below) | `filesystem-read` | | 2.2 CSV sample data | — | `filesystem-read` | | 3.1 Workflow docs | — | `filesystem-write` | | 3.2 Handover document | — | `filesystem-write` | | 4.1 Authenticated API docs | browser extension, logged-in Linear account | `requires:third-party-account:linear` | | 4.2 Stripe dashboard | browser extension, real Stripe account | `requires:third-party-account:stripe` | | 5.1 Form trigger fill | browser extension | `requires:browser-bootstrap` | | 6.1 curl connectivity | network access | `shell` | | 6.2 Environment check | — | `shell` | | 6.3 Move files | — | `filesystem-write`, `shell` | | 7.1 Make.com migration | browser extension, real Make.com account | `requires:third-party-account:make` | | M.1 Proactive CU suggestion | — | `meta`, `proposal` | | M.2 No CU when unnecessary | — | `meta`, `proposal` | | M.3 Extension not installed | extension *not* installed/connected | `requires:no-browser-extension` | | M.4 Local sandbox vs cloud | — | `filesystem-write` | ### Filtering by what you have available `--filter` does a substring match against the scenario id *or* filename, so you can selectively run subsets: ```bash # Just the no-prerequisites scenarios (safe to run anywhere) pnpm --filter @n8n/instance-ai eval:computer-use --filter "2.|3.|6.|M." # Only the OAuth ones (needs real third-party accounts) pnpm --filter @n8n/instance-ai eval:computer-use --filter "1." ``` ### Notes on adaptations - **2.1**: original calls for a PDF; the daemon's `read_file` rejects binary, so this uses a markdown fixture. Tests the same "agent reads a local file as context" signal. - **4.1**: the original prompt's URL was `internal.example.com` (fake). Swapped to Linear's API settings page (`linear.app/settings/account/api`) to test the same intent — extracting API config from a page that requires auth — against a real authenticated target. Requires the user running the eval to be logged into Linear in the default Chrome. - **M.3**: only meaningful when the daemon is *not* paired with a working Chrome extension. Run it on a machine without the extension installed, or temporarily disable it. For OAuth scenarios (1.x) and authenticated dashboards (4.2, 7.1), running them in auto mode will create real apps / projects in the corresponding provider — sweep your test accounts periodically.