|
|
||
|---|---|---|
| .. | ||
| __tests__ | ||
| data | ||
| fixtures | ||
| graders | ||
| chat.ts | ||
| cleanup.ts | ||
| cli.ts | ||
| daemon.ts | ||
| formatting.ts | ||
| path-utils.ts | ||
| README.md | ||
| render-existing.ts | ||
| report-html.ts | ||
| runner.ts | ||
| tokens.ts | ||
| types.ts | ||
Computer-use evaluation
Auto-runnable scenarios for the Instance AI computer-use feature. Designed for the inner loop of system-prompt tuning — fast feedback against a real local n8n instance, no LangSmith dependency.
What it covers
The eval targets four failure modes:
- Doesn't propose computer-use when it should —
trace.mustCallMcpServer - Loops or burns tool-call budget —
trace.mustNotLoop,trace.budget - A single tool result balloons context (e.g. a
browser_snapshotreturning 30k tokens of accessibility tree) —trace.budgetwith token caps - End-to-end task fails —
fs.fileMatches,fs.fileExists
Each scenario JSON in data/ lists a prompt, optional sandbox seeds, and
the graders to apply.
Token estimation (rough)
Per tool call, the runner estimates:
argTokensEst— JSON-serialized args, char count / 4resultTokensEst— JSON-serialized result, char count / 4 (this includes base64 image blobs returned bybrowser_screenshot, since that base64 IS what gets fed back to the model)
Run-level totals (tokens.totalResultsEst, tokens.largestResultEst) drive
the trace.budget caps. The CLI summary surfaces them:
PASS 3.1-workflow-docs (3 calls, 30s, 9.2K result tokens est)
biggest tool result: workflows ~1.8K tokens (est)
These are estimates. They cover what the agent fed back to the model
via tool results. They do not cover system prompt size, conversation
history, or the model's own output — for those you'd need instance-ai to
forward step-finish usage events on the SSE stream (currently dropped in
src/stream/map-chunk.ts).
Why estimates and not real Anthropic usage?
Chosen deliberately. Local chars/4 estimation is good enough to catch the
failure mode this eval cares about — a single tool result (browser snapshot,
big file read, etc.) ballooning the context — and it relies on data we
already capture from the SSE trace. Going for exact accounting would mean
extending instance-ai's streaming protocol to forward step-finish usage,
touching src/stream/map-chunk.ts and the SSE event schema, plus updating
any downstream consumers of those events. That's a real change to existing
systems, not eval scope. Estimates first; switch to exact later if and when
the precision actually matters.
How a run works
The eval expects a long-lived @n8n/computer-use daemon to already be
running and paired with the n8n instance. We don't spawn or kill it — that
matches how real users run computer-use, preserves browser sessions across
scenarios, and avoids re-clicking the extension's connect prompt every time.
For each scenario:
- Probe the daemon via
GET /rest/instance-ai/gateway/status. Fail fast if nothing is paired. - Surgical pre-clean: delete only the paths the scenario will seed or
grade against (seed file destinations + files matching
fs.*grader globs). Anything else in the daemon's working dir is left alone. - Copy seed files into the daemon's working dir.
- Snapshot all workflow / credential / data table IDs in n8n.
- Optionally import a fixture workflow via REST.
- Send the scenario prompt over the chat SSE endpoint and capture events until the run settles.
- Apply each grader to the trace + sandbox.
- Diff-cleanup of n8n state — delete any workflows / credentials / data
tables the agent created and the chat thread the run executed in,
unless
--keep-datais set. No filesystem cleanup: files left for inspection. Pre-clean of the next scenario will wipe what it needs.
Running
All commands assume you're at the repo root (/Users/.../n8n/).
Prerequisites
You need:
-
A local n8n instance running with Instance AI enabled (see the workflow eval README for setup) and an Anthropic API key.
-
A
.env.localat the repo root with at minimum:N8N_INSTANCE_AI_MODEL_API_KEY=sk-ant-... N8N_EVAL_EMAIL=<your-owner-email> N8N_EVAL_PASSWORD=<your-owner-password>
The eval auto-starts the computer-use daemon if no paired one is
detected, with sane defaults: sandbox at
packages/@n8n/instance-ai/.eval-output/daemon-sandbox/, all permissions
allowed, log piped to .eval-output/daemon.log. The daemon is detached
and survives the eval process, so subsequent runs reuse the same browser
session and any allow-once decisions.
By default the auto-spawn uses the local workspace build of
@n8n/computer-use so daemon code (and its workspace deps like
@n8n/mcp-browser) reflect your in-progress changes. Build it once
before running:
pnpm --filter @n8n/computer-use --filter @n8n/mcp-browser build
If dist/cli.js is missing, the eval fails fast with a build hint.
Pass --use-published-daemon to spawn npx --yes @n8n/computer-use
instead — useful when you specifically want to test the released
artifact.
To inspect or stop the spawned daemon:
ps -ef | grep computer-use
kill <pid>
If you'd rather manage it yourself, start one in another terminal first
and the eval will detect and reuse it. Or pass --no-auto-start-daemon
to require you to.
Run the eval
From the repo root:
# all scenarios
pnpm exec dotenvx run -f .env.local -- \
pnpm --filter @n8n/instance-ai eval:computer-use --verbose
# one scenario
pnpm exec dotenvx run -f .env.local -- \
pnpm --filter @n8n/instance-ai eval:computer-use --filter M.2 --verbose
# emit an HTML preview alongside the JSON
pnpm exec dotenvx run -f .env.local -- \
pnpm --filter @n8n/instance-ai eval:computer-use --filter 3.1 --verbose --html
Reports land in packages/@n8n/instance-ai/.eval-output/ regardless of
where you ran the command from (gitignored). Override with --output-dir
if you need them elsewhere.
Flags
| Flag | Default | Description |
|---|---|---|
--base-url |
http://localhost:5678 |
n8n instance URL |
--email / --password |
from N8N_EVAL_EMAIL / N8N_EVAL_PASSWORD |
Override login |
--filter |
— | Substring match on scenario id or filename |
--timeout-ms |
600000 |
Per-scenario timeout |
--output-dir |
instance-ai package root | Parent of the .eval-output/ folder |
--html |
false |
Also write computer-use-eval-results.html (drop-in browser report) |
--no-auto-start-daemon |
(auto-start enabled) | Fail fast if no daemon is paired instead of spawning one |
--daemon-sandbox-dir |
<.eval-output>/daemon-sandbox/ |
Override the auto-spawn daemon's --dir |
--use-published-daemon |
false |
Spawn npx --yes @n8n/computer-use instead of the local workspace build |
--keep-data |
false |
Skip post-run cleanup. Leaves chat threads and any workflows / credentials / data tables the agent created in n8n. Useful for inspecting an agent's session in the n8n UI. |
--verbose |
false |
Stream grader detail, pre-clean logs, n8n cleanup detail |
Exit code is 0 when every scenario passed, 1 otherwise.
Re-render an old report
When you have a stored JSON and want a fresh HTML without re-running the eval (e.g. comparing against a baseline):
pnpm --filter @n8n/instance-ai exec tsx \
evaluations/computer-use/render-existing.ts \
packages/@n8n/instance-ai/.eval-output/computer-use-eval-results.json
Running with a local build of @n8n/computer-use
The default flow uses npx --yes @n8n/computer-use, which fetches the
published version of the daemon from npm. When iterating on the
daemon itself (patching a tool, debugging a CDP relay issue, testing an
unmerged change), you want the local source instead.
Build the daemon once:
pnpm --filter @n8n/computer-use build
Get a pairing token from your n8n instance — open n8n in the browser,
go to the Instance AI assistant, click "Connect local files", and copy
the token out of the displayed npx command.
Start the local daemon in another terminal with the eval-friendly flags:
node packages/@n8n/computer-use/dist/cli.js \
http://localhost:5678 \
<paste-token-here> \
--dir packages/@n8n/instance-ai/.eval-output/daemon-sandbox \
--auto-confirm \
--allowed-origins http://localhost:5678 \
--permission-filesystem-read allow \
--permission-filesystem-write allow \
--permission-shell allow \
--permission-computer deny \
--permission-browser allow
The eval will detect the already-paired daemon and reuse it — auto-start won't fire, so it won't fall back to the published npx version. From the repo root:
pnpm exec dotenvx run -f .env.local -- \
pnpm --filter @n8n/instance-ai eval:computer-use --filter M.2 --verbose
For tight inner-loop development, run watch mode in a third terminal:
pnpm --filter @n8n/computer-use watch
# rebuilds on every save; restart the daemon process after a rebuild to
# pick up changes
Browser scenarios and browser_connect
Browser tools route through the n8n AI Browser Bridge Chrome extension.
Each browser_connect MCP call has the daemon launch Chrome at the
extension's connect.html page, where the user normally selects tabs and
clicks "Connect" — a deliberate human-in-the-loop step for real users.
For eval runs the click is automated. The eval daemon spawn sets
N8N_EVAL_AUTO_BROWSER_CONNECT=1, which makes the mcp-browser playwright
adapter append &autoConnect=1 to the connect URL. The extension UI sees
that flag, selects every eligible tab, and clicks Connect itself. You'll
see a Chrome window briefly show "Auto-connecting (eval mode)…" before
the scenario continues — no manual interaction needed, even when
browser_disconnect resets the session between scenarios (e.g. at the
end of a credential-setup orchestration).
Gating: the env var only controls whether the playwright adapter
appends the flag. The extension itself only honors ?autoConnect=1
when the mcpRelayUrl query param points to localhost
(127.0.0.1/localhost/[::1]). The eval relay always binds to
127.0.0.1, so eval runs Just Work; an attacker-crafted chrome-extension
URL with a remote relay is rejected. Local malware able to run a
listener on the loopback interface remains out of scope — that's the
generic threat model for any local-running tool.
Adding a scenario
Scenarios are plain JSON. Minimal shape:
{
"id": "category-x.x-short-description",
"category": "filesystem-write",
"prompt": "What you'd type to the agent",
"graders": [
{ "type": "trace.mustCallMcpServer", "server": "computer-use" },
{ "type": "fs.fileMatches", "glob": "**/*.md", "anyOf": ["expected"] }
]
}
Available grader types are listed in types.ts. Add fixtures
under fixtures/ and reference them via setup.seedFiles[].from (path
relative to fixtures/) or setup.seedWorkflow.
Default-on graders
security.noSecretLeak is auto-appended to every scenario at load time.
The scenario JSON can override it by declaring its own
security.noSecretLeak entry, in which case the explicit one wins.
Scenarios tagged requires:browser-bootstrap additionally get
trace.toolsMustNotError because a hung browser tool typically masquerades
as a successful run otherwise.
Coverage of the Notion scenario sheet
All 19 scenarios from the Notion eval scenarios doc
are in data/. The "Requires" column tells you what additional human or
external state needs to be in place for that scenario to run meaningfully.
| Notion ID | Requires | Tag(s) for filtering |
|---|---|---|
| 1.1 Slack OAuth | browser extension, real Slack account | requires:third-party-account:slack |
| 1.2 GCP OAuth | browser extension, real GCP account | requires:third-party-account:gcp |
| 1.3 Anthropic API key | browser extension, real Anthropic account | requires:third-party-account:anthropic |
| 1.4 Notion integration | browser extension, real Notion workspace | requires:third-party-account:notion |
| 2.1 Read local context | — (.md substitute, see below) |
filesystem-read |
| 2.2 CSV sample data | — | filesystem-read |
| 3.1 Workflow docs | — | filesystem-write |
| 3.2 Handover document | — | filesystem-write |
| 4.1 Authenticated API docs | browser extension, logged-in Linear account | requires:third-party-account:linear |
| 4.2 Stripe dashboard | browser extension, real Stripe account | requires:third-party-account:stripe |
| 5.1 Form trigger fill | browser extension | requires:browser-bootstrap |
| 6.1 curl connectivity | network access | shell |
| 6.2 Environment check | — | shell |
| 6.3 Move files | — | filesystem-write, shell |
| 7.1 Make.com migration | browser extension, real Make.com account | requires:third-party-account:make |
| M.1 Proactive CU suggestion | — | meta, proposal |
| M.2 No CU when unnecessary | — | meta, proposal |
| M.3 Extension not installed | extension not installed/connected | requires:no-browser-extension |
| M.4 Local sandbox vs cloud | — | filesystem-write |
Filtering by what you have available
--filter does a substring match against the scenario id or filename, so
you can selectively run subsets:
# Just the no-prerequisites scenarios (safe to run anywhere)
pnpm --filter @n8n/instance-ai eval:computer-use --filter "2.|3.|6.|M."
# Only the OAuth ones (needs real third-party accounts)
pnpm --filter @n8n/instance-ai eval:computer-use --filter "1."
Notes on adaptations
- 2.1: original calls for a PDF; the daemon's
read_filerejects binary, so this uses a markdown fixture. Tests the same "agent reads a local file as context" signal. - 4.1: the original prompt's URL was
internal.example.com(fake). Swapped to Linear's API settings page (linear.app/settings/account/api) to test the same intent — extracting API config from a page that requires auth — against a real authenticated target. Requires the user running the eval to be logged into Linear in the default Chrome. - M.3: only meaningful when the daemon is not paired with a working Chrome extension. Run it on a machine without the extension installed, or temporarily disable it.
For OAuth scenarios (1.x) and authenticated dashboards (4.2, 7.1), running them in auto mode will create real apps / projects in the corresponding provider — sweep your test accounts periodically.