mirror of https://github.com/n8n-io/n8n.git synced 2026-05-12 16:10:30 +02:00

History

Milorad FIlipović 54d62bb4a1 fix(core): Update instance-ai evaluator to include pinned subnodes and allow all mcp tools (#30292 )		2026-05-12 09:13:01 +00:00
..
__tests__	feat(ai-builder): Add --prebuilt-workflows flag for eval CLI (no-changelog) (#29830 )	2026-05-07 11:47:29 +00:00
binaryChecks	refactor: Extract shared eval helpers (no-changelog) (#29800 )	2026-05-07 08:05:01 +00:00
checklist	feat(ai-builder): Add LangSmith integration for workflow eval tracking (no-changelog) (#28835 )	2026-04-23 08:47:02 +00:00
cli	fix(core): Update instance-ai evaluator to include pinned subnodes and allow all mcp tools (#30292 )	2026-05-12 09:13:01 +00:00
clients	feat: Computer-use evaluation harness (no-changelog) (#29797 )	2026-05-12 08:36:12 +00:00
comparison	chore(ai-builder): Improve eval comparison alert clarity (no-changelog) (#29929 )	2026-05-06 21:20:49 +00:00
computer-use	feat: Computer-use evaluation harness (no-changelog) (#29797 )	2026-05-12 08:36:12 +00:00
credentials	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
data	feat(ai-builder): Add --prebuilt-workflows flag for eval CLI (no-changelog) (#29830 )	2026-05-07 11:47:29 +00:00
harness	fix(core): Update instance-ai evaluator to include pinned subnodes and allow all mcp tools (#30292 )	2026-05-12 09:13:01 +00:00
langsmith	feat(ai-builder): Add --prebuilt-workflows flag for eval CLI (no-changelog) (#29830 )	2026-05-07 11:47:29 +00:00
outcome	feat(ai-builder): Workflow evaluation framework with LLM mock execution (#27818 )	2026-04-07 13:31:16 +00:00
report	feat(ai-builder): Add LangSmith integration for workflow eval tracking (no-changelog) (#28835 )	2026-04-23 08:47:02 +00:00
subagent	feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog) (#28289 )	2026-04-24 07:50:46 +00:00
system-prompts	fix(core): Update instance-ai evaluator to include pinned subnodes and allow all mcp tools (#30292 )	2026-05-12 09:13:01 +00:00
utils	refactor: Extract shared eval helpers (no-changelog) (#29800 )	2026-05-07 08:05:01 +00:00
index.ts	feat(ai-builder): Add per-PR eval regression detection vs LangSmith baseline (#29456 )	2026-05-06 08:15:08 +00:00
README.md	feat(ai-builder): Add --prebuilt-workflows flag for eval CLI (no-changelog) (#29830 )	2026-05-07 11:47:29 +00:00
tsconfig.json	feat: Add pairwise workflow eval pipeline (#29123 )	2026-05-04 13:26:27 +00:00
types.ts	feat: Add multiple runs to instanceAI eval (no-changelog) (#28493 )	2026-04-22 07:11:10 +00:00

README.md

Workflow evaluation framework

Tests whether workflows built by Instance AI actually work by executing them with LLM-generated mock HTTP responses. No real credentials or external services are involved.

Three harnesses live here:

eval:instance-ai — end-to-end build + mocked execution + LLM verification (drives a running n8n instance)
eval:subagent — builder sub-agent against live n8n, scored by binary checks (drives a running n8n instance)
eval:pairwise — builder sub-agent in-process, scored by an LLM judge panel against do/don't lists (no n8n server). Intended for head-to-head comparison with ai-workflow-builder.ee on the same dataset

Sections:

Running e2e + sub-agent evals
Regression detection
Running evals against pre-built workflows
Running pairwise evals
How the e2e harness works
How the sub-agent harness works

Running evals

Each run:

Build — the test case prompt goes to Instance AI, which builds a workflow.
Phase 1 — the server analyzes the workflow and generates consistent mock data hints (one Sonnet call per scenario).
Phase 2 — the workflow executes with every HTTP request intercepted and answered by an LLM using the node's configuration and API docs from Context7.
Verify — an LLM evaluates whether the scenario's success criteria were met and categorizes any failure by root cause (see Failure categories).

What gets mocked

Mocked nodes — anything that makes HTTP requests (Gmail, Slack, Google Sheets, HTTP Request, Notion…). The request is intercepted before it leaves the process; an LLM generates the response.
Pinned nodes — nodes that don't go through the HTTP layer (triggers/webhooks, LangChain/AI nodes, database nodes). They receive LLM-generated pin data.
Real nodes — logic nodes (Code, Set, Merge, Filter, IF, Switch) execute on the mocked data.

~95% of node types are covered. See Known limitations for the gaps.

Quick start

You need an n8n instance running with Instance AI enabled, a seeded owner account, and an Anthropic API key. Two paths:

Local (pnpm dev:ai)

pnpm dev:ai runs watch mode across n8n, n8n-core, and @n8n/n8n-nodes-langchain in parallel — the set you need when iterating on Instance AI or the eval framework.

Create .env.local at the repo root with at minimum:

N8N_INSTANCE_AI_MODEL_API_KEY=sk-ant-...
N8N_EVAL_EMAIL=nathan@n8n.io
N8N_EVAL_PASSWORD=PlaywrightTest123
# Optional — see "Environment variables" for the full list
LANGSMITH_API_KEY=...
CONTEXT7_API_KEY=...

Start the instance:

dotenvx run -f .env.local -- pnpm dev:ai

Create the owner account. First time only: open http://localhost:5678, sign up with the email/password from your .env.local. If you'd rather reset to a known state (useful after deleting the DB or between iterations), start the server with E2E_TESTS=true and curl the reset endpoint:

E2E_TESTS=true dotenvx run -f .env.local -- pnpm dev:ai
# then in another shell:
curl -sf -X POST http://localhost:5678/rest/e2e/reset \
  -H "Content-Type: application/json" \
  -d '{"owner":{"email":"nathan@n8n.io","password":"PlaywrightTest123","firstName":"Eval","lastName":"Owner"},"admin":{"email":"admin@n8n.io","password":"PlaywrightTest123","firstName":"Admin","lastName":"User"},"members":[],"chat":{"email":"chat@n8n.io","password":"PlaywrightTest123","firstName":"Chat","lastName":"User"}}'

Run evals from packages/@n8n/instance-ai/:

dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --verbose

Docker

Useful when you don't want a live watch process — e.g. reproducing a CI failure. The container starts empty, so owner creation is part of the flow:

# Build the image
INCLUDE_TEST_CONTROLLER=true pnpm build:docker

# Start a container (E2E_TESTS=true exposes /rest/e2e/reset)
docker run -d --name n8n-eval \
  -e E2E_TESTS=true \
  -e N8N_ENABLED_MODULES=instance-ai \
  -e N8N_AI_ENABLED=true \
  -e N8N_INSTANCE_AI_MODEL_API_KEY=your-key \
  -p 5678:5678 \
  n8nio/n8n:local

# Seed the owner
curl -sf -X POST http://localhost:5678/rest/e2e/reset -H "Content-Type: application/json" -d '{"owner":{"email":"nathan@n8n.io","password":"PlaywrightTest123","firstName":"Eval","lastName":"Owner"},"admin":{"email":"admin@n8n.io","password":"PlaywrightTest123","firstName":"Admin","lastName":"User"},"members":[],"chat":{"email":"chat@n8n.io","password":"PlaywrightTest123","firstName":"Chat","lastName":"User"}}'

# Run evals against it
pnpm eval:instance-ai --base-url http://localhost:5678 --verbose

CLI reference

Invoke from packages/@n8n/instance-ai/:

# All test cases
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --verbose

# Single test case (filename substring match)
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --verbose

# Keep built workflows for inspection after the run
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --keep-workflows

# Multi-iteration for pass@k / pass^k metrics
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --iterations 3

Flags

Flag	Default	Description
`--verbose`	`false`	Log build/execute/verify timing and SSE events
`--filter`	—	Filter test cases by filename substring. Comma-separated values mean OR (e.g. `contact-form,deduplication`)
`--exclude`	—	Skip test cases whose filename matches any of the substrings. Same comma-separated shape as `--filter`; applied after `--filter`
`--prebuilt-workflows`	—	Path to a JSON manifest mapping test-case slugs to existing workflow IDs. Skips the orchestrator build for matched test cases — see Running evals against pre-built workflows
`--keep-workflows`	`false`	Don't delete built workflows after the run
`--base-url`	`http://localhost:5678`	n8n instance URL
`--email`	E2E test owner	Override login email (or `N8N_EVAL_EMAIL`)
`--password`	E2E test owner	Override login password (or `N8N_EVAL_PASSWORD`)
`--timeout-ms`	`900000`	Per-test-case timeout
`--output-dir`	cwd	Where to write `eval-results.json`
`--dataset`	`instance-ai-workflow-evals`	LangSmith dataset name
`--concurrency`	`16`	Max concurrent scenarios (builds are separately capped at 4)
`--experiment-name`	auto	LangSmith experiment prefix (defaults to `{branch}-{sha}` in CI or `local-{branch}-{sha}-dirty?` locally)
`--iterations`	`1`	Run each test case N times with fresh builds

pass@k / pass^k: with --iterations N, each scenario runs N times. pass@k is the fraction of scenarios that passed at least once; pass^k is the fraction that passed every time. pass@k shows whether something is possible; pass^k shows whether it's reliable.

Outputs

Every run produces:

Console — live progress, per-scenario pass/fail with [failure_category] tag, and a grouped summary.
eval-results.json — structured results in --output-dir (or cwd). Consumed by the CI PR comment.
.data/workflow-eval-report.html — self-contained debugging view with per-node execution traces, intercepted requests, mock responses, Phase 1 hints, and verifier reasoning.
LangSmith experiment — only when LANGSMITH_API_KEY is set. See the caveat in Environment variables.

Environment variables

Variable	Required	Description
`N8N_INSTANCE_AI_MODEL_API_KEY`	Yes	Anthropic API key for the agent, mock generation, and verification
`N8N_EVAL_EMAIL`	No	n8n login email (defaults to E2E test owner)
`N8N_EVAL_PASSWORD`	No	n8n login password (defaults to E2E test owner)
`LANGSMITH_API_KEY`	No	Enables experiment tracking + tracing. See caveat below.
`LANGSMITH_ENDPOINT`	No	Region (`https://api.smith.langchain.com` US, `https://eu.api.smith.langchain.com` EU)
`LANGSMITH_REVISION_ID`	No	Commit SHA to tag the experiment with (auto-set in CI)
`LANGSMITH_BRANCH`	No	Branch name to tag the experiment with (auto-set in CI)
`CONTEXT7_API_KEY`	No	Context7 key for API-doc lookups. Improves mock realism for less-common services; the LLM falls back to training data when unset
`N8N_AI_ASSISTANT_BASE_URL`	No	Set to `""` to bypass the hosted AI proxy and hit Anthropic directly — useful to avoid per-tenant quota during large batch runs

LangSmith caveat: if LANGSMITH_API_KEY is set in .env.local, local runs also land in the shared instance-ai-workflow-evals dataset. Unset it (or run without dotenvx) to keep exploratory runs out of team results.

Regression detection

When LANGSMITH_API_KEY is set, every eval run automatically compares its results against the most recent pinned baseline (any experiment whose name starts with instance-ai-baseline-). Two output files are written:

eval-results.json — structured data only, including comparison.result when a baseline was found.
eval-pr-comment.md — the full PR comment rendered as markdown, including the alert, aggregate, comparison sections, per-test-case results, and failure details. Always written; falls back to a no-baseline summary when no comparison ran.

The CI PR-comment step uses eval-pr-comment.md as the entire comment body (no jq assembly in the workflow). The console output uses a separate aligned-text formatter — same data, no markdown noise in the terminal.

Refreshing the baseline

There is no auto-refresh — refresh explicitly when you want a new reference point, ideally with high N for low noise:

# From packages/@n8n/instance-ai/, on master at the version you want to pin
LANGSMITH_API_KEY=... dotenvx run -f ../../../.env.local -- \
  pnpm eval:instance-ai --experiment-name instance-ai-baseline --iterations 10

LangSmith appends a random suffix (e.g. instance-ai-baseline-7abc1234); the most recently started one becomes the comparison target on the next eval run. The comparison is silently skipped on the baseline-creation run itself.

How scenarios are tiered

Each scenario lands in one of three regression tiers, evaluated in order of strictness:

Regression — high-confidence flag, gating-grade. The drop must be statistically significant (chance of seeing it by noise < 5%), at least 30 percentage points in size, and the baseline must have been reliable (≥ 70% pass rate).
Likely regression — looser bar for visibility on borderline cases. Looser confidence threshold (chance by noise < 20%), drop ≥ 15 percentage points, baseline ≥ 50%. Frequently natural variance — worth a glance only if your changes touch related code paths.
Worth watching — any scenario whose pass rate moved by ≥ 35 percentage points but wasn't flagged as a regression (hard or likely tier). Pure visibility, no implication of cause.

Other verdicts: improvement (PR significantly better, skips the reliability gate), unreliable_baseline (confident drop but baseline was too flaky to call a regression — surfaced but not flagged), stable, insufficient_data.

Why these tiers and not a flat percentage threshold? At the small N PR runs use (typically 3 iterations), a flat threshold can't tell a real regression from coin-flip noise. The confidence cutoff filters out gaps that could plausibly happen by chance, and the reliability gate avoids chasing noise on already-flaky scenarios. Implementation lives in comparison/statistics.ts (Fisher's exact test for the confidence check, Wilson interval for the headline aggregate band). Tune the likely-regression tier first if the false-positive rate looks off — keep the hard tier strict.

Failure-category drift

When both sides captured per-trial failureCategory values, the comparison also surfaces a run-level table of category rates (PR vs baseline). A category is marked notable when its absolute rate delta is ≥ 5 percentage points and the count change beyond what scenario-count scaling would predict is ≥ 3 trials. This catches cross-scenario shifts (e.g. mock-generation breaking, or a model getting weaker overall) that per-scenario flags can miss.

Best-effort

Comparison is logged and skipped on any LangSmith failure — it never fails the eval. It is also skipped when no baseline experiment exists yet.

Running evals against pre-built workflows

The eval framework normally builds each workflow with Instance AI and then verifies it. With --prebuilt-workflows <path>, the build step is skipped for matched test cases — the harness fetches the existing workflow from the n8n instance and runs verification against it instead. Use this to score workflows authored by other tools (an MCP-driven session, a hand-built reference, an older Instance AI snapshot) on the same dataset and the same verifier.

The manifest is a JSON file mapping test-case file slugs to workflow IDs:

{
  "contact-form-automation": ["W1abc", "W2def", "W3ghi"],
  "deduplication-trigger": ["W4jkl"]
}

Keys are test-case file slugs — the JSON filename without .json (e.g. contact-form-automation for evaluations/data/workflows/contact-form-automation.json). The --filter flag uses the same identifier.
Values are arrays of workflow IDs that already exist in the target n8n instance. Multiple iterations rotate through the list with iteration % ids.length, so an --iterations 5 run with 5 IDs gets 5 distinct builds.

Test cases not present in the manifest fall back to the regular Instance AI build path. To run only the prebuilt set, pair with --exclude to skip the rest, or --filter to narrow the run.

# Score the prebuilt cohort, skipping anything not in the manifest
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai \
  --prebuilt-workflows ./mcp-manifest.json \
  --filter contact-form-automation,deduplication-trigger \
  --iterations 5 \
  --experiment-name mcp-cohort

The harness leaves prebuilt workflows alone after the run (no auto-delete), so the manifest can be re-used across multiple eval runs.

Producing a manifest

pnpm eval:build-mcp-manifest (evaluations/cli/build-mcp-manifest.ts) drives claude -p against an MCP server — defaults to n8n's instance MCP — and writes a manifest in the schema this flag expects, plus a manifest-stats.json sidecar with per-cohort cost / turn / duration aggregates. The output is validated against the same Zod schema the loader uses, so shape regressions surface here rather than at eval time.

Prerequisites: claude CLI installed; ~/.claude.json has the MCP server block configured (project-scoped under .projects[<repo-root>].mcpServers[<name>] or globally under .mcpServers[<name>]); n8n instance reachable at the URL the MCP block points at. Default MCP server name is "n8n-mcp (instance)" — override with --mcp-server.

# Build N=5 per test case, 4 in parallel
pnpm eval:build-mcp-manifest -n 5 -j 4 --output-dir ./mcp-cohort

# Then score the cohort
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai \
  --prebuilt-workflows ./mcp-cohort/manifest.json \
  --iterations 5 \
  --experiment-name mcp-cohort

For runs that need to leave the n8n repo (for example, driving the build from a separate Claude project where you have skills configured), three flags decouple the script from its default assumptions:

--workflow-dir <path> — read test-case JSONs from a directory other than the n8n repo's evaluations/data/workflows/. When set, the script no longer needs git rev-parse to find the repo.
--build-cwd <path> — set the working directory the claude subprocess spawns from. Affects which ~/.claude.json projects entry (and which skills) Claude loads.
--project-id <id> — instructs the model to pass projectId to create_workflow_from_code so workflows land in a specific n8n project instead of the user's personal one.

Run pnpm eval:build-mcp-manifest --help for the full flag list.

Pairwise evals

Pairwise evals score a built workflow against the dataset's dos / donts criteria using an LLM judge panel (3 judges by default, majority vote on pairwise_primary, mean fraction of criteria satisfied on pairwise_diagnostic). The point is head-to-head comparison with ai-workflow-builder.ee on the same dataset (default notion-pairwise-workflows), so the judge panel, defaults, and metric keys are imported from that package directly.

Unlike the e2e and sub-agent harnesses, pairwise runs the builder sub-agent in-process — no n8n server, no Docker, no live workflow service. Stub services capture createFromWorkflowJSON calls; HITL suspensions are auto-approved.

Quick start

# From packages/@n8n/instance-ai/

# 1. Local fixture (small smoke set, no LangSmith required)
N8N_AI_ANTHROPIC_KEY="$ANTHROPIC_API_KEY" pnpm eval:pairwise --judges 1

# 2. Full LangSmith dataset
LANGSMITH_API_KEY=... N8N_AI_ANTHROPIC_KEY="$ANTHROPIC_API_KEY" \
  pnpm eval:pairwise:langsmith --judges 3

# 3. Rerun a specific subset (one example ID per line; #-prefixed lines ignored)
pnpm eval:pairwise:langsmith \
  --example-ids-file .output/pairwise/failed-ids.txt \
  --output-dir .output/pairwise/rerun

Sandbox

Pairwise evals always run inside a sandbox — the same path production uses. The agent writes TypeScript to ~/workspace/src/workflow.ts inside the sandbox, runs tsc to validate, and calls submit-workflow to save the parsed WorkflowJSON. This exercises the production builder agent end-to-end (sandbox prompt, file I/O, real type checking).

Required env vars (Daytona provider — the default):

ANTHROPIC_API_KEY=sk-ant-...           # builder + judge LLM
LANGSMITH_API_KEY=ls__...              # only for --backend langsmith
DAYTONA_API_URL=https://app.daytona.io/api
DAYTONA_API_KEY=dtn_...

# Optional
N8N_INSTANCE_AI_SANDBOX_PROVIDER=daytona      # default; set 'local' or 'n8n-sandbox' to switch
N8N_INSTANCE_AI_SANDBOX_IMAGE=daytonaio/sandbox:0.5.0   # default
N8N_INSTANCE_AI_SANDBOX_TIMEOUT=300000        # per-command timeout (ms)

The CLI fails fast at startup if the chosen provider is misconfigured (e.g., Daytona selected without API URL/key). The chosen provider is recorded under summary.json → sandbox.provider.

Daytona cold-start. The very first sandbox creation triggers an image build on Daytona's side (npm install for @n8n/workflow-sdk). That can exceed the SDK's 5-minute create timeout and fail with Sandbox failed to become ready within the timeout period. Once the image is cached, later runs are fast. Workaround: pre-build the image via the Daytona dashboard before kicking off a full eval run.

Flags

Flag	Default	Description
`--backend`	`local`	`local` reads `evaluations/data/pairwise/local.json`; `langsmith` pulls from the LangSmith dataset
`--dataset`	`notion-pairwise-workflows`	LangSmith dataset name (langsmith backend only)
`--judges`	`3`	Number of judges in the LLM panel
`--judge-model`	`claude-sonnet-4-5-20250929`	LangChain model id for the judge LLM
`--iterations`	`1`	Run each example N times — for measuring judge / build variance
`--concurrency`	`5`	Parallel example workers (`p-limit`)
`--max-examples`	—	Cap dataset to first N examples
`--example-ids-file`	—	Path to a text file of LangSmith example IDs (one per line). Used for rerunning a subset
`--timeout-ms`	`1200000`	Per-example build timeout
`--output-dir`	`.output/pairwise/<iso>`	Where to write artifacts
`--experiment-name`	`pairwise-evals-instance-ai`	LangSmith experiment label
`--verbose`	`false`	Per-example log lines

Outputs

Each run writes a self-contained directory:

.output/pairwise/<run>/
├── summary.json           # totals: pass rate, avg diagnostic, build failures by class, interactivity counters
├── results.jsonl          # one line per example: prompt, dos/donts, captured workflow, build metadata, feedback rows
├── workflows/<id>.json    # normalized workflow JSON (matches SimpleWorkflow shape from ai-workflow-builder.ee)
└── chunks/<id>_<iter>.jsonl  # per-example agent trace: tool-calls, tool-results, suspensions, final text

The chunks/*.jsonl traces are the primary tool for root-causing build failures. Each line is one event: tool-call, tool-result, suspension, auto-approve, text, stream-finish, captured-workflows, error.

When LANGSMITH_API_KEY is set, feedback is also posted to LangSmith with metric keys pairwise_primary, pairwise_diagnostic, pairwise_judges_passed, pairwise_total_passes, pairwise_total_violations, and per-judge judge1..N. Experiment metadata includes builder: 'instance-ai' so it can be queried alongside the ai-workflow-builder.ee baseline.

Build failure classes

Build failures are tracked separately from judge scores:

build_timeout — exceeded --timeout-ms
no_workflow_built — agent finished without invoking build-workflow (no captured workflow)
agent_error — stream errored or the agent threw

A failure produces a row with workflow: null, empty feedback, and the error class — it counts as a primary fail in the comparison report.

Interactivity gates

The agent is stubbed for non-interactive use. The summary tracks divergence from this assumption — investigate any non-zero count:

askUserCount — ask-user tool was invoked (eval responds with { approved: false })
planToolCount — plan tool was invoked (single-prompt dataset shouldn't trigger planning)
autoApprovedSuspensions — HITL-gated tool fired (e.g., data-tables create); auto-approved
mockedCredentialTypes — credential types the agent referenced (auto-mocked since credentialService.list() returns [])

Comparison report

After running both ai-workflow-builder.ee/evaluations/cli (the baseline) and eval:pairwise against the same dataset, generate an HTML side-by-side report:

pnpm eval:pairwise:compare \
  --ee-dir   ../ai-workflow-builder.ee/evaluations/.output/pairwise/<ts> \
  --ia-dir   .output/pairwise/<ts> \
  --out      .output/pairwise/comparison.html

The report shows headline metrics, per-prompt verdicts (TIE / IA-only / Code-only / both-pass / both-fail), and lazy-loaded workflow previews — rows collapse by default and only render the heavy <n8n-demo> preview when expanded.

When pairwise scores wobble

Judge non-determinism + agent retry behavior mean a single run is not a reliable signal. Two specific things to know:

The agent will sometimes retry build-workflow after a parser rejection (e.g., security violation) and sometimes give up. Whether a prompt "fails to build" is non-deterministic across runs.
If you're comparing two builders to claim a regression or improvement, bump --iterations to ≥3 for both sides.

How the e2e harness works

Build — sends the test case prompt to Instance AI, which builds a workflow
Phase 1 — analyzes the workflow and generates consistent mock data hints (one Sonnet call per scenario)
Phase 2 — executes the workflow with all HTTP requests intercepted. Each request goes to an LLM that generates a realistic API response using the node's configuration and API documentation from Context7
Verify — an LLM evaluates whether the scenario's success criteria were met and categorizes any failure by root cause (see Failure categories below)

What gets mocked

Mocked nodes — any node that makes HTTP requests (Gmail, Slack, Google Sheets, HTTP Request, Notion, etc.). The request is intercepted before it leaves the process. An LLM generates the response.
Pinned nodes — nodes that don't go through the HTTP layer: trigger/webhook nodes, LangChain/AI nodes (they use SDKs directly), database nodes. These receive LLM-generated data as pin data.
Real nodes — logic nodes (Code, Set, Merge, Filter, IF, Switch) execute their actual code on the mocked/pinned data.

No real credentials or API connections are needed. ~95% of node types are covered; the main gaps are binary-data nodes (file attachments, image generation) and streaming nodes.

How the sub-agent harness works

The CLI logs in to n8n with N8N_EVAL_EMAIL / N8N_EVAL_PASSWORD.
For each test case it POSTs /rest/instance-ai/eval/run-sub-agent.
The server builds a real InstanceAiContext via InstanceAiAdapterService.createContext, wraps the workflow service to record created IDs, resolves the builder (or other) role's system prompt, instantiates the sub-agent with the full createAllTools(context) tool surface, and runs it to completion.
The server returns { text, toolCalls, toolResults, capturedWorkflowIds, ... }.
The CLI fetches each captured workflow via GET /rest/workflows/:id (this doubles as a round-trip check through the real importer), scores it with the binary-check suite, and archives+deletes it (unless --keep-workflows).

No tools, services, or workflow imports are mocked. The server path exercised here is the same one the orchestrator takes when it spawns a builder sub-agent.

LangSmith integration

When LANGSMITH_API_KEY is set, each run is recorded as a LangSmith experiment against the instance-ai-workflow-evals dataset (synced from the JSON files before each run). Experiments against the same dataset can be compared side-by-side to spot regressions.

Adding test cases

Test cases live in evaluations/data/workflows/*.json. Drop a file in, the CLI and LangSmith sync picks it up — no registration step.

{
  "prompt": "Create a workflow that...",
  "complexity": "medium",
  "tags": ["build", "webhook", "gmail"],
  "triggerType": "webhook",
  "scenarios": [
    {
      "name": "happy-path",
      "description": "Normal operation",
      "dataSetup": "The webhook receives a submission from Jane (jane@example.com)...",
      "successCriteria": "The workflow executes without errors. An email is sent to jane@example.com..."
    }
  ]
}

One JSON file = one LangSmith split. Scenarios in the same file share a split; split names derive from the filename slug. Pick a slug you're happy to also use as a --filter target.

Prompt tips

Be specific about node configuration — document IDs, sheet names, channel names, chat IDs. The agent won't ask for these in eval mode (no multi-turn yet).
Add "Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later."
If a built-in node doesn't expose a field you need (e.g. the Linear node doesn't query creator.email), tell the agent to use HTTP Request instead.

Scenario tips

Don't specify exact counts that depend on mock data ("exactly 7 posts remain"). The LLM is non-deterministic. Say "fewer than the original 10" instead.
dataSetup steers the mock — describe what each service should return, not the exact JSON.
For error scenarios, describe the condition: "The Telegram node returns an error indicating the chat was not found."
successCriteria is what the verifier reads. Be specific: "None of the titles in the Slack message should contain the word 'qui'."

Which scenarios to include

happy-path — everything works as expected
Edge cases — empty data, missing fields, single vs multiple items
Error scenarios only if the workflow is expected to handle them gracefully. Most agent-built workflows don't include error handling, so "the workflow crashes on invalid input" is a legitimate finding, not a test-case failure.

Adding a new credential type

credentials/seeder.ts seeds generic creds (HTTP Header, HTTP Basic) on every run, plus env-gated creds (GitHub, Gmail, Teams, Linear…) when the matching env var is set. If your scenario needs a credential type that isn't there, add it to the appropriate list in seeder.ts — env-gated if it requires a real token, generic if a placeholder is fine.

Failure categories

When a scenario fails, the verifier categorizes the root cause:

builder_issue — the agent misconfigured a node, chose the wrong node type, or built the wrong structure.
mock_issue — the LLM mock returned incorrect data (_evalMockError, wrong response shape).
framework_issue — Phase 1 failed (empty trigger content) or the eval framework itself cascaded an error.
verification_failure — the verifier couldn't produce a valid result.
build_failure — Instance AI failed to build the workflow or a scenario timed out.

Suite pass rates typically sit between 40–65%; most failures are builder_issue on scenarios that require error handling the agent doesn't produce by default.

Troubleshooting

Wrong username or password on login. Your instance has no owner. Run the rest/e2e/reset curl from Quick start (needs E2E_TESTS=true on the server).

Have reached end of quota mid-run. You're hitting the hosted AI proxy's per-tenant quota. Set N8N_AI_ASSISTANT_BASE_URL="" to hit Anthropic directly with your N8N_INSTANCE_AI_MODEL_API_KEY. Also consider lowering --concurrency.

All scenarios timing out. Check that the server is up (curl localhost:5678/healthz) and that N8N_INSTANCE_AI_MODEL_API_KEY is set. A full build is ~60–180s; timeouts past --timeout-ms usually mean the agent is looping.

Flood of framework_issue failures. Phase 1 is returning empty hints — some nodes will degrade to _evalMockError. Check server logs for Anthropic errors; occasionally a rate-limit spike.

Port 5678 already in use. A previous pnpm dev is still running. lsof -iTCP:5678 -sTCP:LISTEN to find the PID, kill it.

Debugging a specific failure. Open the .data/workflow-eval-report.html artifact — it has per-node execution traces, the exact intercepted request, the mock response the LLM generated, and the verifier's reasoning. Start from the failing node.

CI

Evals run automatically on PRs that change Instance AI code (path-filtered). The workflow starts a single Docker container and runs the CLI against it. See .github/workflows/test-evals-instance-ai.yml.

The job is non-blocking. Results are posted as a PR comment and uploaded as artifacts. When LANGSMITH_API_KEY is set via the EVALS_LANGSMITH_API_KEY secret, runs also land as LangSmith experiments tagged with commit SHA + branch, so you can compare against master side-by-side.

Architecture

evaluations/
├── index.ts              # Public API
├── cli/                  # CLI entries: instance-ai, subagent, pairwise, compare-pairwise, report
├── clients/              # n8n REST + SSE clients
├── checklist/            # LLM verification with retry
├── credentials/          # Test credential seeding
├── data/workflows/       # e2e/sub-agent test case JSON files
├── data/pairwise/        # Local pairwise fixture (small smoke set)
├── harness/              # Runners: buildWorkflow + executeScenario (e2e), in-process-builder (pairwise)
├── langsmith/            # Dataset sync + experiment setup
├── outcome/              # SSE event parsing, workflow discovery
├── report/               # HTML report generator
└── system-prompts/       # LLM prompts for verification

packages/cli/src/modules/instance-ai/eval/
├── execution.service.ts  # Phase 1 + Phase 2 orchestration
├── workflow-analysis.ts  # Hint generation (Phase 1)
├── mock-handler.ts       # Per-request mock generation (Phase 2)
├── api-docs.ts           # Context7 API doc fetcher
├── node-config.ts        # Node config serializer
└── pin-data-generator.ts # LLM pin data for bypass nodes (Phase 1.5)

Known limitations

LangChain/AI nodes — use their own SDKs, not the HTTP mock layer. They fail with credential errors; use pin data instead.
Binary / file nodes — media attachments, image generation, file downloads. Mock metadata works; realistic binary content is out of scope.
Streaming nodes — mocks return complete responses, not streams.
GraphQL APIs — response shape depends on the query, not just the endpoint. Quality depends on the LLM knowing the API schema.
Non-determinism — the agent builds different workflows each run. Pass rates vary between 40–65%.
Large workflows — verification artifacts include full execution traces. For 12+ node workflows this can hit token limits. See TRUST-43 for the tool-based verifier approach.

README.md Unescape Escape

Workflow evaluation framework

Running evals

What gets mocked

Quick start

Local (pnpm dev:ai)

Docker

CLI reference

Flags

Outputs

Environment variables

Regression detection

Refreshing the baseline

How scenarios are tiered

Failure-category drift

Best-effort

Running evals against pre-built workflows

Producing a manifest

Pairwise evals

Quick start

Sandbox

Flags

Outputs

Build failure classes

Interactivity gates

Comparison report

When pairwise scores wobble

How the e2e harness works

What gets mocked

How the sub-agent harness works

LangSmith integration

Adding test cases

Adding a new credential type

Failure categories

Troubleshooting

CI

Architecture

Known limitations

README.md