mirror of
https://github.com/n8n-io/n8n.git
synced 2026-05-28 07:17:04 +02:00
554 lines
32 KiB
Markdown
554 lines
32 KiB
Markdown
# Workflow evaluation framework
|
||
|
||
Tests whether workflows built by Instance AI actually work by executing them with LLM-generated mock HTTP responses. No real credentials or external services are involved.
|
||
|
||
Three harnesses live here:
|
||
|
||
- **`eval:instance-ai`** — end-to-end build + mocked execution + LLM verification (drives a running n8n instance)
|
||
- **`eval:subagent`** — builder sub-agent against live n8n, scored by binary checks (drives a running n8n instance)
|
||
- **`eval:pairwise`** — builder sub-agent in-process, scored by an LLM judge panel against do/don't lists (no n8n server). Intended for head-to-head comparison with `ai-workflow-builder.ee` on the same dataset
|
||
|
||
Sections:
|
||
|
||
- [Running e2e + sub-agent evals](#running-evals)
|
||
- [Regression detection](#regression-detection)
|
||
- [Running evals against pre-built workflows](#running-evals-against-pre-built-workflows)
|
||
- [Running pairwise evals](#pairwise-evals)
|
||
- [How the e2e harness works](#how-the-e2e-harness-works)
|
||
- [How the sub-agent harness works](#how-the-sub-agent-harness-works)
|
||
|
||
## Running evals
|
||
|
||
Each run:
|
||
|
||
1. **Build** — the test case prompt goes to Instance AI, which builds a workflow.
|
||
2. **Phase 1** — the server analyzes the workflow and generates consistent mock data hints (one Sonnet call per scenario).
|
||
3. **Phase 2** — the workflow executes with every HTTP request intercepted and answered by an LLM using the node's configuration and API docs from Context7.
|
||
4. **Verify** — an LLM evaluates whether the scenario's success criteria were met and categorizes any failure by root cause (see [Failure categories](#failure-categories)).
|
||
|
||
### What gets mocked
|
||
|
||
- **Mocked nodes** — anything that makes HTTP requests (Gmail, Slack, Google Sheets, HTTP Request, Notion…). The request is intercepted before it leaves the process; an LLM generates the response.
|
||
- **Pinned nodes** — nodes that don't go through the HTTP layer (triggers/webhooks, LangChain/AI nodes, database nodes). They receive LLM-generated pin data.
|
||
- **Real nodes** — logic nodes (Code, Set, Merge, Filter, IF, Switch) execute on the mocked data.
|
||
|
||
~95% of node types are covered. See [Known limitations](#known-limitations) for the gaps.
|
||
|
||
## Quick start
|
||
|
||
You need an n8n instance running with Instance AI enabled, a seeded owner account, and an Anthropic API key. Two paths:
|
||
|
||
### Local (pnpm dev:ai)
|
||
|
||
`pnpm dev:ai` runs watch mode across `n8n`, `n8n-core`, and `@n8n/n8n-nodes-langchain` in parallel — the set you need when iterating on Instance AI or the eval framework.
|
||
|
||
1. **Create `.env.local`** at the repo root with at minimum:
|
||
```env
|
||
N8N_INSTANCE_AI_MODEL_API_KEY=sk-ant-...
|
||
N8N_EVAL_EMAIL=nathan@n8n.io
|
||
N8N_EVAL_PASSWORD=PlaywrightTest123
|
||
# Optional — see "Environment variables" for the full list
|
||
LANGSMITH_API_KEY=...
|
||
CONTEXT7_API_KEY=...
|
||
```
|
||
|
||
2. **Start the instance**:
|
||
```bash
|
||
dotenvx run -f .env.local -- pnpm dev:ai
|
||
```
|
||
|
||
3. **Create the owner account**. First time only: open `http://localhost:5678`, sign up with the email/password from your `.env.local`. If you'd rather reset to a known state (useful after deleting the DB or between iterations), start the server with `E2E_TESTS=true` and `curl` the reset endpoint:
|
||
```bash
|
||
E2E_TESTS=true dotenvx run -f .env.local -- pnpm dev:ai
|
||
# then in another shell:
|
||
curl -sf -X POST http://localhost:5678/rest/e2e/reset \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"owner":{"email":"nathan@n8n.io","password":"PlaywrightTest123","firstName":"Eval","lastName":"Owner"},"admin":{"email":"admin@n8n.io","password":"PlaywrightTest123","firstName":"Admin","lastName":"User"},"members":[],"chat":{"email":"chat@n8n.io","password":"PlaywrightTest123","firstName":"Chat","lastName":"User"}}'
|
||
```
|
||
|
||
4. **Run evals** from `packages/@n8n/instance-ai/`:
|
||
```bash
|
||
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --verbose
|
||
```
|
||
|
||
### Docker
|
||
|
||
Useful when you don't want a live watch process — e.g. reproducing a CI failure. The container starts empty, so owner creation is part of the flow:
|
||
|
||
```bash
|
||
# Build the image
|
||
INCLUDE_TEST_CONTROLLER=true pnpm build:docker
|
||
|
||
# Start a container (E2E_TESTS=true exposes /rest/e2e/reset)
|
||
docker run -d --name n8n-eval \
|
||
-e E2E_TESTS=true \
|
||
-e N8N_ENABLED_MODULES=instance-ai \
|
||
-e N8N_AI_ENABLED=true \
|
||
-e N8N_INSTANCE_AI_MODEL_API_KEY=your-key \
|
||
-p 5678:5678 \
|
||
n8nio/n8n:local
|
||
|
||
# Seed the owner
|
||
curl -sf -X POST http://localhost:5678/rest/e2e/reset -H "Content-Type: application/json" -d '{"owner":{"email":"nathan@n8n.io","password":"PlaywrightTest123","firstName":"Eval","lastName":"Owner"},"admin":{"email":"admin@n8n.io","password":"PlaywrightTest123","firstName":"Admin","lastName":"User"},"members":[],"chat":{"email":"chat@n8n.io","password":"PlaywrightTest123","firstName":"Chat","lastName":"User"}}'
|
||
|
||
# Run evals against it
|
||
pnpm eval:instance-ai --base-url http://localhost:5678 --verbose
|
||
```
|
||
|
||
## CLI reference
|
||
|
||
Invoke from `packages/@n8n/instance-ai/`:
|
||
|
||
```bash
|
||
# All test cases
|
||
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --verbose
|
||
|
||
# Single test case (filename substring match)
|
||
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --verbose
|
||
|
||
# Keep built workflows for inspection after the run
|
||
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --keep-workflows
|
||
|
||
# Multi-iteration for pass@k / pass^k metrics
|
||
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --iterations 3
|
||
```
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|------|---------|-------------|
|
||
| `--verbose` | `false` | Log build/execute/verify timing and SSE events |
|
||
| `--filter` | — | Filter test cases by filename substring. Comma-separated values mean OR (e.g. `contact-form,deduplication`) |
|
||
| `--exclude` | — | Skip test cases whose filename matches any of the substrings. Same comma-separated shape as `--filter`; applied after `--filter` |
|
||
| `--prebuilt-workflows` | — | Path to a JSON manifest mapping test-case slugs to existing workflow IDs. Skips the orchestrator build for matched test cases — see [Running evals against pre-built workflows](#running-evals-against-pre-built-workflows) |
|
||
| `--keep-workflows` | `false` | Don't delete built workflows after the run |
|
||
| `--base-url` | `http://localhost:5678` | n8n instance URL |
|
||
| `--email` | E2E test owner | Override login email (or `N8N_EVAL_EMAIL`) |
|
||
| `--password` | E2E test owner | Override login password (or `N8N_EVAL_PASSWORD`) |
|
||
| `--timeout-ms` | `900000` | Per-test-case timeout |
|
||
| `--output-dir` | cwd | Where to write `eval-results.json` |
|
||
| `--dataset` | `instance-ai-workflow-evals` | LangSmith dataset name |
|
||
| `--concurrency` | `16` | Max concurrent scenarios (builds are separately capped at 4) |
|
||
| `--experiment-name` | auto | LangSmith experiment prefix (defaults to `{branch}-{sha}` in CI or `local-{branch}-{sha}-dirty?` locally) |
|
||
| `--iterations` | `1` | Run each test case N times with fresh builds |
|
||
|
||
**pass@k / pass^k**: with `--iterations N`, each scenario runs N times. `pass@k` is the fraction of scenarios that passed *at least once*; `pass^k` is the fraction that passed *every* time. `pass@k` shows whether something is *possible*; `pass^k` shows whether it's *reliable*.
|
||
|
||
### Outputs
|
||
|
||
Every run produces:
|
||
|
||
- **Console** — live progress, per-scenario pass/fail with `[failure_category]` tag, and a grouped summary.
|
||
- **`eval-results.json`** — structured results in `--output-dir` (or cwd). Consumed by the CI PR comment.
|
||
- **`.data/workflow-eval-report.html`** — self-contained debugging view with per-node execution traces, intercepted requests, mock responses, Phase 1 hints, and verifier reasoning.
|
||
- **LangSmith experiment** — only when `LANGSMITH_API_KEY` is set. See the caveat in [Environment variables](#environment-variables).
|
||
|
||
## Environment variables
|
||
|
||
| Variable | Required | Description |
|
||
|----------|----------|-------------|
|
||
| `N8N_INSTANCE_AI_MODEL_API_KEY` | Yes | Anthropic API key for the agent, mock generation, and verification |
|
||
| `N8N_EVAL_EMAIL` | No | n8n login email (defaults to E2E test owner) |
|
||
| `N8N_EVAL_PASSWORD` | No | n8n login password (defaults to E2E test owner) |
|
||
| `LANGSMITH_API_KEY` | No | Enables experiment tracking + tracing. **See caveat below.** |
|
||
| `LANGSMITH_ENDPOINT` | No | Region (`https://api.smith.langchain.com` US, `https://eu.api.smith.langchain.com` EU) |
|
||
| `LANGSMITH_REVISION_ID` | No | Commit SHA to tag the experiment with (auto-set in CI) |
|
||
| `LANGSMITH_BRANCH` | No | Branch name to tag the experiment with (auto-set in CI) |
|
||
| `CONTEXT7_API_KEY` | No | Context7 key for API-doc lookups. Improves mock realism for less-common services; the LLM falls back to training data when unset |
|
||
| `N8N_AI_ASSISTANT_BASE_URL` | No | Set to `""` to bypass the hosted AI proxy and hit Anthropic directly — useful to avoid per-tenant quota during large batch runs |
|
||
|
||
**LangSmith caveat:** if `LANGSMITH_API_KEY` is set in `.env.local`, local runs also land in the shared `instance-ai-workflow-evals` dataset. Unset it (or run without `dotenvx`) to keep exploratory runs out of team results.
|
||
|
||
## Regression detection
|
||
|
||
When `LANGSMITH_API_KEY` is set, every eval run automatically compares its results against the most recent pinned baseline (any experiment whose name starts with `instance-ai-baseline-`). Two output files are written:
|
||
|
||
- `eval-results.json` — structured data only, including `comparison.result` when a baseline was found.
|
||
- `eval-pr-comment.md` — the full PR comment rendered as markdown, including the alert, aggregate, comparison sections, per-test-case results, and failure details. Always written; falls back to a no-baseline summary when no comparison ran.
|
||
|
||
The CI PR-comment step uses `eval-pr-comment.md` as the entire comment body (no jq assembly in the workflow). The console output uses a separate aligned-text formatter — same data, no markdown noise in the terminal.
|
||
|
||
### Refreshing the baseline
|
||
|
||
There is no auto-refresh — refresh explicitly when you want a new reference point, ideally with high N for low noise:
|
||
|
||
```bash
|
||
# From packages/@n8n/instance-ai/, on master at the version you want to pin
|
||
LANGSMITH_API_KEY=... dotenvx run -f ../../../.env.local -- \
|
||
pnpm eval:instance-ai --experiment-name instance-ai-baseline --iterations 10
|
||
```
|
||
|
||
LangSmith appends a random suffix (e.g. `instance-ai-baseline-7abc1234`); the most recently started one becomes the comparison target on the next eval run. The comparison is silently skipped on the baseline-creation run itself.
|
||
|
||
### How scenarios are tiered
|
||
|
||
Each scenario lands in one of three regression tiers, evaluated in order of strictness:
|
||
|
||
- **Regression** — high-confidence flag, gating-grade. The drop must be statistically significant (chance of seeing it by noise < 5%), at least 30 percentage points in size, and the baseline must have been reliable (≥ 70% pass rate).
|
||
- **Likely regression** — looser bar for visibility on borderline cases. Looser confidence threshold (chance by noise < 20%), drop ≥ 15 percentage points, baseline ≥ 50%. Frequently natural variance — worth a glance only if your changes touch related code paths.
|
||
- **Worth watching** — any scenario whose pass rate moved by ≥ 35 percentage points but wasn't flagged as a regression (hard or likely tier). Pure visibility, no implication of cause.
|
||
|
||
Other verdicts: `improvement` (PR significantly better, skips the reliability gate), `unreliable_baseline` (confident drop but baseline was too flaky to call a regression — surfaced but not flagged), `stable`, `insufficient_data`.
|
||
|
||
Why these tiers and not a flat percentage threshold? At the small N PR runs use (typically 3 iterations), a flat threshold can't tell a real regression from coin-flip noise. The confidence cutoff filters out gaps that could plausibly happen by chance, and the reliability gate avoids chasing noise on already-flaky scenarios. Implementation lives in `comparison/statistics.ts` (Fisher's exact test for the confidence check, Wilson interval for the headline aggregate band). Tune the likely-regression tier first if the false-positive rate looks off — keep the hard tier strict.
|
||
|
||
### Failure-category drift
|
||
|
||
When both sides captured per-trial `failureCategory` values, the comparison also surfaces a run-level table of category rates (PR vs baseline). A category is marked **notable** when its absolute rate delta is ≥ 5 percentage points _and_ the count change beyond what scenario-count scaling would predict is ≥ 3 trials. This catches cross-scenario shifts (e.g. mock-generation breaking, or a model getting weaker overall) that per-scenario flags can miss.
|
||
|
||
### Best-effort
|
||
|
||
Comparison is logged and skipped on any LangSmith failure — it never fails the eval. It is also skipped when no baseline experiment exists yet.
|
||
|
||
## Running evals against pre-built workflows
|
||
|
||
The eval framework normally builds each workflow with Instance AI and then verifies it. With `--prebuilt-workflows <path>`, the build step is skipped for matched test cases — the harness fetches the existing workflow from the n8n instance and runs verification against it instead. Use this to score workflows authored by other tools (an MCP-driven session, a hand-built reference, an older Instance AI snapshot) on the same dataset and the same verifier.
|
||
|
||
The manifest is a JSON file mapping test-case file slugs to workflow IDs:
|
||
|
||
```json
|
||
{
|
||
"contact-form-automation": ["W1abc", "W2def", "W3ghi"],
|
||
"deduplication-trigger": ["W4jkl"]
|
||
}
|
||
```
|
||
|
||
- **Keys** are test-case file slugs — the JSON filename without `.json` (e.g. `contact-form-automation` for `evaluations/data/workflows/contact-form-automation.json`). The `--filter` flag uses the same identifier.
|
||
- **Values** are arrays of workflow IDs that already exist in the target n8n instance. Multiple iterations rotate through the list with `iteration % ids.length`, so an `--iterations 5` run with 5 IDs gets 5 distinct builds.
|
||
|
||
Test cases not present in the manifest fall back to the regular Instance AI build path. To run *only* the prebuilt set, pair with `--exclude` to skip the rest, or `--filter` to narrow the run.
|
||
|
||
```bash
|
||
# Score the prebuilt cohort, skipping anything not in the manifest
|
||
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai \
|
||
--prebuilt-workflows ./mcp-manifest.json \
|
||
--filter contact-form-automation,deduplication-trigger \
|
||
--iterations 5 \
|
||
--experiment-name mcp-cohort
|
||
```
|
||
|
||
The harness leaves prebuilt workflows alone after the run (no auto-delete), so the manifest can be re-used across multiple eval runs.
|
||
|
||
### Producing a manifest
|
||
|
||
`pnpm eval:build-mcp-manifest` (`evaluations/cli/build-mcp-manifest.ts`) drives `claude -p` against an MCP server — defaults to n8n's instance MCP — and writes a manifest in the schema this flag expects, plus a `manifest-stats.json` sidecar with per-cohort cost / turn / duration aggregates. The output is validated against the same Zod schema the loader uses, so shape regressions surface here rather than at eval time.
|
||
|
||
**Prerequisites**: `claude` CLI installed; `~/.claude.json` has the MCP server block configured (project-scoped under `.projects[<repo-root>].mcpServers[<name>]` or globally under `.mcpServers[<name>]`); n8n instance reachable at the URL the MCP block points at. Default MCP server name is `"n8n-mcp (instance)"` — override with `--mcp-server`.
|
||
|
||
```bash
|
||
# Build N=5 per test case, 4 in parallel
|
||
pnpm eval:build-mcp-manifest -n 5 -j 4 --output-dir ./mcp-cohort
|
||
|
||
# Then score the cohort
|
||
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai \
|
||
--prebuilt-workflows ./mcp-cohort/manifest.json \
|
||
--iterations 5 \
|
||
--experiment-name mcp-cohort
|
||
```
|
||
|
||
For runs that need to leave the n8n repo (for example, driving the build from a separate Claude project where you have skills configured), three flags decouple the script from its default assumptions:
|
||
|
||
- `--workflow-dir <path>` — read test-case JSONs from a directory other than the n8n repo's `evaluations/data/workflows/`. When set, the script no longer needs `git rev-parse` to find the repo.
|
||
- `--build-cwd <path>` — set the working directory the `claude` subprocess spawns from. Affects which `~/.claude.json` `projects` entry (and which skills) Claude loads.
|
||
- `--project-id <id>` — instructs the model to pass `projectId` to `create_workflow_from_code` so workflows land in a specific n8n project instead of the user's personal one.
|
||
|
||
Run `pnpm eval:build-mcp-manifest --help` for the full flag list.
|
||
|
||
## Pairwise evals
|
||
|
||
Pairwise evals score a built workflow against the dataset's `dos` / `donts`
|
||
criteria using an LLM judge panel (3 judges by default, majority vote on
|
||
`pairwise_primary`, mean fraction of criteria satisfied on
|
||
`pairwise_diagnostic`). The point is **head-to-head comparison with
|
||
`ai-workflow-builder.ee`** on the same dataset (default
|
||
`notion-pairwise-workflows`), so the judge panel, defaults, and metric keys
|
||
are imported from that package directly.
|
||
|
||
Unlike the e2e and sub-agent harnesses, pairwise runs the **builder
|
||
sub-agent in-process** — no n8n server, no Docker, no live workflow service.
|
||
Stub services capture `createFromWorkflowJSON` calls; HITL suspensions are
|
||
auto-approved.
|
||
|
||
### Quick start
|
||
|
||
```bash
|
||
# From packages/@n8n/instance-ai/
|
||
|
||
# 1. Local fixture (small smoke set, no LangSmith required)
|
||
N8N_AI_ANTHROPIC_KEY="$ANTHROPIC_API_KEY" pnpm eval:pairwise --judges 1
|
||
|
||
# 2. Full LangSmith dataset
|
||
LANGSMITH_API_KEY=... N8N_AI_ANTHROPIC_KEY="$ANTHROPIC_API_KEY" \
|
||
pnpm eval:pairwise:langsmith --judges 3
|
||
|
||
# 3. Rerun a specific subset (one example ID per line; #-prefixed lines ignored)
|
||
pnpm eval:pairwise:langsmith \
|
||
--example-ids-file .output/pairwise/failed-ids.txt \
|
||
--output-dir .output/pairwise/rerun
|
||
```
|
||
|
||
### Sandbox
|
||
|
||
Pairwise evals always run inside a sandbox — the same path production uses.
|
||
The agent writes TypeScript to `~/workspace/src/workflow.ts` inside the
|
||
sandbox, runs `tsc` to validate, and calls `submit-workflow` to save the
|
||
parsed `WorkflowJSON`. This exercises the production builder agent
|
||
end-to-end (sandbox prompt, file I/O, real type checking).
|
||
|
||
Required env vars (Daytona provider — the default):
|
||
|
||
```bash
|
||
ANTHROPIC_API_KEY=sk-ant-... # builder + judge LLM
|
||
LANGSMITH_API_KEY=ls__... # only for --backend langsmith
|
||
DAYTONA_API_URL=https://app.daytona.io/api
|
||
DAYTONA_API_KEY=dtn_...
|
||
|
||
# Optional
|
||
N8N_INSTANCE_AI_SANDBOX_PROVIDER=daytona # default; set 'local' or 'n8n-sandbox' to switch
|
||
N8N_INSTANCE_AI_SANDBOX_IMAGE=daytonaio/sandbox:0.5.0 # default
|
||
N8N_INSTANCE_AI_SANDBOX_TIMEOUT=300000 # per-command timeout (ms)
|
||
```
|
||
|
||
The CLI fails fast at startup if the chosen provider is misconfigured (e.g.,
|
||
Daytona selected without API URL/key). The chosen provider is recorded under
|
||
`summary.json → sandbox.provider`.
|
||
|
||
> **Daytona cold-start.** The very first sandbox creation triggers an image
|
||
> build on Daytona's side (`npm install` for `@n8n/workflow-sdk`). That can
|
||
> exceed the SDK's 5-minute create timeout and fail with `Sandbox failed to
|
||
> become ready within the timeout period`. Once the image is cached, later
|
||
> runs are fast. Workaround: pre-build the image via the Daytona dashboard
|
||
> before kicking off a full eval run.
|
||
|
||
### Flags
|
||
|
||
| Flag | Default | Description |
|
||
|------|---------|-------------|
|
||
| `--backend` | `local` | `local` reads `evaluations/data/pairwise/local.json`; `langsmith` pulls from the LangSmith dataset |
|
||
| `--dataset` | `notion-pairwise-workflows` | LangSmith dataset name (langsmith backend only) |
|
||
| `--judges` | `3` | Number of judges in the LLM panel |
|
||
| `--judge-model` | `claude-sonnet-4-5-20250929` | LangChain model id for the judge LLM |
|
||
| `--iterations` | `1` | Run each example N times — for measuring judge / build variance |
|
||
| `--concurrency` | `5` | Parallel example workers (`p-limit`) |
|
||
| `--max-examples` | — | Cap dataset to first N examples |
|
||
| `--example-ids-file` | — | Path to a text file of LangSmith example IDs (one per line). Used for rerunning a subset |
|
||
| `--timeout-ms` | `1200000` | Per-example build timeout |
|
||
| `--output-dir` | `.output/pairwise/<iso>` | Where to write artifacts |
|
||
| `--experiment-name` | `pairwise-evals-instance-ai` | LangSmith experiment label |
|
||
| `--verbose` | `false` | Per-example log lines |
|
||
|
||
### Outputs
|
||
|
||
Each run writes a self-contained directory:
|
||
|
||
```
|
||
.output/pairwise/<run>/
|
||
├── summary.json # totals: pass rate, avg diagnostic, build failures by class, interactivity counters
|
||
├── results.jsonl # one line per example: prompt, dos/donts, captured workflow, build metadata, feedback rows
|
||
├── workflows/<id>.json # normalized workflow JSON (matches SimpleWorkflow shape from ai-workflow-builder.ee)
|
||
└── chunks/<id>_<iter>.jsonl # per-example agent trace: tool-calls, tool-results, suspensions, final text
|
||
```
|
||
|
||
The `chunks/*.jsonl` traces are the primary tool for root-causing build
|
||
failures. Each line is one event: `tool-call`, `tool-result`, `suspension`,
|
||
`auto-approve`, `text`, `stream-finish`, `captured-workflows`, `error`.
|
||
|
||
When `LANGSMITH_API_KEY` is set, feedback is also posted to LangSmith with
|
||
metric keys `pairwise_primary`, `pairwise_diagnostic`,
|
||
`pairwise_judges_passed`, `pairwise_total_passes`, `pairwise_total_violations`,
|
||
and per-judge `judge1..N`. Experiment metadata includes
|
||
`builder: 'instance-ai'` so it can be queried alongside the
|
||
`ai-workflow-builder.ee` baseline.
|
||
|
||
### Build failure classes
|
||
|
||
Build failures are tracked separately from judge scores:
|
||
|
||
- **`build_timeout`** — exceeded `--timeout-ms`
|
||
- **`no_workflow_built`** — agent finished without invoking `build-workflow` (no captured workflow)
|
||
- **`agent_error`** — stream errored or the agent threw
|
||
|
||
A failure produces a row with `workflow: null`, empty `feedback`, and the
|
||
error class — it counts as a primary fail in the comparison report.
|
||
|
||
### Interactivity gates
|
||
|
||
The agent is stubbed for non-interactive use. The summary tracks divergence
|
||
from this assumption — investigate any non-zero count:
|
||
|
||
- `askUserCount` — `ask-user` tool was invoked (eval responds with `{ approved: false }`)
|
||
- `planToolCount` — `plan` tool was invoked (single-prompt dataset shouldn't trigger planning)
|
||
- `autoApprovedSuspensions` — HITL-gated tool fired (e.g., `data-tables` create); auto-approved
|
||
- `mockedCredentialTypes` — credential types the agent referenced (auto-mocked since `credentialService.list()` returns `[]`)
|
||
|
||
### Comparison report
|
||
|
||
After running both `ai-workflow-builder.ee/evaluations/cli` (the baseline) and
|
||
`eval:pairwise` against the same dataset, generate an HTML side-by-side
|
||
report:
|
||
|
||
```bash
|
||
pnpm eval:pairwise:compare \
|
||
--ee-dir ../ai-workflow-builder.ee/evaluations/.output/pairwise/<ts> \
|
||
--ia-dir .output/pairwise/<ts> \
|
||
--out .output/pairwise/comparison.html
|
||
```
|
||
|
||
The report shows headline metrics, per-prompt verdicts (TIE / IA-only /
|
||
Code-only / both-pass / both-fail), and lazy-loaded workflow previews — rows
|
||
collapse by default and only render the heavy `<n8n-demo>` preview when
|
||
expanded.
|
||
|
||
### When pairwise scores wobble
|
||
|
||
Judge non-determinism + agent retry behavior mean a single run is not a
|
||
reliable signal. Two specific things to know:
|
||
|
||
- The agent will sometimes retry `build-workflow` after a parser rejection
|
||
(e.g., security violation) and sometimes give up. Whether a prompt
|
||
"fails to build" is non-deterministic across runs.
|
||
- If you're comparing two builders to claim a regression or improvement,
|
||
bump `--iterations` to ≥3 for both sides.
|
||
|
||
## How the e2e harness works
|
||
|
||
1. **Build** — sends the test case prompt to Instance AI, which builds a workflow
|
||
2. **Phase 1** — analyzes the workflow and generates consistent mock data hints (one Sonnet call per scenario)
|
||
3. **Phase 2** — executes the workflow with all HTTP requests intercepted. Each request goes to an LLM that generates a realistic API response using the node's configuration and API documentation from Context7
|
||
4. **Verify** — an LLM evaluates whether the scenario's success criteria were met and categorizes any failure by root cause (see Failure categories below)
|
||
|
||
### What gets mocked
|
||
|
||
- **Mocked nodes** — any node that makes HTTP requests (Gmail, Slack, Google Sheets, HTTP Request, Notion, etc.). The request is intercepted before it leaves the process. An LLM generates the response.
|
||
- **Pinned nodes** — nodes that don't go through the HTTP layer: trigger/webhook nodes, LangChain/AI nodes (they use SDKs directly), database nodes. These receive LLM-generated data as pin data.
|
||
- **Real nodes** — logic nodes (Code, Set, Merge, Filter, IF, Switch) execute their actual code on the mocked/pinned data.
|
||
|
||
No real credentials or API connections are needed. ~95% of node types are covered; the main gaps are binary-data nodes (file attachments, image generation) and streaming nodes.
|
||
|
||
## How the sub-agent harness works
|
||
|
||
1. The CLI logs in to n8n with `N8N_EVAL_EMAIL` / `N8N_EVAL_PASSWORD`.
|
||
2. For each test case it POSTs `/rest/instance-ai/eval/run-sub-agent`.
|
||
3. The server builds a real `InstanceAiContext` via `InstanceAiAdapterService.createContext`, wraps the workflow service to record created IDs, resolves the `builder` (or other) role's system prompt, instantiates the sub-agent with the full `createAllTools(context)` tool surface, and runs it to completion.
|
||
4. The server returns `{ text, toolCalls, toolResults, capturedWorkflowIds, ... }`.
|
||
5. The CLI fetches each captured workflow via `GET /rest/workflows/:id` (this doubles as a round-trip check through the real importer), scores it with the binary-check suite, and archives+deletes it (unless `--keep-workflows`).
|
||
|
||
No tools, services, or workflow imports are mocked. The server path exercised here is the same one the orchestrator takes when it spawns a builder sub-agent.
|
||
|
||
## LangSmith integration
|
||
|
||
When `LANGSMITH_API_KEY` is set, each run is recorded as a LangSmith experiment against the `instance-ai-workflow-evals` dataset (synced from the JSON files before each run). Experiments against the same dataset can be compared side-by-side to spot regressions.
|
||
|
||
## Adding test cases
|
||
|
||
Test cases live in `evaluations/data/workflows/*.json`. Drop a file in, the CLI and LangSmith sync picks it up — no registration step.
|
||
|
||
```json
|
||
{
|
||
"prompt": "Create a workflow that...",
|
||
"complexity": "medium",
|
||
"tags": ["build", "webhook", "gmail"],
|
||
"triggerType": "webhook",
|
||
"scenarios": [
|
||
{
|
||
"name": "happy-path",
|
||
"description": "Normal operation",
|
||
"dataSetup": "The webhook receives a submission from Jane (jane@example.com)...",
|
||
"successCriteria": "The workflow executes without errors. An email is sent to jane@example.com..."
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**One JSON file = one LangSmith split.** Scenarios in the same file share a split; split names derive from the filename slug. Pick a slug you're happy to also use as a `--filter` target.
|
||
|
||
**Prompt tips**
|
||
|
||
- Be specific about node configuration — document IDs, sheet names, channel names, chat IDs. The agent won't ask for these in eval mode (no multi-turn yet).
|
||
- Add "Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later."
|
||
- If a built-in node doesn't expose a field you need (e.g. the Linear node doesn't query `creator.email`), tell the agent to use HTTP Request instead.
|
||
|
||
**Scenario tips**
|
||
|
||
- Don't specify exact counts that depend on mock data ("exactly 7 posts remain"). The LLM is non-deterministic. Say "fewer than the original 10" instead.
|
||
- `dataSetup` steers the mock — describe what each service should return, not the exact JSON.
|
||
- For error scenarios, describe the condition: "The Telegram node returns an error indicating the chat was not found."
|
||
- `successCriteria` is what the verifier reads. Be specific: "None of the titles in the Slack message should contain the word 'qui'."
|
||
|
||
**Which scenarios to include**
|
||
|
||
- `happy-path` — everything works as expected
|
||
- Edge cases — empty data, missing fields, single vs multiple items
|
||
- Error scenarios only if the workflow is expected to handle them gracefully. Most agent-built workflows don't include error handling, so "the workflow crashes on invalid input" is a legitimate finding, not a test-case failure.
|
||
|
||
### Adding a new credential type
|
||
|
||
`credentials/seeder.ts` seeds generic creds (HTTP Header, HTTP Basic) on every run, plus env-gated creds (GitHub, Gmail, Teams, Linear…) when the matching env var is set. If your scenario needs a credential type that isn't there, add it to the appropriate list in `seeder.ts` — env-gated if it requires a real token, generic if a placeholder is fine.
|
||
|
||
## Failure categories
|
||
|
||
When a scenario fails, the verifier categorizes the root cause:
|
||
|
||
- **builder_issue** — the agent misconfigured a node, chose the wrong node type, or built the wrong structure.
|
||
- **mock_issue** — the LLM mock returned incorrect data (`_evalMockError`, wrong response shape).
|
||
- **framework_issue** — Phase 1 failed (empty trigger content) or the eval framework itself cascaded an error.
|
||
- **verification_failure** — the verifier couldn't produce a valid result.
|
||
- **build_failure** — Instance AI failed to build the workflow or a scenario timed out.
|
||
|
||
Suite pass rates typically sit between 40–65%; most failures are `builder_issue` on scenarios that require error handling the agent doesn't produce by default.
|
||
|
||
## Troubleshooting
|
||
|
||
**`Wrong username or password` on login.** Your instance has no owner. Run the `rest/e2e/reset` curl from [Quick start](#quick-start) (needs `E2E_TESTS=true` on the server).
|
||
|
||
**`Have reached end of quota` mid-run.** You're hitting the hosted AI proxy's per-tenant quota. Set `N8N_AI_ASSISTANT_BASE_URL=""` to hit Anthropic directly with your `N8N_INSTANCE_AI_MODEL_API_KEY`. Also consider lowering `--concurrency`.
|
||
|
||
**All scenarios timing out.** Check that the server is up (`curl localhost:5678/healthz`) and that `N8N_INSTANCE_AI_MODEL_API_KEY` is set. A full build is ~60–180s; timeouts past `--timeout-ms` usually mean the agent is looping.
|
||
|
||
**Flood of `framework_issue` failures.** Phase 1 is returning empty hints — some nodes will degrade to `_evalMockError`. Check server logs for Anthropic errors; occasionally a rate-limit spike.
|
||
|
||
**Port 5678 already in use.** A previous `pnpm dev` is still running. `lsof -iTCP:5678 -sTCP:LISTEN` to find the PID, `kill` it.
|
||
|
||
**Debugging a specific failure.** Open the `.data/workflow-eval-report.html` artifact — it has per-node execution traces, the exact intercepted request, the mock response the LLM generated, and the verifier's reasoning. Start from the failing node.
|
||
|
||
## CI
|
||
|
||
Evals run automatically on PRs that change Instance AI code (path-filtered). The workflow starts a single Docker container and runs the CLI against it. See `.github/workflows/test-evals-instance-ai.yml`.
|
||
|
||
The job is **non-blocking**. Results are posted as a PR comment and uploaded as artifacts. When `LANGSMITH_API_KEY` is set via the `EVALS_LANGSMITH_API_KEY` secret, runs also land as LangSmith experiments tagged with commit SHA + branch, so you can compare against master side-by-side.
|
||
|
||
## Architecture
|
||
|
||
```
|
||
evaluations/
|
||
├── index.ts # Public API
|
||
├── cli/ # CLI entries: instance-ai, subagent, pairwise, compare-pairwise, report
|
||
├── clients/ # n8n REST + SSE clients
|
||
├── checklist/ # LLM verification with retry
|
||
├── credentials/ # Test credential seeding
|
||
├── data/workflows/ # e2e/sub-agent test case JSON files
|
||
├── data/pairwise/ # Local pairwise fixture (small smoke set)
|
||
├── harness/ # Runners: buildWorkflow + executeScenario (e2e), in-process-builder (pairwise)
|
||
├── langsmith/ # Dataset sync + experiment setup
|
||
├── outcome/ # SSE event parsing, workflow discovery
|
||
├── report/ # HTML report generator
|
||
└── system-prompts/ # LLM prompts for verification
|
||
|
||
packages/cli/src/modules/instance-ai/eval/
|
||
├── execution.service.ts # Phase 1 + Phase 2 orchestration
|
||
├── workflow-analysis.ts # Hint generation (Phase 1)
|
||
├── mock-handler.ts # Per-request mock generation (Phase 2)
|
||
├── api-docs.ts # Context7 API doc fetcher
|
||
├── node-config.ts # Node config serializer
|
||
└── pin-data-generator.ts # LLM pin data for bypass nodes (Phase 1.5)
|
||
```
|
||
|
||
## Known limitations
|
||
|
||
- **LangChain/AI nodes** — use their own SDKs, not the HTTP mock layer. They fail with credential errors; use pin data instead.
|
||
- **Binary / file nodes** — media attachments, image generation, file downloads. Mock metadata works; realistic binary content is out of scope.
|
||
- **Streaming nodes** — mocks return complete responses, not streams.
|
||
- **GraphQL APIs** — response shape depends on the query, not just the endpoint. Quality depends on the LLM knowing the API schema.
|
||
- **Non-determinism** — the agent builds different workflows each run. Pass rates vary between 40–65%.
|
||
- **Large workflows** — verification artifacts include full execution traces. For 12+ node workflows this can hit token limits. See TRUST-43 for the tool-based verifier approach.
|