n8n/packages/@n8n/instance-ai/evaluations/README.md

# Workflow evaluation framework

Tests whether workflows built by Instance AI actually work by executing them with LLM-generated mock HTTP responses. No real credentials or external services are involved.

Three harnesses live here:

- **`eval:instance-ai`** — end-to-end build + mocked execution + LLM verification (drives a running n8n instance)
- **`eval:subagent`** — builder sub-agent against live n8n, scored by binary checks (drives a running n8n instance)
- **`eval:pairwise`** — builder sub-agent in-process, scored by an LLM judge panel against do/don't lists (no n8n server). Intended for head-to-head comparison with `ai-workflow-builder.ee` on the same dataset

Sections:

- [Running e2e + sub-agent evals](#running-evals)
- [Regression detection](#regression-detection)
- [Running evals against pre-built workflows](#running-evals-against-pre-built-workflows)
- [Running pairwise evals](#pairwise-evals)
- [How the e2e harness works](#how-the-e2e-harness-works)
- [How the sub-agent harness works](#how-the-sub-agent-harness-works)

## Running evals

Each run:

1. **Build** — the test case prompt goes to Instance AI, which builds a workflow.
2. **Phase 1** — the server analyzes the workflow and generates consistent mock data hints (one Sonnet call per scenario).
3. **Phase 2** — the workflow executes with every HTTP request intercepted and answered by an LLM using the node's configuration and API docs from Context7.
4. **Verify** — an LLM evaluates whether the scenario's success criteria were met and categorizes any failure by root cause (see [Failure categories](#failure-categories)).

### What gets mocked

- **Mocked nodes** — anything that makes HTTP requests (Gmail, Slack, Google Sheets, HTTP Request, Notion…). The request is intercepted before it leaves the process; an LLM generates the response.
- **Pinned nodes** — nodes that don't go through the HTTP layer (triggers/webhooks, LangChain/AI nodes, database nodes). They receive LLM-generated pin data.
- **Real nodes** — logic nodes (Code, Set, Merge, Filter, IF, Switch) execute on the mocked data.

~95% of node types are covered. See [Known limitations](#known-limitations) for the gaps.

## Quick start

You need an n8n instance running with Instance AI enabled, a seeded owner account, and an Anthropic API key. Two paths:

### Local (pnpm dev:ai)

`pnpm dev:ai` runs watch mode across `n8n`, `n8n-core`, and `@n8n/n8n-nodes-langchain` in parallel — the set you need when iterating on Instance AI or the eval framework.

1. **Create `.env.local`** at the repo root with at minimum:
   ```env
   N8N_INSTANCE_AI_MODEL_API_KEY=sk-ant-...
   N8N_EVAL_EMAIL=nathan@n8n.io
   N8N_EVAL_PASSWORD=PlaywrightTest123
   # Optional — see "Environment variables" for the full list
   LANGSMITH_API_KEY=...
   CONTEXT7_API_KEY=...
   ```

2. **Start the instance**:
   ```bash
   dotenvx run -f .env.local -- pnpm dev:ai
   ```

3. **Create the owner account**. First time only: open `http://localhost:5678`, sign up with the email/password from your `.env.local`. If you'd rather reset to a known state (useful after deleting the DB or between iterations), start the server with `E2E_TESTS=true` and `curl` the reset endpoint:
   ```bash
   E2E_TESTS=true dotenvx run -f .env.local -- pnpm dev:ai
   # then in another shell:
   curl -sf -X POST http://localhost:5678/rest/e2e/reset \
     -H "Content-Type: application/json" \
     -d '{"owner":{"email":"nathan@n8n.io","password":"PlaywrightTest123","firstName":"Eval","lastName":"Owner"},"admin":{"email":"admin@n8n.io","password":"PlaywrightTest123","firstName":"Admin","lastName":"User"},"members":[],"chat":{"email":"chat@n8n.io","password":"PlaywrightTest123","firstName":"Chat","lastName":"User"}}'
   ```

4. **Run evals** from `packages/@n8n/instance-ai/`:
   ```bash
   dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --verbose
   ```

### Docker

Useful when you don't want a live watch process — e.g. reproducing a CI failure. The container starts empty, so owner creation is part of the flow:

```bash
# Build the image
INCLUDE_TEST_CONTROLLER=true pnpm build:docker

# Start a container (E2E_TESTS=true exposes /rest/e2e/reset)
docker run -d --name n8n-eval \
  -e E2E_TESTS=true \
  -e N8N_ENABLED_MODULES=instance-ai \
  -e N8N_AI_ENABLED=true \
  -e N8N_INSTANCE_AI_MODEL_API_KEY=your-key \
  -p 5678:5678 \
  n8nio/n8n:local

# Seed the owner
curl -sf -X POST http://localhost:5678/rest/e2e/reset -H "Content-Type: application/json" -d '{"owner":{"email":"nathan@n8n.io","password":"PlaywrightTest123","firstName":"Eval","lastName":"Owner"},"admin":{"email":"admin@n8n.io","password":"PlaywrightTest123","firstName":"Admin","lastName":"User"},"members":[],"chat":{"email":"chat@n8n.io","password":"PlaywrightTest123","firstName":"Chat","lastName":"User"}}'

# Run evals against it
pnpm eval:instance-ai --base-url http://localhost:5678 --verbose
```

## CLI reference

Invoke from `packages/@n8n/instance-ai/`:

```bash
# All test cases
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --verbose

# Single test case (filename substring match)
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --verbose

# Keep built workflows for inspection after the run
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --keep-workflows

# Multi-iteration for pass@k / pass^k metrics
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --iterations 3
```

### Flags

| Flag | Default | Description |
|------|---------|-------------|
| `--verbose` | `false` | Log build/execute/verify timing and SSE events |
| `--filter` | — | Filter test cases by filename substring. Comma-separated values mean OR (e.g. `contact-form,deduplication`) |
| `--exclude` | — | Skip test cases whose filename matches any of the substrings. Same comma-separated shape as `--filter`; applied after `--filter` |
| `--prebuilt-workflows` | — | Path to a JSON manifest mapping test-case slugs to existing workflow IDs. Skips the orchestrator build for matched test cases — see [Running evals against pre-built workflows](#running-evals-against-pre-built-workflows) |
| `--keep-workflows` | `false` | Don't delete built workflows after the run |
| `--base-url` | `http://localhost:5678` | n8n instance URL |
| `--email` | E2E test owner | Override login email (or `N8N_EVAL_EMAIL`) |
| `--password` | E2E test owner | Override login password (or `N8N_EVAL_PASSWORD`) |
| `--timeout-ms` | `900000` | Per-test-case timeout |
| `--output-dir` | cwd | Where to write `eval-results.json` |
| `--dataset` | `instance-ai-workflow-evals` | LangSmith dataset name |
| `--concurrency` | `16` | Max concurrent scenarios (builds are separately capped at 4) |
| `--experiment-name` | auto | LangSmith experiment prefix (defaults to `{branch}-{sha}` in CI or `local-{branch}-{sha}-dirty?` locally) |
| `--iterations` | `1` | Run each test case N times with fresh builds |

**pass@k / pass^k**: with `--iterations N`, each scenario runs N times. `pass@k` is the fraction of scenarios that passed *at least once*; `pass^k` is the fraction that passed *every* time. `pass@k` shows whether something is *possible*; `pass^k` shows whether it's *reliable*.

### Outputs

Every run produces:

- **Console** — live progress, per-scenario pass/fail with `[failure_category]` tag, and a grouped summary.
- **`eval-results.json`** — structured results in `--output-dir` (or cwd). Consumed by the CI PR comment.
- **`.data/workflow-eval-report.html`** — self-contained debugging view with per-node execution traces, intercepted requests, mock responses, Phase 1 hints, and verifier reasoning.
- **LangSmith experiment** — only when `LANGSMITH_API_KEY` is set. See the caveat in [Environment variables](#environment-variables).

## Environment variables

| Variable | Required | Description |
|----------|----------|-------------|
| `N8N_INSTANCE_AI_MODEL_API_KEY` | Yes | Anthropic API key for the agent, mock generation, and verification |
| `N8N_EVAL_EMAIL` | No | n8n login email (defaults to E2E test owner) |
| `N8N_EVAL_PASSWORD` | No | n8n login password (defaults to E2E test owner) |
| `LANGSMITH_API_KEY` | No | Enables experiment tracking + tracing. **See caveat below.** |
| `LANGSMITH_ENDPOINT` | No | Region (`https://api.smith.langchain.com` US, `https://eu.api.smith.langchain.com` EU) |
| `LANGSMITH_REVISION_ID` | No | Commit SHA to tag the experiment with (auto-set in CI) |
| `LANGSMITH_BRANCH` | No | Branch name to tag the experiment with (auto-set in CI) |
| `CONTEXT7_API_KEY` | No | Context7 key for API-doc lookups. Improves mock realism for less-common services; the LLM falls back to training data when unset |
| `N8N_AI_ASSISTANT_BASE_URL` | No | Set to `""` to bypass the hosted AI proxy and hit Anthropic directly — useful to avoid per-tenant quota during large batch runs |

**LangSmith caveat:** if `LANGSMITH_API_KEY` is set in `.env.local`, local runs also land in the shared `instance-ai-workflow-evals` dataset. Unset it (or run without `dotenvx`) to keep exploratory runs out of team results.

## Regression detection

When `LANGSMITH_API_KEY` is set, every eval run automatically compares its results against the most recent pinned baseline (any experiment whose name starts with `instance-ai-baseline-`). Two output files are written:

- `eval-results.json` — structured data only, including `comparison.result` when a baseline was found.
- `eval-pr-comment.md` — the full PR comment rendered as markdown, including the alert, aggregate, comparison sections, per-test-case results, and failure details. Always written; falls back to a no-baseline summary when no comparison ran.

The CI PR-comment step uses `eval-pr-comment.md` as the entire comment body (no jq assembly in the workflow). The console output uses a separate aligned-text formatter — same data, no markdown noise in the terminal.

### Refreshing the baseline

There is no auto-refresh — refresh explicitly when you want a new reference point, ideally with high N for low noise:

```bash
# From packages/@n8n/instance-ai/, on master at the version you want to pin
LANGSMITH_API_KEY=... dotenvx run -f ../../../.env.local -- \
  pnpm eval:instance-ai --experiment-name instance-ai-baseline --iterations 10
```

LangSmith appends a random suffix (e.g. `instance-ai-baseline-7abc1234`); the most recently started one becomes the comparison target on the next eval run. The comparison is silently skipped on the baseline-creation run itself.

### How scenarios are tiered

Each scenario lands in one of three regression tiers, evaluated in order of strictness:

- **Regression** — high-confidence flag, gating-grade. The drop must be statistically significant (chance of seeing it by noise < 5%), at least 30 percentage points in size, and the baseline must have been reliable (≥ 70% pass rate).
- **Likely regression** — looser bar for visibility on borderline cases. Looser confidence threshold (chance by noise < 20%), drop ≥ 15 percentage points, baseline ≥ 50%. Frequently natural variance — worth a glance only if your changes touch related code paths.
- **Worth watching** — any scenario whose pass rate moved by ≥ 35 percentage points but wasn't flagged as a regression (hard or likely tier). Pure visibility, no implication of cause.

Other verdicts: `improvement` (PR significantly better, skips the reliability gate), `unreliable_baseline` (confident drop but baseline was too flaky to call a regression — surfaced but not flagged), `stable`, `insufficient_data`.

Why these tiers and not a flat percentage threshold? At the small N PR runs use (typically 3 iterations), a flat threshold can't tell a real regression from coin-flip noise. The confidence cutoff filters out gaps that could plausibly happen by chance, and the reliability gate avoids chasing noise on already-flaky scenarios. Implementation lives in `comparison/statistics.ts` (Fisher's exact test for the confidence check, Wilson interval for the headline aggregate band). Tune the likely-regression tier first if the false-positive rate looks off — keep the hard tier strict.

### Failure-category drift

When both sides captured per-trial `failureCategory` values, the comparison also surfaces a run-level table of category rates (PR vs baseline). A category is marked **notable** when its absolute rate delta is ≥ 5 percentage points _and_ the count change beyond what scenario-count scaling would predict is ≥ 3 trials. This catches cross-scenario shifts (e.g. mock-generation breaking, or a model getting weaker overall) that per-scenario flags can miss.

### Best-effort

Comparison is logged and skipped on any LangSmith failure — it never fails the eval. It is also skipped when no baseline experiment exists yet.

## Running evals against pre-built workflows

The eval framework normally builds each workflow with Instance AI and then verifies it. With `--prebuilt-workflows <path>`, the build step is skipped for matched test cases — the harness fetches the existing workflow from the n8n instance and runs verification against it instead. Use this to score workflows authored by other tools (an MCP-driven session, a hand-built reference, an older Instance AI snapshot) on the same dataset and the same verifier.

The manifest is a JSON file mapping test-case file slugs to workflow IDs:

```json
{
  "contact-form-automation": ["W1abc", "W2def", "W3ghi"],
  "deduplication-trigger": ["W4jkl"]
}
```

- **Keys** are test-case file slugs — the JSON filename without `.json` (e.g. `contact-form-automation` for `evaluations/data/workflows/contact-form-automation.json`). The `--filter` flag uses the same identifier.
- **Values** are arrays of workflow IDs that already exist in the target n8n instance. Multiple iterations rotate through the list with `iteration % ids.length`, so an `--iterations 5` run with 5 IDs gets 5 distinct builds.

Test cases not present in the manifest fall back to the regular Instance AI build path. To run *only* the prebuilt set, pair with `--exclude` to skip the rest, or `--filter` to narrow the run.

```bash
# Score the prebuilt cohort, skipping anything not in the manifest
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai \
  --prebuilt-workflows ./mcp-manifest.json \
  --filter contact-form-automation,deduplication-trigger \
  --iterations 5 \
  --experiment-name mcp-cohort
```

The harness leaves prebuilt workflows alone after the run (no auto-delete), so the manifest can be re-used across multiple eval runs.

### Producing a manifest

`pnpm eval:build-mcp-manifest` (`evaluations/cli/build-mcp-manifest.ts`) drives `claude -p` against an MCP server — defaults to n8n's instance MCP — and writes a manifest in the schema this flag expects, plus a `manifest-stats.json` sidecar with per-cohort cost / turn / duration aggregates. The output is validated against the same Zod schema the loader uses, so shape regressions surface here rather than at eval time.

**Prerequisites**: `claude` CLI installed; `~/.claude.json` has the MCP server block configured (project-scoped under `.projects[<repo-root>].mcpServers[<name>]` or globally under `.mcpServers[<name>]`); n8n instance reachable at the URL the MCP block points at. Default MCP server name is `"n8n-mcp (instance)"` — override with `--mcp-server`.

```bash
# Build N=5 per test case, 4 in parallel
pnpm eval:build-mcp-manifest -n 5 -j 4 --output-dir ./mcp-cohort

# Then score the cohort
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai \
  --prebuilt-workflows ./mcp-cohort/manifest.json \
  --iterations 5 \
  --experiment-name mcp-cohort
```

For runs that need to leave the n8n repo (for example, driving the build from a separate Claude project where you have skills configured), three flags decouple the script from its default assumptions:

- `--workflow-dir <path>` — read test-case JSONs from a directory other than the n8n repo's `evaluations/data/workflows/`. When set, the script no longer needs `git rev-parse` to find the repo.
- `--build-cwd <path>` — set the working directory the `claude` subprocess spawns from. Affects which `~/.claude.json` `projects` entry (and which skills) Claude loads.
- `--project-id <id>` — instructs the model to pass `projectId` to `create_workflow_from_code` so workflows land in a specific n8n project instead of the user's personal one.

Run `pnpm eval:build-mcp-manifest --help` for the full flag list.

## Pairwise evals

Pairwise evals score a built workflow against the dataset's `dos` / `donts`
criteria using an LLM judge panel (3 judges by default, majority vote on
`pairwise_primary`, mean fraction of criteria satisfied on
`pairwise_diagnostic`). The point is **head-to-head comparison with
`ai-workflow-builder.ee`** on the same dataset (default
`notion-pairwise-workflows`), so the judge panel, defaults, and metric keys
are imported from that package directly.

Unlike the e2e and sub-agent harnesses, pairwise runs the **builder
sub-agent in-process** — no n8n server, no Docker, no live workflow service.
Stub services capture `createFromWorkflowJSON` calls; HITL suspensions are
auto-approved.

### Quick start

```bash
# From packages/@n8n/instance-ai/

# 1. Local fixture (small smoke set, no LangSmith required)
N8N_AI_ANTHROPIC_KEY="$ANTHROPIC_API_KEY" pnpm eval:pairwise --judges 1

# 2. Full LangSmith dataset
LANGSMITH_API_KEY=... N8N_AI_ANTHROPIC_KEY="$ANTHROPIC_API_KEY" \
  pnpm eval:pairwise:langsmith --judges 3

# 3. Rerun a specific subset (one example ID per line; #-prefixed lines ignored)
pnpm eval:pairwise:langsmith \
  --example-ids-file .output/pairwise/failed-ids.txt \
  --output-dir .output/pairwise/rerun
```

### Sandbox

Pairwise evals always run inside a sandbox — the same path production uses.
The agent writes TypeScript to `~/workspace/src/workflow.ts` inside the
sandbox, runs `tsc` to validate, and calls `submit-workflow` to save the
parsed `WorkflowJSON`. This exercises the production builder agent
end-to-end (sandbox prompt, file I/O, real type checking).

Required env vars (Daytona provider — the default):

```bash
ANTHROPIC_API_KEY=sk-ant-...           # builder + judge LLM
LANGSMITH_API_KEY=ls__...              # only for --backend langsmith
DAYTONA_API_URL=https://app.daytona.io/api
DAYTONA_API_KEY=dtn_...

# Optional
N8N_INSTANCE_AI_SANDBOX_PROVIDER=daytona      # default; set 'local' or 'n8n-sandbox' to switch
N8N_INSTANCE_AI_SANDBOX_IMAGE=daytonaio/sandbox:0.5.0   # default
N8N_INSTANCE_AI_SANDBOX_TIMEOUT=300000        # per-command timeout (ms)
```

The CLI fails fast at startup if the chosen provider is misconfigured (e.g.,
Daytona selected without API URL/key). The chosen provider is recorded under
`summary.json → sandbox.provider`.

> **Daytona cold-start.** The very first sandbox creation triggers an image
> build on Daytona's side (`npm install` for `@n8n/workflow-sdk`). That can
> exceed the SDK's 5-minute create timeout and fail with `Sandbox failed to
> become ready within the timeout period`. Once the image is cached, later
> runs are fast. Workaround: pre-build the image via the Daytona dashboard
> before kicking off a full eval run.

### Flags

| Flag | Default | Description |
|------|---------|-------------|
| `--backend` | `local` | `local` reads `evaluations/data/pairwise/local.json`; `langsmith` pulls from the LangSmith dataset |
| `--dataset` | `notion-pairwise-workflows` | LangSmith dataset name (langsmith backend only) |
| `--judges` | `3` | Number of judges in the LLM panel |
| `--judge-model` | `claude-sonnet-4-5-20250929` | LangChain model id for the judge LLM |
| `--iterations` | `1` | Run each example N times — for measuring judge / build variance |
| `--concurrency` | `5` | Parallel example workers (`p-limit`) |
| `--max-examples` | — | Cap dataset to first N examples |
| `--example-ids-file` | — | Path to a text file of LangSmith example IDs (one per line). Used for rerunning a subset |
| `--timeout-ms` | `1200000` | Per-example build timeout |
| `--output-dir` | `.output/pairwise/<iso>` | Where to write artifacts |
| `--experiment-name` | `pairwise-evals-instance-ai` | LangSmith experiment label |
| `--verbose` | `false` | Per-example log lines |

### Outputs

Each run writes a self-contained directory:

```
.output/pairwise/<run>/
├── summary.json           # totals: pass rate, avg diagnostic, build failures by class, interactivity counters
├── results.jsonl          # one line per example: prompt, dos/donts, captured workflow, build metadata, feedback rows
├── workflows/<id>.json    # normalized workflow JSON (matches SimpleWorkflow shape from ai-workflow-builder.ee)
└── chunks/<id>_<iter>.jsonl  # per-example agent trace: tool-calls, tool-results, suspensions, final text
```

The `chunks/*.jsonl` traces are the primary tool for root-causing build
failures. Each line is one event: `tool-call`, `tool-result`, `suspension`,
`auto-approve`, `text`, `stream-finish`, `captured-workflows`, `error`.

When `LANGSMITH_API_KEY` is set, feedback is also posted to LangSmith with
metric keys `pairwise_primary`, `pairwise_diagnostic`,
`pairwise_judges_passed`, `pairwise_total_passes`, `pairwise_total_violations`,
and per-judge `judge1..N`. Experiment metadata includes
`builder: 'instance-ai'` so it can be queried alongside the
`ai-workflow-builder.ee` baseline.

### Build failure classes

Build failures are tracked separately from judge scores:

- **`build_timeout`** — exceeded `--timeout-ms`
- **`no_workflow_built`** — agent finished without invoking `build-workflow` (no captured workflow)
- **`agent_error`** — stream errored or the agent threw

A failure produces a row with `workflow: null`, empty `feedback`, and the
error class — it counts as a primary fail in the comparison report.

### Interactivity gates

The agent is stubbed for non-interactive use. The summary tracks divergence
from this assumption — investigate any non-zero count:

- `askUserCount` — `ask-user` tool was invoked (eval responds with `{ approved: false }`)
- `planToolCount` — `plan` tool was invoked (single-prompt dataset shouldn't trigger planning)
- `autoApprovedSuspensions` — HITL-gated tool fired (e.g., `data-tables` create); auto-approved
- `mockedCredentialTypes` — credential types the agent referenced (auto-mocked since `credentialService.list()` returns `[]`)

### Comparison report

After running both `ai-workflow-builder.ee/evaluations/cli` (the baseline) and
`eval:pairwise` against the same dataset, generate an HTML side-by-side
report:

```bash
pnpm eval:pairwise:compare \
  --ee-dir   ../ai-workflow-builder.ee/evaluations/.output/pairwise/<ts> \
  --ia-dir   .output/pairwise/<ts> \
  --out      .output/pairwise/comparison.html
```

The report shows headline metrics, per-prompt verdicts (TIE / IA-only /
Code-only / both-pass / both-fail), and lazy-loaded workflow previews — rows
collapse by default and only render the heavy `<n8n-demo>` preview when
expanded.

### When pairwise scores wobble

Judge non-determinism + agent retry behavior mean a single run is not a
reliable signal. Two specific things to know:

- The agent will sometimes retry `build-workflow` after a parser rejection
  (e.g., security violation) and sometimes give up. Whether a prompt
  "fails to build" is non-deterministic across runs.
- If you're comparing two builders to claim a regression or improvement,
  bump `--iterations` to ≥3 for both sides.

## How the e2e harness works

1. **Build** — sends the test case prompt to Instance AI, which builds a workflow
2. **Phase 1** — analyzes the workflow and generates consistent mock data hints (one Sonnet call per scenario)
3. **Phase 2** — executes the workflow with all HTTP requests intercepted. Each request goes to an LLM that generates a realistic API response using the node's configuration and API documentation from Context7
4. **Verify** — an LLM evaluates whether the scenario's success criteria were met and categorizes any failure by root cause (see Failure categories below)

### What gets mocked

- **Mocked nodes** — any node that makes HTTP requests (Gmail, Slack, Google Sheets, HTTP Request, Notion, etc.). The request is intercepted before it leaves the process. An LLM generates the response.
- **Pinned nodes** — nodes that don't go through the HTTP layer: trigger/webhook nodes, LangChain/AI nodes (they use SDKs directly), database nodes. These receive LLM-generated data as pin data.
- **Real nodes** — logic nodes (Code, Set, Merge, Filter, IF, Switch) execute their actual code on the mocked/pinned data.

No real credentials or API connections are needed. ~95% of node types are covered; the main gaps are binary-data nodes (file attachments, image generation) and streaming nodes.

## How the sub-agent harness works

1. The CLI logs in to n8n with `N8N_EVAL_EMAIL` / `N8N_EVAL_PASSWORD`.
2. For each test case it POSTs `/rest/instance-ai/eval/run-sub-agent`.
3. The server builds a real `InstanceAiContext` via `InstanceAiAdapterService.createContext`, wraps the workflow service to record created IDs, resolves the `builder` (or other) role's system prompt, instantiates the sub-agent with the full `createAllTools(context)` tool surface, and runs it to completion.
4. The server returns `{ text, toolCalls, toolResults, capturedWorkflowIds, ... }`.
5. The CLI fetches each captured workflow via `GET /rest/workflows/:id` (this doubles as a round-trip check through the real importer), scores it with the binary-check suite, and archives+deletes it (unless `--keep-workflows`).

No tools, services, or workflow imports are mocked. The server path exercised here is the same one the orchestrator takes when it spawns a builder sub-agent.

## LangSmith integration

When `LANGSMITH_API_KEY` is set, each run is recorded as a LangSmith experiment against the `instance-ai-workflow-evals` dataset (synced from the JSON files before each run). Experiments against the same dataset can be compared side-by-side to spot regressions.

## Adding test cases

Test cases live in `evaluations/data/workflows/*.json`. Drop a file in, the CLI and LangSmith sync picks it up — no registration step.

```json
{
  "prompt": "Create a workflow that...",
  "complexity": "medium",
  "tags": ["build", "webhook", "gmail"],
  "triggerType": "webhook",
  "scenarios": [
    {
      "name": "happy-path",
      "description": "Normal operation",
      "dataSetup": "The webhook receives a submission from Jane (jane@example.com)...",
      "successCriteria": "The workflow executes without errors. An email is sent to jane@example.com..."
    }
  ]
}
```

**One JSON file = one LangSmith split.** Scenarios in the same file share a split; split names derive from the filename slug. Pick a slug you're happy to also use as a `--filter` target.

**Prompt tips**

- Be specific about node configuration — document IDs, sheet names, channel names, chat IDs. The agent won't ask for these in eval mode (no multi-turn yet).
- Add "Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later."
- If a built-in node doesn't expose a field you need (e.g. the Linear node doesn't query `creator.email`), tell the agent to use HTTP Request instead.

**Scenario tips**

- Don't specify exact counts that depend on mock data ("exactly 7 posts remain"). The LLM is non-deterministic. Say "fewer than the original 10" instead.
- `dataSetup` steers the mock — describe what each service should return, not the exact JSON.
- For error scenarios, describe the condition: "The Telegram node returns an error indicating the chat was not found."
- `successCriteria` is what the verifier reads. Be specific: "None of the titles in the Slack message should contain the word 'qui'."

**Which scenarios to include**

- `happy-path` — everything works as expected
- Edge cases — empty data, missing fields, single vs multiple items
- Error scenarios only if the workflow is expected to handle them gracefully. Most agent-built workflows don't include error handling, so "the workflow crashes on invalid input" is a legitimate finding, not a test-case failure.

### Adding a new credential type

`credentials/seeder.ts` seeds generic creds (HTTP Header, HTTP Basic) on every run, plus env-gated creds (GitHub, Gmail, Teams, Linear…) when the matching env var is set. If your scenario needs a credential type that isn't there, add it to the appropriate list in `seeder.ts` — env-gated if it requires a real token, generic if a placeholder is fine.

## Failure categories

When a scenario fails, the verifier categorizes the root cause:

- **builder_issue** — the agent misconfigured a node, chose the wrong node type, or built the wrong structure.
- **mock_issue** — the LLM mock returned incorrect data (`_evalMockError`, wrong response shape).
- **framework_issue** — Phase 1 failed (empty trigger content) or the eval framework itself cascaded an error.
- **verification_failure** — the verifier couldn't produce a valid result.
- **build_failure** — Instance AI failed to build the workflow or a scenario timed out.

Suite pass rates typically sit between 40–65%; most failures are `builder_issue` on scenarios that require error handling the agent doesn't produce by default.

## Troubleshooting

**`Wrong username or password` on login.** Your instance has no owner. Run the `rest/e2e/reset` curl from [Quick start](#quick-start) (needs `E2E_TESTS=true` on the server).

**`Have reached end of quota` mid-run.** You're hitting the hosted AI proxy's per-tenant quota. Set `N8N_AI_ASSISTANT_BASE_URL=""` to hit Anthropic directly with your `N8N_INSTANCE_AI_MODEL_API_KEY`. Also consider lowering `--concurrency`.

**All scenarios timing out.** Check that the server is up (`curl localhost:5678/healthz`) and that `N8N_INSTANCE_AI_MODEL_API_KEY` is set. A full build is ~60–180s; timeouts past `--timeout-ms` usually mean the agent is looping.

**Flood of `framework_issue` failures.** Phase 1 is returning empty hints — some nodes will degrade to `_evalMockError`. Check server logs for Anthropic errors; occasionally a rate-limit spike.

**Port 5678 already in use.** A previous `pnpm dev` is still running. `lsof -iTCP:5678 -sTCP:LISTEN` to find the PID, `kill` it.

**Debugging a specific failure.** Open the `.data/workflow-eval-report.html` artifact — it has per-node execution traces, the exact intercepted request, the mock response the LLM generated, and the verifier's reasoning. Start from the failing node.

## CI

Evals run automatically on PRs that change Instance AI code (path-filtered). The workflow starts a single Docker container and runs the CLI against it. See `.github/workflows/test-evals-instance-ai.yml`.

The job is **non-blocking**. Results are posted as a PR comment and uploaded as artifacts. When `LANGSMITH_API_KEY` is set via the `EVALS_LANGSMITH_API_KEY` secret, runs also land as LangSmith experiments tagged with commit SHA + branch, so you can compare against master side-by-side.

## Architecture

```
evaluations/
├── index.ts              # Public API
├── cli/                  # CLI entries: instance-ai, subagent, pairwise, compare-pairwise, report
├── clients/              # n8n REST + SSE clients
├── checklist/            # LLM verification with retry
├── credentials/          # Test credential seeding
├── data/workflows/       # e2e/sub-agent test case JSON files
├── data/pairwise/        # Local pairwise fixture (small smoke set)
├── harness/              # Runners: buildWorkflow + executeScenario (e2e), in-process-builder (pairwise)
├── langsmith/            # Dataset sync + experiment setup
├── outcome/              # SSE event parsing, workflow discovery
├── report/               # HTML report generator
└── system-prompts/       # LLM prompts for verification

packages/cli/src/modules/instance-ai/eval/
├── execution.service.ts  # Phase 1 + Phase 2 orchestration
├── workflow-analysis.ts  # Hint generation (Phase 1)
├── mock-handler.ts       # Per-request mock generation (Phase 2)
├── api-docs.ts           # Context7 API doc fetcher
├── node-config.ts        # Node config serializer
└── pin-data-generator.ts # LLM pin data for bypass nodes (Phase 1.5)
```

## Known limitations

- **LangChain/AI nodes** — use their own SDKs, not the HTTP mock layer. They fail with credential errors; use pin data instead.
- **Binary / file nodes** — media attachments, image generation, file downloads. Mock metadata works; realistic binary content is out of scope.
- **Streaming nodes** — mocks return complete responses, not streams.
- **GraphQL APIs** — response shape depends on the query, not just the endpoint. Quality depends on the LLM knowing the API schema.
- **Non-determinism** — the agent builds different workflows each run. Pass rates vary between 40–65%.
- **Large workflows** — verification artifacts include full execution traces. For 12+ node workflows this can hit token limits. See TRUST-43 for the tool-based verifier approach.