# Workflow evaluation framework Tests whether workflows built by Instance AI actually work by executing them with LLM-generated mock HTTP responses. No real credentials or external services are involved. Three harnesses live here: - **`eval:instance-ai`** — end-to-end build + mocked execution + LLM verification (drives a running n8n instance) - **`eval:subagent`** — builder sub-agent against live n8n, scored by binary checks (drives a running n8n instance) - **`eval:pairwise`** — builder sub-agent in-process, scored by an LLM judge panel against do/don't lists (no n8n server). Intended for head-to-head comparison with `ai-workflow-builder.ee` on the same dataset Sections: - [Running e2e + sub-agent evals](#running-evals) - [Regression detection](#regression-detection) - [Running evals against pre-built workflows](#running-evals-against-pre-built-workflows) - [Running pairwise evals](#pairwise-evals) - [How the e2e harness works](#how-the-e2e-harness-works) - [How the sub-agent harness works](#how-the-sub-agent-harness-works) ## Running evals Each run: 1. **Build** — the test case prompt goes to Instance AI, which builds a workflow. 2. **Phase 1** — the server analyzes the workflow and generates consistent mock data hints (one Sonnet call per scenario). 3. **Phase 2** — the workflow executes with every HTTP request intercepted and answered by an LLM using the node's configuration and API docs from Context7. 4. **Verify** — an LLM evaluates whether the scenario's success criteria were met and categorizes any failure by root cause (see [Failure categories](#failure-categories)). ### What gets mocked - **Mocked nodes** — anything that makes HTTP requests (Gmail, Slack, Google Sheets, HTTP Request, Notion…). The request is intercepted before it leaves the process; an LLM generates the response. - **Pinned nodes** — nodes that don't go through the HTTP layer (triggers/webhooks, LangChain/AI nodes, database nodes). They receive LLM-generated pin data. - **Real nodes** — logic nodes (Code, Set, Merge, Filter, IF, Switch) execute on the mocked data. ~95% of node types are covered. See [Known limitations](#known-limitations) for the gaps. ## Quick start You need an n8n instance running with Instance AI enabled, a seeded owner account, and an Anthropic API key. Two paths: ### Local (pnpm dev:ai) `pnpm dev:ai` runs watch mode across `n8n`, `n8n-core`, and `@n8n/n8n-nodes-langchain` in parallel — the set you need when iterating on Instance AI or the eval framework. 1. **Create `.env.local`** at the repo root with at minimum: ```env N8N_INSTANCE_AI_MODEL_API_KEY=sk-ant-... N8N_EVAL_EMAIL=nathan@n8n.io N8N_EVAL_PASSWORD=PlaywrightTest123 # Optional — see "Environment variables" for the full list LANGSMITH_API_KEY=... CONTEXT7_API_KEY=... ``` 2. **Start the instance**: ```bash dotenvx run -f .env.local -- pnpm dev:ai ``` 3. **Create the owner account**. First time only: open `http://localhost:5678`, sign up with the email/password from your `.env.local`. If you'd rather reset to a known state (useful after deleting the DB or between iterations), start the server with `E2E_TESTS=true` and `curl` the reset endpoint: ```bash E2E_TESTS=true dotenvx run -f .env.local -- pnpm dev:ai # then in another shell: curl -sf -X POST http://localhost:5678/rest/e2e/reset \ -H "Content-Type: application/json" \ -d '{"owner":{"email":"nathan@n8n.io","password":"PlaywrightTest123","firstName":"Eval","lastName":"Owner"},"admin":{"email":"admin@n8n.io","password":"PlaywrightTest123","firstName":"Admin","lastName":"User"},"members":[],"chat":{"email":"chat@n8n.io","password":"PlaywrightTest123","firstName":"Chat","lastName":"User"}}' ``` 4. **Run evals** from `packages/@n8n/instance-ai/`: ```bash dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --verbose ``` ### Docker Useful when you don't want a live watch process — e.g. reproducing a CI failure. The container starts empty, so owner creation is part of the flow: ```bash # Build the image INCLUDE_TEST_CONTROLLER=true pnpm build:docker # Start a container (E2E_TESTS=true exposes /rest/e2e/reset) docker run -d --name n8n-eval \ -e E2E_TESTS=true \ -e N8N_ENABLED_MODULES=instance-ai \ -e N8N_AI_ENABLED=true \ -e N8N_INSTANCE_AI_MODEL_API_KEY=your-key \ -p 5678:5678 \ n8nio/n8n:local # Seed the owner curl -sf -X POST http://localhost:5678/rest/e2e/reset -H "Content-Type: application/json" -d '{"owner":{"email":"nathan@n8n.io","password":"PlaywrightTest123","firstName":"Eval","lastName":"Owner"},"admin":{"email":"admin@n8n.io","password":"PlaywrightTest123","firstName":"Admin","lastName":"User"},"members":[],"chat":{"email":"chat@n8n.io","password":"PlaywrightTest123","firstName":"Chat","lastName":"User"}}' # Run evals against it pnpm eval:instance-ai --base-url http://localhost:5678 --verbose ``` ## CLI reference Invoke from `packages/@n8n/instance-ai/`: ```bash # All test cases dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --verbose # Single test case (filename substring match) dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --verbose # Keep built workflows for inspection after the run dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --keep-workflows # Multi-iteration for pass@k / pass^k metrics dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --iterations 3 ``` ### Flags | Flag | Default | Description | |------|---------|-------------| | `--verbose` | `false` | Log build/execute/verify timing and SSE events | | `--filter` | — | Filter test cases by filename substring. Comma-separated values mean OR (e.g. `contact-form,deduplication`) | | `--exclude` | — | Skip test cases whose filename matches any of the substrings. Same comma-separated shape as `--filter`; applied after `--filter` | | `--prebuilt-workflows` | — | Path to a JSON manifest mapping test-case slugs to existing workflow IDs. Skips the orchestrator build for matched test cases — see [Running evals against pre-built workflows](#running-evals-against-pre-built-workflows) | | `--keep-workflows` | `false` | Don't delete built workflows after the run | | `--base-url` | `http://localhost:5678` | n8n instance URL | | `--email` | E2E test owner | Override login email (or `N8N_EVAL_EMAIL`) | | `--password` | E2E test owner | Override login password (or `N8N_EVAL_PASSWORD`) | | `--timeout-ms` | `900000` | Per-test-case timeout | | `--output-dir` | cwd | Where to write `eval-results.json` | | `--dataset` | `instance-ai-workflow-evals` | LangSmith dataset name | | `--concurrency` | `16` | Max concurrent scenarios (builds are separately capped at 4) | | `--experiment-name` | auto | LangSmith experiment prefix (defaults to `{branch}-{sha}` in CI or `local-{branch}-{sha}-dirty?` locally) | | `--iterations` | `1` | Run each test case N times with fresh builds | **pass@k / pass^k**: with `--iterations N`, each scenario runs N times. `pass@k` is the fraction of scenarios that passed *at least once*; `pass^k` is the fraction that passed *every* time. `pass@k` shows whether something is *possible*; `pass^k` shows whether it's *reliable*. ### Outputs Every run produces: - **Console** — live progress, per-scenario pass/fail with `[failure_category]` tag, and a grouped summary. - **`eval-results.json`** — structured results in `--output-dir` (or cwd). Consumed by the CI PR comment. - **`.data/workflow-eval-report.html`** — self-contained debugging view with per-node execution traces, intercepted requests, mock responses, Phase 1 hints, and verifier reasoning. - **LangSmith experiment** — only when `LANGSMITH_API_KEY` is set. See the caveat in [Environment variables](#environment-variables). ## Environment variables | Variable | Required | Description | |----------|----------|-------------| | `N8N_INSTANCE_AI_MODEL_API_KEY` | Yes | Anthropic API key for the agent, mock generation, and verification | | `N8N_EVAL_EMAIL` | No | n8n login email (defaults to E2E test owner) | | `N8N_EVAL_PASSWORD` | No | n8n login password (defaults to E2E test owner) | | `LANGSMITH_API_KEY` | No | Enables experiment tracking + tracing. **See caveat below.** | | `LANGSMITH_ENDPOINT` | No | Region (`https://api.smith.langchain.com` US, `https://eu.api.smith.langchain.com` EU) | | `LANGSMITH_REVISION_ID` | No | Commit SHA to tag the experiment with (auto-set in CI) | | `LANGSMITH_BRANCH` | No | Branch name to tag the experiment with (auto-set in CI) | | `CONTEXT7_API_KEY` | No | Context7 key for API-doc lookups. Improves mock realism for less-common services; the LLM falls back to training data when unset | | `N8N_AI_ASSISTANT_BASE_URL` | No | Set to `""` to bypass the hosted AI proxy and hit Anthropic directly — useful to avoid per-tenant quota during large batch runs | **LangSmith caveat:** if `LANGSMITH_API_KEY` is set in `.env.local`, local runs also land in the shared `instance-ai-workflow-evals` dataset. Unset it (or run without `dotenvx`) to keep exploratory runs out of team results. ## Regression detection When `LANGSMITH_API_KEY` is set, every eval run automatically compares its results against the most recent pinned baseline (any experiment whose name starts with `instance-ai-baseline-`). Two output files are written: - `eval-results.json` — structured data only, including `comparison.result` when a baseline was found. - `eval-pr-comment.md` — the full PR comment rendered as markdown, including the alert, aggregate, comparison sections, per-test-case results, and failure details. Always written; falls back to a no-baseline summary when no comparison ran. The CI PR-comment step uses `eval-pr-comment.md` as the entire comment body (no jq assembly in the workflow). The console output uses a separate aligned-text formatter — same data, no markdown noise in the terminal. ### Refreshing the baseline There is no auto-refresh — refresh explicitly when you want a new reference point, ideally with high N for low noise: ```bash # From packages/@n8n/instance-ai/, on master at the version you want to pin LANGSMITH_API_KEY=... dotenvx run -f ../../../.env.local -- \ pnpm eval:instance-ai --experiment-name instance-ai-baseline --iterations 10 ``` LangSmith appends a random suffix (e.g. `instance-ai-baseline-7abc1234`); the most recently started one becomes the comparison target on the next eval run. The comparison is silently skipped on the baseline-creation run itself. ### How scenarios are tiered Each scenario lands in one of three regression tiers, evaluated in order of strictness: - **Regression** — high-confidence flag, gating-grade. The drop must be statistically significant (chance of seeing it by noise < 5%), at least 30 percentage points in size, and the baseline must have been reliable (≥ 70% pass rate). - **Likely regression** — looser bar for visibility on borderline cases. Looser confidence threshold (chance by noise < 20%), drop ≥ 15 percentage points, baseline ≥ 50%. Frequently natural variance — worth a glance only if your changes touch related code paths. - **Worth watching** — any scenario whose pass rate moved by ≥ 35 percentage points but wasn't flagged as a regression (hard or likely tier). Pure visibility, no implication of cause. Other verdicts: `improvement` (PR significantly better, skips the reliability gate), `unreliable_baseline` (confident drop but baseline was too flaky to call a regression — surfaced but not flagged), `stable`, `insufficient_data`. Why these tiers and not a flat percentage threshold? At the small N PR runs use (typically 3 iterations), a flat threshold can't tell a real regression from coin-flip noise. The confidence cutoff filters out gaps that could plausibly happen by chance, and the reliability gate avoids chasing noise on already-flaky scenarios. Implementation lives in `comparison/statistics.ts` (Fisher's exact test for the confidence check, Wilson interval for the headline aggregate band). Tune the likely-regression tier first if the false-positive rate looks off — keep the hard tier strict. ### Failure-category drift When both sides captured per-trial `failureCategory` values, the comparison also surfaces a run-level table of category rates (PR vs baseline). A category is marked **notable** when its absolute rate delta is ≥ 5 percentage points _and_ the count change beyond what scenario-count scaling would predict is ≥ 3 trials. This catches cross-scenario shifts (e.g. mock-generation breaking, or a model getting weaker overall) that per-scenario flags can miss. ### Best-effort Comparison is logged and skipped on any LangSmith failure — it never fails the eval. It is also skipped when no baseline experiment exists yet. ## Running evals against pre-built workflows The eval framework normally builds each workflow with Instance AI and then verifies it. With `--prebuilt-workflows `, the build step is skipped for matched test cases — the harness fetches the existing workflow from the n8n instance and runs verification against it instead. Use this to score workflows authored by other tools (an MCP-driven session, a hand-built reference, an older Instance AI snapshot) on the same dataset and the same verifier. The manifest is a JSON file mapping test-case file slugs to workflow IDs: ```json { "contact-form-automation": ["W1abc", "W2def", "W3ghi"], "deduplication-trigger": ["W4jkl"] } ``` - **Keys** are test-case file slugs — the JSON filename without `.json` (e.g. `contact-form-automation` for `evaluations/data/workflows/contact-form-automation.json`). The `--filter` flag uses the same identifier. - **Values** are arrays of workflow IDs that already exist in the target n8n instance. Multiple iterations rotate through the list with `iteration % ids.length`, so an `--iterations 5` run with 5 IDs gets 5 distinct builds. Test cases not present in the manifest fall back to the regular Instance AI build path. To run *only* the prebuilt set, pair with `--exclude` to skip the rest, or `--filter` to narrow the run. ```bash # Score the prebuilt cohort, skipping anything not in the manifest dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai \ --prebuilt-workflows ./mcp-manifest.json \ --filter contact-form-automation,deduplication-trigger \ --iterations 5 \ --experiment-name mcp-cohort ``` The harness leaves prebuilt workflows alone after the run (no auto-delete), so the manifest can be re-used across multiple eval runs. ### Producing a manifest `pnpm eval:build-mcp-manifest` (`evaluations/cli/build-mcp-manifest.ts`) drives `claude -p` against an MCP server — defaults to n8n's instance MCP — and writes a manifest in the schema this flag expects, plus a `manifest-stats.json` sidecar with per-cohort cost / turn / duration aggregates. The output is validated against the same Zod schema the loader uses, so shape regressions surface here rather than at eval time. **Prerequisites**: `claude` CLI installed; `~/.claude.json` has the MCP server block configured (project-scoped under `.projects[].mcpServers[]` or globally under `.mcpServers[]`); n8n instance reachable at the URL the MCP block points at. Default MCP server name is `"n8n-mcp (instance)"` — override with `--mcp-server`. ```bash # Build N=5 per test case, 4 in parallel pnpm eval:build-mcp-manifest -n 5 -j 4 --output-dir ./mcp-cohort # Then score the cohort dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai \ --prebuilt-workflows ./mcp-cohort/manifest.json \ --iterations 5 \ --experiment-name mcp-cohort ``` For runs that need to leave the n8n repo (for example, driving the build from a separate Claude project where you have skills configured), three flags decouple the script from its default assumptions: - `--workflow-dir ` — read test-case JSONs from a directory other than the n8n repo's `evaluations/data/workflows/`. When set, the script no longer needs `git rev-parse` to find the repo. - `--build-cwd ` — set the working directory the `claude` subprocess spawns from. Affects which `~/.claude.json` `projects` entry (and which skills) Claude loads. - `--project-id ` — instructs the model to pass `projectId` to `create_workflow_from_code` so workflows land in a specific n8n project instead of the user's personal one. Run `pnpm eval:build-mcp-manifest --help` for the full flag list. ## Pairwise evals Pairwise evals score a built workflow against the dataset's `dos` / `donts` criteria using an LLM judge panel (3 judges by default, majority vote on `pairwise_primary`, mean fraction of criteria satisfied on `pairwise_diagnostic`). The point is **head-to-head comparison with `ai-workflow-builder.ee`** on the same dataset (default `notion-pairwise-workflows`), so the judge panel, defaults, and metric keys are imported from that package directly. Unlike the e2e and sub-agent harnesses, pairwise runs the **builder sub-agent in-process** — no n8n server, no Docker, no live workflow service. Stub services capture `createFromWorkflowJSON` calls; HITL suspensions are auto-approved. ### Quick start ```bash # From packages/@n8n/instance-ai/ # 1. Local fixture (small smoke set, no LangSmith required) N8N_AI_ANTHROPIC_KEY="$ANTHROPIC_API_KEY" pnpm eval:pairwise --judges 1 # 2. Full LangSmith dataset LANGSMITH_API_KEY=... N8N_AI_ANTHROPIC_KEY="$ANTHROPIC_API_KEY" \ pnpm eval:pairwise:langsmith --judges 3 # 3. Rerun a specific subset (one example ID per line; #-prefixed lines ignored) pnpm eval:pairwise:langsmith \ --example-ids-file .output/pairwise/failed-ids.txt \ --output-dir .output/pairwise/rerun ``` ### Sandbox Pairwise evals always run inside a sandbox — the same path production uses. The agent writes TypeScript to `~/workspace/src/workflow.ts` inside the sandbox, runs `tsc` to validate, and calls `submit-workflow` to save the parsed `WorkflowJSON`. This exercises the production builder agent end-to-end (sandbox prompt, file I/O, real type checking). Required env vars (Daytona provider — the default): ```bash ANTHROPIC_API_KEY=sk-ant-... # builder + judge LLM LANGSMITH_API_KEY=ls__... # only for --backend langsmith DAYTONA_API_URL=https://app.daytona.io/api DAYTONA_API_KEY=dtn_... # Optional N8N_INSTANCE_AI_SANDBOX_PROVIDER=daytona # default; set 'local' or 'n8n-sandbox' to switch N8N_INSTANCE_AI_SANDBOX_IMAGE=daytonaio/sandbox:0.5.0 # default N8N_INSTANCE_AI_SANDBOX_TIMEOUT=300000 # per-command timeout (ms) ``` The CLI fails fast at startup if the chosen provider is misconfigured (e.g., Daytona selected without API URL/key). The chosen provider is recorded under `summary.json → sandbox.provider`. > **Daytona cold-start.** The very first sandbox creation triggers an image > build on Daytona's side (`npm install` for `@n8n/workflow-sdk`). That can > exceed the SDK's 5-minute create timeout and fail with `Sandbox failed to > become ready within the timeout period`. Once the image is cached, later > runs are fast. Workaround: pre-build the image via the Daytona dashboard > before kicking off a full eval run. ### Flags | Flag | Default | Description | |------|---------|-------------| | `--backend` | `local` | `local` reads `evaluations/data/pairwise/local.json`; `langsmith` pulls from the LangSmith dataset | | `--dataset` | `notion-pairwise-workflows` | LangSmith dataset name (langsmith backend only) | | `--judges` | `3` | Number of judges in the LLM panel | | `--judge-model` | `claude-sonnet-4-5-20250929` | LangChain model id for the judge LLM | | `--iterations` | `1` | Run each example N times — for measuring judge / build variance | | `--concurrency` | `5` | Parallel example workers (`p-limit`) | | `--max-examples` | — | Cap dataset to first N examples | | `--example-ids-file` | — | Path to a text file of LangSmith example IDs (one per line). Used for rerunning a subset | | `--timeout-ms` | `1200000` | Per-example build timeout | | `--output-dir` | `.output/pairwise/` | Where to write artifacts | | `--experiment-name` | `pairwise-evals-instance-ai` | LangSmith experiment label | | `--verbose` | `false` | Per-example log lines | ### Outputs Each run writes a self-contained directory: ``` .output/pairwise// ├── summary.json # totals: pass rate, avg diagnostic, build failures by class, interactivity counters ├── results.jsonl # one line per example: prompt, dos/donts, captured workflow, build metadata, feedback rows ├── workflows/.json # normalized workflow JSON (matches SimpleWorkflow shape from ai-workflow-builder.ee) └── chunks/_.jsonl # per-example agent trace: tool-calls, tool-results, suspensions, final text ``` The `chunks/*.jsonl` traces are the primary tool for root-causing build failures. Each line is one event: `tool-call`, `tool-result`, `suspension`, `auto-approve`, `text`, `stream-finish`, `captured-workflows`, `error`. When `LANGSMITH_API_KEY` is set, feedback is also posted to LangSmith with metric keys `pairwise_primary`, `pairwise_diagnostic`, `pairwise_judges_passed`, `pairwise_total_passes`, `pairwise_total_violations`, and per-judge `judge1..N`. Experiment metadata includes `builder: 'instance-ai'` so it can be queried alongside the `ai-workflow-builder.ee` baseline. ### Build failure classes Build failures are tracked separately from judge scores: - **`build_timeout`** — exceeded `--timeout-ms` - **`no_workflow_built`** — agent finished without invoking `build-workflow` (no captured workflow) - **`agent_error`** — stream errored or the agent threw A failure produces a row with `workflow: null`, empty `feedback`, and the error class — it counts as a primary fail in the comparison report. ### Interactivity gates The agent is stubbed for non-interactive use. The summary tracks divergence from this assumption — investigate any non-zero count: - `askUserCount` — `ask-user` tool was invoked (eval responds with `{ approved: false }`) - `planToolCount` — `plan` tool was invoked (single-prompt dataset shouldn't trigger planning) - `autoApprovedSuspensions` — HITL-gated tool fired (e.g., `data-tables` create); auto-approved - `mockedCredentialTypes` — credential types the agent referenced (auto-mocked since `credentialService.list()` returns `[]`) ### Comparison report After running both `ai-workflow-builder.ee/evaluations/cli` (the baseline) and `eval:pairwise` against the same dataset, generate an HTML side-by-side report: ```bash pnpm eval:pairwise:compare \ --ee-dir ../ai-workflow-builder.ee/evaluations/.output/pairwise/ \ --ia-dir .output/pairwise/ \ --out .output/pairwise/comparison.html ``` The report shows headline metrics, per-prompt verdicts (TIE / IA-only / Code-only / both-pass / both-fail), and lazy-loaded workflow previews — rows collapse by default and only render the heavy `` preview when expanded. ### When pairwise scores wobble Judge non-determinism + agent retry behavior mean a single run is not a reliable signal. Two specific things to know: - The agent will sometimes retry `build-workflow` after a parser rejection (e.g., security violation) and sometimes give up. Whether a prompt "fails to build" is non-deterministic across runs. - If you're comparing two builders to claim a regression or improvement, bump `--iterations` to ≥3 for both sides. ## How the e2e harness works 1. **Build** — sends the test case prompt to Instance AI, which builds a workflow 2. **Phase 1** — analyzes the workflow and generates consistent mock data hints (one Sonnet call per scenario) 3. **Phase 2** — executes the workflow with all HTTP requests intercepted. Each request goes to an LLM that generates a realistic API response using the node's configuration and API documentation from Context7 4. **Verify** — an LLM evaluates whether the scenario's success criteria were met and categorizes any failure by root cause (see Failure categories below) ### What gets mocked - **Mocked nodes** — any node that makes HTTP requests (Gmail, Slack, Google Sheets, HTTP Request, Notion, etc.). The request is intercepted before it leaves the process. An LLM generates the response. - **Pinned nodes** — nodes that don't go through the HTTP layer: trigger/webhook nodes, LangChain/AI nodes (they use SDKs directly), database nodes. These receive LLM-generated data as pin data. - **Real nodes** — logic nodes (Code, Set, Merge, Filter, IF, Switch) execute their actual code on the mocked/pinned data. No real credentials or API connections are needed. ~95% of node types are covered; the main gaps are binary-data nodes (file attachments, image generation) and streaming nodes. ## How the sub-agent harness works 1. The CLI logs in to n8n with `N8N_EVAL_EMAIL` / `N8N_EVAL_PASSWORD`. 2. For each test case it POSTs `/rest/instance-ai/eval/run-sub-agent`. 3. The server builds a real `InstanceAiContext` via `InstanceAiAdapterService.createContext`, wraps the workflow service to record created IDs, resolves the `builder` (or other) role's system prompt, instantiates the sub-agent with the full `createAllTools(context)` tool surface, and runs it to completion. 4. The server returns `{ text, toolCalls, toolResults, capturedWorkflowIds, ... }`. 5. The CLI fetches each captured workflow via `GET /rest/workflows/:id` (this doubles as a round-trip check through the real importer), scores it with the binary-check suite, and archives+deletes it (unless `--keep-workflows`). No tools, services, or workflow imports are mocked. The server path exercised here is the same one the orchestrator takes when it spawns a builder sub-agent. ## LangSmith integration When `LANGSMITH_API_KEY` is set, each run is recorded as a LangSmith experiment against the `instance-ai-workflow-evals` dataset (synced from the JSON files before each run). Experiments against the same dataset can be compared side-by-side to spot regressions. ## Adding test cases Test cases live in `evaluations/data/workflows/*.json`. Drop a file in, the CLI and LangSmith sync picks it up — no registration step. ```json { "prompt": "Create a workflow that...", "complexity": "medium", "tags": ["build", "webhook", "gmail"], "triggerType": "webhook", "scenarios": [ { "name": "happy-path", "description": "Normal operation", "dataSetup": "The webhook receives a submission from Jane (jane@example.com)...", "successCriteria": "The workflow executes without errors. An email is sent to jane@example.com..." } ] } ``` **One JSON file = one LangSmith split.** Scenarios in the same file share a split; split names derive from the filename slug. Pick a slug you're happy to also use as a `--filter` target. **Prompt tips** - Be specific about node configuration — document IDs, sheet names, channel names, chat IDs. The agent won't ask for these in eval mode (no multi-turn yet). - Add "Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later." - If a built-in node doesn't expose a field you need (e.g. the Linear node doesn't query `creator.email`), tell the agent to use HTTP Request instead. **Scenario tips** - Don't specify exact counts that depend on mock data ("exactly 7 posts remain"). The LLM is non-deterministic. Say "fewer than the original 10" instead. - `dataSetup` steers the mock — describe what each service should return, not the exact JSON. - For error scenarios, describe the condition: "The Telegram node returns an error indicating the chat was not found." - `successCriteria` is what the verifier reads. Be specific: "None of the titles in the Slack message should contain the word 'qui'." **Which scenarios to include** - `happy-path` — everything works as expected - Edge cases — empty data, missing fields, single vs multiple items - Error scenarios only if the workflow is expected to handle them gracefully. Most agent-built workflows don't include error handling, so "the workflow crashes on invalid input" is a legitimate finding, not a test-case failure. ### Adding a new credential type `credentials/seeder.ts` seeds generic creds (HTTP Header, HTTP Basic) on every run, plus env-gated creds (GitHub, Gmail, Teams, Linear…) when the matching env var is set. If your scenario needs a credential type that isn't there, add it to the appropriate list in `seeder.ts` — env-gated if it requires a real token, generic if a placeholder is fine. ## Failure categories When a scenario fails, the verifier categorizes the root cause: - **builder_issue** — the agent misconfigured a node, chose the wrong node type, or built the wrong structure. - **mock_issue** — the LLM mock returned incorrect data (`_evalMockError`, wrong response shape). - **framework_issue** — Phase 1 failed (empty trigger content) or the eval framework itself cascaded an error. - **verification_failure** — the verifier couldn't produce a valid result. - **build_failure** — Instance AI failed to build the workflow or a scenario timed out. Suite pass rates typically sit between 40–65%; most failures are `builder_issue` on scenarios that require error handling the agent doesn't produce by default. ## Troubleshooting **`Wrong username or password` on login.** Your instance has no owner. Run the `rest/e2e/reset` curl from [Quick start](#quick-start) (needs `E2E_TESTS=true` on the server). **`Have reached end of quota` mid-run.** You're hitting the hosted AI proxy's per-tenant quota. Set `N8N_AI_ASSISTANT_BASE_URL=""` to hit Anthropic directly with your `N8N_INSTANCE_AI_MODEL_API_KEY`. Also consider lowering `--concurrency`. **All scenarios timing out.** Check that the server is up (`curl localhost:5678/healthz`) and that `N8N_INSTANCE_AI_MODEL_API_KEY` is set. A full build is ~60–180s; timeouts past `--timeout-ms` usually mean the agent is looping. **Flood of `framework_issue` failures.** Phase 1 is returning empty hints — some nodes will degrade to `_evalMockError`. Check server logs for Anthropic errors; occasionally a rate-limit spike. **Port 5678 already in use.** A previous `pnpm dev` is still running. `lsof -iTCP:5678 -sTCP:LISTEN` to find the PID, `kill` it. **Debugging a specific failure.** Open the `.data/workflow-eval-report.html` artifact — it has per-node execution traces, the exact intercepted request, the mock response the LLM generated, and the verifier's reasoning. Start from the failing node. ## CI Evals run automatically on PRs that change Instance AI code (path-filtered). The workflow starts a single Docker container and runs the CLI against it. See `.github/workflows/test-evals-instance-ai.yml`. The job is **non-blocking**. Results are posted as a PR comment and uploaded as artifacts. When `LANGSMITH_API_KEY` is set via the `EVALS_LANGSMITH_API_KEY` secret, runs also land as LangSmith experiments tagged with commit SHA + branch, so you can compare against master side-by-side. ## Architecture ``` evaluations/ ├── index.ts # Public API ├── cli/ # CLI entries: instance-ai, subagent, pairwise, compare-pairwise, report ├── clients/ # n8n REST + SSE clients ├── checklist/ # LLM verification with retry ├── credentials/ # Test credential seeding ├── data/workflows/ # e2e/sub-agent test case JSON files ├── data/pairwise/ # Local pairwise fixture (small smoke set) ├── harness/ # Runners: buildWorkflow + executeScenario (e2e), in-process-builder (pairwise) ├── langsmith/ # Dataset sync + experiment setup ├── outcome/ # SSE event parsing, workflow discovery ├── report/ # HTML report generator └── system-prompts/ # LLM prompts for verification packages/cli/src/modules/instance-ai/eval/ ├── execution.service.ts # Phase 1 + Phase 2 orchestration ├── workflow-analysis.ts # Hint generation (Phase 1) ├── mock-handler.ts # Per-request mock generation (Phase 2) ├── api-docs.ts # Context7 API doc fetcher ├── node-config.ts # Node config serializer └── pin-data-generator.ts # LLM pin data for bypass nodes (Phase 1.5) ``` ## Known limitations - **LangChain/AI nodes** — use their own SDKs, not the HTTP mock layer. They fail with credential errors; use pin data instead. - **Binary / file nodes** — media attachments, image generation, file downloads. Mock metadata works; realistic binary content is out of scope. - **Streaming nodes** — mocks return complete responses, not streams. - **GraphQL APIs** — response shape depends on the query, not just the endpoint. Quality depends on the LLM knowing the API schema. - **Non-determinism** — the agent builds different workflows each run. Pass rates vary between 40–65%. - **Large workflows** — verification artifacts include full execution traces. For 12+ node workflows this can hit token limits. See TRUST-43 for the tool-based verifier approach.