mirror of https://github.com/n8n-io/n8n.git synced 2026-05-30 08:17:06 +02:00

History

José Braulio González Valido 540faa7c1d fix(ai-builder): Update eval dataset prompts to match scenario checklists (no-changelog) (#29011 ) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-24 09:55:18 +00:00
..
__tests__	feat(ai-builder): Workflow evaluation framework with LLM mock execution (#27818 )	2026-04-07 13:31:16 +00:00
binaryChecks	feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog) (#28289 )	2026-04-24 07:50:46 +00:00
checklist	feat(ai-builder): Add LangSmith integration for workflow eval tracking (no-changelog) (#28835 )	2026-04-23 08:47:02 +00:00
cli	feat(ai-builder): Add LangSmith integration for workflow eval tracking (no-changelog) (#28835 )	2026-04-23 08:47:02 +00:00
clients	feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog) (#28289 )	2026-04-24 07:50:46 +00:00
credentials	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
data	fix(ai-builder): Update eval dataset prompts to match scenario checklists (no-changelog) (#29011 )	2026-04-24 09:55:18 +00:00
harness	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
langsmith	feat(ai-builder): Add LangSmith integration for workflow eval tracking (no-changelog) (#28835 )	2026-04-23 08:47:02 +00:00
outcome	feat(ai-builder): Workflow evaluation framework with LLM mock execution (#27818 )	2026-04-07 13:31:16 +00:00
report	feat(ai-builder): Add LangSmith integration for workflow eval tracking (no-changelog) (#28835 )	2026-04-23 08:47:02 +00:00
subagent	feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog) (#28289 )	2026-04-24 07:50:46 +00:00
system-prompts	fix(ai-builder): Expose outputCount when eval node output is truncated (no-changelog) (#28977 )	2026-04-23 11:06:27 +00:00
index.ts	feat(ai-builder): Add LangSmith integration for workflow eval tracking (no-changelog) (#28835 )	2026-04-23 08:47:02 +00:00
README.md	feat(ai-builder): Add sub-agent evaluation harness with binary checks (no-changelog) (#28289 )	2026-04-24 07:50:46 +00:00
types.ts	feat: Add multiple runs to instanceAI eval (no-changelog) (#28493 )	2026-04-22 07:11:10 +00:00

README.md

Workflow evaluation framework

Tests whether workflows built by Instance AI actually work by executing them with LLM-generated mock HTTP responses.

Running evals

CLI

Two harnesses (eval:instance-ai and eval:subagent) drive a running n8n instance over HTTP. Start n8n locally with N8N_ENABLED_MODULES=instance-ai before running either one. The sub-agent harness also requires E2E_TESTS=true on the n8n instance — the /eval/run-sub-agent endpoint is disabled otherwise.

# From packages/@n8n/instance-ai/, with n8n running via pnpm dev:ai

# End-to-end workflow evaluation (builder + mock execution + verification)
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --verbose
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --verbose
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --keep-workflows --verbose

# Sub-agent evaluation (builder only, real tools on live n8n, binary checks)
dotenvx run -f ../../../.env.local -- pnpm eval:subagent --filter webhook-to-slack --verbose

Outputs

Every run produces three artifacts:

Console — live progress, per-scenario pass/fail with [failure_category] tag, and a grouped summary at the end.
eval-results.json — structured results in the current working directory. Consumed by the CI PR comment.
.data/workflow-eval-report.html — rich debugging view with per-node execution traces, intercepted requests, mock responses, Phase 1 hints, and verifier reasoning. Self-contained HTML you can open in a browser.

If LANGSMITH_API_KEY is set, results are also sent to LangSmith as an experiment for historical comparison.

CLI flags

Flag	Default	Description
`--verbose`	`false`	Log build/execute/verify timing and SSE events
`--filter`	—	Filter test cases by filename substring (e.g. `contact-form`)
`--keep-workflows`	`false`	Don't delete built workflows after the run
`--base-url`	`http://localhost:5678`	n8n instance URL
`--email`	E2E test owner	Override login email (also via `N8N_EVAL_EMAIL`)
`--password`	E2E test owner	Override login password (also via `N8N_EVAL_PASSWORD`)
`--timeout-ms`	`600000`	Per-test-case timeout
`--output-dir`	cwd	Where to write `eval-results.json`
`--dataset`	`instance-ai-workflow-evals`	LangSmith dataset name
`--concurrency`	`16`	Max concurrent scenarios (builds are separately capped at 4)
`--experiment-name`	auto	LangSmith experiment prefix (defaults to `{branch}-{sha}` in CI or `local-{branch}-{sha}-dirty?` locally)
`--iterations`	`1`	Run each test case N times with fresh builds — powers pass@k / pass^k metrics

Docker (without pnpm dev:ai)

# Build the Docker image
INCLUDE_TEST_CONTROLLER=true pnpm build:docker

# Start a container
docker run -d --name n8n-eval \
  -e E2E_TESTS=true \
  -e N8N_ENABLED_MODULES=instance-ai \
  -e N8N_AI_ENABLED=true \
  -e N8N_INSTANCE_AI_MODEL_API_KEY=your-key \
  -p 5678:5678 \
  n8nio/n8n:local

# Seed the test user
curl -sf -X POST http://localhost:5678/rest/e2e/reset \
  -H "Content-Type: application/json" \
  -d '{"owner":{"email":"nathan@n8n.io","password":"PlaywrightTest123","firstName":"Eval","lastName":"Owner"},"admin":{"email":"admin@n8n.io","password":"PlaywrightTest123","firstName":"Admin","lastName":"User"},"members":[],"chat":{"email":"chat@n8n.io","password":"PlaywrightTest123","firstName":"Chat","lastName":"User"}}'

# Run evals against it
pnpm eval:instance-ai --base-url http://localhost:5678 --verbose
pnpm eval:subagent --base-url http://localhost:5678 --verbose

CI

Evals run automatically on PRs that change Instance AI code (path-filtered). The CI workflow starts a single Docker container and runs the CLI against it. See .github/workflows/test-evals-instance-ai.yml.

The eval job is non-blocking. Results are posted as a PR comment and uploaded as artifacts. When LANGSMITH_API_KEY is set (via the EVALS_LANGSMITH_API_KEY secret), the run also lands as an experiment in LangSmith with commit SHA + branch tagged.

Environment variables

Variable	Required	Description
`N8N_EVAL_BASE_URL`	No	n8n base URL. Defaults to `http://localhost:5678`.
`N8N_EVAL_EMAIL`	No	n8n login email (defaults to E2E test owner).
`N8N_EVAL_PASSWORD`	No	n8n login password (defaults to E2E test owner).
`N8N_INSTANCE_AI_MODEL_API_KEY`	On server	Anthropic key for the Instance AI agent, mock generation, and verification. Set on the n8n instance, not the eval CLI — the sub-agent runs inside n8n.
`E2E_TESTS`	On server, for `eval:subagent`	Must be `true` on the n8n instance. The `/eval/run-sub-agent` endpoint returns 403 otherwise.
`LANGSMITH_API_KEY`	For `--dataset` mode	Enables LangSmith experiment tracking + tracing. Without it, the CLI still runs and writes JSON/HTML. Required for `eval:subagent --dataset` mode.
`LANGSMITH_ENDPOINT`	No	LangSmith region endpoint (`https://api.smith.langchain.com` for US, `https://eu.api.smith.langchain.com` for EU).
`LANGSMITH_REVISION_ID`	No	Commit SHA to tag the experiment with (set automatically in CI).
`LANGSMITH_BRANCH`	No	Branch name to tag the experiment with (set automatically in CI).
`CONTEXT7_API_KEY`	No	Higher API-doc rate limits. Set on the n8n instance. Free tier is 1,000 req/month.

Sub-agent evaluation flags

Flag	Description
`--base-url <url>`	n8n base URL (overrides `N8N_EVAL_BASE_URL`)
`--filter <substring>`	Run only test cases whose file name matches
`--prompt <text>`	Run a single ad-hoc prompt
`--dataset <name>`	Run against a LangSmith dataset
`--experiment <name>`	LangSmith experiment label
`--subagent <role>`	Sub-agent role. Default: `builder`
`--max-steps <n>`	Max agent steps. Default: 40
`--timeout <ms>`	Per-test timeout. Default: 120000
`--concurrency <n>`	Parallel test cases. Default: 5
`--keep-workflows`	Do not archive/delete workflows created during the run
`--verbose`	Verbose output

How the e2e harness works

Build — sends the test case prompt to Instance AI, which builds a workflow
Phase 1 — analyzes the workflow and generates consistent mock data hints (one Sonnet call per scenario)
Phase 2 — executes the workflow with all HTTP requests intercepted. Each request goes to an LLM that generates a realistic API response using the node's configuration and API documentation from Context7
Verify — an LLM evaluates whether the scenario's success criteria were met and categorizes any failure by root cause (see Failure categories below)

What gets mocked

Mocked nodes — any node that makes HTTP requests (Gmail, Slack, Google Sheets, HTTP Request, Notion, etc.). The request is intercepted before it leaves the process. An LLM generates the response.
Pinned nodes — nodes that don't go through the HTTP layer: trigger/webhook nodes, LangChain/AI nodes (they use SDKs directly), database nodes. These receive LLM-generated data as pin data.
Real nodes — logic nodes (Code, Set, Merge, Filter, IF, Switch) execute their actual code on the mocked/pinned data.

No real credentials or API connections are needed. ~95% of node types are covered; the main gaps are binary-data nodes (file attachments, image generation) and streaming nodes.

How the sub-agent harness works

The CLI logs in to n8n with N8N_EVAL_EMAIL / N8N_EVAL_PASSWORD.
For each test case it POSTs /rest/instance-ai/eval/run-sub-agent.
The server builds a real InstanceAiContext via InstanceAiAdapterService.createContext, wraps the workflow service to record created IDs, resolves the builder (or other) role's system prompt, instantiates the sub-agent with the full createAllTools(context) tool surface, and runs it to completion.
The server returns { text, toolCalls, toolResults, capturedWorkflowIds, ... }.
The CLI fetches each captured workflow via GET /rest/workflows/:id (this doubles as a round-trip check through the real importer), scores it with the binary-check suite, and archives+deletes it (unless --keep-workflows).

No tools, services, or workflow imports are mocked. The server path exercised here is the same one the orchestrator takes when it spawns a builder sub-agent.

LangSmith integration

When LANGSMITH_API_KEY is set, each run is recorded as a LangSmith experiment against the instance-ai-workflow-evals dataset (synced from the JSON files before each run). Experiments against the same dataset can be compared side-by-side to spot regressions.

Adding test cases

Test cases live in evaluations/data/workflows/*.json:

{
  "prompt": "Create a workflow that...",
  "complexity": "medium",
  "tags": ["build", "webhook", "gmail"],
  "triggerType": "webhook",
  "scenarios": [
    {
      "name": "happy-path",
      "description": "Normal operation",
      "dataSetup": "The webhook receives a submission from Jane (jane@example.com)...",
      "successCriteria": "The workflow executes without errors. An email is sent to jane@example.com..."
    }
  ]
}

Writing good test cases

Prompt tips:

Be specific about node configuration — include document IDs, sheet names, channel names, chat IDs. The agent won't ask for these in eval mode (no multi-turn support yet).
Say "Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later."
If a built-in node doesn't expose the fields you need (e.g., Linear node doesn't query creator.email), tell the agent to use an HTTP Request node with a custom API call instead.

Scenario tips:

Don't specify exact counts that depend on mock data (e.g., "exactly 7 posts remain"). The LLM generates data non-deterministically. Instead say "some posts are filtered out — fewer remain than the original 10."
The dataSetup field steers the mock data generation. Describe what each service should return, not the exact JSON.
For error scenarios, describe the error condition: "The Telegram node returns an error indicating the chat was not found."
The successCriteria is what the verification LLM checks. Be specific about what constitutes success — "None of the titles in the Slack message should contain the word 'qui'."

Scenarios to include:

happy-path — everything works as expected
Edge cases — empty data, missing fields, single vs multiple items
Error scenarios only if the workflow is expected to handle them gracefully. Most agent-built workflows don't include error handling, so testing "the workflow crashes on invalid input" is a legitimate finding, not a test case failure.

Failure categories

When a scenario fails, the verifier categorizes the root cause:

builder_issue — the agent misconfigured a node, chose the wrong node type, or the workflow structure doesn't match what was asked
mock_issue — the LLM mock returned incorrect data (e.g., _evalMockError, wrong response shape)
framework_issue — Phase 1 failed (empty trigger content), cascading errors from the eval framework itself
verification_failure — the LLM verifier couldn't produce a valid result
build_failure — Instance AI failed to build the workflow or a scenario timed out

Architecture

evaluations/
├── index.ts              # Public API
├── cli/                  # CLI entry point, arg parsing, CI metadata
├── clients/              # n8n REST + SSE clients
├── checklist/            # LLM verification with retry
├── credentials/          # Test credential seeding
├── data/workflows/       # Test case JSON files
├── harness/              # Runner: buildWorkflow, executeScenario, cleanupBuild
├── langsmith/            # Dataset sync + experiment setup
├── outcome/              # SSE event parsing, workflow discovery
├── report/               # HTML report generator
└── system-prompts/       # LLM prompts for verification

packages/cli/src/modules/instance-ai/eval/
├── execution.service.ts  # Phase 1 + Phase 2 orchestration
├── workflow-analysis.ts  # Hint generation (Phase 1)
├── mock-handler.ts       # Per-request mock generation (Phase 2)
├── api-docs.ts           # Context7 API doc fetcher
├── node-config.ts        # Node config serializer
└── pin-data-generator.ts # LLM pin data for bypass nodes (Phase 1.5)

Known limitations

LangChain/AI nodes — use their own SDKs, not intercepted by the HTTP mock layer. These nodes will fail with credential errors. Use pin data for these.
Binary / file nodes — media attachments, image generation, file downloads. Mock metadata works but realistic binary content is out of scope.
Streaming nodes — our mock returns complete responses, not streams.
GraphQL APIs — response shape depends on the query, not just the endpoint. Quality depends on the LLM knowing the API schema.
Context7 quota — free tier is 1,000 requests/month, 60/hour. A full suite run uses ~100 requests. When quota is exceeded, the LLM falls back to its training data.
Non-determinism — the agent builds different workflows each run. Pass rates vary between 40-65%.
Large workflows — the verification artifact includes full execution traces. For complex workflows (12+ nodes) this can hit token limits. See TRUST-43 for the tool-based verifier approach.