test(ai-builder): Add multi-turn capability for IAI evals (no-changelog) (#30586)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:07:12 +02:00 · 2026-05-21 14:03:35 +01:00 · 2026-05-21 14:03:35 +01:00 · 81ea56fa6b
commit 81ea56fa6b
parent e9b1c7c48f
52 changed files with 3441 additions and 320 deletions
--- a/.github/workflows/test-evals-instance-ai.yml
+++ b/.github/workflows/test-evals-instance-ai.yml
@ -32,9 +32,9 @@ jobs:
    env:
      # Each port hosts an independent n8n container. The eval CLI's
      # work-stealing allocator dispatches builds across them, capped per-lane.
-      # 9 lanes on 4vcpu — builds are LLM-bound so CPU headroom is sufficient;
+      # 11 lanes on 4vcpu — builds are LLM-bound so CPU headroom is sufficient;
      # bump back to 8vcpu if contention shows up.
-      LANE_PORTS: '5678,5679,5680,5681,5682,5683,5684,5685,5686'
+      LANE_PORTS: '5678,5679,5680,5681,5682,5683,5684,5685,5686,5687,5688'
    permissions:
      contents: read
      pull-requests: write
--- a/packages/@n8n/instance-ai/.gitignore
+++ b/packages/@n8n/instance-ai/.gitignore
@ -1 +1,3 @@
 .output/
+eval-pr-comment.md
+eval-results.json
--- a/packages/@n8n/instance-ai/docs/evals/evals-rubric-v1.md
+++ b/packages/@n8n/instance-ai/docs/evals/evals-rubric-v1.md
@ -0,0 +1,161 @@
+# Instance AI evals rubric — v1
+
+The rubric defines the axes Instance AI's workflow-builder is scored on. Each axis is a named group of binary checks already running today; the rubric just gives them a taxonomy so per-axis pass/fail can be surfaced in reports and per-axis judge-vs-human agreement can be measured during calibration.
+
+This is the M0 deliverable from `roadmap.md`. M2 takes this file and applies it mechanically: it tags each of the 28 binary checks in `packages/@n8n/instance-ai/evaluations/binaryChecks/checks/` with one of the seven axis names below, adds the union type, and emits per-axis Feedback alongside the existing per-check signal. The rubric does not invent new checks — it organizes the ones that already exist.
+
+Seven axes for v1, scoring single-turn workflow artifacts. Two multi-turn axes (`clarification_quality`, `recovery`) are deferred to M3 when behaviour judges land.
+
+```ts
+export type RubricAxis =
+  | 'structure'
+  | 'connection_topology'
+  | 'parameter_correctness'
+  | 'intent_match'
+  | 'ai_nodes'
+  | 'quality'
+  | 'security';
+```
+
+Scoring levels for v1 are pass/fail (matching the current binary checks). M2's Phase B optionally adds a `partial` level to LLM checks where the distinction between "wrong" and "incomplete" is worth capturing — out of scope for this file.
+
+Anchors describe patterns in n8n workflow JSON — node shapes, connection structures, parameter values, expressions — rather than specific scenarios. The rubric applies to any workflow Instance AI produces, including future authored scenarios and real-user conversation traces from Phase E. The rubric is also additive: no existing check name, Feedback shape, or downstream consumer contract changes. Per-axis signal is layered on top of what's already running.
+
+---
+
+## `structure`
+
+**Definition.** The workflow exists as a runnable artifact: at least one node, at least one trigger node, no nodes left disabled, and the trigger is wired to something downstream. This is the floor for being graded on anything else — if `structure` fails, downstream axes are usually meaningless.
+
+**Positive anchor.** The saved workflow has a non-empty `nodes` array containing at least one trigger node (any of Schedule Trigger, Webhook, Form Trigger, Manual Trigger, Telegram Trigger, MCP Trigger, etc.). At least one trigger appears as a key in `connections` with `main[0]` populated by at least one downstream link object. No node has `disabled: true`. The artifact could be opened in n8n and an execution attempted from a trigger.
+
+**Negative anchor.** The saved workflow has `nodes: []` — zero nodes — despite the agent's response describing what it built. Or: a single trigger node is present with no downstream nodes (the trigger appears in `nodes` but its key in `connections` is missing, or its `main` array is empty). Or: every node carries `disabled: true`, leaving scaffolding behind instead of finishing the build.
+
+---
+
+## `connection_topology`
+
+**Definition.** Nodes in the workflow are wired correctly into the execution graph: every active node is reachable from a trigger via BFS through `connections`, no node is dangling, and routing nodes (Switch, IF, anything with multiple `main` outputs) have each enumerated branch populated with at least one downstream link.
+
+**Positive anchor.** Every active node in `nodes` is reachable from some trigger by walking forward through `connections[source].main[i][j].node`. Routing nodes (Switch with N rules, IF with two branches, etc.) have an N-length `connections[name].main` array, each element being a non-empty list of downstream link objects — the count of populated branches matches the count of rules the node declares. No active node exists in `nodes` without being the target of some edge reachable from a trigger.
+
+**Negative anchor.** A Switch node has `connections[name].main === []` — none of the branches are wired, even though the rules define them and the downstream action nodes exist elsewhere in `nodes`. The workflow looks complete on the canvas (correct nodes are present) but execution never reaches the routed actions; every routed path fails simultaneously because none are connected. A weaker variant: a Switch declares three rules but `connections[name].main` is `[ [link], [link], [] ]` — the third branch is an empty array, so items matching that rule are silently dropped. Or: an action node exists in `nodes` but no edge anywhere in `connections` targets it — it's dangling and `all_nodes_connected` fails.
+
+---
+
+## `parameter_correctness`
+
+**Definition.** Individual node configurations are valid and meaningful: required parameters present with valid values, enum-typed parameters use legal values, expressions reference nodes and fields that exist upstream, `$fromAI()` used only where the tool API allows it, and parameter values match the prompt's stated intent (e.g. Bearer auth when the prompt says Bearer). Contrast with `intent_match`, which asks "does the workflow as a whole contain every feature the user asked for"; the two overlap on resource/operation correctness (a misconfigured operation may fire both `correct_node_operations` here and `fulfills_user_request` over in intent_match) — that's redundant detection, not a bug.
+
+**Positive anchor.** Every node's required parameters are populated with valid values (`url` non-empty on HTTP Request, `assignments.assignments` non-empty on Set, conditions populated on Filter/IF). Enum parameters use values from the legal set. Expressions like `={{ $json.fieldName }}` reference fields that the immediately upstream node actually outputs; expressions like `={{ $('Other Node').item.json.field }}` reference both a node that exists in the workflow and a field that node produces. `$fromAI()` appears only inside tool node parameters. Authentication-type parameters align with what the prompt requested (Bearer prompts → `httpBearerAuth`, custom-header prompts → `httpHeaderAuth`).
+
+**Negative anchor.** An expression references `$('Build Caption')` but the workflow has no node named exactly "Build Caption" (only "Build caption" — case mismatch). Or: an expression references `$json.transcript` but the upstream node's output schema doesn't include that field. Or: a Set node has `assignments.assignments: []` — present but does nothing at runtime. Or: an HTTP Request node has `url: ''` or a missing required parameter. Or: a regular (non-tool) Set or HTTP Request node has `$fromAI("level")` inside a value — `$fromAI` is only legal inside tool node parameters. Or: the prompt says "Authorization: Bearer ..." and the builder picks `genericAuthType: 'httpHeaderAuth'` (custom header type) instead of `'httpBearerAuth'`. Or: an HTTP Request intended to GET a resource is configured with `method: 'POST'` and an empty body — schema-valid but semantically wrong for the request being made.
+
+---
+
+## `intent_match`
+
+**Definition.** The workflow does what the user asked for: every explicitly requested feature has a node of the right type *and* the right operation, and iteration over collections matches the use case. A node existing with the right name but the wrong operation does not count as fulfilling the request.
+
+**Positive anchor.** For each feature the prompt explicitly requests, the workflow contains a node of the correct type configured with the correct `resource` and `operation` (e.g., a node intended to fetch captions has `resource: 'caption'` rather than `resource: 'video'`). If the prompt requires per-item processing of a returned collection (an API returning `{ records: [...] }`, a list endpoint, a multi-row response), the workflow includes the iteration mechanism (Split Out, Loop Over Items, or equivalent) so downstream nodes execute per item rather than once on the wrapping object.
+
+**Negative anchor.** A node with the right type is present but its operation is wrong — e.g., a node named "Get Captions" with `resource: 'video'` when the prompt asks for captions (which require `resource: 'caption'`). The node has the right name; it does the wrong thing. Or: the prompt asks for N independent actions (notify + log + email) and only N−1 are present in `nodes` — one feature is silently missing. Or: an API returns a list envelope (`{ records: [...] }`) and the workflow posts the raw envelope to a downstream action once, instead of iterating and acting per item.
+
+---
+
+## `ai_nodes`
+
+**Definition.** AI-specific sub-node wiring and configuration is correct: Agent nodes have a language model attached via `ai_languageModel`, dynamic prompts use expressions (not hardcoded strings), memory nodes connect to a parent via `ai_memory` and key sessions by an explicit named-node reference rather than `$json`, vector stores have embeddings via `ai_embedding`, and tool nodes are configured with at least the parameters they need. Only graded when AI nodes are present in the workflow.
+
+**Positive anchor.** Each Agent node has at least one `ai_languageModel` edge feeding it from a chat-model node. Memory nodes connect back to a parent Agent via `ai_memory` and use a session key like `={{ $('Telegram Trigger').first().json.message.chat.id }}` — an explicit named-node reference, so the session boundary is anchored to the trigger and stable across branches. Vector store nodes have an embeddings model wired via `ai_embedding`. Tool nodes have non-empty `parameters` (or are one of the parameterless tool types like `toolCalculator` / `toolWikipedia`). Agent `text` parameters contain expressions referencing trigger payload (e.g., `={{ $json.message.text }}`); `options.systemMessage` is a non-empty instruction string.
+
+**Negative anchor.** An Agent node exists with no `ai_languageModel` edge feeding it. Or: a memory node sits in `nodes` but is not the source of any `ai_memory` connection — it's wired into nothing. Or: a memory node's `sessionKey` is `={{ $json.chat.id }}` (implicit upstream); under branching or merging this silently couples session scope to whichever node flows in at execution time, not to the trigger. Or: an Agent's `text` parameter is a literal string with no `{{ ... }}` expression — the agent receives the same prompt regardless of payload. Or: a tool node was added but `parameters: {}` and the tool type isn't in the parameterless allowlist.
+
+---
+
+## `quality`
+
+**Definition.** Soft signals about workflow craft: nodes have meaningful names, the workflow uses built-in nodes rather than Code nodes for transformations already supported, and the agent's spoken response accurately describes the workflow it actually built. The most opinionated axis — `no_unnecessary_code_nodes` is explicitly overridable per scenario via `annotations.code_necessary: true`. Quality is informational; it surfaces preference and craft, not correctness.
+
+**Positive anchor.** Nodes have purpose-descriptive names ("Fetch GitHub Issues", "Filter Bug Label", "Create Notion Page") rather than generic defaults. The workflow uses built-in nodes (Set, Filter, IF, Switch, Split Out, Merge, Aggregate, Sort) for transformations they natively handle, instead of a Code node. The agent's text response accurately describes the nodes added and connections made — if the response says "I added a Slack node and connected it to the Filter," the saved workflow contains exactly that node and that connection.
+
+**Negative anchor.** Nodes are named `"HTTP Request1"`, `"HTTP Request2"`, `"Set"`, `"Code"`, `"IF"` — defaults that obscure purpose at a glance on the canvas. Or: a Code node containing `return items.flatMap(item => item.json.records)` is doing array unwrapping a Split Out node handles natively (and no scenario annotation marks the Code node as necessary). Or: a Code node maps fields with `return items.map(i => ({ json: { caption: i.json.title } }))` when a Set node would do the same. Or: the agent's response claims "I created a Slack node and wired it to the trigger," but the saved workflow contains no Slack node — or contains a Discord node instead. The text and the artifact disagree.
+
+---
+
+## `security`
+
+**Definition.** No hardcoded credentials in node parameters (headers, query params, Set values), and inbound trigger nodes (webhook, form, chat, MCP) keep authentication disabled unless the prompt explicitly asks for protected access. Security stays a distinct axis so any regression in this class surfaces clearly — credentials in workflow JSON are a separate kind of bad outcome from a misconfigured field.
+
+**Positive anchor.** HTTP Request nodes use either `authentication: 'genericCredentialType'` with a named credential reference, or credential expressions like `=Bearer {{ $credentials.token }}` — actual token values never appear as literal strings in the saved JSON. Inbound trigger nodes (Webhook, Form Trigger, Chat Trigger, MCP Trigger) carry `authentication: 'none'` unless the prompt explicitly asked for protected access. Sensitive Set assignments use `={{ $credentials.<field> }}` expressions, never raw strings.
+
+**Negative anchor.** A header parameter has `{ name: 'Authorization', value: 'Bearer sk-prod-1234...' }` — a real-looking token baked into the saved workflow. Or: a Set assignment is `apiKey: 'AKIA1234...'` as a literal string. Or: the prompt asks for a public form trigger (no auth mentioned) and the builder configures the Form Trigger with `authentication: 'basicAuth'` — friction the user didn't ask for. Or: a query parameter named `api_key` carries a literal value rather than a credential expression.
+
+---
+
+## Mapping: 28 binary checks → axes
+
+This is the M0 commit. M2 applies it by adding `axis: RubricAxis` to each `BinaryCheck` entry.
+
+| Check | Axis | Kind | Notes |
+|---|---|---|---|
+| `has_nodes` | `structure` | det | |
+| `has_trigger` | `structure` | det | |
+| `has_start_node` | `structure` | det | Borderline — see open question 1 |
+| `no_disabled_nodes` | `structure` | det | |
+| `all_nodes_connected` | `connection_topology` | det | |
+| `no_unreachable_nodes` | `connection_topology` | det | |
+| `switch_fallback_output_enabled` | `connection_topology` | det | Catches the Switch `main: []` family |
+| `expressions_reference_existing_nodes` | `parameter_correctness` | det | |
+| `valid_field_references` | `parameter_correctness` | det | |
+| `valid_node_config` | `parameter_correctness` | det | Required params, valid enums |
+| `no_empty_set_nodes` | `parameter_correctness` | det | |
+| `no_invalid_from_ai` | `parameter_correctness` | det | |
+| `http_generic_auth_type_matches_prompt` | `parameter_correctness` | det | Uses prompt — see open question 2 |
+| `correct_node_operations` | `parameter_correctness` | LLM | |
+| `valid_data_flow` | `parameter_correctness` | LLM | LLM sibling of `valid_field_references` — see open question 3 |
+| `fulfills_user_request` | `intent_match` | LLM | |
+| `handles_multiple_items` | `intent_match` | LLM | Borderline — see open question 4 |
+| `agent_has_dynamic_prompt` | `ai_nodes` | det | |
+| `agent_has_language_model` | `ai_nodes` | det | |
+| `memory_properly_connected` | `ai_nodes` | det | |
+| `memory_session_key_expression` | `ai_nodes` | det | |
+| `vector_store_has_embeddings` | `ai_nodes` | det | |
+| `tools_have_parameters` | `ai_nodes` | det | |
+| `no_hardcoded_credentials` | `security` | det | |
+| `inbound_trigger_auth_defaults` | `security` | det | |
+| `no_unnecessary_code_nodes` | `quality` | det | Opinionated; overridable via `annotations.code_necessary` |
+| `descriptive_node_names` | `quality` | LLM | |
+| `response_matches_workflow_changes` | `quality` | LLM | Agent-honesty check — see open question 5 |
+
+Counts: `structure` 4 · `connection_topology` 3 · `parameter_correctness` 8 · `intent_match` 2 · `ai_nodes` 6 · `quality` 3 · `security` 2 = 28.
+
+---
+
+## Open questions and things flagged
+
+**1. `has_start_node` straddles `structure` and `connection_topology`.** The check passes when a trigger has at least one downstream edge. That's a connection. I'm assigning it to `structure` because the failure mode in practice is "the builder committed an empty or trivial workflow" (e.g., a trigger and nothing else) — which is a structural problem upstream of any topology question. If somebody disagrees and prefers `connection_topology`, the rubric still works.
+
+**2. `http_generic_auth_type_matches_prompt` uses the user prompt to score a parameter.** I've kept it in `parameter_correctness` because the failure-mode answer is "the parameter value was wrong." But it has prompt context like the LLM intent checks do. If we later split parameter_correctness into "intrinsic" vs "intent-aware" sub-categories, this and `correct_node_operations` move together.
+
+**3. `valid_data_flow` is in `parameter_correctness`, not `intent_match`.** The strategy doc's Phase A proposal originally placed it under `intent_match` (its name suggests something semantic). After reading the check, it's a near-exact LLM-version sibling of `valid_field_references` — both ask "do expressions reference fields that exist upstream?" Keeping them together makes the parameter_correctness axis a coherent group (anything that fails per-node config validation, including expressions). This deviation from the strategy doc proposal is intentional but worth flagging. It also leaves `intent_match` thin — see open question 6 below.
+
+**4. `handles_multiple_items` could fit `connection_topology` instead of `intent_match`.** The check is about loop/iteration shape — is there a Split Out where one is needed, an aggregate where one is needed. That's structural data-flow shape more than user intent. But the LLM judge uses the prompt to know what "right for the use case" means. Leaving it in intent_match for now; flag for revisit if calibration shows it correlates more with topology failures than intent failures.
+
+**5. `response_matches_workflow_changes` doesn't really fit any axis.** It scores agent communication honesty, not workflow quality. I've parked it under `quality` because that's the least-bad home in the v1 axes. If multi-turn brings axes like `clarification_quality` and `recovery`, this might belong with them in a future `agent_communication` axis. Worth a follow-up after M3 when we see what shape those axes take.
+
+**6. `intent_match` is thin (2 checks, both LLM).** Only `fulfills_user_request` and `handles_multiple_items` sit under this axis, and both are LLM checks — the whole axis depends on judge calibration, with no deterministic intent signal. This is intrinsic to the problem (intent is judged against the prompt, which is in English) but worth knowing: a regression in the judge prompt or model will hit this entire axis at once.
+
+**7. Error-handling structure is a chronic failure mode that no check directly scores.** In current evals, scenarios that test partial-failure handling (one branch erroring while others should still complete) have never passed in any baseline — the builder almost never adds on-error branching to action nodes. None of the 28 checks specifically scores "does the workflow have on-error wiring for actions the scenario tests failure for." This is a known gap; if we want axis-level signal to drive these specific chronic failures down, we'd need a new check (probably in `connection_topology` or a new `error_handling` axis). Out of scope for M0, but flag for the M2 PR — the gap exists in the *underlying check set*, not in the rubric. Adding a check is the fix; the rubric just makes the gap visible.
+
+**8. `execution_outcome` is missing from this rubric on purpose.** The strategy doc (Section 6, rollout step 3) talks about adding an `execution_outcome` axis from the post-execution verifier, alongside the artifact-level binary checks. That's not in v1 because the rubric is scoped to the artifact-level binary checks in this milestone. M2 binds these seven axes to the binary checks; whenever the execution verifier joins the same rubric (later in M2 or in M4 close-out), the union type adds `execution_outcome`. Worth signposting that the rubric isn't finished growing — we lock these seven now and add the eighth alongside the execution-verifier integration.
+
+**9. `descriptive_node_names` and `no_unnecessary_code_nodes` are preferences, not correctness.** Both pass `quality` checks today. Worth being explicit when calibration starts: low judge-vs-human agreement here doesn't necessarily mean the judge is broken — it may mean humans disagree on taste. Calibration should report κ separately for `quality` and not flag low κ on this axis as a system-broken signal the way low κ on `intent_match` would be.
+
+---
+
+## Deferred — multi-turn axes
+
+`clarification_quality` and `recovery` get defined when M3 prep begins. They score conversational behaviour (was the clarification on-topic? did the agent re-plan after a tool error?), not single-turn workflow artifacts. They'll plug into the same `RubricAxis` union when defined.
+
+M3 is intended to be additive — adding two axes for multi-turn behaviour without restructuring the seven above. One likely exception: `response_matches_workflow_changes` (see open question 5) may move into a new `agent_communication` axis if that proves the right shape. That decision belongs to M3.
--- a/packages/@n8n/instance-ai/evaluations/tests/comparison-format.test.ts
+++ b/packages/@n8n/instance-ai/evaluations/tests/comparison-format.test.ts
@ -6,7 +6,7 @@ import {
 	type ScenarioCounts,
 } from '../comparison/compare';
 import { formatComparisonMarkdown, formatComparisonTerminal } from '../comparison/format';
-import type { MultiRunEvaluation, WorkflowTestCase, ScenarioResult } from '../types';
+import type { MultiRunEvaluation, WorkflowTestCase, ExecutionScenarioResult } from '../types';

 function ok(result: ComparisonResult): ComparisonOutcome {
 	return { kind: 'ok', result };
@ -32,7 +32,7 @@ function evaluation(
 	opts: {
 		totalRuns?: number;
 		testCases?: Array<{
-			prompt?: string;
+			userText?: string;
 			buildSuccessCount?: number;
 			scenarios?: Array<{
 				name: string;
@ -49,10 +49,10 @@ function evaluation(
 		totalRuns,
 		testCases: (opts.testCases ?? []).map((tc) => {
 			const testCase = {
-				prompt: tc.prompt ?? 'Test workflow prompt',
+				conversation: [{ role: 'user', text: tc.userText ?? 'Test workflow prompt' }],
 				complexity: 'medium' as const,
 				tags: [],
-				scenarios: (tc.scenarios ?? []).map((sa) => ({
+				executionScenarios: (tc.scenarios ?? []).map((sa) => ({
 					name: sa.name,
 					description: '',
 					dataSetup: '',
@ -61,14 +61,14 @@ function evaluation(
 			} as WorkflowTestCase;
 			const buildSuccessCount = tc.buildSuccessCount ?? totalRuns;
 			const scenarios = (tc.scenarios ?? []).map((sa) => ({
-				scenario: testCase.scenarios.find((sc) => sc.name === sa.name)!,
+				scenario: testCase.executionScenarios.find((sc) => sc.name === sa.name)!,
 				passCount: sa.passCount,
 				passRate: totalRuns > 0 ? sa.passCount / totalRuns : 0,
 				passAtK: new Array(totalRuns).fill(sa.passCount > 0 ? 1 : 0) as number[],
 				passHatK: new Array(totalRuns).fill(sa.passCount === totalRuns ? 1 : 0) as number[],
 				runs: sa.passes.map(
-					(passed): ScenarioResult => ({
-						scenario: testCase.scenarios.find((sc) => sc.name === sa.name)!,
+					(passed): ExecutionScenarioResult => ({
+						scenario: testCase.executionScenarios.find((sc) => sc.name === sa.name)!,
 						success: passed,
 						score: passed ? 1 : 0,
 						reasoning: sa.reasoning ?? '',
@ -79,12 +79,12 @@ function evaluation(
 			return {
 				testCase,
 				workflowBuildSuccess: buildSuccessCount > 0,
-				scenarioResults: [],
-				scenarios,
+				executionScenarioResults: [],
+				executionScenarios: scenarios,
 				runs: new Array(totalRuns).fill(null).map(() => ({
 					testCase,
 					workflowBuildSuccess: buildSuccessCount > 0,
-					scenarioResults: [],
+					executionScenarioResults: [],
 				})),
 				buildSuccessCount,
 			};
@ -97,7 +97,7 @@ describe('formatComparisonMarkdown', () => {
 		totalRuns: 3,
 		testCases: [
 			{
-				prompt: 'a',
+				userText: 'a',
 				scenarios: [{ name: 'happy', passCount: 0, passes: [false, false, false] }],
 			},
 		],
@ -239,7 +239,7 @@ describe('formatComparisonMarkdown', () => {
 			totalRuns: 3,
 			testCases: [
 				{
-					prompt: 'a',
+					userText: 'a',
 					scenarios: [
 						{
 							name: 'happy',
@ -275,7 +275,7 @@ describe('formatComparisonMarkdown', () => {
 			totalRuns: 3,
 			testCases: [
 				{
-					prompt: 'Build a cross-team Linear report digest',
+					userText: 'Build a cross-team Linear report digest',
 					scenarios: [
 						{
 							name: 'no-cross-team-issues',
@ -307,7 +307,7 @@ describe('formatComparisonMarkdown', () => {
 			totalRuns: 3,
 			testCases: [
 				{
-					prompt: 'cross-team prompt',
+					userText: 'cross-team prompt',
 					scenarios: [
 						{
 							name: 'happy-path',
@ -319,7 +319,7 @@ describe('formatComparisonMarkdown', () => {
 					],
 				},
 				{
-					prompt: 'weather prompt',
+					userText: 'weather prompt',
 					scenarios: [
 						{
 							name: 'happy-path',
@ -365,7 +365,7 @@ describe('formatComparisonMarkdown', () => {
 			totalRuns: 3,
 			testCases: [
 				{
-					prompt: 'Build a cross-team Linear report digest from open issues',
+					userText: 'Build a cross-team Linear report digest from open issues',
 					scenarios: [{ name: 'happy', passCount: 0, passes: [false, false, false] }],
 				},
 			],
@ -389,7 +389,7 @@ describe('formatComparisonMarkdown', () => {
 			totalRuns: 3,
 			testCases: [
 				{
-					prompt: 'a',
+					userText: 'a',
 					scenarios: [
 						{
 							name: 'happy',
@ -441,7 +441,7 @@ describe('formatComparisonTerminal', () => {
 		totalRuns: 3,
 		testCases: [
 			{
-				prompt: 'a',
+				userText: 'a',
 				scenarios: [{ name: 'happy', passCount: 0, passes: [false, false, false] }],
 			},
 		],
--- a/packages/@n8n/instance-ai/evaluations/tests/data-workflows.test.ts
+++ b/packages/@n8n/instance-ai/evaluations/tests/data-workflows.test.ts
@ -41,7 +41,7 @@ function slugs(filter?: string, exclude?: string): string[] {
 }

 describe('loadWorkflowTestCasesWithFiles', () => {
-	it('returns every .json slug when no filter or exclude is given', () => {
+	it('returns every .json slug from workflows/ when no filter or exclude is given', () => {
 		expect(slugs()).toEqual([
 			'contact-form-automation',
 			'cross-team-linear-report',
--- a/packages/@n8n/instance-ai/evaluations/tests/dataset-sync.test.ts
+++ b/packages/@n8n/instance-ai/evaluations/tests/dataset-sync.test.ts
@ -14,11 +14,11 @@ const mockedLoad = jest.mocked(loadWorkflowTestCasesWithFiles);
 function scenarioFixture(testCaseFile: string, scenarioName: string) {
 	return {
 		testCase: {
-			prompt: `prompt for ${testCaseFile}`,
+			conversation: [{ role: 'user' as const, text: `prompt for ${testCaseFile}` }],
 			complexity: 'medium' as const,
 			tags: ['test'],
 			triggerType: 'manual' as const,
-			scenarios: [
+			executionScenarios: [
 				{
 					name: scenarioName,
 					description: `desc for ${scenarioName}`,
@ -38,7 +38,6 @@ function existingExample(id: string, testCaseFile: string, scenarioName: string)
 		created_at: '2024-01-01',
 		modified_at: '2024-01-01',
 		inputs: {
-			prompt: `prompt for ${testCaseFile}`,
 			testCaseFile,
 			scenarioName,
 			scenarioDescription: `desc for ${scenarioName}`,
--- a/packages/@n8n/instance-ai/evaluations/tests/event-parser.test.ts
+++ b/packages/@n8n/instance-ai/evaluations/tests/event-parser.test.ts
@ -1,4 +1,8 @@
-import { extractOutcomeFromEvents, buildMetrics } from '../outcome/event-parser';
+import {
+	buildConversationMetrics,
+	buildMetrics,
+	extractOutcomeFromEvents,
+} from '../outcome/event-parser';
 import type { CapturedEvent } from '../types';

 // ---------------------------------------------------------------------------
@ -339,3 +343,238 @@ describe('buildMetrics', () => {
 		expect(metrics.totalTimeMs).toBe(4000); // 5000 - 1000
 	});
 });
+
+// ---------------------------------------------------------------------------
+// buildConversationMetrics — per-turn counters
+// ---------------------------------------------------------------------------
+
+describe('buildConversationMetrics', () => {
+	it('returns empty metrics for no events', () => {
+		const result = buildConversationMetrics([]);
+		expect(result.turnCount).toBe(0);
+		expect(result.perTurn).toEqual([]);
+		expect(result.confirmationAskedTotal).toBe(0);
+		expect(result.confirmationAskedByKind).toEqual({});
+		expect(result.reachedRunFinishCleanly).toBe(false);
+	});
+
+	it('segments a single turn and counts tool calls + errors', () => {
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 2,
+				type: 'tool-call',
+				data: { type: 'tool-call', payload: { toolName: 'foo' } },
+			},
+			{ timestamp: 3, type: 'tool-error', data: { type: 'tool-error' } },
+			{
+				timestamp: 4,
+				type: 'tool-call',
+				data: { type: 'tool-call', payload: { toolName: 'bar' } },
+			},
+			{
+				timestamp: 5,
+				type: 'run-finish',
+				data: { type: 'run-finish', payload: { status: 'completed' } },
+			},
+		];
+
+		const result = buildConversationMetrics(events);
+		expect(result.turnCount).toBe(1);
+		expect(result.perTurn).toHaveLength(1);
+		expect(result.perTurn[0].turn).toBe(1);
+		expect(result.perTurn[0].toolCallCount).toBe(2);
+		expect(result.perTurn[0].toolErrorCount).toBe(1);
+		expect(result.perTurn[0].runFinishStatus).toBe('completed');
+		expect(result.reachedRunFinishCleanly).toBe(true);
+	});
+
+	it('segments multiple turns by run-start boundaries', () => {
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 2,
+				type: 'tool-call',
+				data: { type: 'tool-call', payload: { toolName: 'a' } },
+			},
+			{
+				timestamp: 3,
+				type: 'run-finish',
+				data: { type: 'run-finish', payload: { status: 'completed' } },
+			},
+			{ timestamp: 4, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 5,
+				type: 'tool-call',
+				data: { type: 'tool-call', payload: { toolName: 'b' } },
+			},
+			{
+				timestamp: 6,
+				type: 'tool-call',
+				data: { type: 'tool-call', payload: { toolName: 'c' } },
+			},
+			{
+				timestamp: 7,
+				type: 'run-finish',
+				data: { type: 'run-finish', payload: { status: 'completed' } },
+			},
+		];
+
+		const result = buildConversationMetrics(events);
+		expect(result.turnCount).toBe(2);
+		expect(result.perTurn).toHaveLength(2);
+		expect(result.perTurn[0].toolCallCount).toBe(1);
+		expect(result.perTurn[1].toolCallCount).toBe(2);
+	});
+
+	it('groups confirmations by inputType', () => {
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 2,
+				type: 'confirmation-request',
+				data: {
+					type: 'confirmation-request',
+					payload: { requestId: 'r1', inputType: 'questions' },
+				},
+			},
+			{
+				timestamp: 3,
+				type: 'confirmation-request',
+				data: {
+					type: 'confirmation-request',
+					payload: { requestId: 'r2', inputType: 'plan-review' },
+				},
+			},
+			{
+				timestamp: 4,
+				type: 'confirmation-request',
+				data: {
+					type: 'confirmation-request',
+					payload: { requestId: 'r3', inputType: 'questions' },
+				},
+			},
+			{
+				timestamp: 5,
+				type: 'run-finish',
+				data: { type: 'run-finish', payload: { status: 'completed' } },
+			},
+		];
+
+		const result = buildConversationMetrics(events);
+		expect(result.confirmationAskedTotal).toBe(3);
+		expect(result.confirmationAskedByKind).toEqual({ questions: 2, 'plan-review': 1 });
+		expect(result.perTurn[0].confirmationAskedTotal).toBe(3);
+		expect(result.perTurn[0].confirmationAskedByKind).toEqual({
+			questions: 2,
+			'plan-review': 1,
+		});
+	});
+
+	it('defaults inputType to "approval" when omitted', () => {
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 2,
+				type: 'confirmation-request',
+				data: { type: 'confirmation-request', payload: { requestId: 'r1' } },
+			},
+			{ timestamp: 3, type: 'run-finish', data: { type: 'run-finish' } },
+		];
+
+		const result = buildConversationMetrics(events);
+		expect(result.confirmationAskedByKind).toEqual({ approval: 1 });
+	});
+
+	it('detects repeat questions by requestId across turns', () => {
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 2,
+				type: 'confirmation-request',
+				data: {
+					type: 'confirmation-request',
+					payload: { requestId: 'shared', inputType: 'questions' },
+				},
+			},
+			{ timestamp: 3, type: 'run-finish', data: { type: 'run-finish' } },
+			{ timestamp: 4, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 5,
+				type: 'confirmation-request',
+				data: {
+					type: 'confirmation-request',
+					payload: { requestId: 'shared', inputType: 'questions' },
+				},
+			},
+			{ timestamp: 6, type: 'run-finish', data: { type: 'run-finish' } },
+		];
+
+		const result = buildConversationMetrics(events);
+		expect(result.perTurn[0].repeatQuestionCount).toBe(0);
+		expect(result.perTurn[1].repeatQuestionCount).toBe(1);
+	});
+
+	it('counts replan_after_error when a tool-error is followed by tasks-update in the same turn', () => {
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{ timestamp: 2, type: 'tool-error', data: { type: 'tool-error' } },
+			{ timestamp: 3, type: 'tasks-update', data: { type: 'tasks-update' } },
+			{ timestamp: 4, type: 'run-finish', data: { type: 'run-finish' } },
+		];
+
+		const result = buildConversationMetrics(events);
+		expect(result.perTurn[0].replanAfterErrorCount).toBe(1);
+	});
+
+	it('counts replan_after_error when a tool-error is followed by a plan-typed tool-call', () => {
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{ timestamp: 2, type: 'tool-error', data: { type: 'tool-error' } },
+			{
+				timestamp: 3,
+				type: 'tool-call',
+				data: { type: 'tool-call', payload: { toolName: 'plan' } },
+			},
+			{ timestamp: 4, type: 'run-finish', data: { type: 'run-finish' } },
+		];
+
+		const result = buildConversationMetrics(events);
+		expect(result.perTurn[0].replanAfterErrorCount).toBe(1);
+	});
+
+	it('does NOT count replan_after_error when the recovery is in a previous turn', () => {
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{ timestamp: 2, type: 'tasks-update', data: { type: 'tasks-update' } },
+			{ timestamp: 3, type: 'run-finish', data: { type: 'run-finish' } },
+			{ timestamp: 4, type: 'run-start', data: { type: 'run-start' } },
+			{ timestamp: 5, type: 'tool-error', data: { type: 'tool-error' } },
+			{ timestamp: 6, type: 'run-finish', data: { type: 'run-finish' } },
+		];
+
+		const result = buildConversationMetrics(events);
+		expect(result.perTurn[1].replanAfterErrorCount).toBe(0);
+	});
+
+	it('marks reachedRunFinishCleanly false when the last run-finish is not completed', () => {
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 2,
+				type: 'run-finish',
+				data: { type: 'run-finish', payload: { status: 'completed' } },
+			},
+			{ timestamp: 3, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 4,
+				type: 'run-finish',
+				data: { type: 'run-finish', payload: { status: 'cancelled' } },
+			},
+		];
+
+		const result = buildConversationMetrics(events);
+		expect(result.reachedRunFinishCleanly).toBe(false);
+		expect(result.perTurn[1].runFinishStatus).toBe('cancelled');
+	});
+});
--- a/packages/@n8n/instance-ai/evaluations/tests/lane-allocator.test.ts
+++ b/packages/@n8n/instance-ai/evaluations/tests/lane-allocator.test.ts
@ -8,7 +8,7 @@ function newLanes(count: number): TestLane[] {
 	return Array.from({ length: count }, (_, i) => ({
 		id: i,
 		activeBuilds: 0,
-		inflightPrompts: new Set<string>(),
+		inflightKeys: new Set<string>(),
 	}));
 }

@ -30,8 +30,8 @@ describe('LaneAllocator', () => {
 		const l2 = await a.acquire('p1');
 		expect(l1.id).toBe(0);
 		expect(l2.id).toBe(1);
-		expect(lanes[0].inflightPrompts.has('p1')).toBe(true);
-		expect(lanes[1].inflightPrompts.has('p1')).toBe(true);
+		expect(lanes[0].inflightKeys.has('p1')).toBe(true);
+		expect(lanes[1].inflightKeys.has('p1')).toBe(true);
 	});

 	it('queues acquires when no lane can serve the prompt', async () => {
@ -48,7 +48,7 @@ describe('LaneAllocator', () => {
 		a.release(lanes[0], 'p1');
 		const lane = await second;
 		expect(lane.id).toBe(0);
-		expect(lanes[0].inflightPrompts.has('p1')).toBe(true);
+		expect(lanes[0].inflightKeys.has('p1')).toBe(true);
 	});

 	it('respects maxConcurrentBuilds per lane', async () => {
--- a/packages/@n8n/instance-ai/evaluations/tests/runner-prebuilt.test.ts
+++ b/packages/@n8n/instance-ai/evaluations/tests/runner-prebuilt.test.ts
@ -47,10 +47,10 @@ function makeTestCase(): WorkflowTestCase {
 	// Empty scenarios => runWorkflowTestCase short-circuits past the
 	// scenario-execution loop, so we don't need to mock executeScenario.
 	return {
-		prompt: 'build me something',
+		conversation: [{ role: 'user', text: 'build me something' }],
 		complexity: 'simple',
 		tags: ['test'],
-		scenarios: [],
+		executionScenarios: [],
 	};
 }

--- a/packages/@n8n/instance-ai/evaluations/tests/transcript-from-events.test.ts
+++ b/packages/@n8n/instance-ai/evaluations/tests/transcript-from-events.test.ts
@ -0,0 +1,177 @@
+import { buildTranscriptFromEvents } from '../outcome/transcript-from-events';
+import type { CapturedEvent } from '../types';
+
+function evt(type: string, data: Record<string, unknown> = {}): CapturedEvent {
+	return { timestamp: 0, type, data };
+}
+
+const RUN_START = evt('run-start');
+
+describe('buildTranscriptFromEvents', () => {
+	it('returns empty when there are no events', () => {
+		expect(buildTranscriptFromEvents({ events: [] })).toEqual([]);
+	});
+
+	it('assembles agent text and the opening user message per turn', () => {
+		const turns = buildTranscriptFromEvents({
+			events: [RUN_START, evt('text-delta', { text: 'Hello there' })],
+			openingMessage: 'Build me a workflow',
+		});
+		expect(turns).toHaveLength(1);
+		expect(turns[0]).toMatchObject({
+			userMessage: 'Build me a workflow',
+			agentText: 'Hello there',
+		});
+	});
+
+	describe('ask-user routing', () => {
+		const questions = [{ id: 'q1', question: 'Which channels?' }];
+
+		it('renders ask-user from confirmation-request and skips the tool-call twin', () => {
+			const turns = buildTranscriptFromEvents({
+				events: [
+					RUN_START,
+					evt('tool-call', { payload: { toolName: 'ask-user', args: { questions } } }),
+					evt('confirmation-request', {
+						payload: { requestId: 'r1', questions, inputType: 'questions' },
+					}),
+				],
+				proxyResponses: new Map([
+					[
+						'r1',
+						{
+							kind: 'questions' as const,
+							answers: [{ questionId: 'q1', selectedOptions: ['#general'] }],
+						},
+					],
+				]),
+			});
+			const interactions = turns[0].toolInteractions;
+			expect(interactions).toHaveLength(1);
+			expect(interactions[0]).toMatchObject({
+				kind: 'ask-user',
+				questions: [{ id: 'q1', question: 'Which channels?' }],
+				answers: [{ questionId: 'q1', selectedOptions: ['#general'] }],
+			});
+		});
+
+		it('renders only the questions on the confirmation-request, not the tool-call', () => {
+			const turns = buildTranscriptFromEvents({
+				events: [
+					RUN_START,
+					evt('tool-call', { payload: { toolName: 'ask-user', args: { questions } } }),
+					evt('confirmation-request', {
+						payload: { requestId: 'r1', questions, inputType: 'questions' },
+					}),
+				],
+			});
+			const askUserInteractions = turns[0].toolInteractions.filter((i) => i.kind === 'ask-user');
+			expect(askUserInteractions).toHaveLength(1);
+		});
+	});
+
+	describe('plan routing', () => {
+		it('renders plan from tool-call args', () => {
+			const turns = buildTranscriptFromEvents({
+				events: [
+					RUN_START,
+					evt('tool-call', {
+						payload: {
+							toolName: 'plan',
+							args: { tasks: [{ title: 'Fetch posts', description: 'GET /posts' }] },
+						},
+					}),
+				],
+			});
+			expect(turns[0].toolInteractions[0]).toMatchObject({
+				kind: 'plan',
+				tasks: [{ title: 'Fetch posts', description: 'GET /posts' }],
+			});
+		});
+
+		it('treats `add-plan-item` as plan (alias in the dispatch table)', () => {
+			const turns = buildTranscriptFromEvents({
+				events: [
+					RUN_START,
+					evt('tool-call', {
+						payload: { toolName: 'add-plan-item', args: { tasks: [{ title: 'A' }] } },
+					}),
+				],
+			});
+			expect(turns[0].toolInteractions[0]).toMatchObject({ kind: 'plan' });
+		});
+	});
+
+	describe('setup wizard routing', () => {
+		it('renders the outcome from tool-result and skips the confirmation-request twin', () => {
+			const turns = buildTranscriptFromEvents({
+				events: [
+					RUN_START,
+					evt('confirmation-request', {
+						payload: { requestId: 'r1', setupRequests: [{ nodeName: 'Slack' }] },
+					}),
+					evt('tool-result', {
+						payload: {
+							toolName: 'workflows',
+							result: {
+								completedNodes: [{ nodeName: 'Schedule', parametersSet: ['cron'] }],
+								skippedNodes: [{ nodeName: 'Slack', credentialType: 'slackApi' }],
+							},
+						},
+					}),
+				],
+			});
+			const interactions = turns[0].toolInteractions;
+			expect(interactions).toHaveLength(1);
+			expect(interactions[0]).toMatchObject({
+				kind: 'setup-wizard',
+				completedNodes: [{ nodeName: 'Schedule', parametersSet: ['cron'] }],
+				skippedNodes: [{ nodeName: 'Slack', credentialType: 'slackApi' }],
+			});
+		});
+	});
+
+	describe('generic confirmation', () => {
+		it('records the resume reason and the proxy approval flag', () => {
+			const turns = buildTranscriptFromEvents({
+				events: [
+					RUN_START,
+					evt('confirmation-request', {
+						payload: { requestId: 'r1', toolName: 'submit-plan' },
+					}),
+				],
+				proxyResponses: new Map([['r1', { kind: 'approval' as const, approved: false }]]),
+			});
+			expect(turns[0].toolInteractions[0]).toMatchObject({
+				kind: 'confirmation',
+				toolName: 'submit-plan',
+				resumeReason: 'approval',
+				approved: false,
+			});
+		});
+	});
+
+	describe('plain tool-call dedupe', () => {
+		it('collapses repeat invocations of the same tool name within a turn', () => {
+			const turns = buildTranscriptFromEvents({
+				events: [
+					RUN_START,
+					evt('tool-call', { payload: { toolName: 'credentials', args: {} } }),
+					evt('tool-call', { payload: { toolName: 'credentials', args: {} } }),
+					evt('tool-call', { payload: { toolName: 'credentials', args: {} } }),
+				],
+			});
+			const calls = turns[0].toolInteractions.filter((i) => i.kind === 'tool-call');
+			expect(calls).toHaveLength(1);
+		});
+	});
+
+	it('drops a turn that produced nothing visible (e.g. stray run-start)', () => {
+		const turns = buildTranscriptFromEvents({
+			events: [RUN_START, evt('text-delta', { text: 'hi' }), RUN_START],
+			openingMessage: 'go',
+		});
+		expect(turns).toHaveLength(1);
+		expect(turns[0]).toMatchObject({ userMessage: 'go', agentText: 'hi' });
+	});
+});
--- a/packages/@n8n/instance-ai/evaluations/tests/user-proxy.test.ts
+++ b/packages/@n8n/instance-ai/evaluations/tests/user-proxy.test.ts
@ -0,0 +1,642 @@
+// ---------------------------------------------------------------------------
+// Tests for UserProxyLlm — structured-output dispatch with deterministic shortcuts.
+//
+// The proxy delegates LLM-driven decisions to an injectable agent
+// (UserProxyAgent). Tests pass a programmable fake agent to assert routing,
+// deterministic shortcuts, repeat detection, and budget enforcement.
+// ---------------------------------------------------------------------------
+
+import type { CapturedEvent } from '../types';
+import { UserProxyLlm } from '../utils/user-proxy';
+import type { UserProxyAgent } from '../utils/user-proxy/agent';
+import type { Decision } from '../utils/user-proxy/tools';
+
+// ---------------------------------------------------------------------------
+// FakeAgent — programmable agent for tests
+// ---------------------------------------------------------------------------
+
+class FakeAgent implements UserProxyAgent {
+	readonly prompts: string[] = [];
+	private queue: Array<Decision | undefined | Error> = [];
+
+	enqueue(...decisions: Array<Decision | undefined | Error>): void {
+		this.queue.push(...decisions);
+	}
+
+	// eslint-disable-next-line @typescript-eslint/require-await
+	async decide(userPrompt: string): Promise<Decision | undefined> {
+		this.prompts.push(userPrompt);
+		const next = this.queue.shift();
+		if (next instanceof Error) throw next;
+		return next;
+	}
+
+	get callCount(): number {
+		return this.prompts.length;
+	}
+}
+
+// ---------------------------------------------------------------------------
+// Event helpers
+// ---------------------------------------------------------------------------
+
+function questionEvent(
+	requestId: string,
+	questions: Array<{
+		id: string;
+		question: string;
+		type: 'single' | 'multi' | 'text';
+		options?: string[];
+	}>,
+): CapturedEvent {
+	return {
+		timestamp: 100,
+		type: 'confirmation-request',
+		data: {
+			type: 'confirmation-request',
+			payload: {
+				requestId,
+				toolCallId: 'tc-x',
+				toolName: 'ask-user',
+				args: {},
+				severity: 'info',
+				message: 'Please answer',
+				inputType: 'questions',
+				questions,
+			},
+		},
+	};
+}
+
+function planReviewEvent(requestId: string): CapturedEvent {
+	return {
+		timestamp: 100,
+		type: 'confirmation-request',
+		data: {
+			type: 'confirmation-request',
+			payload: {
+				requestId,
+				toolCallId: 'tc-x',
+				toolName: 'plan',
+				args: {},
+				severity: 'info',
+				message: 'Approve plan?',
+				inputType: 'plan-review',
+			},
+		},
+	};
+}
+
+function setupWizardEvent(requestId: string): CapturedEvent {
+	return {
+		timestamp: 100,
+		type: 'confirmation-request',
+		data: {
+			type: 'confirmation-request',
+			payload: {
+				requestId,
+				toolCallId: 'tc-x',
+				toolName: 'setup-workflow',
+				args: {},
+				severity: 'info',
+				message: 'Set up the workflow',
+				setupRequests: [{ nodeId: 'n1', nodeName: 'Send Slack Message', parameterRequests: [] }],
+			},
+		},
+	};
+}
+
+function credentialEvent(requestId: string): CapturedEvent {
+	return {
+		timestamp: 100,
+		type: 'confirmation-request',
+		data: {
+			type: 'confirmation-request',
+			payload: {
+				requestId,
+				toolCallId: 'tc-x',
+				toolName: 'credential-setup',
+				args: {},
+				severity: 'info',
+				message: 'Set up credentials',
+				credentialRequests: [{ type: 'slackApi' }],
+			},
+		},
+	};
+}
+
+function domainAccessEvent(requestId: string): CapturedEvent {
+	return {
+		timestamp: 100,
+		type: 'confirmation-request',
+		data: {
+			type: 'confirmation-request',
+			payload: {
+				requestId,
+				toolCallId: 'tc-x',
+				toolName: 'web-research',
+				args: {},
+				severity: 'info',
+				message: 'Allow domain?',
+				domainAccess: { url: 'https://docs.example.com', host: 'docs.example.com' },
+			},
+		},
+	};
+}
+
+function resourceDecisionEvent(requestId: string, options: string[]): CapturedEvent {
+	return {
+		timestamp: 100,
+		type: 'confirmation-request',
+		data: {
+			type: 'confirmation-request',
+			payload: {
+				requestId,
+				toolCallId: 'tc-x',
+				toolName: 'gateway-resource',
+				args: {},
+				severity: 'info',
+				message: 'Pick option',
+				resourceDecision: { options },
+			},
+		},
+	};
+}
+
+function textInputEvent(requestId: string): CapturedEvent {
+	return {
+		timestamp: 100,
+		type: 'confirmation-request',
+		data: {
+			type: 'confirmation-request',
+			payload: {
+				requestId,
+				toolCallId: 'tc-x',
+				toolName: 'pause-for-user',
+				args: {},
+				severity: 'info',
+				message: 'Please respond',
+				inputType: 'text',
+			},
+		},
+	};
+}
+
+// ---------------------------------------------------------------------------
+// respondToConfirmation
+// ---------------------------------------------------------------------------
+
+describe('UserProxyLlm.respondToConfirmation', () => {
+	it('answers questions when the agent returns answer_questions', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({
+			action: 'answer_questions',
+			answers: [{ questionId: 'q1', selectedOptions: ['#general'] }],
+		});
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'post to #general' }],
+			agent,
+		});
+
+		const event = questionEvent('req-1', [
+			{ id: 'q1', question: 'Which channel?', type: 'single', options: ['#general'] },
+		]);
+		const response = await proxy.respondToConfirmation(event);
+
+		expect(response.kind).toBe('questions');
+		if (response.kind === 'questions') {
+			expect(response.answers).toEqual([{ questionId: 'q1', selectedOptions: ['#general'] }]);
+		}
+		expect(agent.callCount).toBe(1);
+	});
+
+	it('returns approval with userInput when the agent picks approve_or_reject', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({
+			action: 'approve_or_reject',
+			approved: true,
+			userInput: 'looks good',
+		});
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'approve' }],
+			agent,
+		});
+
+		const response = await proxy.respondToConfirmation(planReviewEvent('req-pr'));
+		expect(response.kind).toBe('approval');
+		if (response.kind === 'approval') {
+			expect(response.approved).toBe(true);
+			expect(response.userInput).toBe('looks good');
+		}
+	});
+
+	it('returns approval with no userInput when the agent omits it', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({ action: 'approve_or_reject', approved: true });
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'approve' }],
+			agent,
+		});
+
+		const response = await proxy.respondToConfirmation(planReviewEvent('req-pr'));
+		expect(response.kind).toBe('approval');
+		if (response.kind === 'approval') {
+			expect(response.approved).toBe(true);
+			expect(response.userInput).toBeUndefined();
+		}
+	});
+
+	it('rejects a plan when the agent returns approve_or_reject with approved=false', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({
+			action: 'approve_or_reject',
+			approved: false,
+			userInput: 'I wanted email, not data table',
+		});
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'send an email' }],
+			agent,
+		});
+
+		const response = await proxy.respondToConfirmation(planReviewEvent('req-pr'));
+		expect(response.kind).toBe('approval');
+		if (response.kind === 'approval') {
+			expect(response.approved).toBe(false);
+			expect(response.userInput).toContain('email');
+		}
+	});
+
+	it('encodes apply_setup_wizard into setupWorkflowApply with nodeParameters', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({
+			action: 'apply_setup_wizard',
+			nodeParametersJson: JSON.stringify({
+				'Send Slack Message': { channelId: 'general', text: 'hi' },
+			}),
+		});
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'post hi to #general' }],
+			agent,
+		});
+
+		const response = await proxy.respondToConfirmation(setupWizardEvent('req-sw'));
+		expect(response.kind).toBe('setupWorkflowApply');
+		if (response.kind === 'setupWorkflowApply') {
+			expect(response.nodeParameters).toEqual({
+				'Send Slack Message': { channelId: 'general', text: 'hi' },
+			});
+			expect(response.nodeCredentials).toBeUndefined();
+		}
+	});
+
+	it('handles credential events deterministically without invoking the agent', async () => {
+		const agent = new FakeAgent();
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			agent,
+		});
+
+		const response = await proxy.respondToConfirmation(credentialEvent('req-cred'));
+		expect(response.kind).toBe('credentialSelection');
+		if (response.kind === 'credentialSelection') {
+			expect(response.credentials).toEqual({});
+		}
+		expect(agent.callCount).toBe(0);
+	});
+
+	it('handles domain-access events deterministically with allow_all', async () => {
+		const agent = new FakeAgent();
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			agent,
+		});
+
+		const response = await proxy.respondToConfirmation(domainAccessEvent('req-dom'));
+		expect(response.kind).toBe('domainAccessApprove');
+		if (response.kind === 'domainAccessApprove') {
+			expect(response.domainAccessAction).toBe('allow_all');
+		}
+		expect(agent.callCount).toBe(0);
+	});
+
+	it('handles resource-decision events deterministically with first allow option', async () => {
+		const agent = new FakeAgent();
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			agent,
+		});
+
+		const response = await proxy.respondToConfirmation(
+			resourceDecisionEvent('req-res', ['deny', 'allowOnce', 'allowAll']),
+		);
+		expect(response.kind).toBe('resourceDecision');
+		if (response.kind === 'resourceDecision') {
+			expect(response.resourceDecision).toBe('allowOnce');
+		}
+		expect(agent.callCount).toBe(0);
+	});
+
+	it('routes setup-wizard events to the agent even when they include credentialRequests', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({
+			action: 'apply_setup_wizard',
+			nodeParametersJson: JSON.stringify({ Node1: { p1: 'v1' } }),
+		});
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			agent,
+		});
+
+		const event: CapturedEvent = {
+			timestamp: 100,
+			type: 'confirmation-request',
+			data: {
+				type: 'confirmation-request',
+				payload: {
+					requestId: 'req-mixed',
+					setupRequests: [{ nodeId: 'n1', nodeName: 'Node1' }],
+					credentialRequests: [{ type: 'slackApi' }],
+				},
+			},
+		};
+
+		const response = await proxy.respondToConfirmation(event);
+		expect(response.kind).toBe('setupWorkflowApply');
+		expect(agent.callCount).toBe(1);
+	});
+
+	it('falls back to the permissive payload when the agent returns undefined', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue(undefined);
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			agent,
+		});
+
+		const response = await proxy.respondToConfirmation(planReviewEvent('req-fail'));
+		// buildAutoApprovePayload returns kind: 'approval' approved: true for plan-review
+		expect(response.kind).toBe('approval');
+	});
+
+	it('falls back to the permissive payload when the agent picks a between-run action', async () => {
+		const agent = new FakeAgent();
+		// declare_done is a between-run action, invalid as a confirmation response.
+		agent.enqueue({ action: 'declare_done' });
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			agent,
+		});
+
+		const response = await proxy.respondToConfirmation(planReviewEvent('req-mis'));
+		expect(response.kind).toBe('approval');
+	});
+
+	it('returns the permissive payload on a repeat requestId without consulting the agent', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({
+			action: 'answer_questions',
+			answers: [{ questionId: 'q1', selectedOptions: ['#general'] }],
+		});
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			agent,
+		});
+
+		const event = questionEvent('req-repeat', [
+			{ id: 'q1', question: 'Q?', type: 'single', options: ['#general'] },
+		]);
+		await proxy.respondToConfirmation(event);
+		const second = await proxy.respondToConfirmation(event);
+
+		// Repeat falls back to buildAutoApprovePayload; for questions inputType
+		// that means kind: 'questions' with empty answers.
+		expect(second.kind).toBe('questions');
+		if (second.kind === 'questions') expect(second.answers).toEqual([]);
+		expect(agent.callCount).toBe(1); // only first call invoked the agent
+	});
+
+	it('handles text input by routing to the agent and encoding as approval', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({
+			action: 'approve_or_reject',
+			approved: true,
+			userInput: 'continue',
+		});
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			agent,
+		});
+
+		const response = await proxy.respondToConfirmation(textInputEvent('req-txt'));
+		expect(response.kind).toBe('approval');
+		if (response.kind === 'approval') {
+			expect(response.userInput).toBe('continue');
+		}
+	});
+});
+
+// ---------------------------------------------------------------------------
+// decideFollowUp
+// ---------------------------------------------------------------------------
+
+describe('UserProxyLlm.decideFollowUp', () => {
+	it('returns done immediately when messageBudget is 0 without invoking the agent', async () => {
+		const agent = new FakeAgent();
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'do it' }],
+			messageBudget: 0,
+			agent,
+		});
+
+		const decision = await proxy.decideFollowUp();
+		expect(decision.kind).toBe('done');
+		expect(agent.callCount).toBe(0);
+	});
+
+	it('always invokes the agent to compose the next user turn', async () => {
+		// Previously the proxy short-circuited to "next script user turn
+		// verbatim". The new design always defers to the agent so the message
+		// can adapt to whatever the assistant just said while staying faithful
+		// to the script's intent.
+		const agent = new FakeAgent();
+		agent.enqueue({ action: 'send_follow_up_message', message: 'also log to sheets' });
+		const proxy = new UserProxyLlm({
+			conversation: [
+				{ role: 'user', text: 'build the workflow' },
+				{ role: 'assistant', text: 'done!' },
+				{ role: 'user', text: 'now also log to sheets' },
+			],
+			messageBudget: 5,
+			agent,
+		});
+
+		const decision = await proxy.decideFollowUp();
+		expect(decision.kind).toBe('followUp');
+		if (decision.kind === 'followUp') {
+			expect(decision.message).toBe('also log to sheets');
+		}
+		expect(proxy.getMessagesSent()).toBe(1);
+		expect(agent.callCount).toBe(1);
+	});
+
+	it('invokes the agent on every follow-up — no verbatim shortcut for short scripts', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({ action: 'send_follow_up_message', message: 'one more thing' });
+		const proxy = new UserProxyLlm({
+			// Only one user turn in the script.
+			conversation: [{ role: 'user', text: 'build it' }],
+			messageBudget: 5,
+			agent,
+		});
+
+		const decision = await proxy.decideFollowUp();
+		expect(decision.kind).toBe('followUp');
+		if (decision.kind === 'followUp') {
+			expect(decision.message).toBe('one more thing');
+		}
+		expect(agent.callCount).toBe(1);
+	});
+
+	it('treats declare_done as done', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({ action: 'declare_done' });
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'all set' }],
+			messageBudget: 3,
+			agent,
+		});
+
+		const decision = await proxy.decideFollowUp();
+		expect(decision.kind).toBe('done');
+	});
+
+	it('returns done when the agent returns undefined', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue(undefined);
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			messageBudget: 3,
+			agent,
+		});
+
+		const decision = await proxy.decideFollowUp();
+		expect(decision.kind).toBe('done');
+	});
+
+	it('returns done when the agent picks a confirmation-only action', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({
+			action: 'answer_questions',
+			answers: [{ questionId: 'q1', selectedOptions: [] }],
+		});
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			messageBudget: 3,
+			agent,
+		});
+
+		const decision = await proxy.decideFollowUp();
+		expect(decision.kind).toBe('done');
+	});
+
+	it('treats an empty follow-up message as done without consuming budget', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({ action: 'send_follow_up_message', message: '   ' });
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			messageBudget: 3,
+			agent,
+		});
+
+		const decision = await proxy.decideFollowUp();
+		expect(decision.kind).toBe('done');
+		expect(proxy.getMessagesSent()).toBe(0);
+	});
+
+	it('caps follow-ups at messageBudget across multiple invocations', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue(
+			{ action: 'send_follow_up_message', message: 'msg1' },
+			{ action: 'send_follow_up_message', message: 'msg2' },
+		);
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			messageBudget: 2,
+			agent,
+		});
+
+		expect((await proxy.decideFollowUp()).kind).toBe('followUp');
+		expect((await proxy.decideFollowUp()).kind).toBe('followUp');
+		const third = await proxy.decideFollowUp();
+		expect(third.kind).toBe('done');
+		expect(proxy.getMessagesSent()).toBe(2);
+	});
+});
+
+// ---------------------------------------------------------------------------
+// ingestEvents
+// ---------------------------------------------------------------------------
+
+describe('UserProxyLlm.ingestEvents', () => {
+	it('accumulates text-delta payloads into the rolling transcript', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({ action: 'declare_done' });
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'open a ticket' }],
+			messageBudget: 3,
+			agent,
+		});
+
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 2,
+				type: 'text-delta',
+				data: { type: 'text-delta', payload: { text: 'Hello ' } },
+			},
+			{
+				timestamp: 3,
+				type: 'text-delta',
+				data: { type: 'text-delta', payload: { text: 'world' } },
+			},
+			{ timestamp: 4, type: 'run-finish', data: { type: 'run-finish' } },
+			{ timestamp: 5, type: 'run-start', data: { type: 'run-start' } },
+			{ timestamp: 6, type: 'text-delta', data: { type: 'text-delta', text: 'second' } },
+			{ timestamp: 7, type: 'run-finish', data: { type: 'run-finish' } },
+		];
+		proxy.ingestEvents(events);
+
+		await proxy.decideFollowUp();
+		const lastPrompt = agent.prompts[agent.prompts.length - 1];
+		expect(lastPrompt).toContain('Hello world');
+		expect(lastPrompt).toContain('second');
+	});
+
+	it('is idempotent — re-ingesting the same array does not duplicate transcript entries', async () => {
+		const agent = new FakeAgent();
+		agent.enqueue({ action: 'declare_done' });
+		const proxy = new UserProxyLlm({
+			conversation: [{ role: 'user', text: 'go' }],
+			messageBudget: 3,
+			agent,
+		});
+
+		const events: CapturedEvent[] = [
+			{ timestamp: 1, type: 'run-start', data: { type: 'run-start' } },
+			{
+				timestamp: 2,
+				type: 'text-delta',
+				data: { type: 'text-delta', payload: { text: 'echoed' } },
+			},
+			{ timestamp: 3, type: 'run-finish', data: { type: 'run-finish' } },
+		];
+		proxy.ingestEvents(events);
+		proxy.ingestEvents(events); // second call should be a no-op
+		proxy.ingestEvents(events); // and a third
+
+		await proxy.decideFollowUp();
+		const prompt = agent.prompts[0];
+		// 'echoed' should appear once in the transcript, not three times.
+		expect((prompt.match(/echoed/g) ?? []).length).toBe(1);
+	});
+});
--- a/packages/@n8n/instance-ai/evaluations/cli/aggregator.ts
+++ b/packages/@n8n/instance-ai/evaluations/cli/aggregator.ts
@ -2,7 +2,7 @@ import type {
 	WorkflowTestCaseResult,
 	MultiRunEvaluation,
 	TestCaseAggregation,
-	ScenarioAggregation,
+	ExecutionScenarioAggregation,
 } from '../types';

 /**
@ -65,14 +65,14 @@ export function aggregateResults(
 		const testCase = runs[0].testCase;
 		const buildSuccessCount = runs.filter((r) => r.workflowBuildSuccess).length;

-		const scenarioCount = testCase.scenarios.length;
-		const scenarios: ScenarioAggregation[] = [];
+		const scenarioCount = testCase.executionScenarios.length;
+		const executionScenarios: ExecutionScenarioAggregation[] = [];

 		for (let sIdx = 0; sIdx < scenarioCount; sIdx++) {
-			const scenario = testCase.scenarios[sIdx];
+			const scenario = testCase.executionScenarios[sIdx];
 			const scenarioRuns = runs.map(
 				(r) =>
-					r.scenarioResults[sIdx] ?? {
+					r.executionScenarioResults[sIdx] ?? {
 						scenario,
 						success: false,
 						score: 0,
@ -82,7 +82,7 @@ export function aggregateResults(
 			const passCount = scenarioRuns.filter((sr) => sr.success).length;
 			const { passAtKValues, passHatKValues } = computePassMetrics(totalRuns, passCount);

-			scenarios.push({
+			executionScenarios.push({
 				scenario,
 				runs: scenarioRuns,
 				passCount,
@ -92,7 +92,7 @@ export function aggregateResults(
 			});
 		}

-		testCases.push({ testCase, runs, buildSuccessCount, scenarios });
+		testCases.push({ testCase, runs, buildSuccessCount, executionScenarios });
 	}

 	return { totalRuns, testCases };
--- a/packages/@n8n/instance-ai/evaluations/cli/build-mcp-manifest.ts
+++ b/packages/@n8n/instance-ai/evaluations/cli/build-mcp-manifest.ts
@ -358,7 +358,11 @@ interface BuildOutcome {
 	durationMs: number;
 }

-const testCaseSchema = z.object({ prompt: z.string() }).passthrough();
+const testCaseSchema = z
+	.object({
+		conversation: z.array(z.object({ role: z.string(), text: z.string() })).min(1),
+	})
+	.passthrough();

 function tailWorkflowId(text: string): string | null {
 	const matches = [...text.matchAll(/WORKFLOW_ID=([A-Za-z0-9_-]+)/g)];
@ -384,7 +388,7 @@ async function buildOne(
 		? `\n\nWhen calling create_workflow_from_code, pass projectId: '${args.projectId}' so the workflow is created in that n8n project.`
 		: '';

-	const userMessage = `${testCase.prompt}${projectInstruction}
+	const userMessage = `${testCase.conversation[0].text}${projectInstruction}

 ---
 After you have created the workflow with create_workflow_from_code, print a final line of the exact form:
--- a/packages/@n8n/instance-ai/evaluations/cli/index.ts
+++ b/packages/@n8n/instance-ai/evaluations/cli/index.ts
@ -56,8 +56,9 @@ import { snapshotWorkflowIds } from '../outcome/workflow-discovery';
 import { writeWorkflowReport } from '../report/workflow-report';
 import type {
 	MultiRunEvaluation,
-	ScenarioResult,
-	TestScenario,
+	ExecutionScenarioResult,
+	ExecutionScenario,
+	TranscriptTurn,
 	WorkflowTestCase,
 	WorkflowTestCaseResult,
 } from '../types';
@ -79,6 +80,8 @@ const targetOutputSchema = z.object({
 	buildDurationMs: z.number().optional(),
 	execDurationMs: z.number().default(0),
 	nodeCount: z.number().default(0),
+	/** The thread id used during the build — keys the LangSmith trace lookup. */
+	threadId: z.string().optional(),
 });

 type TargetOutput = Omit<z.infer<typeof targetOutputSchema>, 'evalResult'> & {
@ -118,7 +121,6 @@ function parseTargetOutput(raw: unknown): TargetOutput | undefined {

 const runInputsSchema = z
 	.object({
-		prompt: z.string().default(''),
 		testCaseFile: z.string().default(''),
 		scenarioName: z.string().default(''),
 		/** 0-based iteration index; injected during multi-run expansion. */
@ -236,7 +238,8 @@ async function main(): Promise<void> {
 		);
 		console.log(`Results:    ${jsonPath}`);
 		console.log(`PR comment: ${prCommentPath}`);
-		const htmlPath = writeWorkflowReport(flattenRunsForReport(evaluation));
+		const reportResults = flattenRunsForReport(evaluation);
+		const htmlPath = writeWorkflowReport(reportResults);
 		console.log(`Report:     ${htmlPath}`);
 		console.log(
 			'\n' + formatComparisonTerminal(evaluation, outcome, { commitSha, slugByTestCase }),
@ -266,18 +269,38 @@ async function runWithLangSmith(config: RunConfig): Promise<{
 	const datasetName = await syncDataset(lsClient, args.dataset, logger, args.filter, args.exclude);
 	const testCasesWithFiles = loadWorkflowTestCasesWithFiles(args.filter, args.exclude);

+	// Stash transcripts by threadId so reshapeLangSmithRuns can merge them in —
+	// the LangSmith target() output schema doesn't carry the full transcript.
+	const transcriptByThreadId = new Map<string, TranscriptTurn[]>();
+
+	// LangSmith dataset rows carry only per-scenario fields. The conversation
+	// for the build is sourced locally, keyed by fileSlug.
+	const conversationByFileSlug = new Map<
+		string,
+		{ conversation: WorkflowTestCase['conversation']; messageBudget?: number }
+	>();
+	for (const { testCase, fileSlug } of testCasesWithFiles) {
+		conversationByFileSlug.set(fileSlug, {
+			conversation: testCase.conversation,
+			messageBudget: testCase.messageBudget,
+		});
+	}
+
 	// LaneState carries the allocator-managed counters (activeBuilds,
-	// inflightPrompts) plus the lane's traced LangSmith wrappers. `runner` is
+	// inflightKeys) plus the lane's traced LangSmith wrappers. `runner` is
 	// the underlying Lane (n8n client, credential state) — named distinctly so
 	// it doesn't shadow the iteration variable `lane` in lanes.map().
 	interface LaneState {
 		runner: Lane;
 		activeBuilds: number;
-		inflightPrompts: Set<string>;
-		tracedBuild: (prompt: string) => Promise<BuildResult>;
+		inflightKeys: Set<string>;
+		tracedBuild: (buildArgs: {
+			conversation: WorkflowTestCase['conversation'];
+			messageBudget?: number;
+		}) => Promise<BuildResult>;
 		tracedExecute: (execArgs: {
 			workflowId: string;
-			scenario: TestScenario;
+			scenario: ExecutionScenario;
 			workflowJsons: BuildResult['workflowJsons'];
 		}) => Promise<Awaited<ReturnType<typeof executeScenario>>>;
 	}
@ -288,12 +311,16 @@ async function runWithLangSmith(config: RunConfig): Promise<{
 		return {
 			runner: lane,
 			activeBuilds: 0,
-			inflightPrompts: new Set<string>(),
+			inflightKeys: new Set<string>(),
 			tracedBuild: traceable(
-				async (prompt: string) =>
+				async (buildArgs: {
+					conversation: WorkflowTestCase['conversation'];
+					messageBudget?: number;
+				}) =>
 					await buildWorkflow({
 						client: lane.client,
-						prompt,
+						conversation: buildArgs.conversation,
+						messageBudget: buildArgs.messageBudget,
 						timeoutMs: args.timeoutMs,
 						preRunWorkflowIds: lane.preRunWorkflowIds,
 						claimedWorkflowIds: lane.claimedWorkflowIds,
@ -310,7 +337,7 @@ async function runWithLangSmith(config: RunConfig): Promise<{
 			tracedExecute: traceable(
 				async (execArgs: {
 					workflowId: string;
-					scenario: TestScenario;
+					scenario: ExecutionScenario;
 					workflowJsons: BuildResult['workflowJsons'];
 				}) =>
 					await executeScenario(
@ -332,7 +359,7 @@ async function runWithLangSmith(config: RunConfig): Promise<{
 	});

 	// Work-stealing: each build acquires a lane that isn't already running its
-	// prompt, runs there (capped per-lane), then releases. Scenarios re-use the
+	// fileSlug, runs there (capped per-lane), then releases. Scenarios re-use the
 	// lane that built their workflow.
 	const allocator = new LaneAllocator(laneStates, MAX_CONCURRENT_BUILDS);
 	const buildCache = new Map<
@ -342,7 +369,6 @@ async function runWithLangSmith(config: RunConfig): Promise<{
 	const buildDurations = new Map<string, number>();

 	async function getOrBuild(
-		prompt: string,
 		iteration: number,
 		fileSlug: string,
 	): Promise<{ build: BuildResult; lane: LaneState; buildDurationMs: number }> {
@ -362,31 +388,41 @@ async function runWithLangSmith(config: RunConfig): Promise<{
 				const build = await fetchPrebuiltBuild(lane.runner.client, prebuiltId, logger);
 				const buildDurationMs = Date.now() - start;
 				buildDurations.set(key, buildDurationMs);
+				stashTranscript(build);
 				return { build, lane, buildDurationMs };
 			}
-			// Orchestrator path: allocator is keyed on prompt while the build
-			// cache is keyed on (iter, fileSlug). Granularity intentionally
-			// differs — the allocator wants to spread distinct prompts across
-			// lanes, while the cache dedupes scenarios within one file (which
-			// share both prompt and slug).
-			const lane = await allocator.acquire(prompt);
+			// Orchestrator path: allocator spreads distinct fileSlugs across lanes;
+			// the build cache dedupes scenarios within one file.
+			const lane = await allocator.acquire(fileSlug);
+			const entry = conversationByFileSlug.get(fileSlug);
+			if (!entry) throw new Error(`No conversation found for fileSlug=${fileSlug}`);
 			try {
 				const start = Date.now();
-				const build = await lane.tracedBuild(prompt);
+				const build = await lane.tracedBuild({
+					conversation: entry.conversation,
+					messageBudget: entry.messageBudget,
+				});
 				const buildDurationMs = Date.now() - start;
 				buildDurations.set(key, buildDurationMs);
+				stashTranscript(build);
 				return { build, lane, buildDurationMs };
 			} finally {
-				allocator.release(lane, prompt);
+				allocator.release(lane, fileSlug);
 			}
 		})();
 		buildCache.set(key, promise);
 		return await promise;
 	}

+	function stashTranscript(build: BuildResult): void {
+		if (build.threadId && build.transcript) {
+			transcriptByThreadId.set(build.threadId, build.transcript);
+		}
+	}
+
 	const target = async (inputs: TargetInputs): Promise<TargetOutput> => {
 		const iteration = inputs._iteration ?? 0;
-		const scenario: TestScenario = {
+		const scenario: ExecutionScenario = {
 			name: inputs.scenarioName,
 			description: inputs.scenarioDescription,
 			dataSetup: inputs.dataSetup,
@ -397,7 +433,7 @@ async function runWithLangSmith(config: RunConfig): Promise<{
 			build,
 			lane: builtOnLane,
 			buildDurationMs,
-		} = await getOrBuild(inputs.prompt, iteration, inputs.testCaseFile);
+		} = await getOrBuild(iteration, inputs.testCaseFile);

 		if (!build.success || !build.workflowId) {
 			return {
@ -410,6 +446,7 @@ async function runWithLangSmith(config: RunConfig): Promise<{
 				buildDurationMs,
 				execDurationMs: 0,
 				nodeCount: 0,
+				threadId: build.threadId,
 			};
 		}

@ -462,6 +499,7 @@ async function runWithLangSmith(config: RunConfig): Promise<{
 			buildDurationMs,
 			execDurationMs,
 			nodeCount,
+			threadId: build.threadId,
 		};
 	};

@ -546,6 +584,7 @@ async function runWithLangSmith(config: RunConfig): Promise<{
 			experimentResults.results,
 			testCasesWithFiles,
 			args.iterations,
+			transcriptByThreadId,
 		);
 		const evaluation = aggregateResults(allRunResults, args.iterations);

@ -745,6 +784,7 @@ function reshapeLangSmithRuns(
 	rows: Array<{ run: Run }>,
 	testCasesWithFiles: WorkflowTestCaseWithFile[],
 	numIterations: number,
+	transcriptByThreadId: Map<string, TranscriptTurn[]>,
 ): WorkflowTestCaseResult[][] {
 	// Index runs by (iteration, testCaseFile, scenarioName) using the `_iteration`
 	// we injected in expandExamplesForIterations. Falls back to 0 for single-run.
@ -760,16 +800,17 @@ function reshapeLangSmithRuns(
 	for (let iter = 0; iter < numIterations; iter++) {
 		const runResults: WorkflowTestCaseResult[] = [];
 		for (const { testCase, fileSlug } of testCasesWithFiles) {
-			const scenarioResults: ScenarioResult[] = [];
+			const executionScenarioResults: ExecutionScenarioResult[] = [];
 			let workflowBuildSuccess = false;
 			let workflowId: string | undefined;
 			let buildError: string | undefined;
+			let threadId: string | undefined;

-			for (const scenario of testCase.scenarios) {
+			for (const scenario of testCase.executionScenarios) {
 				const run = byKey.get(`${String(iter)}/${fileSlug}/${scenario.name}`);
 				const output = run ? parseTargetOutput(run.outputs) : undefined;
 				if (!run || !output) {
-					scenarioResults.push({
+					executionScenarioResults.push({
 						scenario,
 						success: false,
 						score: 0,
@ -779,8 +820,9 @@ function reshapeLangSmithRuns(
 				}
 				if (output.buildSuccess) workflowBuildSuccess = true;
 				if (output.workflowId) workflowId = output.workflowId;
+				if (output.threadId) threadId = output.threadId;
 				if (!output.buildSuccess && output.reasoning) buildError = output.reasoning;
-				scenarioResults.push({
+				executionScenarioResults.push({
 					scenario,
 					success: output.passed,
 					evalResult: output.evalResult,
@ -791,12 +833,15 @@ function reshapeLangSmithRuns(
 				});
 			}

+			const transcript = threadId ? transcriptByThreadId.get(threadId) : undefined;
 			runResults.push({
 				testCase,
 				workflowBuildSuccess,
 				workflowId,
-				scenarioResults,
+				executionScenarioResults,
 				buildError,
+				threadId,
+				transcript,
 			});
 		}
 		allRunResults.push(runResults);
@ -818,7 +863,7 @@ async function runDirectLoop(config: RunConfig): Promise<MultiRunEvaluation> {
 	}

 	const totalScenarios = testCasesWithFiles.reduce(
-		(sum, { testCase }) => sum + testCase.scenarios.length,
+		(sum, { testCase }) => sum + testCase.executionScenarios.length,
 		0,
 	);
 	logger.info(
@ -880,21 +925,30 @@ async function runDirectLoop(config: RunConfig): Promise<MultiRunEvaluation> {
 * HTML report. Previously we rendered only `tc.runs[0]`, which silently hid
 * iterations 2..N — a flaky scenario that passed once and failed twice would
 * appear clean in the uploaded artifact. For multi-iteration runs we prefix
- * each prompt with its iteration number so the cards are distinguishable at
- * a glance.
+ * the opening user turn with its iteration number so the cards are
+ * distinguishable at a glance.
 */
 function flattenRunsForReport(evaluation: MultiRunEvaluation): WorkflowTestCaseResult[] {
 	if (evaluation.totalRuns <= 1) {
 		return evaluation.testCases.map((tc) => tc.runs[0]);
 	}
 	return evaluation.testCases.flatMap((tc) =>
-		tc.runs.map((run, iter) => ({
-			...run,
-			testCase: {
-				...run.testCase,
-				prompt: `[iter ${String(iter + 1)}/${String(evaluation.totalRuns)}] ${run.testCase.prompt}`,
-			},
-		})),
+		tc.runs.map((run, iter) => {
+			const [opening, ...rest] = run.testCase.conversation;
+			return {
+				...run,
+				testCase: {
+					...run.testCase,
+					conversation: [
+						{
+							...opening,
+							text: `[iter ${String(iter + 1)}/${String(evaluation.totalRuns)}] ${opening.text}`,
+						},
+						...rest,
+					],
+				},
+			};
+		}),
 	);
 }

@ -915,7 +969,7 @@ interface AggregateMetrics {

 function computeAggregateMetrics(evaluation: MultiRunEvaluation): AggregateMetrics {
 	const { totalRuns, testCases } = evaluation;
-	const allScenarios = testCases.flatMap((tc) => tc.scenarios);
+	const allScenarios = testCases.flatMap((tc) => tc.executionScenarios);
 	const total = allScenarios.length;
 	const kIndex = Math.max(totalRuns - 1, 0);
 	const built = testCases.filter((tc) => tc.buildSuccessCount > 0).length;
@ -936,7 +990,7 @@ function computeAggregateMetrics(evaluation: MultiRunEvaluation): AggregateMetri
 /** Pass rate of each iteration formatted as e.g. "37% / 37% / 37%". */
 function computePassRatePerIter(evaluation: MultiRunEvaluation): string {
 	const { totalRuns, testCases } = evaluation;
-	const allScenarios = testCases.flatMap((tc) => tc.scenarios);
+	const allScenarios = testCases.flatMap((tc) => tc.executionScenarios);
 	if (allScenarios.length === 0) return '';
 	const rates: string[] = [];
 	for (let i = 0; i < totalRuns; i++) {
@ -987,10 +1041,10 @@ function writeEvalResults(
 		comparisonStatus: outcome?.kind ?? 'not_attempted',
 		comparisonError: outcome?.kind === 'fetch_failed' ? outcome.error : undefined,
 		testCases: testCases.map((tc) => ({
-			name: tc.testCase.prompt.slice(0, 70),
+			name: tc.testCase.conversation[0].text.slice(0, 70),
 			buildSuccessCount: tc.buildSuccessCount,
 			totalRuns,
-			scenarios: tc.scenarios.map((sa) => ({
+			scenarios: tc.executionScenarios.map((sa) => ({
 				name: sa.scenario.name,
 				passCount: sa.passCount,
 				totalRuns,
@ -1117,11 +1171,11 @@ function bucketFromEvaluation(
 		const fileSlug = slugByTestCase.get(tc.testCase);
 		if (!fileSlug) {
 			throw new Error(
-				`bucketFromEvaluation: no fileSlug for test case "${tc.testCase.prompt.slice(0, 60)}"`,
+				`bucketFromEvaluation: no fileSlug for test case "${tc.testCase.conversation[0].text.slice(0, 60)}"`,
 			);
 		}
 		const total = tc.runs.length;
-		for (const sa of tc.scenarios) {
+		for (const sa of tc.executionScenarios) {
 			const key = `${fileSlug}/${sa.scenario.name}`;
 			const failureCategories: Record<string, number> = {};
 			for (const sr of sa.runs) {
--- a/packages/@n8n/instance-ai/evaluations/cli/lane-allocator.ts
+++ b/packages/@n8n/instance-ai/evaluations/cli/lane-allocator.ts
@ -1,14 +1,14 @@
 // Pull-based lane allocator. Each lane caps at `maxConcurrentBuilds` and never
-// runs the same prompt twice concurrently — pairing those rules eliminates the
-// same-prompt concentration that breaks the agent under load.
+// runs the same key twice concurrently — pairing those rules eliminates the
+// same-key concentration that breaks the agent under load.

 export interface AllocatableLane {
 	activeBuilds: number;
-	inflightPrompts: Set<string>;
+	inflightKeys: Set<string>;
 }

 interface Waiter<L> {
-	prompt: string;
+	key: string;
 	resolve: (lane: L) => void;
 }

@ -20,50 +20,50 @@ export class LaneAllocator<L extends AllocatableLane> {
 		private readonly maxConcurrentBuilds: number,
 	) {}

-	async acquire(prompt: string): Promise<L> {
-		const lane = this.findFree(prompt);
+	async acquire(key: string): Promise<L> {
+		const lane = this.findFree(key);
 		if (lane) {
-			this.markBusy(lane, prompt);
+			this.markBusy(lane, key);
 			return lane;
 		}
 		return await new Promise<L>((resolve) => {
-			this.waiters.push({ prompt, resolve });
+			this.waiters.push({ key, resolve });
 		});
 	}

-	release(lane: L, prompt: string): void {
+	release(lane: L, key: string): void {
 		lane.activeBuilds--;
-		lane.inflightPrompts.delete(prompt);
+		lane.inflightKeys.delete(key);
 		this.wakeNext(lane);
 	}

-	private findFree(prompt: string): L | undefined {
+	private findFree(key: string): L | undefined {
 		// Least-loaded policy: spread builds evenly across lanes rather than
 		// filling lane 0 to cap before touching lane 1. Avoids hot-spotting.
 		let best: L | undefined;
 		for (const lane of this.lanes) {
-			if (!this.canRun(lane, prompt)) continue;
+			if (!this.canRun(lane, key)) continue;
 			if (best === undefined || lane.activeBuilds < best.activeBuilds) best = lane;
 		}
 		return best;
 	}

-	private canRun(lane: L, prompt: string): boolean {
-		return lane.activeBuilds < this.maxConcurrentBuilds && !lane.inflightPrompts.has(prompt);
+	private canRun(lane: L, key: string): boolean {
+		return lane.activeBuilds < this.maxConcurrentBuilds && !lane.inflightKeys.has(key);
 	}

-	private markBusy(lane: L, prompt: string): void {
+	private markBusy(lane: L, key: string): void {
 		lane.activeBuilds++;
-		lane.inflightPrompts.add(prompt);
+		lane.inflightKeys.add(key);
 	}

 	private wakeNext(lane: L): void {
 		// Wake the first waiter this lane can now serve. FIFO ordering.
 		for (let i = 0; i < this.waiters.length; i++) {
 			const w = this.waiters[i];
-			if (this.canRun(lane, w.prompt)) {
+			if (this.canRun(lane, w.key)) {
 				this.waiters.splice(i, 1);
-				this.markBusy(lane, w.prompt);
+				this.markBusy(lane, w.key);
 				w.resolve(lane);
 				return;
 			}
--- a/packages/@n8n/instance-ai/evaluations/comparison/format.ts
+++ b/packages/@n8n/instance-ai/evaluations/comparison/format.ts
@ -221,7 +221,7 @@ function formatAggregateBlock(
 	comparison?: ComparisonResult,
 ): string {
 	if (!comparison) {
-		const allScenarios = evaluation.testCases.flatMap((tc) => tc.scenarios);
+		const allScenarios = evaluation.testCases.flatMap((tc) => tc.executionScenarios);
 		const passed = allScenarios.reduce((sum, sa) => sum + sa.passCount, 0);
 		const total = allScenarios.reduce((sum, sa) => sum + sa.runs.length, 0);
 		const rate = total > 0 ? (passed / total) * 100 : 0;
@ -371,23 +371,23 @@ function renderPerTestCaseDetails(
 	lines.push('');
 	const renderName = (tc: TestCaseAggregation): string => {
 		const slug = slugByTestCase?.get(tc.testCase);
-		return slug ? `\`${slug}\`` : `\`${tc.testCase.prompt.slice(0, 70)}\``;
+		return slug ? `\`${slug}\`` : `\`${tc.testCase.conversation[0].text.slice(0, 70)}\``;
 	};
 	if (totalRuns > 1) {
 		lines.push(`| Workflow | Built | pass@${totalRuns} | pass^${totalRuns} |`);
 		lines.push('|---|---|---|---|');
 		for (const tc of testCases) {
-			const meanPassAtK = tc.scenarios.length
+			const meanPassAtK = tc.executionScenarios.length
 				? Math.round(
-						(tc.scenarios.reduce((sum, sa) => sum + (sa.passAtK[totalRuns - 1] ?? 0), 0) /
-							tc.scenarios.length) *
+						(tc.executionScenarios.reduce((sum, sa) => sum + (sa.passAtK[totalRuns - 1] ?? 0), 0) /
+							tc.executionScenarios.length) *
 							100,
 					)
 				: 0;
-			const meanPassHatK = tc.scenarios.length
+			const meanPassHatK = tc.executionScenarios.length
 				? Math.round(
-						(tc.scenarios.reduce((sum, sa) => sum + (sa.passHatK[totalRuns - 1] ?? 0), 0) /
-							tc.scenarios.length) *
+						(tc.executionScenarios.reduce((sum, sa) => sum + (sa.passHatK[totalRuns - 1] ?? 0), 0) /
+							tc.executionScenarios.length) *
 							100,
 					)
 				: 0;
@ -400,8 +400,8 @@ function renderPerTestCaseDetails(
 		lines.push('|---|---|---|');
 		for (const tc of testCases) {
 			const built = tc.runs[0]?.workflowBuildSuccess ? '✓' : '✗';
-			const passed = tc.scenarios.filter((sa) => sa.runs[0]?.success).length;
-			const total = tc.scenarios.length;
+			const passed = tc.executionScenarios.filter((sa) => sa.runs[0]?.success).length;
+			const total = tc.executionScenarios.length;
 			lines.push(`| ${renderName(tc)} | ${built} | ${passed}/${total} |`);
 		}
 	}
@ -476,7 +476,7 @@ function renderFailureDetails(
 	}> = [];
 	for (const tc of evaluation.testCases) {
 		const fileSlug = slugByTestCase?.get(tc.testCase);
-		for (const sa of tc.scenarios) {
+		for (const sa of tc.executionScenarios) {
 			const failedRuns = sa.runs
 				.filter((r) => !r.success)
 				.map((r) => ({ category: r.failureCategory, reasoning: r.reasoning }));
@ -493,7 +493,7 @@ function renderFailureDetails(
 	for (const { tc, fileSlug, scenarioName, failedRuns } of failed) {
 		const slug = fileSlug
 			? `${fileSlug}/${scenarioName}`
-			: `${tc.testCase.prompt.slice(0, 50).trim()} / ${scenarioName}`;
+			: `${tc.testCase.conversation[0].text.slice(0, 50).trim()} / ${scenarioName}`;
 		lines.push(`**\`${slug}\`** — ${failedRuns.length} failed`);
 		for (const fr of failedRuns) {
 			const tag = fr.category ? ` [${fr.category}]` : '';
@ -534,7 +534,7 @@ function buildFailedRunsIndex(
 	for (const tc of evaluation.testCases) {
 		const fileSlug = slugByTestCase.get(tc.testCase);
 		if (!fileSlug) continue; // testCase not in the slug map — skip rather than misattribute
-		for (const sa of tc.scenarios) {
+		for (const sa of tc.executionScenarios) {
 			const failedRuns: FailedRunDetail[] = [];
 			sa.runs.forEach((r, i) => {
 				if (!r.success) {
@ -748,7 +748,7 @@ function formatTerminalAggregate(
 ): string[] {
 	const lines: string[] = [];
 	if (!comparison) {
-		const allScenarios = evaluation.testCases.flatMap((tc) => tc.scenarios);
+		const allScenarios = evaluation.testCases.flatMap((tc) => tc.executionScenarios);
 		const passed = allScenarios.reduce((sum, sa) => sum + sa.passCount, 0);
 		const total = allScenarios.reduce((sum, sa) => sum + sa.runs.length, 0);
 		const rate = total > 0 ? (passed / total) * 100 : 0;
@ -803,24 +803,30 @@ function formatTerminalPerTestCase(

 	const nameOf = (tc: TestCaseAggregation, max: number): string => {
 		const slug = slugByTestCase?.get(tc.testCase);
-		return slug ?? tc.testCase.prompt.slice(0, max);
+		return slug ?? tc.testCase.conversation[0].text.slice(0, max);
 	};

 	if (totalRuns > 1) {
 		const rows = testCases.map((tc) => {
 			const meanPassAtK =
-				tc.scenarios.length > 0
+				tc.executionScenarios.length > 0
 					? Math.round(
-							(tc.scenarios.reduce((sum, sa) => sum + (sa.passAtK[totalRuns - 1] ?? 0), 0) /
-								tc.scenarios.length) *
+							(tc.executionScenarios.reduce(
+								(sum, sa) => sum + (sa.passAtK[totalRuns - 1] ?? 0),
+								0,
+							) /
+								tc.executionScenarios.length) *
 								100,
 						)
 					: 0;
 			const meanPassHatK =
-				tc.scenarios.length > 0
+				tc.executionScenarios.length > 0
 					? Math.round(
-							(tc.scenarios.reduce((sum, sa) => sum + (sa.passHatK[totalRuns - 1] ?? 0), 0) /
-								tc.scenarios.length) *
+							(tc.executionScenarios.reduce(
+								(sum, sa) => sum + (sa.passHatK[totalRuns - 1] ?? 0),
+								0,
+							) /
+								tc.executionScenarios.length) *
 								100,
 						)
 					: 0;
@ -871,7 +877,7 @@ function formatTerminalPerTestCase(
 			lines.push(TERMINAL_INDENT + `${nameOf(tc, 70)}…`);
 			lines.push(TERMINAL_INDENT + `  ${buildStatus}${r.workflowId ? ` (${r.workflowId})` : ''}`);
 			if (r.buildError) lines.push(TERMINAL_INDENT + `  error: ${r.buildError.slice(0, 200)}`);
-			for (const sa of tc.scenarios) {
+			for (const sa of tc.executionScenarios) {
 				const sr = sa.runs[0];
 				const status = sr.success ? 'PASS' : 'FAIL';
 				const category = sr.failureCategory ? ` [${sr.failureCategory}]` : '';
--- a/packages/@n8n/instance-ai/evaluations/computer-use/cli.ts
+++ b/packages/@n8n/instance-ai/evaluations/computer-use/cli.ts
@ -19,7 +19,7 @@ import { ensureDaemon } from './daemon';
 import { formatTokens } from './formatting';
 import { renderHtml } from './report-html';
 import { runScenario } from './runner';
-import type { RunManifest, RunReport, Scenario, ScenarioResult } from './types';
+import type { RunManifest, RunReport, Scenario, ExecutionScenarioResult } from './types';
 import { N8nClient } from '../clients/n8n-client';
 import { createLogger } from '../harness/logger';

@ -256,7 +256,7 @@ async function main(): Promise<void> {
 	);

 	const startedAt = new Date().toISOString();
-	const results: ScenarioResult[] = [];
+	const results: ExecutionScenarioResult[] = [];

 	for (const scenario of scenarios) {
 		const result = await runScenario({
--- a/packages/@n8n/instance-ai/evaluations/computer-use/report-html.ts
+++ b/packages/@n8n/instance-ai/evaluations/computer-use/report-html.ts
@ -13,7 +13,7 @@ import type {
 	GraderResult,
 	RunManifest,
 	RunReport,
-	ScenarioResult,
+	ExecutionScenarioResult,
 } from './types';

 export function renderHtml(report: RunReport): string {
@ -78,7 +78,7 @@ ${report.results.map(renderScenario).join('\n')}
 // Per-scenario card
 // ---------------------------------------------------------------------------

-function renderScenario(result: ScenarioResult): string {
+function renderScenario(result: ExecutionScenarioResult): string {
 	const failedGraders = result.graderResults.filter((g) => !g.pass);
 	const tagChips = (result.scenario.tags ?? [])
 		.map((t) => `<span class="chip">${escapeHtml(t)}</span>`)
@ -164,7 +164,7 @@ function renderAllGraders(results: GraderResult[]): string {
  </div>`;
 }

-function renderToolCalls(r: ScenarioResult): string {
+function renderToolCalls(r: ExecutionScenarioResult): string {
 	if (r.toolCalls.length === 0) {
 		return '<div class="tools"><div class="section-label">Tool calls</div><div class="muted">none</div></div>';
 	}
--- a/packages/@n8n/instance-ai/evaluations/computer-use/runner.ts
+++ b/packages/@n8n/instance-ai/evaluations/computer-use/runner.ts
@ -22,7 +22,7 @@ import type { DaemonInfo } from './daemon';
 import { applyGrader } from './graders';
 import { findFiles } from './graders/fs';
 import { isContained } from './path-utils';
-import type { GraderResult, Scenario, ScenarioResult, ScenarioTrace } from './types';
+import type { GraderResult, Scenario, ExecutionScenarioResult, ScenarioTrace } from './types';
 import type { N8nClient } from '../clients/n8n-client';
 import type { EvalLogger } from '../harness/logger';

@ -39,7 +39,7 @@ export interface RunScenarioOptions {
 	keepData?: boolean;
 }

-export async function runScenario(options: RunScenarioOptions): Promise<ScenarioResult> {
+export async function runScenario(options: RunScenarioOptions): Promise<ExecutionScenarioResult> {
 	const { client, scenario, daemon, logger } = options;
 	const timeoutMs = options.timeoutMs ?? DEFAULT_TIMEOUT_MS;
 	const sandboxDir = daemon.directory;
--- a/packages/@n8n/instance-ai/evaluations/computer-use/types.ts
+++ b/packages/@n8n/instance-ai/evaluations/computer-use/types.ts
@ -245,7 +245,7 @@ export interface GraderResult {
 	reason: string;
 }

-export interface ScenarioResult {
+export interface ExecutionScenarioResult {
 	scenario: Scenario;
 	pass: boolean;
 	graderResults: GraderResult[];
@ -291,5 +291,5 @@ export interface RunReport {
 	finishedAt: string;
 	totalScenarios: number;
 	passCount: number;
-	results: ScenarioResult[];
+	results: ExecutionScenarioResult[];
 }
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/airtable-split-to-slack.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/airtable-split-to-slack.json
@ -1,9 +1,31 @@
 {
-	"prompt": "Every hour, fetch all records from an Airtable table. Use the HTTP Request node to call GET https://api.airtable.com/v0/app123abc/Tasks with a Bearer token auth header — Airtable responds with a JSON object of the shape { \"records\": [...] } where each record has an id and a fields object. For each record, post a message to the Slack channel #daily-tasks containing the task name ({{ fields.Name }}) and status ({{ fields.Status }}). Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "Every hour, get our tasks from Airtable and ping each one to a Slack channel."
+		},
+		{
+			"role": "assistant",
+			"text": "Which Airtable table and which Slack channel?"
+		},
+		{
+			"role": "user",
+			"text": "Table at GET https://api.airtable.com/v0/app123abc/Tasks (Bearer auth via HTTP Request, not the Airtable node). Slack channel #daily-tasks."
+		},
+		{
+			"role": "assistant",
+			"text": "What should each Slack message contain?"
+		},
+		{
+			"role": "user",
+			"text": "The task name and status — both come from the record's `fields` object (fields.Name and fields.Status). Note Airtable returns a JSON envelope `{records: [...]}`, so you'll need to split that into individual items before posting."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "medium",
 	"tags": ["build", "schedule", "http-request", "airtable", "slack", "split-out"],
 	"triggerType": "schedule",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "Airtable returns a named 'records' array with 3 tasks; each is posted to Slack",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/contact-form-automation.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/contact-form-automation.json
@ -1,9 +1,31 @@
 {
-	"prompt": "Create a workflow that handles contact form submissions via a webhook. It should send an auto-reply email to the person who submitted the form, notify my team on Telegram, and log each submission to Google Sheets. For the Google Sheets node use documentId: '1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms' and sheet name: 'Contact Submissions'. Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "I want a workflow that handles contact-form submissions coming in over a webhook."
+		},
+		{
+			"role": "assistant",
+			"text": "What should happen on each submission?"
+		},
+		{
+			"role": "user",
+			"text": "Send an auto-reply email to the person who filled the form, ping my team on Telegram, and log the submission to Google Sheets."
+		},
+		{
+			"role": "assistant",
+			"text": "Got it. For the Google Sheet, do you have a specific document and tab I should target?"
+		},
+		{
+			"role": "user",
+			"text": "Yes — documentId '1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms', sheet name 'Contact Submissions'."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "medium",
 	"tags": ["build", "webhook", "gmail", "telegram", "google-sheets", "multi-action"],
 	"triggerType": "webhook",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "A valid contact form submission triggers all 3 actions",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/cross-team-linear-report.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/cross-team-linear-report.json
@ -1,14 +1,36 @@
 {
-	"prompt": "Get all the Linear issues created in the last 2 weeks. Filter them for issues created for a different team than the one the creator is in. I have this team mapping to use: Alice (alice@company.com) belongs to both Frontend AND AI, Bob (bob@company.com) to Backend, Carol (carol@company.com) to Backend. Store this mapping in the workflow (note that a person can belong to multiple teams) and use it to filter cross-team issues — an issue is cross-team only if its team is NOT in the creator's team list. Then calculate the number of tickets per creator and post the list (ordered descending) to a Slack channel called #cross-team-reports. Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "I want a workflow that reports who on the team is filing Linear issues into teams they don't belong to — sort of a cross-team contribution report, posted to Slack."
+		},
+		{
+			"role": "assistant",
+			"text": "What's the time window, and how do you map people to teams?"
+		},
+		{
+			"role": "user",
+			"text": "Last 2 weeks. Map by email: Alice (alice@company.com) → Frontend AND AI, Bob (bob@company.com) → Backend, Carol (carol@company.com) → Backend. Store this mapping inside the workflow."
+		},
+		{
+			"role": "assistant",
+			"text": "What counts as cross-team — issue team different from any of the creator's teams, or only different from their primary?"
+		},
+		{
+			"role": "user",
+			"text": "An issue is cross-team only if its team is NOT in the creator's team list (so Alice filing into AI is fine, but Alice filing into Backend counts). Then count per creator, sort descending by count, post the list to Slack #cross-team-reports."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "complex",
 	"tags": ["build", "linear", "slack", "schedule", "data-processing"],
 	"triggerType": "schedule",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "Multiple issues from different creators, some cross-team, some same-team",
 			"dataSetup": "The data source returns 8 issues total. Within the last 2 weeks: Alice (alice@company.com) created 3 issues for team 'Backend' and 1 for team 'Frontend', Bob (bob@company.com) created 1 issue for team 'Frontend', Carol (carol@company.com) created 1 issue for team 'Backend'. Outside the 2-week window (3 weeks ago): Alice created 1 issue for team 'Backend', Bob created 1 issue for team 'Frontend'. The people/teams data maps Alice to team 'Frontend', Bob to team 'Backend', and Carol to team 'Backend'. The Slack post node returns a success response.",
-			"successCriteria": "The workflow executes without errors. Only the 6 issues from the last 2 weeks are processed \u2014 the 2 older issues are excluded. The cross-team filter correctly identifies Alice's 3 Backend issues and Bob's 1 Frontend issue as cross-team. Carol's issue is filtered out (same team). The count per creator shows Alice: 3, Bob: 1. The list is sorted descending by count. The result is posted to Slack."
+			"successCriteria": "The workflow executes without errors. Only the 6 issues from the last 2 weeks are processed — the 2 older issues are excluded. The cross-team filter correctly identifies Alice's 3 Backend issues and Bob's 1 Frontend issue as cross-team. Carol's issue is filtered out (same team). The count per creator shows Alice: 3, Bob: 1. The list is sorted descending by count. The result is posted to Slack."
 		},
 		{
 			"name": "multi-team-creator",
@ -20,20 +42,13 @@
 			"name": "no-cross-team-issues",
 			"description": "All issues are created for the creator's own team",
 			"dataSetup": "The data source returns 4 issues from the last 2 weeks: Alice (alice@company.com) created 2 issues for team 'Frontend', Bob (bob@company.com) created 2 issues for team 'Backend'. The people/teams data maps Alice to 'Frontend' and Bob to 'Backend'. All issues match the creator's team. The Slack post node returns a success response.",
-			"successCriteria": "The workflow executes without errors. The cross-team filter removes all issues. The workflow handles the empty result gracefully \u2014 either posting a 'no cross-team issues' message or completing without error."
+			"successCriteria": "The workflow executes without errors. The cross-team filter removes all issues. The workflow handles the empty result gracefully — either posting a 'no cross-team issues' message or completing without error."
 		},
 		{
 			"name": "unknown-creator",
 			"description": "An issue creator is not in the people/teams list",
 			"dataSetup": "The data source returns 4 issues from the last 2 weeks. Two are by Alice (alice@company.com, mapped to team 'Frontend'), two are by Dave (dave@company.com) who is not in the people/teams data at all. The Slack post node returns a success response.",
 			"successCriteria": "The workflow handles the unknown creator without crashing. Dave's issues are either excluded from the cross-team report or handled with a sensible default. Alice's cross-team issues are still correctly processed."
-		},
-		{
-			"name": "api-error",
-			"description": "Linear API returns an authentication error",
-			"dataSetup": "The Linear/data source node returns an authentication error. The Slack post node returns a success response.",
-			"successCriteria": "The workflow handles the API error gracefully. It should not crash silently or post empty/misleading data to Slack. The error is either reported or the workflow stops cleanly.",
-			"requires": "mock-server"
 		}
 	]
 }
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/daily-slack-summary.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/daily-slack-summary.json
@ -1,9 +1,31 @@
 {
-	"prompt": "Every day, get the posts made in the past day on 3 different Slack channels: #general (C04GENERAL01), #engineering (C04ENGINEER1), and #product (C04PRODUCT01). Summarize them using AI, and post the summary on #daily-digest (C04DAILYDG01). Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "Every day, fetch the last day of messages from a few Slack channels, summarize them with an LLM, and post the summary to another Slack channel."
+		},
+		{
+			"role": "assistant",
+			"text": "Which channels should I read from, and where should the summary go?"
+		},
+		{
+			"role": "user",
+			"text": "Read from #general (C04GENERAL01), #engineering (C04ENGINEER1), and #product (C04PRODUCT01). Post the summary to #daily-digest (C04DAILYDG01)."
+		},
+		{
+			"role": "assistant",
+			"text": "Any preference on what time of day the digest goes out?"
+		},
+		{
+			"role": "user",
+			"text": "Morning is fine."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "medium",
 	"tags": ["build", "slack", "ai", "schedule"],
 	"triggerType": "schedule",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "All 3 channels return messages, AI summarizes, posts to channel",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/form-to-hubspot.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/form-to-hubspot.json
@ -1,9 +1,31 @@
 {
-	"prompt": "Create a form that collects: name, email, company, and interest level (dropdown: starter, professional, enterprise). When submitted, create a new contact in HubSpot with firstname, lastname (split from name), email, company, and a custom property 'interest_level'. Then send a confirmation email via SendGrid to the submitted email address — the subject should be 'Thanks for reaching out, {name}!' and the body should mention their company. Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "We need a lead-capture form on our site that pipes new sign-ups into our CRM and sends them a thank-you email."
+		},
+		{
+			"role": "assistant",
+			"text": "What fields on the form, and what CRM are you using?"
+		},
+		{
+			"role": "user",
+			"text": "HubSpot. Form fields: name, email, company, and an interest-level dropdown with options starter / professional / enterprise."
+		},
+		{
+			"role": "assistant",
+			"text": "Plan: form trigger → HubSpot create contact (firstname + lastname split from `name`, email, company, custom property `interest_level`) → SendGrid confirmation email. What should the email say?"
+		},
+		{
+			"role": "user",
+			"text": "Subject 'Thanks for reaching out, {name}!' and the body should mention their company."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "medium",
 	"tags": ["build", "form-trigger", "hubspot", "sendgrid", "crm"],
 	"triggerType": "form",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "A complete form submission creates a HubSpot contact and sends confirmation",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/github-notion-sync.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/github-notion-sync.json
@ -1,9 +1,31 @@
 {
-	"prompt": "Every day, fetch all open GitHub issues from repository 'acme-corp/backend' that have the label 'bug'. For each issue, create a page in a Notion database (database ID: 'a1b2c3d4e5f6789012345678abcdef01') with properties: Name (issue title), URL (issue html_url), Created (issue created_at date), Assignee (assignee login name or 'Unassigned'), and Status set to 'Open'. Use the HTTP Request node to call the GitHub API directly (GET https://api.github.com/repos/acme-corp/backend/issues?labels=bug&state=open) with a Bearer token authorization header. Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "We track bugs in GitHub but our PM wants a Notion view. Build a daily workflow that creates a Notion page per open bug."
+		},
+		{
+			"role": "assistant",
+			"text": "Which repo, which Notion database, and what filter on the GitHub side?"
+		},
+		{
+			"role": "user",
+			"text": "Repo 'acme-corp/backend', open issues with the 'bug' label only. Notion database ID is 'a1b2c3d4e5f6789012345678abcdef01'."
+		},
+		{
+			"role": "assistant",
+			"text": "Which fields should each Notion page carry, and how should I authenticate to GitHub?"
+		},
+		{
+			"role": "user",
+			"text": "Per page: Name (issue title), URL (html_url), Created (issue created_at), Assignee (assignee login, or 'Unassigned' if none), Status='Open'. Use the HTTP Request node — call GET https://api.github.com/repos/acme-corp/backend/issues?labels=bug&state=open with a Bearer token header."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "complex",
 	"tags": ["build", "schedule", "http-request", "notion", "github-api", "data-sync"],
 	"triggerType": "schedule",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "GitHub returns issues, each is synced to Notion",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/index.ts
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/index.ts
@ -30,13 +30,15 @@ function parseSubstringList(value: string | undefined): string[] {
 }

 function getJsonFiles(filter?: string, exclude?: string): string[] {
-	const dir = __dirname;
-	let files = readdirSync(dir).filter((f) => f.endsWith('.json'));
+	const allFiles = readdirSync(__dirname)
+		.filter((f) => f.endsWith('.json'))
+		.map((f) => join(__dirname, f));

+	let files = allFiles;
 	const includeTokens = parseSubstringList(filter);
 	if (includeTokens.length > 0) {
 		files = files.filter((f) => {
-			const lower = f.toLowerCase();
+			const lower = basename(f).toLowerCase();
 			return includeTokens.some((t) => lower.includes(t));
 		});
 	}
@ -44,12 +46,12 @@ function getJsonFiles(filter?: string, exclude?: string): string[] {
 	const excludeTokens = parseSubstringList(exclude);
 	if (excludeTokens.length > 0) {
 		files = files.filter((f) => {
-			const lower = f.toLowerCase();
+			const lower = basename(f).toLowerCase();
 			return !excludeTokens.some((t) => lower.includes(t));
 		});
 	}

-	return files.map((f) => join(dir, f));
+	return files;
 }

 /** Load test cases with their file slugs (for LangSmith dataset sync derived IDs). */
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/linear-bq-leaderboard.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/linear-bq-leaderboard.json
@ -1,9 +1,31 @@
 {
-	"prompt": "Every two weeks I want to check the amount of n8n usage and bug reporting that the team has done and produce a leaderboard that then gets posted to Slack (channel ID: D034WT7G4CW).\n\nHere are the users in the team:\n\n- David Roberts (id: 1)\n- David Arens (id: 2)\n- Niklas Hatje (id: 3)\n\nHere is an example leaderboard:\n\n```\nUsage in the last two weeks:\n\nJonathan Clift: 7 tickets (5 execs, 3 hours)\nFabian Puehringer: 7 tickets (4 execs, 1 hours)\nTuukka Kantola: 6 tickets (16 execs, 6 hours)\n\nTickets = Linear bug tickets created\nExecs = Manual execs on registered accounts\n```\n\nIt is ordered by the number of tickets created (desc) then the number of execs (desc).\n\nTo get the number of bugs that a user has reported, query Linear and get the number of issues created by them in any team that have the `bug` label (case-sensitive), matched by name.\n\nTo get the number of hours that each user was using n8n for, connect to BigQuery and use something similar to the following query:\n\nwith\nsettings as (\n    select\n        timestamp('<start_cutoff>') as start_cutoff,\n        timestamp('<end_cutoff>') as end_cutoff,\n),\nuser_accounts as (\n    select * from unnest([\n        struct<name string, user_id string>\n        ...\n    ])\n),\nnode_exec as (\n    select\n        timestamp,\n        timestamp_trunc(timestamp, hour) as timestamp_hour,\n        name,\n        f.user_id,\n        f.instance_id,\n        status,\n    from rudder_schema.manual_node_exec_finished f\n    inner join user_accounts a on a.user_id = f.user_id\n    cross join settings\n    where f.timestamp between start_cutoff and end_cutoff\n),\nworkflow_exec as (\n    select\n        timestamp,\n        timestamp_trunc(timestamp, hour) as timestamp_hour,\n        name,\n        f.user_id,\n        f.instance_id,\n        status,\n    from rudder_schema.manual_workflow_exec_finished f\n    inner join user_accounts a on a.user_id = f.user_id\n    cross join settings\n    where f.timestamp between start_cutoff and end_cutoff\n),\nexec as (\n    select * from node_exec\n    union all\n    select * from workflow_exec\n),\nexec_summary as (\n    select\n        name,\n        count(distinct instance_id) as instances,\n        count(distinct timestamp_hour) as hours,\n        count(*) as manual_execs,\n    from exec\n    group by 1\n)\n\nselect * from exec_summary\n\nConfigure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "I want a workflow that posts a bi-weekly Slack message ranking my team by how engaged they are with our product."
+		},
+		{
+			"role": "assistant",
+			"text": "How should engagement be measured, and where should the ranking post?"
+		},
+		{
+			"role": "user",
+			"text": "Each person's score combines their Linear bugs filed (any team, `bug` label, case-sensitive, matched by name) and their n8n usage hours from BigQuery. Order the list by bugs filed (desc), then by execs (desc). Team is David Roberts (id 1), David Arens (id 2), Niklas Hatje (id 3). Post to Slack channel D034WT7G4CW."
+		},
+		{
+			"role": "assistant",
+			"text": "What format should the Slack message use, and do you have a BigQuery query I should reuse?"
+		},
+		{
+			"role": "user",
+			"text": "Format like this:\n\n```\nUsage in the last two weeks:\n\nJonathan Clift: 7 tickets (5 execs, 3 hours)\nFabian Puehringer: 7 tickets (4 execs, 1 hours)\nTuukka Kantola: 6 tickets (16 execs, 6 hours)\n\nTickets = Linear bug tickets created\nExecs = Manual execs on registered accounts\n```\n\nBigQuery:\n\nwith\nsettings as (\n    select\n        timestamp('<start_cutoff>') as start_cutoff,\n        timestamp('<end_cutoff>') as end_cutoff,\n),\nuser_accounts as (\n    select * from unnest([\n        struct<name string, user_id string>\n        ...\n    ])\n),\nnode_exec as (\n    select\n        timestamp,\n        timestamp_trunc(timestamp, hour) as timestamp_hour,\n        name,\n        f.user_id,\n        f.instance_id,\n        status,\n    from rudder_schema.manual_node_exec_finished f\n    inner join user_accounts a on a.user_id = f.user_id\n    cross join settings\n    where f.timestamp between start_cutoff and end_cutoff\n),\nworkflow_exec as (\n    select\n        timestamp,\n        timestamp_trunc(timestamp, hour) as timestamp_hour,\n        name,\n        f.user_id,\n        f.instance_id,\n        status,\n    from rudder_schema.manual_workflow_exec_finished f\n    inner join user_accounts a on a.user_id = f.user_id\n    cross join settings\n    where f.timestamp between start_cutoff and end_cutoff\n),\nexec as (\n    select * from node_exec\n    union all\n    select * from workflow_exec\n),\nexec_summary as (\n    select\n        name,\n        count(distinct instance_id) as instances,\n        count(distinct timestamp_hour) as hours,\n        count(*) as manual_execs,\n    from exec\n    group by 1\n)\n\nselect * from exec_summary"
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "simple",
 	"tags": ["build", "schedule", "http-request", "bigquery"],
 	"triggerType": "schedule",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "Leaderboard is posted to Slack",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/notification-router.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/notification-router.json
@ -1,9 +1,31 @@
 {
-	"prompt": "Create a workflow that receives webhook notifications with a JSON body containing 'level' (high, medium, or low), 'title', and 'message'. Route them based on level: high priority goes to Microsoft Teams (use team ID '9b4c3a2f-1d8e-4f5b-a6c7-8e9f0b1d2c3a' and channel ID '19:a1b2c3d4e5f6@thread.tacv2'), medium goes to Slack channel #notifications, and low goes to Gmail (send to alerts@ourcompany.com). Each notification should include the title and message from the payload. Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "I want a webhook that fans out incoming notifications to different places depending on how urgent they are."
+		},
+		{
+			"role": "assistant",
+			"text": "What does each incoming payload look like, and what are the urgency levels?"
+		},
+		{
+			"role": "user",
+			"text": "JSON with `level` (high, medium, or low), `title`, and `message`. Each notification should be forwarded with title + message intact."
+		},
+		{
+			"role": "assistant",
+			"text": "Where should each level go?"
+		},
+		{
+			"role": "user",
+			"text": "High → Microsoft Teams (team ID '9b4c3a2f-1d8e-4f5b-a6c7-8e9f0b1d2c3a', channel ID '19:a1b2c3d4e5f6@thread.tacv2'). Medium → Slack #notifications. Low → email to alerts@ourcompany.com via Gmail."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "medium",
 	"tags": ["build", "webhook", "switch", "microsoft-teams", "slack", "gmail", "routing"],
 	"triggerType": "webhook",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "high-priority",
 			"description": "A high-priority notification is routed to Microsoft Teams",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/rest-api-data-pipeline.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/rest-api-data-pipeline.json
@ -1,9 +1,31 @@
 {
-	"prompt": "Fetch the latest posts from the JSONPlaceholder API (GET https://jsonplaceholder.typicode.com/posts). Filter out any posts where the title contains the word 'qui'. Then post a summary message to a Slack channel called #api-digest that says how many posts remain and lists their titles. Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "Daily, fetch posts from an API, filter some out, and post the rest to a Slack channel."
+		},
+		{
+			"role": "assistant",
+			"text": "Which API endpoint, what's the filter, and which Slack channel?"
+		},
+		{
+			"role": "user",
+			"text": "GET https://jsonplaceholder.typicode.com/posts. Drop any post whose title contains the word 'qui'. Post to #api-digest."
+		},
+		{
+			"role": "assistant",
+			"text": "What should the Slack message look like?"
+		},
+		{
+			"role": "user",
+			"text": "It should say how many posts remain after filtering and list each remaining post's title."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "medium",
 	"tags": ["build", "http-request", "slack", "data-transformation", "schedule"],
 	"triggerType": "schedule",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "API returns posts, some contain 'qui' and get filtered, summary posted to Slack",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/set-edit-fields-contract.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/set-edit-fields-contract.json
@ -1,9 +1,31 @@
 {
-	"prompt": "Every day, fetch one post from the JSONPlaceholder API (GET https://jsonplaceholder.typicode.com/posts/1). Then use an Edit Fields (Set) node, not a Code node, to add a field called caption from the post title and a field called source with the value jsonplaceholder, while preserving all original fields from the HTTP response. Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "Every day fetch a post from JSONPlaceholder and add two fields to it before passing it on."
+		},
+		{
+			"role": "assistant",
+			"text": "Which post, and which two fields?"
+		},
+		{
+			"role": "user",
+			"text": "GET https://jsonplaceholder.typicode.com/posts/1. Add `caption` from the post title, and `source` with the literal value 'jsonplaceholder'. Keep all the original fields too."
+		},
+		{
+			"role": "assistant",
+			"text": "Plan: schedule daily → HTTP GET the post → Code node to compose the new object with caption, source, and the original fields preserved. Sound good?"
+		},
+		{
+			"role": "user",
+			"text": "No Code node — use Edit Fields (Set). That's exactly what it's for. Preserve original fields via the includeOtherFields option, not by manually re-mapping them."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "medium",
 	"tags": ["build", "schedule", "http-request", "set", "data-transformation"],
 	"triggerType": "schedule",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "preserve-fields",
 			"description": "HTTP data is reshaped with Set/Edit Fields while preserving the original response fields",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/telegram-chatbot-memory-session.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/telegram-chatbot-memory-session.json
@ -1,8 +1,30 @@
 {
-	"prompt": "Build a Telegram chatbot workflow for a family assistant. It should receive Telegram messages, answer with an AI Agent using an OpenAI chat model, keep short-term conversation memory scoped separately for each Telegram chat, and send the AI Agent's answer back to the same Telegram chat. Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "I want to build a Telegram chatbot for my family. It should be an AI assistant that can answer questions."
+		},
+		{
+			"role": "assistant",
+			"text": "What model should the AI Agent use, and should the bot remember prior messages in the conversation?"
+		},
+		{
+			"role": "user",
+			"text": "OpenAI chat model. Yes, it should remember the last few messages — but the memory needs to be separate per Telegram chat, so that if my sister and my mum both talk to it, they each get their own thread."
+		},
+		{
+			"role": "assistant",
+			"text": "Plan: Telegram Trigger → AI Agent (OpenAI chat model + memory) → Telegram send back to the same chat. Memory will be scoped by the trigger's chat id. Good?"
+		},
+		{
+			"role": "user",
+			"text": "Yes."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "medium",
 	"tags": ["build", "telegram", "chatbot", "ai-agent", "memory", "expressions"],
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "distinct-telegram-chat",
 			"description": "A Telegram message from one chat is answered with memory scoped to that chat id",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/weather-alert.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/weather-alert.json
@ -1,9 +1,31 @@
 {
-	"prompt": "Every day at 8am, check the weather in Berlin using the OpenMeteo API and send me an email to david@thedavid.co.uk using the gmail node if it's going to rain",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "I want a daily rain warning every morning by email."
+		},
+		{
+			"role": "assistant",
+			"text": "What city, what time, and which email address?"
+		},
+		{
+			"role": "user",
+			"text": "Berlin, 8am, david@thedavid.co.uk. Use Gmail to send it."
+		},
+		{
+			"role": "assistant",
+			"text": "I'll use the OpenMeteo API for the forecast. Any precipitation should trigger the email, or only above a threshold?"
+		},
+		{
+			"role": "user",
+			"text": "Any rain at all."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "simple",
 	"tags": ["build", "schedule", "http-request", "gmail", "conditional"],
 	"triggerType": "schedule",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "Email is sent warning of rain",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/weather-monitoring.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/weather-monitoring.json
@ -1,5 +1,27 @@
 {
-	"prompt": "Every hour, check the current weather for London, New York, and Tokyo using the OpenWeatherMap API. Use 3 separate HTTP Request nodes, one per city. If any city has a temperature above 30°C, send a Telegram alert to chat ID -1001234567890 listing the hot cities. Log all readings to an Airtable table (base ID: 'appK2xGfGNOIrl2gT', table ID: 'tbl8xK3nP5mQ7rS9W') with columns: city, temperature, humidity, timestamp. Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "I want a workflow that monitors weather hourly for three cities and alerts me if it gets too hot anywhere."
+		},
+		{
+			"role": "assistant",
+			"text": "Which cities, what defines 'too hot', and where should the alert go?"
+		},
+		{
+			"role": "user",
+			"text": "London, New York, Tokyo. Alert if any city is above 30°C. Send the alert via Telegram to chat ID -1001234567890."
+		},
+		{
+			"role": "assistant",
+			"text": "I'll use OpenWeatherMap, one HTTP Request per city. Should I log every reading anywhere, or only the hot ones?"
+		},
+		{
+			"role": "user",
+			"text": "Log all of them to Airtable — base ID 'appK2xGfGNOIrl2gT', table ID 'tbl8xK3nP5mQ7rS9W'. Columns: city, temperature, humidity, timestamp."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "complex",
 	"tags": [
 		"build",
@ -11,7 +33,7 @@
 		"multi-request"
 	],
 	"triggerType": "schedule",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "All 3 cities return weather data, one is above 30°C",
--- a/packages/@n8n/instance-ai/evaluations/data/workflows/workflow-data-table.json
+++ b/packages/@n8n/instance-ai/evaluations/data/workflows/workflow-data-table.json
@ -1,9 +1,31 @@
 {
-	"prompt": "I want you to build a workflow that will read n8n workflow databases and extract certain information and then populate that information in a data table called 'workflows'.\n\nThe schema of the data table should be as follows:\n- instanceId\n- workflowId\n- workflowName\n- tags\n\nIf the workflow is run multiple times it should update the current rows rather than creating dupes.\n\nThe instance with the workflows is https://wonderman.users.n8n.cloud/. Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later.",
+	"conversation": [
+		{
+			"role": "user",
+			"text": "I'd like to pull workflow metadata from another n8n instance into a data table on this one. Mostly for reporting."
+		},
+		{
+			"role": "assistant",
+			"text": "Which instance, and what columns should the data table have?"
+		},
+		{
+			"role": "user",
+			"text": "Source instance is https://wonderman.users.n8n.cloud/. The data table is called 'workflows' and should have instanceId, workflowId, workflowName, tags."
+		},
+		{
+			"role": "assistant",
+			"text": "Got it. Should it re-fetch on a schedule? And how should it handle workflows that already exist in the table — re-insert, or update in place?"
+		},
+		{
+			"role": "user",
+			"text": "Yes, run on a schedule. And it must update in place — no duplicate rows when the same workflow comes through twice. Use upsert."
+		}
+	],
+	"messageBudget": 6,
 	"complexity": "complex",
 	"tags": ["build", "schedule", "data-table", "n8n-api"],
 	"triggerType": "schedule",
-	"scenarios": [
+	"executionScenarios": [
 		{
 			"name": "happy-path",
 			"description": "Data table is populated",
--- a/packages/@n8n/instance-ai/evaluations/discovery/runner.ts
+++ b/packages/@n8n/instance-ai/evaluations/discovery/runner.ts
@ -264,7 +264,7 @@ function createStubOrchestrationContext(
 	const taskStorage: TaskStorage = {
 		// eslint-disable-next-line @typescript-eslint/require-await
 		get: async (): Promise<TaskList | null> => null,
-		// eslint-disable-next-line @typescript-eslint/require-await
+
 		save: async (): Promise<void> => {},
 	};

--- a/packages/@n8n/instance-ai/evaluations/harness/chat-loop.ts
+++ b/packages/@n8n/instance-ai/evaluations/harness/chat-loop.ts
@ -17,6 +17,8 @@ import type { EvalLogger } from './logger';
 import type { N8nClient } from '../clients/n8n-client';
 import { consumeSseStream } from '../clients/sse-client';
 import type { CapturedEvent } from '../types';
+import { getEventPayload, tryInfrastructureResponse } from '../utils/confirmation-payload';
+import { getNestedRecord } from '../utils/safe-extract';

 // ---------------------------------------------------------------------------
 // Constants
@ -63,6 +65,10 @@ export async function startSseConnection(
 // Wait for all activity: run-finish -> background tasks -> possible new run
 // ---------------------------------------------------------------------------

+export type ConfirmationStrategy = (
+	event: CapturedEvent,
+) => InstanceAiConfirmRequest | Promise<InstanceAiConfirmRequest>;
+
 export interface WaitConfig {
 	client: N8nClient;
 	threadId: string;
@ -71,9 +77,18 @@ export interface WaitConfig {
 	startTime: number;
 	timeoutMs: number;
 	logger: EvalLogger;
+	confirmationStrategy?: ConfirmationStrategy;
+	/** Per-conversation retry count by requestId. Auto-allocated when omitted. */
+	confirmationRetries?: Map<string, number>;
+	/** Caller-supplied sink for proxy confirmation payloads, keyed by requestId. */
+	proxyResponses?: Map<string, InstanceAiConfirmRequest>;
 }

 export async function waitForAllActivity(config: WaitConfig): Promise<void> {
+	// Allocate the retries map once per conversation if the caller didn't
+	// pass one; per-call allocation would reset attempt counts every poll.
+	config.confirmationRetries ??= new Map<string, number>();
+
 	let runFinishCount = 0;

 	while (true) {
@ -172,14 +187,54 @@ async function waitForBackgroundTasks(config: WaitConfig, timeoutMs: number): Pr
 	);
 }

+// ---------------------------------------------------------------------------
+// Multi-turn conversation loop
+// ---------------------------------------------------------------------------
+
+export type NextMessageDecision = { kind: 'followUp'; message: string } | { kind: 'done' };
+
+export interface MultiTurnConfig extends WaitConfig {
+	nextMessageDecider: () => Promise<NextMessageDecision>;
+}
+
+export async function runMultiTurnConversation(config: MultiTurnConfig): Promise<void> {
+	while (true) {
+		await waitForAllActivity(config);
+
+		if (Date.now() - config.startTime > config.timeoutMs) {
+			config.logger.verbose(
+				`[multi-turn] Timeout reached after ${String(Date.now() - config.startTime)}ms — exiting loop`,
+			);
+			return;
+		}
+
+		const decision = await config.nextMessageDecider();
+		if (decision.kind === 'done') {
+			config.logger.verbose('[multi-turn] Proxy returned done — exiting loop');
+			return;
+		}
+
+		config.logger.verbose(
+			`[multi-turn] Sending follow-up: ${decision.message.slice(0, 80)}${decision.message.length > 80 ? '...' : ''}`,
+		);
+		try {
+			await config.client.sendMessage(config.threadId, decision.message);
+		} catch (error: unknown) {
+			const msg = error instanceof Error ? error.message : String(error);
+			config.logger.verbose(`[multi-turn] sendMessage failed: ${msg} — exiting loop`);
+			return;
+		}
+	}
+}
+
 // ---------------------------------------------------------------------------
 // Confirmation auto-approval
 // ---------------------------------------------------------------------------

-const confirmationRetries = new Map<string, number>();
-
 export async function processConfirmationRequests(config: WaitConfig): Promise<void> {
 	const confirmationEvents = config.events.filter((e) => e.type === 'confirmation-request');
+	const strategy = config.confirmationStrategy ?? buildAutoApprovePayload;
+	const retries = config.confirmationRetries ?? new Map<string, number>();

 	for (const event of confirmationEvents) {
 		const requestId = extractConfirmationRequestId(event);
@ -187,24 +242,26 @@ export async function processConfirmationRequests(config: WaitConfig): Promise<v
 			continue;
 		}

-		const retryCount = confirmationRetries.get(requestId) ?? 0;
+		const retryCount = retries.get(requestId) ?? 0;
 		if (retryCount >= MAX_CONFIRMATION_RETRIES) {
 			continue;
 		}

 		if (retryCount === 0) {
-			config.logger.verbose(`[auto-approve] Approving confirmation: ${requestId}`);
+			config.logger.verbose(`[confirm] Responding to confirmation: ${requestId}`);
 		}

 		try {
-			await config.client.confirmAction(requestId, buildAutoApprovePayload(event));
+			const payload = await strategy(event);
+			await config.client.confirmAction(requestId, payload);
 			config.approvedRequests.add(requestId);
-			confirmationRetries.delete(requestId);
+			config.proxyResponses?.set(requestId, payload);
+			retries.delete(requestId);
 		} catch (error: unknown) {
-			confirmationRetries.set(requestId, retryCount + 1);
+			retries.set(requestId, retryCount + 1);
 			const msg = error instanceof Error ? error.message : String(error);
 			config.logger.verbose(
-				`[auto-approve] Failed to approve ${requestId} (attempt ${String(retryCount + 1)}/${String(MAX_CONFIRMATION_RETRIES)}): ${msg}`,
+				`[confirm] Failed to respond to ${requestId} (attempt ${String(retryCount + 1)}/${String(MAX_CONFIRMATION_RETRIES)}): ${msg}`,
 			);
 		}
 	}
@ -214,32 +271,15 @@ export async function processConfirmationRequests(config: WaitConfig): Promise<v
 *  matching kind. The eval runner has no real credentials and no human in the loop —
 *  we just need a structurally-valid payload that lets the agent proceed. */
 export function buildAutoApprovePayload(event: CapturedEvent): InstanceAiConfirmRequest {
-	const payload = getNestedRecord(event.data, 'payload') ?? {};
+	const infra = tryInfrastructureResponse(event);
+	if (infra) return infra;

-	if (getNestedRecord(payload, 'domainAccess')) {
-		return { kind: 'domainAccessApprove', domainAccessAction: 'allow_all' };
-	}
-
-	const resourceDecision = getNestedRecord(payload, 'resourceDecision');
-	if (resourceDecision) {
-		const options = Array.isArray(resourceDecision.options)
-			? (resourceDecision.options as unknown[]).filter((o): o is string => typeof o === 'string')
-			: [];
-		const allowOption = options.find((o) => o.toLowerCase().includes('allow')) ?? options[0];
-		return {
-			kind: 'resourceDecision',
-			resourceDecision: isResourceDecision(allowOption) ? allowOption : 'allowOnce',
-		};
-	}
+	const payload = getEventPayload(event);

 	if (Array.isArray(payload.setupRequests)) {
 		return { kind: 'setupWorkflowApply' };
 	}

-	if (Array.isArray(payload.credentialRequests)) {
-		return { kind: 'credentialSelection', credentials: {} };
-	}
-
 	if (payload.inputType === 'questions') {
 		return { kind: 'questions', answers: [] };
 	}
@ -295,14 +335,3 @@ export function extractAgentId(event: CapturedEvent): string | undefined {

 	return undefined;
 }
-
-function getNestedRecord(
-	obj: Record<string, unknown>,
-	key: string,
-): Record<string, unknown> | undefined {
-	const value = obj[key];
-	if (typeof value === 'object' && value !== null && !Array.isArray(value)) {
-		return value as Record<string, unknown>;
-	}
-	return undefined;
-}
--- a/packages/@n8n/instance-ai/evaluations/harness/runner.ts
+++ b/packages/@n8n/instance-ai/evaluations/harness/runner.ts
@ -6,25 +6,36 @@
 // LLM-mocked HTTP, checklist verification, and result aggregation.
 // ---------------------------------------------------------------------------

-import type { InstanceAiEvalExecutionResult } from '@n8n/api-types';
+import type { InstanceAiConfirmRequest, InstanceAiEvalExecutionResult } from '@n8n/api-types';
 import crypto from 'node:crypto';
 import { setTimeout as delay } from 'node:timers/promises';

-import { SSE_SETTLE_DELAY_MS, startSseConnection, waitForAllActivity } from './chat-loop';
+import {
+	SSE_SETTLE_DELAY_MS,
+	startSseConnection,
+	waitForAllActivity,
+	runMultiTurnConversation,
+	type ConfirmationStrategy,
+} from './chat-loop';
 import { type EvalLogger } from './logger';
 import { fetchPrebuiltBuild } from './prebuilt-workflows';
 import { verifyChecklist } from '../checklist/verifier';
 import type { N8nClient, WorkflowResponse } from '../clients/n8n-client';
-import { extractOutcomeFromEvents } from '../outcome/event-parser';
+import { buildConversationMetrics, extractOutcomeFromEvents } from '../outcome/event-parser';
+import { buildTranscriptFromEvents } from '../outcome/transcript-from-events';
 import { buildAgentOutcome, extractWorkflowIdsFromMessages } from '../outcome/workflow-discovery';
 import type {
 	ChecklistItem,
 	CapturedEvent,
-	ScenarioResult,
-	TestScenario,
+	ConversationMetrics,
+	ConversationTurn,
+	ExecutionScenarioResult,
+	ExecutionScenario,
+	TranscriptTurn,
 	WorkflowTestCase,
 	WorkflowTestCaseResult,
 } from '../types';
+import { UserProxyLlm, type ProxyDecisionStats } from '../utils/user-proxy';

 // ---------------------------------------------------------------------------
 // Constants
@ -69,14 +80,15 @@ export async function runWorkflowTestCase(
 	const result: WorkflowTestCaseResult = {
 		testCase,
 		workflowBuildSuccess: false,
-		scenarioResults: [],
+		executionScenarioResults: [],
 	};

 	const build = config.prebuiltWorkflowId
 		? await fetchPrebuiltBuild(client, config.prebuiltWorkflowId, logger)
 		: await buildWorkflow({
 				client,
-				prompt: testCase.prompt,
+				conversation: testCase.conversation,
+				messageBudget: testCase.messageBudget,
 				timeoutMs,
 				preRunWorkflowIds: config.preRunWorkflowIds,
 				claimedWorkflowIds: config.claimedWorkflowIds,
@ -84,6 +96,16 @@ export async function runWorkflowTestCase(
 				laneTag: config.laneTag,
 			});

+	if (build.conversationMetrics) {
+		result.conversationMetrics = build.conversationMetrics;
+	}
+	if (build.threadId) {
+		result.threadId = build.threadId;
+	}
+	if (build.transcript) {
+		result.transcript = build.transcript;
+	}
+
 	if (!build.success || !build.workflowId) {
 		result.buildError = build.error;
 		return result;
@ -94,8 +116,8 @@ export async function runWorkflowTestCase(
 	result.workflowJson = build.workflowJsons[0];

 	const scenarioStart = Date.now();
-	result.scenarioResults = await runWithConcurrency(
-		testCase.scenarios,
+	result.executionScenarioResults = await runWithConcurrency(
+		testCase.executionScenarios,
 		async (scenario) => {
 			try {
 				return await executeScenario(
@ -114,7 +136,7 @@ export async function runWorkflowTestCase(
 					success: false,
 					score: 0,
 					reasoning: `Error: ${errorMessage}`,
-				} satisfies ScenarioResult;
+				} satisfies ExecutionScenarioResult;
 			}
 		},
 		MAX_CONCURRENT_SCENARIOS,
@ -122,7 +144,7 @@ export async function runWorkflowTestCase(

 	const scenarioMs = Date.now() - scenarioStart;
 	logger.info(
-		`  Scenarios done: ${String(result.scenarioResults.length)} scenarios [${String(Math.round(scenarioMs / 1000))}s]${config.laneTag ?? ''}`,
+		`  Scenarios done: ${String(result.executionScenarioResults.length)} scenarios [${String(Math.round(scenarioMs / 1000))}s]${config.laneTag ?? ''}`,
 	);

 	if (!config.keepWorkflows) {
@ -132,6 +154,64 @@ export async function runWorkflowTestCase(
 	return result;
 }

+// ---------------------------------------------------------------------------
+// Multi-turn driver — wires UserProxyLlm into runMultiTurnConversation
+// ---------------------------------------------------------------------------
+
+interface MultiTurnDriverConfig {
+	client: N8nClient;
+	threadId: string;
+	conversation: ConversationTurn[];
+	messageBudget?: number;
+	events: CapturedEvent[];
+	approvedRequests: Set<string>;
+	startTime: number;
+	timeoutMs: number;
+	logger: EvalLogger;
+	proxyResponses?: Map<string, InstanceAiConfirmRequest>;
+	followUpMessagesOut?: string[];
+}
+
+async function driveMultiTurnConversation(
+	config: MultiTurnDriverConfig,
+): Promise<ProxyDecisionStats> {
+	const openingMessage = config.conversation[0]?.text ?? '';
+
+	const proxy = new UserProxyLlm({
+		conversation: config.conversation,
+		messageBudget: config.messageBudget,
+		logger: config.logger,
+	});
+
+	const confirmationStrategy: ConfirmationStrategy = proxy.respondToConfirmation.bind(proxy);
+
+	const nextMessageDecider = async () => {
+		proxy.ingestEvents(config.events);
+		const decision = await proxy.decideFollowUp();
+		if (decision.kind === 'followUp') {
+			config.followUpMessagesOut?.push(decision.message);
+		}
+		return decision;
+	};
+
+	await config.client.sendMessage(config.threadId, openingMessage);
+
+	await runMultiTurnConversation({
+		client: config.client,
+		threadId: config.threadId,
+		events: config.events,
+		approvedRequests: config.approvedRequests,
+		startTime: config.startTime,
+		timeoutMs: config.timeoutMs,
+		logger: config.logger,
+		confirmationStrategy,
+		nextMessageDecider,
+		proxyResponses: config.proxyResponses,
+	});
+
+	return { ...proxy.getDecisionStats() };
+}
+
 // ---------------------------------------------------------------------------
 // Split API: build once, run scenarios independently
 // ---------------------------------------------------------------------------
@ -144,11 +224,27 @@ export interface BuildResult {
 	/** IDs to pass to cleanupBuild() */
 	createdWorkflowIds: string[];
 	createdDataTableIds: string[];
+	/** Per-turn deterministic counters extracted from the captured event stream. */
+	conversationMetrics?: ConversationMetrics;
+	/** The thread id used during the build — keys the LangSmith trace lookup. */
+	threadId?: string;
+	/** Counts of UserProxyLlm decisions by category (multi-turn builds only). */
+	proxyDecisionStats?: ProxyDecisionStats;
+	/** Chat-style transcript built from the SSE event stream + proxy responses. */
+	transcript?: TranscriptTurn[];
 }

 export interface BuildWorkflowConfig {
 	client: N8nClient;
-	prompt: string;
+	/**
+	 * Hand-authored conversation. ≥1 turn, first turn must be `user`.
+	 *
+	 * - One user turn, no assistant turns → auto-approve all confirmations.
+	 * - Anything else → UserProxyLlm engages.
+	 */
+	conversation: ConversationTurn[];
+	/** Max follow-up messages the proxy will send. Ignored in auto-approve mode. */
+	messageBudget?: number;
 	timeoutMs?: number;
 	preRunWorkflowIds: Set<string>;
 	claimedWorkflowIds: Set<string>;
@ -157,12 +253,21 @@ export interface BuildWorkflowConfig {
 	laneTag?: string;
 }

+/** A conversation is multi-turn if it has more than one turn, or if the only
+ *  turn is from the assistant. Empty conversations are treated as single-turn. */
+function isMultiTurnConversation(conversation: ConversationTurn[]): boolean {
+	if (conversation.length === 0) return false;
+	if (conversation.length > 1) return true;
+	return conversation[0].role !== 'user';
+}
+
 /**
 * Build a workflow via Instance AI. Returns the workflow ID for use with
 * executeScenario(). Call cleanupBuild() when done.
 */
 export async function buildWorkflow(config: BuildWorkflowConfig): Promise<BuildResult> {
-	const { client, prompt, logger } = config;
+	const { client, conversation, logger } = config;
+	const openingMessage = conversation[0]?.text ?? '';
 	const threadId = `eval-${crypto.randomUUID()}`;
 	const startTime = Date.now();
 	const timeoutMs = config.timeoutMs ?? DEFAULT_TIMEOUT_MS;
@ -170,31 +275,62 @@ export async function buildWorkflow(config: BuildWorkflowConfig): Promise<BuildR
 	const abortController = new AbortController();
 	const events: CapturedEvent[] = [];
 	const approvedRequests = new Set<string>();
+	const proxyResponses = new Map<string, InstanceAiConfirmRequest>();
+	const followUpMessages: string[] = [];

 	try {
 		const buildStart = Date.now();
-		logger.info(`  Building workflow: "${truncate(prompt, 60)}"${config.laneTag ?? ''}`);
+		const isMultiTurn = isMultiTurnConversation(conversation);
+		logger.info(
+			`  Building workflow${isMultiTurn ? ' [multi-turn]' : ''}: "${truncate(openingMessage, 60)}"${config.laneTag ?? ''}`,
+		);

 		const ssePromise = startSseConnection(client, threadId, events, abortController.signal).catch(
 			() => {},
 		);

 		await delay(SSE_SETTLE_DELAY_MS);
-		await client.sendMessage(threadId, prompt);

-		await waitForAllActivity({
-			client,
-			threadId,
-			events,
-			approvedRequests,
-			startTime,
-			timeoutMs,
-			logger,
-		});
+		let proxyDecisionStats: ProxyDecisionStats | undefined;
+		if (isMultiTurn) {
+			proxyDecisionStats = await driveMultiTurnConversation({
+				client,
+				threadId,
+				conversation,
+				messageBudget: config.messageBudget,
+				events,
+				approvedRequests,
+				startTime,
+				timeoutMs,
+				logger,
+				proxyResponses,
+				followUpMessagesOut: followUpMessages,
+			});
+		} else {
+			await client.sendMessage(threadId, openingMessage);
+			await waitForAllActivity({
+				client,
+				threadId,
+				events,
+				approvedRequests,
+				startTime,
+				timeoutMs,
+				logger,
+				proxyResponses,
+			});
+		}

 		abortController.abort();
 		await ssePromise.catch(() => {});

+		const conversationMetrics = buildConversationMetrics(events);
+		const transcript = buildTranscriptFromEvents({
+			events,
+			openingMessage,
+			followUpMessages,
+			proxyResponses,
+		});
+
 		let threadMessages;
 		try {
 			threadMessages = await client.getThreadMessages(threadId);
@ -259,12 +395,17 @@ export async function buildWorkflow(config: BuildWorkflowConfig): Promise<BuildR
 				workflowJsons: [],
 				createdWorkflowIds: [],
 				createdDataTableIds: outcome.dataTablesCreated,
+				conversationMetrics,
+				threadId,
+				proxyDecisionStats,
+				transcript,
 			};
 		}

 		const buildMs = Date.now() - buildStart;
+		const proxySuffix = formatProxyStatsSuffix(proxyDecisionStats);
 		logger.info(
-			`  Workflow built: ${outcome.workflowsCreated[0].name} (${String(outcome.workflowsCreated[0].nodeCount)} nodes) [${String(Math.round(buildMs / 1000))}s]`,
+			`  Workflow built: ${outcome.workflowsCreated[0].name} (${String(outcome.workflowsCreated[0].nodeCount)} nodes) [${String(Math.round(buildMs / 1000))}s]${isMultiTurn ? ` (${String(conversationMetrics.turnCount)} turn${conversationMetrics.turnCount === 1 ? '' : 's'})` : ''}${proxySuffix}`,
 		);

 		return {
@ -273,30 +414,45 @@ export async function buildWorkflow(config: BuildWorkflowConfig): Promise<BuildR
 			workflowJsons: outcome.workflowJsons,
 			createdWorkflowIds: outcome.workflowsCreated.map((wf) => wf.id),
 			createdDataTableIds: outcome.dataTablesCreated,
+			conversationMetrics,
+			threadId,
+			proxyDecisionStats,
+			transcript,
 		};
 	} catch (error: unknown) {
 		abortController.abort();
+		// Try to surface partial metrics so timeouts still produce a per-turn report.
+		const conversationMetrics = events.length > 0 ? buildConversationMetrics(events) : undefined;
 		return {
 			success: false,
 			error: error instanceof Error ? error.message : String(error),
 			workflowJsons: [],
 			createdWorkflowIds: [],
 			createdDataTableIds: [],
+			conversationMetrics,
+			threadId,
 		};
 	}
 }

+function formatProxyStatsSuffix(stats: ProxyDecisionStats | undefined): string {
+	if (!stats) return '';
+	const entries = Object.entries(stats).sort(([, a], [, b]) => b - a);
+	if (entries.length === 0) return '';
+	return ` [proxy: ${entries.map(([k, v]) => `${k}=${String(v)}`).join(', ')}]`;
+}
+
 /**
 * Execute a single scenario against a pre-built workflow and verify the result.
 */
 export async function executeScenario(
 	client: N8nClient,
 	workflowId: string,
-	scenario: TestScenario,
+	scenario: ExecutionScenario,
 	workflowJsons: WorkflowResponse[],
 	logger: EvalLogger,
 	timeoutMs?: number,
-): Promise<ScenarioResult> {
+): Promise<ExecutionScenarioResult> {
 	return await runScenario(client, scenario, workflowId, workflowJsons, logger, timeoutMs);
 }

@ -339,12 +495,12 @@ export async function cleanupBuild(

 async function runScenario(
 	client: N8nClient,
-	scenario: TestScenario,
+	scenario: ExecutionScenario,
 	workflowId: string,
 	workflowJsons: WorkflowResponse[],
 	logger: EvalLogger,
 	timeoutMs?: number,
-): Promise<ScenarioResult> {
+): Promise<ExecutionScenarioResult> {
 	const execStart = Date.now();
 	const evalResult = await client.executeWithLlmMock(workflowId, scenario.dataSetup, timeoutMs);
 	const execMs = Date.now() - execStart;
@ -407,7 +563,7 @@ async function runScenario(
 * and pre-analysis flags so the verifier can diagnose root causes.
 */
 function buildVerificationArtifact(
-	scenario: TestScenario,
+	scenario: ExecutionScenario,
 	evalResult: InstanceAiEvalExecutionResult,
 	workflowJsons: WorkflowResponse[],
 ): string {
--- a/packages/@n8n/instance-ai/evaluations/index.ts
+++ b/packages/@n8n/instance-ai/evaluations/index.ts
@ -33,9 +33,9 @@ export { type EvalLogger, createLogger } from './harness/logger';
 // -- Types --
 export type {
 	WorkflowTestCase,
-	TestScenario,
+	ExecutionScenario,
 	WorkflowTestCaseResult,
-	ScenarioResult,
+	ExecutionScenarioResult,
 	ChecklistItem,
 	ChecklistResult,
 } from './types';
--- a/packages/@n8n/instance-ai/evaluations/langsmith/dataset-sync.ts
+++ b/packages/@n8n/instance-ai/evaluations/langsmith/dataset-sync.ts
@ -13,7 +13,7 @@ import type { Client } from 'langsmith';
 import type { Example, KVMap } from 'langsmith/schemas';
 import { z } from 'zod';

-import { loadWorkflowTestCasesWithFiles } from '../data/workflows';
+import { loadWorkflowTestCasesWithFiles, type WorkflowTestCaseWithFile } from '../data/workflows';
 import type { EvalLogger } from '../harness/logger';

 /**
@ -22,7 +22,6 @@ import type { EvalLogger } from '../harness/logger';
 * workflow a scenario belongs to (metadata is hidden by default).
 */
 export const datasetExampleInputsSchema = z.object({
-	prompt: z.string(),
 	testCaseFile: z.string(),
 	scenarioName: z.string(),
 	scenarioDescription: z.string(),
@ -101,7 +100,6 @@ export async function syncDataset(
 		const derivedId = `${scenario.testCaseFile}/${scenario.scenarioName}`;

 		const inputs: DatasetExampleInputs = {
-			prompt: scenario.prompt,
 			testCaseFile: scenario.testCaseFile,
 			scenarioName: scenario.scenarioName,
 			scenarioDescription: scenario.scenarioDescription,
@ -177,7 +175,6 @@ export async function syncDataset(
 // ---------------------------------------------------------------------------

 interface FlatScenario {
-	prompt: string;
 	testCaseFile: string;
 	scenarioName: string;
 	scenarioDescription: string;
@ -194,32 +191,18 @@ interface FlatScenario {
 * Input:  [tc1(s1,s2,s3), tc2(s1,s2), tc3(s1)]
 * Output: [tc1/s1, tc2/s1, tc3/s1, tc1/s2, tc2/s2, tc1/s3]
 */
-function buildRoundRobinScenarios(
-	testCasesWithFiles: Array<{
-		testCase: {
-			prompt: string;
-			complexity?: 'simple' | 'medium' | 'complex';
-			tags?: string[];
-			triggerType?: 'manual' | 'webhook' | 'schedule' | 'form';
-			scenarios: Array<{
-				name: string;
-				description: string;
-				dataSetup: string;
-				successCriteria: string;
-			}>;
-		};
-		fileSlug: string;
-	}>,
-): FlatScenario[] {
+function buildRoundRobinScenarios(testCasesWithFiles: WorkflowTestCaseWithFile[]): FlatScenario[] {
 	const result: FlatScenario[] = [];
-	const maxScenarios = Math.max(...testCasesWithFiles.map((tc) => tc.testCase.scenarios.length), 0);
+	const maxScenarios = Math.max(
+		...testCasesWithFiles.map((tc) => tc.testCase.executionScenarios.length),
+		0,
+	);

 	for (let i = 0; i < maxScenarios; i++) {
 		for (const { testCase, fileSlug } of testCasesWithFiles) {
-			const scenario = testCase.scenarios[i];
+			const scenario = testCase.executionScenarios[i];
 			if (scenario) {
 				result.push({
-					prompt: testCase.prompt,
 					testCaseFile: fileSlug,
 					scenarioName: scenario.name,
 					scenarioDescription: scenario.description,
@ -241,7 +224,6 @@ function buildRoundRobinScenarios(

 const existingInputsSchema = z
 	.object({
-		prompt: z.string().default(''),
 		testCaseFile: z.string().default(''),
 		scenarioName: z.string().default(''),
 		scenarioDescription: z.string().default(''),
@ -266,7 +248,6 @@ function hasInputsChanged(existing: unknown, incoming: DatasetExampleInputs): bo
 	if (!parsed.success) return true;
 	const e = parsed.data;
 	return (
-		e.prompt !== incoming.prompt ||
 		e.testCaseFile !== incoming.testCaseFile ||
 		e.dataSetup !== incoming.dataSetup ||
 		e.successCriteria !== incoming.successCriteria ||
--- a/packages/@n8n/instance-ai/evaluations/outcome/event-parser.ts
+++ b/packages/@n8n/instance-ai/evaluations/outcome/event-parser.ts
@ -6,9 +6,12 @@ import type {
 	AgentActivity,
 	CapturedEvent,
 	CapturedToolCall,
+	ConversationMetrics,
 	EventOutcome,
 	InstanceAiMetrics,
+	TurnCounter,
 } from '../types';
+import { getNestedRecord as getRecord, getString, isRecord } from '../utils/safe-extract';

 // ---------------------------------------------------------------------------
 // Tool names whose results contain resource IDs we need to track
@ -24,24 +27,6 @@ const WORKFLOW_TOOLS = new Set([
 const EXECUTION_TOOL = 'run-workflow';
 const DATA_TABLE_TOOL = 'create-data-table';

-// ---------------------------------------------------------------------------
-// Type guards for event payloads
-// ---------------------------------------------------------------------------
-
-function isRecord(value: unknown): value is Record<string, unknown> {
-	return typeof value === 'object' && value !== null && !Array.isArray(value);
-}
-
-function getString(obj: Record<string, unknown>, key: string): string | undefined {
-	const value = obj[key];
-	return typeof value === 'string' ? value : undefined;
-}
-
-function getRecord(obj: Record<string, unknown>, key: string): Record<string, unknown> | undefined {
-	const value = obj[key];
-	return isRecord(value) ? value : undefined;
-}
-
 // ---------------------------------------------------------------------------
 // extractOutcomeFromEvents
 // ---------------------------------------------------------------------------
@ -321,6 +306,129 @@ export function buildMetrics(events: CapturedEvent[], startTime: number): Instan
 	};
 }

+// ---------------------------------------------------------------------------
+// Per-turn conversation metrics
+// ---------------------------------------------------------------------------
+
+const PLAN_RECOVERY_TOOL_NAMES = new Set(['plan', 'planWithAgent', 'plan-with-agent']);
+
+export function buildConversationMetrics(events: CapturedEvent[]): ConversationMetrics {
+	const turns = splitEventsIntoTurns(events);
+	const perTurn: TurnCounter[] = [];
+	const seenRequestIds = new Set<string>();
+	const aggregateByKind: Record<string, number> = {};
+	let aggregateTotal = 0;
+
+	for (let i = 0; i < turns.length; i++) {
+		const turnEvents = turns[i];
+		const counter: TurnCounter = {
+			turn: i + 1,
+			toolCallCount: 0,
+			toolErrorCount: 0,
+			confirmationAskedTotal: 0,
+			confirmationAskedByKind: {},
+			replanAfterErrorCount: 0,
+			repeatQuestionCount: 0,
+		};
+
+		const errorPositions: number[] = [];
+		const planRecoveryPositions: number[] = [];
+
+		for (let j = 0; j < turnEvents.length; j++) {
+			const event = turnEvents[j];
+			const payload = getRecord(event.data, 'payload') ?? event.data;
+
+			switch (event.type) {
+				case 'tool-call': {
+					counter.toolCallCount++;
+					const toolName = getString(payload, 'toolName');
+					if (toolName && PLAN_RECOVERY_TOOL_NAMES.has(toolName)) {
+						planRecoveryPositions.push(j);
+					}
+					break;
+				}
+				case 'tool-error': {
+					counter.toolErrorCount++;
+					errorPositions.push(j);
+					break;
+				}
+				case 'tasks-update': {
+					planRecoveryPositions.push(j);
+					break;
+				}
+				case 'confirmation-request': {
+					counter.confirmationAskedTotal++;
+					aggregateTotal++;
+					const inputType = getString(payload, 'inputType') ?? 'approval';
+					counter.confirmationAskedByKind[inputType] =
+						(counter.confirmationAskedByKind[inputType] ?? 0) + 1;
+					aggregateByKind[inputType] = (aggregateByKind[inputType] ?? 0) + 1;
+					const requestId = getString(payload, 'requestId');
+					if (requestId) {
+						if (seenRequestIds.has(requestId)) {
+							counter.repeatQuestionCount++;
+						} else {
+							seenRequestIds.add(requestId);
+						}
+					}
+					break;
+				}
+				case 'run-finish': {
+					counter.runFinishStatus = getString(payload, 'status') ?? counter.runFinishStatus;
+					break;
+				}
+				default:
+					break;
+			}
+		}
+
+		for (const errPos of errorPositions) {
+			if (planRecoveryPositions.some((recPos) => recPos > errPos)) {
+				counter.replanAfterErrorCount++;
+			}
+		}
+
+		perTurn.push(counter);
+	}
+
+	const turnCount = countEvents(events, 'run-finish');
+	const lastTurn = perTurn[perTurn.length - 1];
+	const reachedRunFinishCleanly = lastTurn?.runFinishStatus === 'completed';
+
+	return {
+		turnCount,
+		perTurn,
+		confirmationAskedTotal: aggregateTotal,
+		confirmationAskedByKind: aggregateByKind,
+		reachedRunFinishCleanly,
+	};
+}
+
+/** Split events into turns. Each turn begins at a `run-start` event; events
+ *  before the first `run-start` form a leading pseudo-turn (unusual but handled). */
+export function splitEventsIntoTurns(events: CapturedEvent[]): CapturedEvent[][] {
+	const turns: CapturedEvent[][] = [];
+	let current: CapturedEvent[] = [];
+	for (const event of events) {
+		if (event.type === 'run-start' && current.length > 0) {
+			turns.push(current);
+			current = [event];
+		} else if (event.type === 'run-start') {
+			current = [event];
+		} else {
+			current.push(event);
+		}
+	}
+	if (current.length > 0) turns.push(current);
+	return turns;
+}
+
+function countEvents(events: CapturedEvent[], type: string): number {
+	let n = 0;
+	for (const event of events) if (event.type === type) n++;
+	return n;
+}
+
 // ---------------------------------------------------------------------------
 // Internal helpers
 // ---------------------------------------------------------------------------
--- a/packages/@n8n/instance-ai/evaluations/outcome/transcript-from-events.ts
+++ b/packages/@n8n/instance-ai/evaluations/outcome/transcript-from-events.ts
@ -0,0 +1,269 @@
+/**
+ * Build a chat-style transcript from the captured SSE event stream + the
+ * proxy's confirmation responses. In-process, no LangSmith roundtrip.
+ *
+ * Reasoning/thinking blocks aren't included — those only live in the agent's
+ * LangSmith LLM-run outputs, not forwarded over SSE.
+ */
+
+import type { InstanceAiConfirmRequest } from '@n8n/api-types';
+
+import type {
+	AskUserAnswer,
+	AskUserQuestion,
+	CapturedEvent,
+	PlanTask,
+	SetupWizardCompletedNode,
+	SetupWizardSkippedNode,
+	ToolInteraction,
+	TranscriptTurn,
+} from '../types';
+import { splitEventsIntoTurns } from './event-parser';
+import { getNestedRecord as getRecord, getString, isRecord } from '../utils/safe-extract';
+
+type ProxyResponses = Map<string, InstanceAiConfirmRequest>;
+
+export interface BuildTranscriptOptions {
+	events: CapturedEvent[];
+	openingMessage?: string;
+	followUpMessages?: string[];
+	proxyResponses?: ProxyResponses;
+}
+
+export function buildTranscriptFromEvents(opts: BuildTranscriptOptions): TranscriptTurn[] {
+	const { events, openingMessage, followUpMessages = [], proxyResponses } = opts;
+	if (events.length === 0) return [];
+
+	const userMessages: string[] = [];
+	if (openingMessage) userMessages.push(openingMessage);
+	userMessages.push(...followUpMessages);
+
+	const turns: TranscriptTurn[] = [];
+	for (const turnEvents of splitEventsIntoTurns(events)) {
+		const turn = buildTurn(turnEvents, userMessages.shift(), proxyResponses);
+		if (turn.userMessage || turn.agentText || turn.toolInteractions.length > 0) {
+			turns.push(turn);
+		}
+	}
+	return turns;
+}
+
+// ---------------------------------------------------------------------------
+// Per-turn assembly
+//
+// Each tool can emit two events for one logical interaction (e.g. ask-user
+// fires both a tool-call and a confirmation-request). To render it once,
+// only the variant carrying the richer payload handles it; the other is
+// skipped. This relies on both events arriving in the same turn — which
+// they always do in practice.
+// ---------------------------------------------------------------------------
+
+function buildTurn(
+	events: CapturedEvent[],
+	userMessage: string | undefined,
+	proxyResponses: ProxyResponses | undefined,
+): TranscriptTurn {
+	const textChunks: string[] = [];
+	const toolInteractions: ToolInteraction[] = [];
+	const seenPlainTools = new Set<string>();
+
+	for (const event of events) {
+		if (event.type === 'text-delta') {
+			const text =
+				getString(event.data, 'text') ?? getString(getRecord(event.data, 'payload') ?? {}, 'text');
+			if (text) textChunks.push(text);
+			continue;
+		}
+
+		if (event.type === 'tool-call') {
+			handleToolCall(event, toolInteractions, seenPlainTools);
+			continue;
+		}
+
+		if (event.type === 'tool-result') {
+			handleToolResult(event, toolInteractions);
+			continue;
+		}
+
+		if (event.type === 'confirmation-request') {
+			handleConfirmationRequest(event, proxyResponses, toolInteractions);
+			continue;
+		}
+	}
+
+	return {
+		userMessage,
+		agentText: textChunks.join(''),
+		toolInteractions,
+	};
+}
+
+function handleToolCall(
+	event: CapturedEvent,
+	out: ToolInteraction[],
+	seenPlainTools: Set<string>,
+): void {
+	const payload = getRecord(event.data, 'payload') ?? event.data;
+	const toolName = getString(payload, 'toolName') ?? '';
+	const args = getRecord(payload, 'args') ?? {};
+
+	// ask-user is rendered from the confirmation-request (which has the answers).
+	if (toolName === 'ask-user') return;
+
+	if (toolName === 'plan' || toolName === 'add-plan-item') {
+		const tasks = Array.isArray(args.tasks) ? extractPlanTasks(args.tasks) : [];
+		if (tasks.length > 0) out.push({ kind: 'plan', tasks });
+		return;
+	}
+
+	// Plain tool-call — collapsed to one entry per tool name within the turn.
+	if (!toolName || seenPlainTools.has(toolName)) return;
+	seenPlainTools.add(toolName);
+	out.push({ kind: 'tool-call', toolName });
+}
+
+function handleToolResult(event: CapturedEvent, out: ToolInteraction[]): void {
+	const payload = getRecord(event.data, 'payload') ?? event.data;
+	const toolName = getString(payload, 'toolName') ?? '';
+	const result = payload.result;
+
+	if (toolName === 'workflows' && isRecord(result)) {
+		const interaction = extractSetupWizardOutcome(result);
+		if (interaction) out.push(interaction);
+	}
+}
+
+function handleConfirmationRequest(
+	event: CapturedEvent,
+	proxyResponses: ProxyResponses | undefined,
+	out: ToolInteraction[],
+): void {
+	const payload = getRecord(event.data, 'payload') ?? {};
+	const requestId = getString(payload, 'requestId');
+	const response = requestId ? proxyResponses?.get(requestId) : undefined;
+
+	if (payload.inputType === 'questions') {
+		const questions = Array.isArray(payload.questions)
+			? extractAskUserQuestions(payload.questions)
+			: [];
+		if (questions.length === 0) return;
+		const answers =
+			response?.kind === 'questions' ? extractAskUserAnswers(response.answers) : undefined;
+		out.push({ kind: 'ask-user', questions, answers });
+		return;
+	}
+
+	// setup wizard suspend — its outcome is rendered from the tool-result instead.
+	if (Array.isArray(payload.setupRequests)) return;
+
+	const toolName =
+		getString(payload, 'toolName') ?? getString(payload, 'agentId') ?? 'confirmation';
+	out.push({
+		kind: 'confirmation',
+		toolName,
+		resumeReason: inferResumeReason(payload, response),
+		approved: inferApproval(response),
+	});
+}
+
+// ---------------------------------------------------------------------------
+// Helpers
+// ---------------------------------------------------------------------------
+
+function extractSetupWizardOutcome(result: Record<string, unknown>): ToolInteraction | null {
+	const completed = Array.isArray(result.completedNodes)
+		? extractCompletedNodes(result.completedNodes)
+		: [];
+	const skipped = Array.isArray(result.skippedNodes)
+		? extractSkippedNodes(result.skippedNodes)
+		: [];
+	if (completed.length === 0 && skipped.length === 0) return null;
+	const reason = typeof result.reason === 'string' ? result.reason : undefined;
+	return { kind: 'setup-wizard', completedNodes: completed, skippedNodes: skipped, reason };
+}
+
+function inferResumeReason(
+	payload: Record<string, unknown>,
+	response: InstanceAiConfirmRequest | undefined,
+): string {
+	if (response?.kind === 'domainAccessApprove') return 'domain-access';
+	if (response?.kind === 'resourceDecision') return 'resource-decision';
+	if (response?.kind === 'credentialSelection') return 'credential-selection';
+	if (payload.domainAccess) return 'domain-access';
+	if (payload.resourceDecision) return 'resource-decision';
+	if (Array.isArray(payload.credentialRequests)) return 'credential-selection';
+	return 'approval';
+}
+
+function inferApproval(response: InstanceAiConfirmRequest | undefined): boolean | undefined {
+	if (!response) return undefined;
+	if (response.kind === 'approval') return response.approved;
+	return true;
+}
+
+function extractPlanTasks(raw: unknown[]): PlanTask[] {
+	const tasks: PlanTask[] = [];
+	for (const item of raw) {
+		if (!isRecord(item)) continue;
+		const title = typeof item.title === 'string' ? item.title : undefined;
+		const description = typeof item.description === 'string' ? item.description : undefined;
+		if (title || description) tasks.push({ title, description });
+	}
+	return tasks;
+}
+
+function extractAskUserQuestions(raw: unknown[]): AskUserQuestion[] {
+	const questions: AskUserQuestion[] = [];
+	for (const item of raw) {
+		if (!isRecord(item)) continue;
+		const id = typeof item.id === 'string' ? item.id : '';
+		const question = typeof item.question === 'string' ? item.question : '';
+		const options = Array.isArray(item.options)
+			? item.options.filter((o): o is string => typeof o === 'string')
+			: undefined;
+		if (id || question) questions.push({ id, question, options });
+	}
+	return questions;
+}
+
+function extractAskUserAnswers(raw: unknown): AskUserAnswer[] {
+	if (!Array.isArray(raw)) return [];
+	const answers: AskUserAnswer[] = [];
+	for (const item of raw) {
+		if (!isRecord(item) || typeof item.questionId !== 'string') continue;
+		const selectedOptions = Array.isArray(item.selectedOptions)
+			? item.selectedOptions.filter((o): o is string => typeof o === 'string')
+			: [];
+		answers.push({
+			questionId: item.questionId,
+			selectedOptions,
+			customText: typeof item.customText === 'string' ? item.customText : undefined,
+			skipped: typeof item.skipped === 'boolean' ? item.skipped : undefined,
+		});
+	}
+	return answers;
+}
+
+function extractCompletedNodes(raw: unknown[]): SetupWizardCompletedNode[] {
+	const nodes: SetupWizardCompletedNode[] = [];
+	for (const item of raw) {
+		if (!isRecord(item) || typeof item.nodeName !== 'string') continue;
+		const parametersSet = Array.isArray(item.parametersSet)
+			? item.parametersSet.filter((p): p is string => typeof p === 'string')
+			: undefined;
+		nodes.push({ nodeName: item.nodeName, parametersSet });
+	}
+	return nodes;
+}
+
+function extractSkippedNodes(raw: unknown[]): SetupWizardSkippedNode[] {
+	const nodes: SetupWizardSkippedNode[] = [];
+	for (const item of raw) {
+		if (!isRecord(item) || typeof item.nodeName !== 'string') continue;
+		nodes.push({
+			nodeName: item.nodeName,
+			credentialType: typeof item.credentialType === 'string' ? item.credentialType : undefined,
+		});
+	}
+	return nodes;
+}
--- a/packages/@n8n/instance-ai/evaluations/report/workflow-report.ts
+++ b/packages/@n8n/instance-ai/evaluations/report/workflow-report.ts
@ -10,7 +10,14 @@
 import fs from 'fs';
 import path from 'path';

-import type { WorkflowTestCaseResult, ScenarioResult } from '../types';
+import type {
+	ConversationMetrics,
+	ExecutionScenarioResult,
+	ToolInteraction,
+	TranscriptTurn,
+	TurnCounter,
+	WorkflowTestCaseResult,
+} from '../types';

 // ---------------------------------------------------------------------------
 // Helpers
@ -29,7 +36,7 @@ function escapeHtml(str: string): string {
 // Scenario rendering
 // ---------------------------------------------------------------------------

-function renderScenario(sr: ScenarioResult, index: number): string {
+function renderScenario(sr: ExecutionScenarioResult, index: number): string {
 	const icon = sr.success ? '&#10003;' : '&#10007;';
 	const statusClass = sr.success ? 'pass' : 'fail';

@ -61,7 +68,7 @@ function renderScenario(sr: ScenarioResult, index: number): string {
 	</div>`;
 }

-function renderScenarioDetail(sr: ScenarioResult): string {
+function renderScenarioDetail(sr: ExecutionScenarioResult): string {
 	let html = '';

 	if (!sr.evalResult) {
@ -186,12 +193,205 @@ function renderScenarioDetail(sr: ScenarioResult): string {
 	return html;
 }

+// ---------------------------------------------------------------------------
+// Conversation metrics (per-turn deterministic counters)
+// ---------------------------------------------------------------------------
+
+function renderConversationMetrics(metrics: ConversationMetrics | undefined): string {
+	if (!metrics || metrics.perTurn.length === 0) return '';
+
+	const turnRows = metrics.perTurn.map((turn) => renderTurnRow(turn)).join('');
+	const finishBadge = metrics.reachedRunFinishCleanly
+		? '<span class="badge badge-pass">finished cleanly</span>'
+		: '<span class="badge badge-fail">incomplete</span>';
+
+	const byKindBits = Object.entries(metrics.confirmationAskedByKind)
+		.map(([kind, count]) => `${escapeHtml(kind)}×${String(count)}`)
+		.join(' · ');
+
+	const summary = [
+		`<strong>${String(metrics.turnCount)}</strong> turn${metrics.turnCount === 1 ? '' : 's'}`,
+		`<strong>${String(metrics.confirmationAskedTotal)}</strong> confirmation${metrics.confirmationAskedTotal === 1 ? '' : 's'} asked${byKindBits ? ` (${byKindBits})` : ''}`,
+		finishBadge,
+	].join(' · ');
+
+	return `<details class="section"><summary>Conversation metrics</summary>
+		<div class="conv-summary">${summary}</div>
+		<table class="conv-table">
+			<thead><tr>
+				<th>Turn</th><th>Tool calls</th><th>Tool errors</th><th>Confirmations</th>
+				<th>Replan after error</th><th>Repeat questions</th><th>Finish status</th>
+			</tr></thead>
+			<tbody>${turnRows}</tbody>
+		</table>
+	</details>`;
+}
+
+function renderTurnRow(turn: TurnCounter): string {
+	const status = turn.runFinishStatus ?? '—';
+	const statusClass =
+		turn.runFinishStatus === 'completed'
+			? 'turn-status-ok'
+			: turn.runFinishStatus === undefined
+				? 'turn-status-pending'
+				: 'turn-status-fail';
+	const confByKind = Object.entries(turn.confirmationAskedByKind)
+		.map(([kind, count]) => `${escapeHtml(kind)}×${String(count)}`)
+		.join(', ');
+	const confDetail = confByKind ? ` <span class="muted">(${confByKind})</span>` : '';
+	return `<tr>
+		<td>#${String(turn.turn)}</td>
+		<td>${String(turn.toolCallCount)}</td>
+		<td>${String(turn.toolErrorCount)}</td>
+		<td>${String(turn.confirmationAskedTotal)}${confDetail}</td>
+		<td>${String(turn.replanAfterErrorCount)}</td>
+		<td>${String(turn.repeatQuestionCount)}</td>
+		<td class="${statusClass}">${escapeHtml(status)}</td>
+	</tr>`;
+}
+
+// ---------------------------------------------------------------------------
+// Conversation transcript — chat-style view built from the captured event stream
+// ---------------------------------------------------------------------------
+
+function renderConversationTranscript(transcript: TranscriptTurn[] | undefined): string {
+	if (!transcript || transcript.length === 0) return '';
+	const turnsHtml = transcript.map((turn, i) => renderTranscriptTurn(turn, i + 1)).join('');
+	return `<details class="section" open><summary>Conversation transcript</summary>
+		<div class="transcript">${turnsHtml}</div>
+	</details>`;
+}
+
+function renderTranscriptTurn(turn: TranscriptTurn, turnNum: number): string {
+	const parts: string[] = [`<div class="transcript-turn-header">Turn ${String(turnNum)}</div>`];
+	if (turn.userMessage) {
+		parts.push(
+			`<div class="transcript-line transcript-user"><span class="transcript-icon">👤</span><span class="transcript-text">${escapeHtml(turn.userMessage)}</span></div>`,
+		);
+	}
+	if (turn.agentText) {
+		parts.push(
+			`<div class="transcript-line transcript-assistant"><span class="transcript-icon">🤖</span><span class="transcript-text">${escapeHtml(turn.agentText)}</span></div>`,
+		);
+	}
+
+	const toolNames: string[] = [];
+	for (const interaction of turn.toolInteractions) {
+		const block = renderInteraction(interaction);
+		if (block) parts.push(block);
+		if (interaction.kind === 'tool-call') toolNames.push(interaction.toolName);
+	}
+
+	if (toolNames.length > 0) {
+		parts.push(
+			`<div class="transcript-tools">🔧 ${toolNames.map((t) => escapeHtml(t)).join(', ')}</div>`,
+		);
+	}
+	return `<div class="transcript-turn">${parts.join('')}</div>`;
+}
+
+function renderInteraction(interaction: ToolInteraction): string | null {
+	switch (interaction.kind) {
+		case 'plan': {
+			if (interaction.tasks.length === 0) return null;
+			const lines = interaction.tasks
+				.map((t, i) => {
+					const title = t.title ?? `Task ${String(i + 1)}`;
+					const desc = t.description ? `: ${escapeHtml(t.description)}` : '';
+					return `<li><strong>${escapeHtml(title)}</strong>${desc}</li>`;
+				})
+				.join('');
+			const word = interaction.tasks.length === 1 ? 'task' : 'tasks';
+			return `<details class="transcript-aside" open><summary>📋 plan (${String(interaction.tasks.length)} ${word})</summary><ul class="transcript-plan">${lines}</ul></details>`;
+		}
+		case 'ask-user': {
+			if (interaction.questions.length === 0) return null;
+			const answerByQId = new Map<string, string>();
+			for (const a of interaction.answers ?? []) {
+				const selected = a.selectedOptions.join(', ');
+				const text = [selected, a.customText].filter(Boolean).join(' — ');
+				if (text) answerByQId.set(a.questionId, text);
+			}
+			const lines = interaction.questions
+				.map((q) => {
+					const opts =
+						q.options && q.options.length > 0
+							? ` <em>(${q.options.map((o) => escapeHtml(o)).join(' / ')})</em>`
+							: '';
+					const answer = answerByQId.get(q.id);
+					const answerHtml = answer
+						? `<div class="transcript-answer">👤 ${escapeHtml(answer)}</div>`
+						: '';
+					return `<li>${escapeHtml(q.question)}${opts}${answerHtml}</li>`;
+				})
+				.join('');
+			const summary =
+				answerByQId.size > 0
+					? '❓ ask-user (with answers)'
+					: `❓ ask-user (${String(interaction.questions.length)} question${interaction.questions.length === 1 ? '' : 's'})`;
+			return `<details class="transcript-aside" open><summary>${summary}</summary><ul class="transcript-questions">${lines}</ul></details>`;
+		}
+		case 'setup-wizard': {
+			const skipped = interaction.skippedNodes;
+			const needCreds = skipped.filter((s) => Boolean(s.credentialType)).length;
+			const needParams = skipped.length - needCreds;
+			const breakdown: string[] = [];
+			if (needCreds > 0) breakdown.push(`${String(needCreds)} need credentials`);
+			if (needParams > 0) breakdown.push(`${String(needParams)} need parameters`);
+			const headerParts: string[] = [];
+			if (interaction.completedNodes.length > 0) {
+				headerParts.push(`${String(interaction.completedNodes.length)} configured`);
+			}
+			if (skipped.length > 0) {
+				headerParts.push(
+					`${String(skipped.length)} skipped${breakdown.length > 0 ? ` (${breakdown.join(', ')})` : ''}`,
+				);
+			}
+			const header = headerParts.length > 0 ? headerParts.join(', ') : 'nothing to apply';
+
+			const sections: string[] = [];
+			if (interaction.completedNodes.length > 0) {
+				const items = interaction.completedNodes
+					.map((c) => {
+						const params = c.parametersSet ? c.parametersSet.join(', ') : '';
+						return `<li>${escapeHtml(c.nodeName)}${params ? ` — params: ${escapeHtml(params)}` : ''}</li>`;
+					})
+					.join('');
+				sections.push(
+					`<div class="transcript-section-label">configured (${String(interaction.completedNodes.length)})</div><ul class="transcript-plan">${items}</ul>`,
+				);
+			}
+			if (skipped.length > 0) {
+				const items = skipped
+					.map(
+						(s) =>
+							`<li>${escapeHtml(s.nodeName)}${s.credentialType ? ` — needs <code>${escapeHtml(s.credentialType)}</code> credential` : ' — needs parameters'}</li>`,
+					)
+					.join('');
+				sections.push(
+					`<div class="transcript-section-label">skipped (${String(skipped.length)})</div><ul class="transcript-plan">${items}</ul>`,
+				);
+			}
+			return `<details class="transcript-aside" open><summary>🛠 setup wizard — ${escapeHtml(header)}</summary>${sections.join('')}</details>`;
+		}
+		case 'confirmation': {
+			const decisionTag =
+				typeof interaction.approved === 'boolean'
+					? ` <em>(${interaction.approved ? 'approved' : 'rejected'})</em>`
+					: '';
+			return `<div class="transcript-resume">↪ resume <code>${escapeHtml(interaction.toolName)}</code>: ${escapeHtml(interaction.resumeReason)}${decisionTag}</div>`;
+		}
+		case 'tool-call':
+			return null; // surfaced in the aggregate tool-names line at the bottom
+	}
+}
+
 // ---------------------------------------------------------------------------
 // Workflow summary
 // ---------------------------------------------------------------------------

 function renderWorkflowSummary(result: WorkflowTestCaseResult): string {
-	const firstEval = result.scenarioResults[0]?.evalResult;
+	const firstEval = result.executionScenarioResults[0]?.evalResult;

 	let nodesHtml = '';
 	if (firstEval) {
@ -227,8 +427,8 @@ function renderWorkflowSummary(result: WorkflowTestCaseResult): string {
 // ---------------------------------------------------------------------------

 function renderTestCase(result: WorkflowTestCaseResult, tcIndex: number): string {
-	const passCount = result.scenarioResults.filter((sr) => sr.success).length;
-	const totalCount = result.scenarioResults.length;
+	const passCount = result.executionScenarioResults.filter((sr) => sr.success).length;
+	const totalCount = result.executionScenarioResults.length;
 	const allPass = passCount === totalCount && totalCount > 0;
 	const statusClass = result.workflowBuildSuccess ? (allPass ? 'pass' : 'mixed') : 'fail';

@ -241,11 +441,11 @@ function renderTestCase(result: WorkflowTestCaseResult, tcIndex: number): string
 			? `<span class="badge badge-${allPass ? 'pass' : 'fail'}">${String(passCount)}/${String(totalCount)}</span>`
 			: '';

-	const prompt = result.testCase.prompt;
+	const prompt = result.testCase.conversation[0].text;
 	const truncatedPrompt = prompt.length > 100 ? prompt.slice(0, 100) + '...' : prompt;

 	// Inline scenario indicators for quick triage without expanding
-	const scenarioIndicators = result.scenarioResults
+	const scenarioIndicators = result.executionScenarioResults
 		.map(
 			(sr) =>
 				`<span class="scenario-indicator ${sr.success ? 'pass' : 'fail'}" title="${escapeHtml(sr.scenario.name)}">${sr.success ? '✓' : '✗'} ${escapeHtml(sr.scenario.name)}</span>`,
@ -253,8 +453,8 @@ function renderTestCase(result: WorkflowTestCaseResult, tcIndex: number): string
 		.join(' ');

 	let scenariosHtml = '';
-	if (result.scenarioResults.length > 0) {
-		scenariosHtml = result.scenarioResults
+	if (result.executionScenarioResults.length > 0) {
+		scenariosHtml = result.executionScenarioResults
 			.map((sr, i) => renderScenario(sr, tcIndex * 100 + i))
 			.join('');
 	} else if (!result.workflowBuildSuccess) {
@ -278,6 +478,8 @@ function renderTestCase(result: WorkflowTestCaseResult, tcIndex: number): string
 		</div>
 		<div class="test-case-detail">
 			<details class="section"><summary>Prompt</summary><div class="prompt-text">${escapeHtml(prompt)}</div></details>
+			${renderConversationMetrics(result.conversationMetrics)}
+			${renderConversationTranscript(result.transcript)}
 			${renderWorkflowSummary(result)}
 			${scenariosHtml}
 		</div>
@ -291,7 +493,7 @@ function renderTestCase(result: WorkflowTestCaseResult, tcIndex: number): string
 export function generateWorkflowReport(results: WorkflowTestCaseResult[]): string {
 	const totalTestCases = results.length;
 	const builtCount = results.filter((r) => r.workflowBuildSuccess).length;
-	const allScenarios = results.flatMap((r) => r.scenarioResults);
+	const allScenarios = results.flatMap((r) => r.executionScenarioResults);
 	const passCount = allScenarios.filter((sr) => sr.success).length;
 	const failCount = allScenarios.length - passCount;
 	const totalScenarios = allScenarios.length;
@ -442,6 +644,41 @@ export function generateWorkflowReport(results: WorkflowTestCaseResult[]): strin

 	/* Utilities */
 	.muted { color: var(--text-muted); font-size: 12px; }
+
+	/* Conversation metrics */
+	.conv-summary { color: var(--text-secondary); font-size: 12px; padding: 6px 0; }
+	.conv-table { width: 100%; border-collapse: collapse; font-size: 12px; margin-top: 6px; }
+	.conv-table th, .conv-table td { text-align: left; padding: 4px 8px; border-bottom: 1px solid var(--border-light); }
+	.conv-table th { color: var(--text-muted); font-weight: 600; font-size: 11px; text-transform: uppercase; letter-spacing: 0.04em; }
+	.conv-table td { font-family: monospace; }
+	.turn-status-ok { color: var(--color-pass); }
+	.turn-status-fail { color: var(--color-fail); }
+	.turn-status-pending { color: var(--text-muted); }
+
+	/* Conversation transcript */
+	.transcript { padding: 4px 0; }
+	.transcript-turn { padding: 8px 0; border-bottom: 1px dashed var(--border-light); }
+	.transcript-turn:last-child { border-bottom: none; }
+	.transcript-turn-header { font-size: 10px; text-transform: uppercase; letter-spacing: 0.06em; color: var(--text-muted); margin-bottom: 6px; }
+	.transcript-line { display: flex; gap: 8px; padding: 4px 0; align-items: flex-start; font-size: 13px; line-height: 1.5; }
+	.transcript-icon { width: 18px; text-align: center; flex-shrink: 0; }
+	.transcript-text { color: var(--text-primary); white-space: pre-wrap; }
+	.transcript-user .transcript-text { color: var(--text-primary); }
+	.transcript-assistant .transcript-text { color: var(--text-secondary); }
+	.transcript-internal > summary { cursor: pointer; padding: 4px 0; font-size: 12px; color: var(--text-muted); display: flex; gap: 8px; align-items: flex-start; }
+	.transcript-internal > summary:hover { color: var(--text-secondary); }
+	.transcript-internal .transcript-text { color: var(--text-muted); font-style: italic; }
+	.transcript-aside { margin: 4px 0 4px 26px; }
+	.transcript-aside > summary { cursor: pointer; color: var(--text-muted); font-size: 11px; padding: 2px 0; }
+	.transcript-reasoning { color: var(--text-muted); font-size: 12px; line-height: 1.5; padding: 6px 8px; background: var(--bg-primary); border-left: 2px solid var(--border); border-radius: 2px; white-space: pre-wrap; margin-top: 4px; }
+	.transcript-tools { color: var(--text-muted); font-size: 11px; font-family: monospace; padding: 4px 0 0 26px; }
+	.transcript-plan, .transcript-questions { margin: 4px 0 4px 18px; padding: 0; font-size: 12px; line-height: 1.5; color: var(--text-primary); }
+	.transcript-plan li, .transcript-questions li { margin: 4px 0; }
+	.transcript-answer { color: var(--text-secondary); font-size: 12px; margin: 2px 0 6px 16px; padding: 2px 0; }
+	.transcript-resume { font-size: 11px; font-family: monospace; color: var(--text-muted); padding: 2px 0 2px 26px; }
+	.transcript-resume code { background: var(--bg-tertiary); padding: 0 4px; border-radius: 2px; }
+	.transcript-section-label { font-size: 11px; color: var(--text-muted); margin: 6px 0 2px 18px; text-transform: uppercase; letter-spacing: 0.04em; }
+	.transcript-empty { font-size: 12px; color: var(--text-muted); font-style: italic; margin: 4px 0 4px 18px; }
 </style>
 </head>
 <body>
--- a/packages/@n8n/instance-ai/evaluations/types.ts
+++ b/packages/@n8n/instance-ai/evaluations/types.ts
@ -76,6 +76,30 @@ export interface InstanceAiMetrics {
 	events: CapturedEvent[];
 }

+// ---------------------------------------------------------------------------
+// Per-turn conversation metrics
+// ---------------------------------------------------------------------------
+
+/** Counters for one turn (run-start → run-finish). */
+export interface TurnCounter {
+	turn: number;
+	toolCallCount: number;
+	toolErrorCount: number;
+	confirmationAskedTotal: number;
+	confirmationAskedByKind: Record<string, number>;
+	replanAfterErrorCount: number;
+	repeatQuestionCount: number;
+	runFinishStatus?: string;
+}
+
+export interface ConversationMetrics {
+	turnCount: number;
+	perTurn: TurnCounter[];
+	confirmationAskedTotal: number;
+	confirmationAskedByKind: Record<string, number>;
+	reachedRunFinishCleanly: boolean;
+}
+
 // ---------------------------------------------------------------------------
 // Outcome types
 // ---------------------------------------------------------------------------
@ -129,7 +153,7 @@ export interface EventOutcome {
 // Workflow evaluation test cases
 // ---------------------------------------------------------------------------

-export interface TestScenario {
+export interface ExecutionScenario {
 	name: string;
 	description: string;
 	/** Instructions for mock data generation — passed as scenario hints to the LLM mock endpoint */
@ -138,20 +162,35 @@ export interface TestScenario {
 	successCriteria: string;
 }

+export interface ConversationTurn {
+	role: 'user' | 'assistant';
+	text: string;
+}
+
 export interface WorkflowTestCase {
-	prompt: string;
+	/**
+	 * Hand-authored conversation that drives the build. Must have ≥1 turn,
+	 * and the first turn must be `user`.
+	 *
+	 * - One user turn, no assistant turns → auto-approve mode (single-prompt build).
+	 * - Anything else → multi-turn UserProxyLlm engages (answers clarifications,
+	 *   sends follow-ups consuming `messageBudget`).
+	 */
+	conversation: ConversationTurn[];
 	complexity: 'simple' | 'medium' | 'complex';
 	tags: string[];
 	triggerType?: 'manual' | 'webhook' | 'schedule' | 'form';
-	scenarios: TestScenario[];
+	executionScenarios: ExecutionScenario[];
+	/** Max follow-up messages the proxy will send. Ignored in auto-approve mode. */
+	messageBudget?: number;
 }

 // ---------------------------------------------------------------------------
 // Workflow test case results
 // ---------------------------------------------------------------------------

-export interface ScenarioResult {
-	scenario: TestScenario;
+export interface ExecutionScenarioResult {
+	scenario: ExecutionScenario;
 	success: boolean;
 	evalResult?: InstanceAiEvalExecutionResult;
 	score: number;
@ -167,18 +206,71 @@ export interface WorkflowTestCaseResult {
 	workflowId?: string;
 	workflowBuildSuccess: boolean;
 	buildError?: string;
-	scenarioResults: ScenarioResult[];
+	executionScenarioResults: ExecutionScenarioResult[];
 	/** The built workflow JSON — saved for debugging and cross-run comparison */
 	workflowJson?: WorkflowResponse;
+	conversationMetrics?: ConversationMetrics;
+	threadId?: string;
+	transcript?: TranscriptTurn[];
+}
+
+// ---------------------------------------------------------------------------
+// Conversation transcript (synthesized from the SSE event stream)
+// ---------------------------------------------------------------------------
+
+export interface TranscriptTurn {
+	userMessage?: string;
+	agentText: string;
+	toolInteractions: ToolInteraction[];
+}
+
+export type ToolInteraction =
+	| { kind: 'plan'; tasks: PlanTask[] }
+	| { kind: 'ask-user'; questions: AskUserQuestion[]; answers?: AskUserAnswer[] }
+	| {
+			kind: 'setup-wizard';
+			completedNodes: SetupWizardCompletedNode[];
+			skippedNodes: SetupWizardSkippedNode[];
+			reason?: string;
+	  }
+	| { kind: 'confirmation'; toolName: string; resumeReason: string; approved?: boolean }
+	| { kind: 'tool-call'; toolName: string };
+
+export interface PlanTask {
+	title?: string;
+	description?: string;
+}
+
+export interface AskUserQuestion {
+	id: string;
+	question: string;
+	options?: string[];
+}
+
+export interface AskUserAnswer {
+	questionId: string;
+	selectedOptions: string[];
+	customText?: string;
+	skipped?: boolean;
+}
+
+export interface SetupWizardCompletedNode {
+	nodeName: string;
+	parametersSet?: string[];
+}
+
+export interface SetupWizardSkippedNode {
+	nodeName: string;
+	credentialType?: string;
 }

 // ---------------------------------------------------------------------------
 // Multi-run aggregation
 // ---------------------------------------------------------------------------

-export interface ScenarioAggregation {
-	scenario: TestScenario;
-	runs: ScenarioResult[];
+export interface ExecutionScenarioAggregation {
+	scenario: ExecutionScenario;
+	runs: ExecutionScenarioResult[];
 	passCount: number;
 	passRate: number;
 	/** probability at least 1 of k attempts passes */
@ -191,7 +283,7 @@ export interface TestCaseAggregation {
 	testCase: WorkflowTestCase;
 	runs: WorkflowTestCaseResult[];
 	buildSuccessCount: number;
-	scenarios: ScenarioAggregation[];
+	executionScenarios: ExecutionScenarioAggregation[];
 }

 export interface MultiRunEvaluation {
--- a/packages/@n8n/instance-ai/evaluations/utils/confirmation-payload.ts
+++ b/packages/@n8n/instance-ai/evaluations/utils/confirmation-payload.ts
@ -0,0 +1,52 @@
+// Shared confirmation-event helpers — used by both the deterministic shortcut
+// (utils/user-proxy/deterministic.ts) and the autoApprove fallback
+// (harness/chat-loop.ts) to avoid copy-pasted dispatch logic.
+
+import { instanceGatewayResourceDecisionSchema } from '@n8n/api-types';
+import type { InstanceAiConfirmRequest } from '@n8n/api-types';
+
+import { getNestedRecord } from './safe-extract';
+import type { CapturedEvent } from '../types';
+
+/**
+ * Handle confirmation events that carry no user-intent signal — domain access,
+ * resource decisions, standalone credential requests. The eval grants all
+ * access, has no credentials, and picks the most-permissive option for
+ * resource gates. Returns `undefined` for events that need caller-specific
+ * handling: setup wizards, ask-user questions, plan reviews.
+ */
+export function tryInfrastructureResponse(
+	event: CapturedEvent,
+): InstanceAiConfirmRequest | undefined {
+	const payload = getNestedRecord(event.data, 'payload') ?? {};
+
+	if (getNestedRecord(payload, 'domainAccess')) {
+		return { kind: 'domainAccessApprove', domainAccessAction: 'allow_all' };
+	}
+
+	const resourceDecision = getNestedRecord(payload, 'resourceDecision');
+	if (resourceDecision) {
+		const options = Array.isArray(resourceDecision.options)
+			? (resourceDecision.options as unknown[]).filter((o): o is string => typeof o === 'string')
+			: [];
+		const allowOption = options.find((o) => o.toLowerCase().includes('allow')) ?? options[0];
+		const parsed = instanceGatewayResourceDecisionSchema.safeParse(allowOption);
+		return {
+			kind: 'resourceDecision',
+			resourceDecision: parsed.success ? parsed.data : 'allowOnce',
+		};
+	}
+
+	// Standalone credential request only — when setupRequests is also present,
+	// the setup wizard takes priority because it carries node parameters to
+	// fill (handled by the caller).
+	if (Array.isArray(payload.credentialRequests) && !Array.isArray(payload.setupRequests)) {
+		return { kind: 'credentialSelection', credentials: {} };
+	}
+
+	return undefined;
+}
+
+export function getEventPayload(event: CapturedEvent): Record<string, unknown> {
+	return getNestedRecord(event.data, 'payload') ?? {};
+}
--- a/packages/@n8n/instance-ai/evaluations/utils/safe-extract.ts
+++ b/packages/@n8n/instance-ai/evaluations/utils/safe-extract.ts
@ -0,0 +1,19 @@
+// Type guards for pulling fields off `unknown` records — used wherever we
+// inspect event payloads, run inputs/outputs, or other loosely-typed JSON.
+
+export function isRecord(value: unknown): value is Record<string, unknown> {
+	return typeof value === 'object' && value !== null && !Array.isArray(value);
+}
+
+export function getNestedRecord(
+	obj: Record<string, unknown>,
+	key: string,
+): Record<string, unknown> | undefined {
+	const value = obj[key];
+	return isRecord(value) ? value : undefined;
+}
+
+export function getString(obj: Record<string, unknown>, key: string): string | undefined {
+	const value = obj[key];
+	return typeof value === 'string' ? value : undefined;
+}
--- a/packages/@n8n/instance-ai/evaluations/utils/user-proxy/agent.ts
+++ b/packages/@n8n/instance-ai/evaluations/utils/user-proxy/agent.ts
@ -0,0 +1,52 @@
+import { SYSTEM_PROMPT, TOOL_DESCRIPTIONS } from './prompts';
+import { decisionSchema, type Decision } from './tools';
+import { createEvalAgent } from '../../../src/utils/eval-agents';
+import type { EvalLogger } from '../../harness/logger';
+
+export interface UserProxyAgentConfig {
+	modelId?: string;
+	logger?: EvalLogger;
+}
+
+export interface UserProxyAgent {
+	decide(userPrompt: string): Promise<Decision | undefined>;
+}
+
+export function createUserProxyAgent(config: UserProxyAgentConfig = {}): UserProxyAgent {
+	const instructions = `${SYSTEM_PROMPT}\n\n${TOOL_DESCRIPTIONS}`;
+
+	return {
+		async decide(userPrompt: string): Promise<Decision | undefined> {
+			const agent = createEvalAgent('eval-user-proxy', {
+				...(config.modelId ? { model: config.modelId } : {}),
+				instructions,
+				cache: true,
+			}).structuredOutput(decisionSchema);
+
+			try {
+				const result = await agent.generate(userPrompt);
+				const decision = (result.structuredOutput as Decision | undefined) ?? undefined;
+				if (!decision) {
+					config.logger?.warn(
+						`[user-proxy] no structuredOutput; error=${describeFailure((result as { error?: unknown }).error)}`,
+					);
+				}
+				return decision;
+			} catch (caught) {
+				config.logger?.warn(`[user-proxy] agent.generate threw: ${describeFailure(caught)}`);
+				return undefined;
+			}
+		},
+	};
+}
+
+function describeFailure(value: unknown): string {
+	if (value === undefined) return 'undefined';
+	if (value instanceof Error) return `${value.name}: ${value.message}`;
+	if (typeof value === 'string') return value;
+	try {
+		return JSON.stringify(value).slice(0, 600);
+	} catch {
+		return '[unable to stringify]';
+	}
+}
--- a/packages/@n8n/instance-ai/evaluations/utils/user-proxy/deterministic.ts
+++ b/packages/@n8n/instance-ai/evaluations/utils/user-proxy/deterministic.ts
@ -0,0 +1,40 @@
+// Deterministic shortcuts that bypass the LLM for events with no user-intent signal.
+
+import type { InstanceAiConfirmRequest } from '@n8n/api-types';
+
+import type { CapturedEvent } from '../../types';
+import { getEventPayload, tryInfrastructureResponse } from '../confirmation-payload';
+
+export function tryDeterministicConfirmationResponse(
+	event: CapturedEvent,
+): InstanceAiConfirmRequest | undefined {
+	const infra = tryInfrastructureResponse(event);
+	if (infra) return infra;
+
+	const payload = getEventPayload(event);
+
+	// Setup wizard with credentials-only requests: skip. The eval has no
+	// credentials and applying an empty payload loops the agent ("partial 0/N").
+	// Mixed (credential + parameter issues, or parameter-only) → LLM fills params.
+	if (Array.isArray(payload.setupRequests)) {
+		if (
+			payload.setupRequests.length > 0 &&
+			payload.setupRequests.every(isCredentialOnlySetupRequest)
+		) {
+			return { kind: 'approval', approved: false };
+		}
+		return undefined;
+	}
+
+	// inputType=questions, text, plan-review, or default approval — LLM handles.
+	return undefined;
+}
+
+function isCredentialOnlySetupRequest(value: unknown): boolean {
+	if (typeof value !== 'object' || value === null) return false;
+	const req = value as Record<string, unknown>;
+	if (typeof req.credentialType !== 'string') return false;
+	const issues = req.parameterIssues;
+	if (issues && typeof issues === 'object' && Object.keys(issues).length > 0) return false;
+	return true;
+}
--- a/packages/@n8n/instance-ai/evaluations/utils/user-proxy/index.ts
+++ b/packages/@n8n/instance-ai/evaluations/utils/user-proxy/index.ts
@ -0,0 +1,257 @@
+// LLM-backed user simulator for multi-turn workflow evals.
+
+import type { InstanceAiConfirmRequest } from '@n8n/api-types';
+
+import { createUserProxyAgent, type UserProxyAgent } from './agent';
+import { tryDeterministicConfirmationResponse } from './deterministic';
+import { buildConfirmationPrompt, buildFollowUpPrompt } from './prompts';
+import { encodeConfirmationDecision, type Decision } from './tools';
+import { buildAutoApprovePayload } from '../../harness/chat-loop';
+import type { NextMessageDecision } from '../../harness/chat-loop';
+import type { EvalLogger } from '../../harness/logger';
+import type { CapturedEvent, ConversationTurn } from '../../types';
+import { getNestedRecord, getString } from '../safe-extract';
+
+/**
+ * What category of response the proxy sent for a confirmation event.
+ * Mostly mirrors the `kind` of the InstanceAiConfirmRequest, with overlay
+ * categories that describe WHERE the response came from:
+ *
+ *  - `dismissal` / `rejection` — shape of a successful LLM-driven decision
+ *  - `deterministic` — handled by the deterministic shortcut (no LLM call)
+ *  - `repeat` — a confirmation requestId we already responded to
+ *  - `fallback-no-decision` — LLM returned no decision; sent autoApprove
+ *  - `fallback-unencoded` — LLM picked a between-run action that doesn't
+ *    encode to a confirmation payload; sent autoApprove
+ */
+export type ProxyDecisionCategory =
+	| InstanceAiConfirmRequest['kind']
+	| 'dismissal'
+	| 'rejection'
+	| 'deterministic'
+	| 'repeat'
+	| 'fallback-no-decision'
+	| 'fallback-unencoded';
+
+export type ProxyDecisionStats = Partial<Record<ProxyDecisionCategory, number>>;
+
+// ---------------------------------------------------------------------------
+// Constants
+// ---------------------------------------------------------------------------
+
+const DEFAULT_MESSAGE_BUDGET = 5;
+
+// ---------------------------------------------------------------------------
+// Public types
+// ---------------------------------------------------------------------------
+
+export interface UserProxyConfig {
+	conversation: ConversationTurn[];
+	messageBudget?: number;
+	modelId?: string;
+	logger?: EvalLogger;
+	/** Test seam — inject a fake agent. */
+	agent?: UserProxyAgent;
+}
+
+// ---------------------------------------------------------------------------
+// UserProxyLlm
+// ---------------------------------------------------------------------------
+
+export class UserProxyLlm {
+	/** The intended conversation — read-only, what the user wants overall. */
+	private readonly script: ConversationTurn[];
+	private readonly messageBudget: number;
+	private readonly agent: UserProxyAgent;
+	private readonly logger?: EvalLogger;
+
+	/** What's actually been sent and received this run, both sides. The
+	 *  opening turn is seeded here on construction because the harness sends
+	 *  it directly via `client.sendMessage` before the first SSE event. */
+	private readonly actualTranscript: ConversationTurn[];
+
+	private messagesSent = 0;
+	private ingestedEventCount = 0;
+	private readonly seenRequestIds = new Set<string>();
+	private readonly decisionStats: ProxyDecisionStats = {};
+
+	constructor(config: UserProxyConfig) {
+		this.script = config.conversation;
+		this.messageBudget = config.messageBudget ?? DEFAULT_MESSAGE_BUDGET;
+		this.logger = config.logger;
+		this.agent =
+			config.agent ?? createUserProxyAgent({ modelId: config.modelId, logger: config.logger });
+		// Seed with the opener — the harness has already sent it.
+		const opener = this.script[0];
+		this.actualTranscript = opener ? [{ role: opener.role, text: opener.text }] : [];
+	}
+
+	getMessagesSent(): number {
+		return this.messagesSent;
+	}
+
+	ingestEvents(events: CapturedEvent[]): void {
+		const newEvents = events.slice(this.ingestedEventCount);
+		this.ingestedEventCount = events.length;
+
+		let pendingAssistantText = '';
+		for (const event of newEvents) {
+			if (event.type === 'text-delta') {
+				const text = extractTextDelta(event);
+				if (text) pendingAssistantText += text;
+			} else if (event.type === 'run-finish' && pendingAssistantText.length > 0) {
+				this.actualTranscript.push({ role: 'assistant', text: pendingAssistantText });
+				pendingAssistantText = '';
+			}
+		}
+
+		if (pendingAssistantText.length > 0) {
+			const last = this.actualTranscript[this.actualTranscript.length - 1];
+			if (last?.role === 'assistant') {
+				last.text = last.text + pendingAssistantText;
+			} else {
+				this.actualTranscript.push({ role: 'assistant', text: pendingAssistantText });
+			}
+		}
+	}
+
+	async respondToConfirmation(event: CapturedEvent): Promise<InstanceAiConfirmRequest> {
+		const requestId = extractRequestId(event);
+		const isRepeat = requestId !== undefined && this.seenRequestIds.has(requestId);
+		if (requestId) this.seenRequestIds.add(requestId);
+
+		if (isRepeat) {
+			this.bumpStat('repeat');
+			return buildAutoApprovePayload(event);
+		}
+
+		const det = tryDeterministicConfirmationResponse(event);
+		if (det) {
+			this.bumpStat('deterministic');
+			return det;
+		}
+
+		const prompt = buildConfirmationPrompt(this.promptContext(), event);
+		const decision = await this.agent.decide(prompt);
+		if (!decision) {
+			this.logger?.warn(`[user-proxy] no decision; event=${summarizeEvent(event)}`);
+			this.bumpStat('fallback-no-decision');
+			return buildAutoApprovePayload(event);
+		}
+
+		const encoded = encodeConfirmationDecision(decision, (raw, parseError) =>
+			this.logger?.warn(
+				`[user-proxy] nodeParametersJson failed to parse (${String(parseError)}); raw=${raw.slice(0, 200)}`,
+			),
+		);
+		if (!encoded) {
+			this.logger?.warn(
+				`[user-proxy] action=${decision.action} did not encode to a confirmation payload`,
+			);
+			this.bumpStat('fallback-unencoded');
+			return buildAutoApprovePayload(event);
+		}
+
+		this.recordDecision(decision, encoded, event);
+		return encoded;
+	}
+
+	private bumpStat(category: ProxyDecisionCategory): void {
+		this.decisionStats[category] = (this.decisionStats[category] ?? 0) + 1;
+	}
+
+	/** Counts of proxy decisions by category. Read after the build completes. */
+	getDecisionStats(): Readonly<ProxyDecisionStats> {
+		return { ...this.decisionStats };
+	}
+
+	private recordDecision(
+		decision: Decision,
+		encoded: InstanceAiConfirmRequest,
+		event: CapturedEvent,
+	): void {
+		const category = classifyDecision(encoded);
+		this.bumpStat(category);
+		this.logger?.verbose(`[user-proxy] decision action=${decision.action} category=${category}`);
+		if (category === 'dismissal') {
+			this.logger?.warn(
+				`[user-proxy] dismissal-like response kind=${encoded.kind}; event=${summarizeEvent(event)}`,
+			);
+		}
+	}
+
+	async decideFollowUp(): Promise<NextMessageDecision> {
+		if (this.messagesSent >= this.messageBudget) {
+			this.logger?.warn(
+				`[user-proxy] message budget exhausted (${String(this.messagesSent)}/${String(this.messageBudget)}); ending conversation`,
+			);
+			return { kind: 'done' };
+		}
+
+		const prompt = buildFollowUpPrompt(this.promptContext());
+		const decision = await this.agent.decide(prompt);
+		if (!decision) return { kind: 'done' };
+
+		if (decision.action === 'send_follow_up_message') {
+			const message = decision.message.trim();
+			if (!message) return { kind: 'done' };
+			this.messagesSent++;
+			this.actualTranscript.push({ role: 'user', text: message });
+			return { kind: 'followUp', message };
+		}
+		return { kind: 'done' };
+	}
+
+	// -------------------------------------------------------------------------
+	// Internal
+	// -------------------------------------------------------------------------
+
+	private promptContext() {
+		return {
+			script: this.script,
+			actualTranscript: this.actualTranscript,
+		};
+	}
+}
+
+// ---------------------------------------------------------------------------
+// Event helpers
+// ---------------------------------------------------------------------------
+
+function extractTextDelta(event: CapturedEvent): string | undefined {
+	const directText = event.data.text;
+	if (typeof directText === 'string') return directText;
+	const payload = getNestedRecord(event.data, 'payload');
+	if (payload && typeof payload.text === 'string') return payload.text;
+	return undefined;
+}
+
+function extractRequestId(event: CapturedEvent): string | undefined {
+	const payload = getNestedRecord(event.data, 'payload');
+	if (payload) {
+		const id = getString(payload, 'requestId');
+		if (id) return id;
+	}
+	return getString(event.data, 'requestId');
+}
+
+/** Compact JSON of the event payload, truncated for log readability. */
+function summarizeEvent(event: CapturedEvent): string {
+	const payload = getNestedRecord(event.data, 'payload') ?? event.data;
+	const summary = JSON.stringify(payload);
+	return summary.length > 800 ? `${summary.slice(0, 800)}…` : summary;
+}
+
+/** Coarse category for accounting: how the proxy responded to a confirmation. */
+function classifyDecision(encoded: InstanceAiConfirmRequest): ProxyDecisionCategory {
+	if (
+		(encoded.kind === 'questions' &&
+			(encoded.answers.length === 0 || encoded.answers.every((a) => a.skipped))) ||
+		(encoded.kind === 'setupWorkflowApply' &&
+			(!encoded.nodeParameters || Object.keys(encoded.nodeParameters).length === 0))
+	) {
+		return 'dismissal';
+	}
+	if (encoded.kind === 'approval' && !encoded.approved) return 'rejection';
+	return encoded.kind;
+}
--- a/packages/@n8n/instance-ai/evaluations/utils/user-proxy/prompts.ts
+++ b/packages/@n8n/instance-ai/evaluations/utils/user-proxy/prompts.ts
@ -0,0 +1,118 @@
+// Prompts for the user-proxy agent. System prompt frames the model as the
+// user; per-event prompts assemble the script + actual transcript + event.
+
+import type { CapturedEvent, ConversationTurn } from '../../types';
+import { getEventPayload } from '../confirmation-payload';
+
+export interface PromptContext {
+	/** What the user INTENDS to say across the build — the authored script. */
+	script: ConversationTurn[];
+	/** What's actually been said this run, both sides. */
+	actualTranscript: ConversationTurn[];
+}
+
+export const SYSTEM_PROMPT = `You are simulating a real user in a workflow-building conversation with an AI assistant.
+
+Stay in character as the USER. Never describe what the assistant should do — say what you, the user, want.
+
+Be brief. Real users send 1–2 sentence messages.
+
+## Always answer. Never leave fields blank.
+
+A real user shown a form does not walk away — they type something in. Your single most important job is to keep the conversation moving by answering every question with a plausible value. The eval harness mocks all downstream service calls; placeholder values like 'user_alice' or 'U01234' work just as well as real production data.
+
+Pick the value to use in this order:
+1. **Stated** — the user said it in the script or transcript. Use it verbatim.
+2. **Implied** — the user said something nearby that points at a natural reading.
+   e.g. "schedule" → daily; "Slack" without a channel → '#general'; "Linear bugs" → label='bug', state=open.
+3. **Invented but plausible** — the user never mentioned it. Make one up that's the obvious shape and would let the workflow run.
+   e.g. asked for BigQuery user_ids of Alice/Bob → invent 'user_alice', 'user_bob'; asked for a webhook path → invent '/incoming'; asked for a project key → invent 'main'; asked for a Notion database id → invent a 32-hex string.
+
+Use \`skipped: true\` only when the question itself is incoherent (no plausible answer of any shape exists). Reluctance to invent is a bug — invent.
+
+## One exception: credentials
+
+Never set credentials. They're deferred and the user will configure them via the UI. Credentials are the one and only thing left blank.
+
+## Pushing back on plans and summaries
+
+When the agent shows a plan, summary, or "here's what I'll build" preview, **audit it against the script**. The agent is designed to make assumptions rather than ask, so its plan often omits or substitutes things the user actually stated in the script.
+
+Reject when the plan misses any of the following from the script:
+- **Concrete values** — channel IDs, table names, URLs, schedules, specific node configurations. Example: "Use #engineering (C04ENGINEER1), not the generic channel you picked."
+- **Stated behaviours** — sort/order rules ("sort descending by count"), filter conditions ("only include issues outside the creator's team"), branching logic ("if X then post to Y else …"), error handling, deduplication, retry behaviour. These are as load-bearing as concrete values. Example: "The script said 'sort descending by count' but the plan doesn't include a sort step — add an explicit sort by violation count."
+
+Be specific in the rejection — quote the requirement that's missing or wrong. Don't just say "this is wrong."
+
+Accept when the plan covers every concrete value AND every stated behaviour from the script, even if the agent invented other reasonable details the script didn't specify.
+
+Real users say "no, I wanted X, not Y" — that's the proxy's primary lever for steering the build.
+
+## Composing the next user message (between-run decisions)
+
+You'll be given a SCRIPT (what the user wants overall) and the ACTUAL CONVERSATION SO FAR. After the agent's most recent turn, decide what the user would say next.
+
+- The script is a reference for what the user MIGHT say — not a checklist to mechanically deliver. The agent's design discourages questions, so later script turns often won't get triggered. That's expected.
+- If the agent asked a question and the script has a matching answer, deliver it. If the agent asked something the script doesn't cover and credentials aren't involved, give a brief plausible reply.
+- If the agent finished without asking and the plan was already approved or rejected appropriately, pick \`declare_done\`. Don't volunteer late script content as a proactive follow-up — the plan-rejection path is the right channel for steering.
+- When delivering a script user turn, adapt its wording so it reads as a real reply to the agent's last message — but keep every concrete value verbatim.
+- Don't restate what's already in the transcript.
+- Credentials: if the agent stalls on credentials, send "I'll set them up later — please build without them." Do not provide credentials.
+
+## Format
+
+On each event, pick exactly one action from the schema. The action represents what the user would do at this moment in the conversation.`;
+
+export function buildConfirmationPrompt(ctx: PromptContext, event: CapturedEvent): string {
+	return [
+		formatScriptSection(ctx),
+		formatTranscriptSection(ctx),
+		formatEventSection(event),
+		'Pick one action to respond to this confirmation as the user.',
+	].join('\n\n');
+}
+
+export function buildFollowUpPrompt(ctx: PromptContext): string {
+	return [
+		formatScriptSection(ctx),
+		formatTranscriptSection(ctx),
+		'The agent has just finished a run. Decide what the user would say next.',
+		'',
+		"Pick `send_follow_up_message` when the agent asked a question (in its last response) or stalled and needs unblocking. If the script answers the question, deliver that answer with concrete values verbatim. If the script doesn't cover it and credentials aren't involved, give a brief plausible reply.",
+		'Pick `declare_done` when the agent finished a build, approved/rejected a plan appropriately, or otherwise has no open thread for the user to respond to. The script is a reference, not a checklist — late script content gets surfaced via plan rejection, not unsolicited follow-ups.',
+	].join('\n\n');
+}
+
+// ---------------------------------------------------------------------------
+// Section formatters
+// ---------------------------------------------------------------------------
+
+function formatScriptSection(ctx: PromptContext): string {
+	const lines: string[] = ['## Script (what the user intends to say across this build)'];
+	for (const turn of ctx.script) {
+		lines.push(`${turn.role === 'user' ? 'USER' : 'ASSISTANT'}: ${turn.text}`);
+	}
+	return lines.join('\n');
+}
+
+function formatTranscriptSection(ctx: PromptContext): string {
+	const lines: string[] = ['## Actual conversation so far'];
+	if (ctx.actualTranscript.length === 0) {
+		lines.push('(nothing yet)');
+	} else {
+		for (const turn of ctx.actualTranscript) {
+			lines.push(`${turn.role === 'user' ? 'USER' : 'ASSISTANT'}: ${turn.text}`);
+		}
+	}
+	return lines.join('\n');
+}
+
+function formatEventSection(event: CapturedEvent): string {
+	const payload = getEventPayload(event);
+	return [
+		'## New event requiring a response',
+		'```json',
+		JSON.stringify(payload, null, 2),
+		'```',
+	].join('\n');
+}
--- a/packages/@n8n/instance-ai/evaluations/utils/user-proxy/tools.ts
+++ b/packages/@n8n/instance-ai/evaluations/utils/user-proxy/tools.ts
@ -0,0 +1,130 @@
+// Decision schema (structured-output target) + encoders to InstanceAiConfirmRequest.
+
+import type { InstanceAiConfirmRequest } from '@n8n/api-types';
+import { z } from 'zod';
+
+// ---------------------------------------------------------------------------
+// Decision schema — the structured-output shape the model fills
+// ---------------------------------------------------------------------------
+
+const answerSchema = z.object({
+	questionId: z.string(),
+	selectedOptions: z.array(z.string()),
+	customText: z.string().optional(),
+	skipped: z.boolean().optional(),
+});
+
+export const decisionSchema = z.discriminatedUnion('action', [
+	z.object({
+		action: z.literal('answer_questions'),
+		answers: z.array(answerSchema),
+	}),
+	z.object({
+		action: z.literal('apply_setup_wizard'),
+		// JSON-encoded object mapping nodeId -> parameter map. Emitted as a string
+		// because Anthropic structured output rejects nested z.record schemas.
+		nodeParametersJson: z.string(),
+	}),
+	z.object({
+		action: z.literal('approve_or_reject'),
+		approved: z.boolean(),
+		userInput: z.string().optional(),
+	}),
+	z.object({
+		action: z.literal('respond_to_domain_access'),
+		response: z.enum(['allow_once', 'allow_all', 'deny']),
+	}),
+	z.object({
+		action: z.literal('pick_resource_decision'),
+		decision: z.string(),
+	}),
+	z.object({
+		action: z.literal('send_follow_up_message'),
+		message: z.string(),
+	}),
+	z.object({
+		action: z.literal('declare_done'),
+	}),
+]);
+
+export type Decision = z.infer<typeof decisionSchema>;
+
+// ---------------------------------------------------------------------------
+// Tool descriptions — bundled with the prompt so the model picks the right action
+// ---------------------------------------------------------------------------
+
+export const TOOL_DESCRIPTIONS = `Available actions:
+
+- answer_questions(answers[]): The agent fired an ask-user confirmation (inputType=questions). Answer every question with a plausible value — stated → implied → invented. Invent rather than skip. Only set skipped=true when the question has no plausible answer of any shape.
+
+- apply_setup_wizard(nodeParametersJson): The agent fired a setup-wizard event with placeholder parameters. Emit a JSON string that decodes to { "<nodeId>": { "<paramName>": <value>, ... }, ... }. Fill every non-credential placeholder with a plausible value — stated → implied → invented. Never set credentials.
+
+- approve_or_reject(approved, userInput?): The agent showed a plan (plan-review) or asked an open free-text question (inputType=text). Approve if the plan matches user intent; reject with reason if it diverges.
+
+- respond_to_domain_access(response): The agent is asking for domain access permissions. Pick allow_once, allow_all, or deny. Default to allow_all unless the user would deny.
+
+- pick_resource_decision(decision): The agent is asking the user to pick a gateway resource access option. Pick the option the user would choose.
+
+- send_follow_up_message(message): Between-run decision. Send the user's next message — use when the user would continue.
+
+- declare_done(): Between-run decision. Signal that the user has gotten what they wanted and the conversation ends.`;
+
+// ---------------------------------------------------------------------------
+// Decision → InstanceAiConfirmRequest encoders
+// ---------------------------------------------------------------------------
+
+/**
+ * Encode a confirmation-response action into an InstanceAiConfirmRequest.
+ * Returns null for between-run actions (send_follow_up_message, declare_done),
+ * which the caller routes separately.
+ */
+export function encodeConfirmationDecision(
+	decision: Decision,
+	onParseFailure?: (raw: string, error: unknown) => void,
+): InstanceAiConfirmRequest | null {
+	switch (decision.action) {
+		case 'answer_questions':
+			return { kind: 'questions', answers: decision.answers };
+
+		case 'apply_setup_wizard':
+			return {
+				kind: 'setupWorkflowApply',
+				nodeParameters: parseNodeParametersJson(decision.nodeParametersJson, onParseFailure),
+			};
+
+		case 'approve_or_reject':
+			return {
+				kind: 'approval',
+				approved: decision.approved,
+				...(decision.userInput ? { userInput: decision.userInput } : {}),
+			};
+
+		case 'respond_to_domain_access':
+			return decision.response === 'deny'
+				? { kind: 'domainAccessDeny' }
+				: { kind: 'domainAccessApprove', domainAccessAction: decision.response };
+
+		case 'pick_resource_decision':
+			return { kind: 'resourceDecision', resourceDecision: decision.decision };
+
+		case 'send_follow_up_message':
+		case 'declare_done':
+			return null;
+	}
+}
+
+function parseNodeParametersJson(
+	json: string,
+	onFailure?: (raw: string, error: unknown) => void,
+): Record<string, Record<string, unknown>> {
+	try {
+		const parsed: unknown = JSON.parse(json);
+		if (parsed && typeof parsed === 'object' && !Array.isArray(parsed)) {
+			return parsed as Record<string, Record<string, unknown>>;
+		}
+		onFailure?.(json, new Error('parsed value is not a plain object'));
+	} catch (error) {
+		onFailure?.(json, error);
+	}
+	return {};
+}
--- a/packages/@n8n/instance-ai/src/utils/eval-agents.ts
+++ b/packages/@n8n/instance-ai/src/utils/eval-agents.ts
@ -44,12 +44,6 @@ export function createEvalAgent(
 		model?: string;
 		instructions: string;
 		cache?: boolean;
-		/**
-		 * Extended-thinking config:
-		 * - 'adaptive' (default): model decides per request.
-		 * - 'off': no thinking.
-		 * - { budgetTokens: N }: fixed budget mode.
-		 */
 		thinking?: 'adaptive' | 'off' | { budgetTokens: number };
 	},
 ): Agent {
@ -64,7 +58,7 @@ export function createEvalAgent(
 		agent.instructions(options.instructions);
 	}

-	const thinking = options.thinking ?? 'adaptive';
+	const thinking = options.thinking ?? 'off';
 	if (thinking === 'adaptive') {
 		agent.thinking('anthropic', { mode: 'adaptive' });
 	} else if (typeof thinking === 'object') {