mirror of https://github.com/n8n-io/n8n.git synced 2026-05-31 08:46:58 +02:00

History

oleg ccd974ea22 feat(ai-builder): Implement Core Subgraph Infrastructure (no-changelog) (#22325 ) Signed-off-by: Oleg Ivaniv <me@olegivaniv.com>		2025-12-01 07:53:55 +01:00
..
chains	feat(ai-builder): Implement Core Subgraph Infrastructure (no-changelog) (#22325 )	2025-12-01 07:53:55 +01:00
cli	feat(ai-builder): CSV input for evals (no-changelog) (#21150 )	2025-10-24 14:15:36 +02:00
core	feat(ai-builder): Implement Core Subgraph Infrastructure (no-changelog) (#22325 )	2025-12-01 07:53:55 +01:00
langsmith	test(ai-builder): Add pairwise evaluations (no-changelog) (#22438 )	2025-11-28 15:35:53 +01:00
programmatic	fix(ai-builder): Fix import of multiple nodes with maxNode, add validation (#22348 )	2025-11-26 17:34:34 +01:00
reference-workflows	test(ai-builder): Integrate structural workflow comparison (#22209 )	2025-11-24 15:25:12 +01:00
types	test(ai-builder): Integrate structural workflow comparison (#22209 )	2025-11-24 15:25:12 +01:00
utils	test(ai-builder): Integrate structural workflow comparison (#22209 )	2025-11-24 15:25:12 +01:00
index.ts	test(ai-builder): Add pairwise evaluations (no-changelog) (#22438 )	2025-11-28 15:35:53 +01:00
load-nodes.ts	feat: Evaluation framework for AI Workflow Builder (#18016 )	2025-08-20 11:11:14 +02:00
prompts-example.ts	feat(ai-builder): Categorize prompts for taxonomy approach (#20862 )	2025-10-29 14:26:56 +00:00
README.md	test(ai-builder): Add pairwise evaluations (no-changelog) (#22438 )	2025-11-28 15:35:53 +01:00

README.md

AI Workflow Builder Evaluations

This module provides a evaluation framework for testing the AI Workflow Builder's ability to generate correct n8n workflows from natural language prompts.

Architecture Overview

The evaluation system is split into three distinct modes with a parallel evaluation architecture for optimal performance:

CLI Evaluation - Runs predefined test cases locally with progress tracking and parallel metric evaluation
Langsmith Evaluation - Integrates with Langsmith for dataset-based evaluation and experiment tracking
Pairwise Evaluation - Evaluates workflows against custom do/don't criteria from a dataset

Directory Structure

evaluations/
├── cli/                 # CLI evaluation implementation
│   ├── runner.ts       # Main CLI evaluation orchestrator
│   └── display.ts      # Console output and progress tracking
├── langsmith/          # Langsmith integration
│   ├── evaluator.ts    # Langsmith-compatible evaluator function
│   ├── runner.ts       # Langsmith evaluation orchestrator
│   └── pairwise-runner.ts # Pairwise evaluation orchestrator
├── core/               # Shared evaluation logic
│   ├── environment.ts  # Test environment setup and configuration
│   └── test-runner.ts  # Core test execution logic
├── types/              # Type definitions
│   ├── evaluation.ts   # Evaluation result schemas
│   ├── test-result.ts  # Test result interfaces
│   └── langsmith.ts    # Langsmith-specific types and guards
├── chains/             # LLM evaluation chains
│   ├── test-case-generator.ts    # Dynamic test case generation
│   ├── workflow-evaluator.ts     # Main orchestrator for parallel evaluation
│   ├── pairwise-evaluator.ts     # Pairwise do/don't criteria evaluator
│   └── evaluators/               # Individual metric evaluators
│       ├── index.ts              # Evaluator exports
│       ├── functionality-evaluator.ts      # Functional correctness evaluation
│       ├── connections-evaluator.ts        # Node connection evaluation
│       ├── expressions-evaluator.ts        # n8n expression syntax evaluation
│       ├── node-configuration-evaluator.ts # Node parameter evaluation
│       ├── efficiency-evaluator.ts         # Workflow efficiency evaluation
│       ├── data-flow-evaluator.ts          # Data flow logic evaluation
│       └── maintainability-evaluator.ts    # Code maintainability evaluation
├── utils/              # Utility functions
│   ├── evaluation-calculator.ts  # Metrics calculation
│   ├── evaluation-helpers.ts     # Common helper functions
│   ├── evaluation-reporter.ts    # Report generation
└── index.ts            # Main entry point

Implementation Details

Core Components

1. Test Runner (`core/test-runner.ts`)

The core test runner handles individual test execution:

Generates workflows using the WorkflowBuilderAgent
Validates generated workflows using type guards
Evaluates workflows against test criteria
Returns structured test results with error handling

2. Environment Setup (`core/environment.ts`)

Centralizes environment configuration:

LLM initialization with API key validation
Langsmith client setup
Node types loading
Concurrency and test generation settings

3. Workflow Evaluator (`chains/workflow-evaluator.ts`)

The main orchestrator that coordinates parallel evaluation across all metric categories:

Parallel Execution: Runs all 7 evaluators concurrently using Promise.all() for optimal performance
Score Calculation: Computes weighted overall score using the weight distribution
Summary Generation: Creates evaluation summaries based on all metric results
Critical Issues Identification: Aggregates critical violations from all evaluator categories

4. Individual Evaluators (`chains/evaluators/`)

Each metric category has its own specialized evaluator chain with tailored prompts and scoring logic:

Functionality Evaluator: Focuses on whether the workflow achieves explicitly requested goals Connections Evaluator: Analyzes node connections and data flow paths Expressions Evaluator: Validates n8n expression syntax and data references Node Configuration Evaluator: Checks parameter configuration and required fields Efficiency Evaluator: Evaluates redundancy, path optimization, and node count efficiency Data Flow Evaluator: Analyzes data transformations and validation logic Maintainability Evaluator: Assesses naming, organization, and structural quality

5. Langsmith Integration

The Langsmith integration provides two key components:

Evaluator (langsmith/evaluator.ts):

Converts Langsmith Run objects to evaluation inputs
Validates all data using type guards before processing
Safely extracts usage metadata without type coercion
Returns structured evaluation results from the parallel evaluation system

Runner (langsmith/runner.ts):

Creates workflow generation functions compatible with Langsmith
Validates message content before processing
Extracts usage metrics safely from message metadata
Handles dataset verification and error reporting

6. Pairwise Evaluation

Pairwise evaluation provides a simpler, criteria-based approach to workflow evaluation. Instead of using the complex multi-metric evaluation system, it evaluates workflows against a custom set of "do" and "don't" rules defined in the dataset.

Evaluator (chains/pairwise-evaluator.ts):

Evaluates workflows against a checklist of criteria (dos and don'ts)
Uses an LLM to determine if each criterion passes or fails
Requires evidence-based justification for each decision
Calculates a simple pass/fail score (passes / total rules)

Runner (langsmith/pairwise-runner.ts):

Generates workflows from prompts in the dataset
Applies pairwise evaluation to each generated workflow
Reports three metrics to Langsmith:
- pairwise_score: Overall score (0-1)
- pairwise_passed_count: Number of criteria passed
- pairwise_failed_count: Number of criteria violated

Dataset Format: The pairwise evaluation expects a Langsmith dataset with examples containing:

{
  "inputs": {
    "prompt": "Create a workflow that...",
    "evals": {
      "dos": "Use HTTP Request node for API calls\nInclude error handling",
      "donts": "Don't use deprecated nodes\nDon't hardcode credentials"
    }
  }
}

Note: dos and donts are newline-separated strings, not arrays.

7. CLI Evaluation

The CLI evaluation provides local testing capabilities:

Runner (cli/runner.ts):

Orchestrates parallel test execution with concurrency control
Manages test case generation when enabled
Generates detailed reports and saves results

Display (cli/display.ts):

Progress bar management for real-time feedback
Console output formatting
Error display and reporting

Evaluation Metrics

The system evaluates workflows across seven categories, with each category having its own specialized evaluator chain that runs in parallel:

Functionality (25% weight)
- Does the workflow achieve the intended goal?
- Are the right nodes selected?
- Is core functionality explicitly requested implemented?
Connections (15% weight)
- Are nodes properly connected?
- Is data flow logical?
- Are connection paths optimized?
Expressions (15% weight)
- Are n8n expressions syntactically correct?
- Do they reference valid data paths?
- Are expressions efficient and maintainable?
Node Configuration (15% weight)
- Are node parameters properly set?
- Are required fields populated?
- Are configurations appropriate for the use case?
Efficiency (10% weight)
- Redundancy Score: Avoiding duplicate operations that could be consolidated
- Path Optimization: Using optimal execution paths
- Node Count Efficiency: Using minimal necessary nodes
- Are backup/fallback paths intentional vs. wasteful?
Data Flow (10% weight)
- Is data flowing correctly between nodes?
- Are data transformations logical and necessary?
- Is data validation properly implemented?
Maintainability (5% weight)
- Node Naming Quality: Are nodes descriptively named?
- Workflow Organization: Is the structure logically organized?
- Modularity: Are components reusable and well-structured?
Structural Similarity (5% weight, optional)
- How closely does the structure match a reference workflow?
- Only evaluated when reference workflow is provided

Violation Severity Levels

Violations are categorized by severity:

Critical (-40 to -50 points): Workflow-breaking issues
Major (-15 to -25 points): Significant problems affecting functionality
Minor (-5 to -15 points): Non-critical issues or inefficiencies

Running Evaluations

CLI Evaluation

# Run with default settings
pnpm eval

# Run a specific test case
pnpm eval --test-case google-sheets-processing
pnpm eval --test-case extract-from-file

# With additional generated test cases
GENERATE_TEST_CASES=true pnpm eval

# With custom concurrency
EVALUATION_CONCURRENCY=10 pnpm eval

Langsmith Evaluation

# Set required environment variables
export LANGSMITH_API_KEY=your_api_key
# Optionally specify dataset
export LANGSMITH_DATASET_NAME=your_dataset_name

# Run evaluation
pnpm eval:langsmith

Pairwise Evaluation

Pairwise evaluation uses a dataset with custom do/don't criteria for each prompt.

# Set required environment variables
export LANGSMITH_API_KEY=your_api_key

# Run pairwise evaluation (uses default dataset: notion-pairwise-workflows)
pnpm eval:pairwise

# Use a custom dataset
LANGSMITH_DATASET_NAME=my-pairwise-dataset pnpm eval:pairwise

# Limit to specific number of examples (useful for testing)
EVAL_MAX_EXAMPLES=2 pnpm eval:pairwise

# Run with multiple repetitions
pnpm eval:pairwise --repetitions 3

Configuration

Required Files

nodes.json

IMPORTANT: The evaluation framework requires a nodes.json file in the evaluations root directory (evaluations/nodes.json).

This file contains all n8n node type definitions and is used by the AI Workflow Builder agent to:

Know what nodes are available in n8n
Understand node parameters and their schemas
Generate valid workflows with proper node configurations

Why is this required? The AI Workflow Builder agent needs access to node definitions to generate workflows. In a normal n8n runtime, these definitions are loaded automatically. However, since the evaluation framework instantiates the agent without a running n8n instance, we must provide the node definitions manually via nodes.json.

How to generate nodes.json:

Run your n8n instance
Download the node definitions from locally running n8n instance(http://localhost:5678/types/nodes.json)
Save the node definitions to evaluations/nodes.json curl -o evaluations/nodes.json http://localhost:5678/types/nodes.json

The evaluation will fail with a clear error message if nodes.json is missing.

Environment Variables

N8N_AI_ANTHROPIC_KEY - Required for LLM access
LANGSMITH_API_KEY - Required for Langsmith evaluation
USE_LANGSMITH_EVAL - Set to "true" to use Langsmith mode
USE_PAIRWISE_EVAL - Set to "true" to use pairwise evaluation mode
LANGSMITH_DATASET_NAME - Override default dataset name
EVAL_MAX_EXAMPLES - Limit number of examples to evaluate (useful for testing)
EVALUATION_CONCURRENCY - Number of parallel test executions (default: 5)
GENERATE_TEST_CASES - Set to "true" to generate additional test cases
LLM_MODEL - Model identifier for metadata tracking

Output

CLI Evaluation Output

Console Display: Real-time progress, test results, and summary statistics
Markdown Report: results/evaluation-report-[timestamp].md
JSON Results: results/evaluation-results-[timestamp].json

Langsmith Evaluation Output

Results are stored in Langsmith dashboard
Experiment name format: workflow-builder-evaluation-[date]
Includes detailed metrics for each evaluation category

Pairwise Evaluation Output

Results are stored in Langsmith dashboard
Experiment name format: pairwise-evals-[uuid]
Metrics reported:
- pairwise_score: Overall pass rate (0-1)
- pairwise_passed_count: Number of criteria that passed
- pairwise_failed_count: Number of criteria that were violated
Each result includes detailed comments with:
- List of violations with justifications
- List of passes with justifications

Adding New Test Cases

Test cases are defined in chains/test-case-generator.ts. Each test case requires:

id: Unique identifier
name: Descriptive name
prompt: Natural language description of the workflow to generate
referenceWorkflow (optional): Expected workflow structure for comparison

Extending the Framework

To add new evaluation metrics:

Create a new evaluator file in chains/evaluators/ following the existing pattern
Update the EvaluationResult schema in types/evaluation.ts to include the new metric
Add the new evaluator to the exports in chains/evaluators/index.ts
Import and call the new evaluator in chains/workflow-evaluator.ts's Promise.all() array
Adjust weight calculations in the calculateWeightedScore function
Update the evaluator in langsmith/evaluator.ts to include new metrics