feat: add LLM eval suite for Payload conventions and code generation by kendelljoseph · Pull Request #15710 · payloadcms/payload

kendelljoseph · 2026-02-20T20:33:47Z

Overview

The suite tests two complementary things:

QA evals — does the model correctly answer questions about Payload's API and conventions?
Codegen evals — can the model apply a specific change to a real payload.config.ts file, producing valid TypeScript with the right outcome?

Codegen evals use a three-step pipeline: LLM generation → TypeScript compilation → LLM scoring.

Skills Evaluation

Each QA suite runs in two modes to measure the impact of injecting SKILL.md as passive context:

Spec file	System prompt	Purpose
`eval.<suite>.spec.ts`	`qaWithSkill` — SKILL.md injected	Primary eval
`eval.<suite>.baseline.spec.ts`	`qaNoSkill` — no context doc	Baseline for comparison

Both modes are passive context injection (the document goes directly into the system: field). There is no tool-call indirection. The delta between the two is a direct measure of what SKILL.md contributes.

Cache keys include systemPromptKey, so qaWithSkill and qaNoSkill results are always stored as separate entries and never collide.

Running the evals

# Run all evals (with skill, high-power model)
pnpm run test:eval

# Run all evals — baseline (no skill context, high-power model)
pnpm run test:eval -- eval.baseline

# Run a specific suite only
pnpm run test:eval -- eval.config
pnpm run test:eval -- eval.conventions

# Force a fresh run, bypassing the result cache
EVAL_NO_CACHE=true pnpm run test:eval

# Run with an interactive HTML report (opens in browser after run)
pnpm run test:eval:report

# Report for a specific suite
pnpm run test:eval:report -- eval.config

OPENAI_API_KEY must be set in your environment.

The test:eval:report script generates test/evals/eval-results/report.html and serves it locally via Vitest UI. The file is gitignored.

Pipelines

QA Pipeline

flowchart LR
    qaCase["EvalCase"]
    optFixture["fixture"]
    systemPrompt["system prompt\n(qaWithSkill or qaNoSkill)"]
    runEval["runEval"]
    scoreAnswer["scoreAnswer"]
    qaResult["EvalResult"]

    qaCase --> runEval
    optFixture -->|"injected into prompt"| runEval
    systemPrompt --> runEval
    runEval --> scoreAnswer
    scoreAnswer --> qaResult

Codegen Pipeline

flowchart LR
    codegenCase["CodegenEvalCase"]
    fixture["fixture"]
    runCodegenEval["runCodegenEval"]
    tsc["validateConfigTypes"]
    scoreConfigChange["scoreConfigChange"]
    codegenResult["EvalResult"]

    codegenCase --> fixture
    fixture --> runCodegenEval
    runCodegenEval --> tsc
    tsc -->|"valid"| scoreConfigChange
    tsc -->|"invalid"| codegenResult
    scoreConfigChange --> codegenResult

The tsc check is the hard gate — if the generated TypeScript does not compile, the case fails immediately without calling the scorer. This keeps the scorer focused on semantic correctness rather than syntax errors.

Codegen always uses the configModify system prompt regardless of skill variant. Codegen cache keys do not include systemPromptKey, so codegen results are shared between with-skill and baseline runs — this is intentional and correct.

Result Caching

flowchart LR
    start["Eval"]
    cacheCheck{"cache hit?"}
    cached["cached EvalResult"]
    run["Run full pipeline"]
    write["eval-results/cache/<hash>.json"]
    done["EvalResult"]

    start --> cacheCheck
    cacheCheck -->|"yes + EVAL_NO_CACHE unset"| cached
    cacheCheck -->|"no or EVAL_NO_CACHE=true"| run
    run --> write
    write --> done
    cached --> done

Cache keys include the model ID and (for QA) the systemPromptKey, so the following never collide:

eval.spec.ts (gpt-5.2 + qaWithSkill)
eval.baseline.spec.ts (gpt-5.2 + qaNoSkill)
eval.low-power.spec.ts (gpt-4o + qaWithSkill)

Token Usage Tracking

Every EvalResult includes a usage object covering all LLM calls for that case:

{
  "result": {
    "pass": true,
    "score": 0.92,
    "usage": {
      "runner": {
        "inputTokens": 3499,
        "cachedInputTokens": 3328,
        "outputTokens": 280,
        "totalTokens": 3779,
      },
      "scorer": {
        "inputTokens": 669,
        "cachedInputTokens": 0,
        "outputTokens": 89,
        "totalTokens": 758,
      },
      "total": {
        "inputTokens": 4168,
        "cachedInputTokens": 3328,
        "outputTokens": 369,
        "totalTokens": 4537,
      },
    },
  },
}

runner — tokens spent generating the answer or modified config.
scorer — tokens spent evaluating the result (consistent across skill variants since the scorer prompt is fixed).
total — sum of runner + scorer for full per-case cost.
cachedInputTokens — the key signal for skill efficiency. qaWithSkill injects SKILL.md (~3,400 tokens) into every system prompt. Once the API warms the prompt cache, ~95% of those tokens are cachedInputTokens (billed at a reduced rate), so the net new tokens per call drops to ~170 — nearly identical to the qaNoSkill baseline.

For codegen cases that fail tsc, scorer is absent and total equals runner.

Usage is stored in the cache alongside the result, so historical runs retain their token data for cost comparisons across model variants and skill configurations.

Negative Tests

The negative suite tests the evaluation pipeline itself as much as the model:

Test	What it checks
Detection (QA)	Given a broken config, does the model identify the specific error? Expects ≥ 70% accuracy.
Correction (Codegen)	Given a broken config, does the model fix the error? tsc must pass after correction.
Invalid instruction	The model is explicitly told to introduce a bad field type. The test passes only if tsc catches the error and the pipeline correctly reports it as a failure.

The three broken fixtures (invalid-field-type, invalid-access-return, missing-beforechange-return) are shared by both the detection and correction datasets.

Adding a new eval case

QA case — add an entry to the appropriate datasets/<category>/qa.ts:

{
  input: 'How do you configure Payload to send emails?',
  expected: 'set the email property in buildConfig with an adapter like nodemailerAdapter',
  category: 'config',
}

Codegen case — create a fixture first, then add the dataset entry:

Add test/evals/fixtures/<category>/codegen/<name>/payload.config.ts — a minimal but valid config that gives the LLM context for the specific task.
Add an entry to datasets/<category>/codegen.ts:

{
  input: 'Add a text field named "excerpt" to the posts collection.',
  expected: 'text field with name "excerpt" added to posts.fields',
  category: 'collections',
  fixturePath: 'collections/codegen/<name>',
}

The cache key for codegen includes the fixture file's content (not just its path), so updating a fixture automatically invalidates its cached result.

Debugging failed cases

Every failed case writes a JSON file to eval-results/failed-assertions/<label-slug>/. For codegen cases this includes the starter config, the LLM-generated config, tsc errors (if any), and the scorer's reasoning. For QA cases it includes the question, expected answer, actual answer, and reasoning.

The generated .ts files in eval-results/<category>/codegen/ show the last LLM output for each fixture and can be opened directly in the editor for manual inspection.

github-actions · 2026-02-20T20:42:50Z

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖
This PR introduced no changes to the esbuild bundle! 🙌

denolfe

Good start. I think we should also use the TypeScript compiler to evaluate "correctness" of the LLM output. LLM-as-a-judge is great for free-form text, but I'd think it's possible that the LLM could evaluate output as "correct", but it still wouldn't be correct in a real TypeScript project.

What I'd like to see:

Each test should have its own payload.config.ts that the LLM can insert the code into
The TypeScript compiler can then run against the modified config
We still should keep the LLM-as-a-judge piece that you have here, as it's possible to get compiling code that doesn't actually fulfill the spirit of the test.

Let's look at https://github.com/vercel/next-evals-oss as a good example of this structure, which leverages their agent eval package: https://github.com/vercel-labs/agent-eval.

With the above, we should be able to get a good output on both measures of correctness of the LLM outputs.

…ve tests

…completeness scores Refactors the eval scoring system to use weighted correctness and completeness subscores instead of a single boolean pass. Extracts runCodegenCase as a standalone exported function, adds averageScore to accuracy summaries, and introduces a thresholds.ts file for SCORE_THRESHOLD and ACCURACY_THRESHOLD constants.

… and fixtures

… HTML report Tracks token usage (input, output, cached) across runner and scorer LLM calls and attaches it to EvalResult. Renames the qa system prompt to qaWithSkill and adds a qaNoSkill baseline variant sourced from SKILL.md instead of CLAUDE.md, with one new baseline spec file per suite to enable A/B comparison. Adds @vitest/ui and a test:eval:report script for generating HTML reports.

Adds an eval dashboard with results and compare table views, a Payload config and generated types for storing eval runs, an eval report handler, and supporting icons/nav components. Also updates runDataset/runCodegenDataset to persist results and registers the evals app in vitest config.

feat: init evals

6c17ce1

github-actions bot added the created-by: Payload team label Feb 20, 2026

kendelljoseph requested a review from denolfe February 20, 2026 20:34

kendelljoseph changed the title ~~feat(test): add LLM eval suite for Payload conventions and code generation~~ feat: add LLM eval suite for Payload conventions and code generation Feb 20, 2026

denolfe reviewed Feb 20, 2026

View reviewed changes

kendelljoseph added 11 commits February 23, 2026 16:18

feat: build out eval suite with codegen pipeline, caching, and negati…

6eb5265

…ve tests

chore: increases timeout

7124e6f

chore: splits evals into smaller evals

c9e9a97

chore: updates config fixtures

443f51e

chore: updates utils

d510780

chore: adds init eval suites

0021ef2

chore: updates prompts

4d66c86

chore: updates vite to let tests run longer

87b6a6d

chore: add GraphQL, Local API, and REST API eval suites with datasets…

8784444

… and fixtures

kendelljoseph marked this pull request as ready for review February 25, 2026 21:02

kendelljoseph requested a review from AlessioGr as a code owner February 25, 2026 21:02

kendelljoseph added 5 commits February 26, 2026 14:24

chore: add systemPromptKey to codegen cache and runner

40390f7

chore: pass systemPromptKey to codegen runner in plugin eval suites

d6da378

fix: type sortinig

465f6f4

Merge branch 'main' into ai/evals

7feeefd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add LLM eval suite for Payload conventions and code generation#15710

feat: add LLM eval suite for Payload conventions and code generation#15710
kendelljoseph wants to merge 17 commits intomainfrom
ai/evals

kendelljoseph commented Feb 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

denolfe left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kendelljoseph commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Skills Evaluation

Running the evals

Pipelines

QA Pipeline

Codegen Pipeline

Result Caching

Token Usage Tracking

Negative Tests

Adding a new eval case

Debugging failed cases

Uh oh!

github-actions bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 esbuild Bundle Analysis for payload

Uh oh!

denolfe left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kendelljoseph commented Feb 20, 2026 •

edited

Loading

github-actions bot commented Feb 20, 2026 •

edited

Loading