Skip to content

feat: add LLM eval suite for Payload conventions and code generation#15710

Open
kendelljoseph wants to merge 17 commits intomainfrom
ai/evals
Open

feat: add LLM eval suite for Payload conventions and code generation#15710
kendelljoseph wants to merge 17 commits intomainfrom
ai/evals

Conversation

@kendelljoseph
Copy link
Contributor

@kendelljoseph kendelljoseph commented Feb 20, 2026

Overview

The suite tests two complementary things:

  • QA evals — does the model correctly answer questions about Payload's API and conventions?
  • Codegen evals — can the model apply a specific change to a real payload.config.ts file, producing valid TypeScript with the right outcome?

Codegen evals use a three-step pipeline: LLM generationTypeScript compilationLLM scoring.

Skills Evaluation

Each QA suite runs in two modes to measure the impact of injecting SKILL.md as passive context:

Spec file System prompt Purpose
eval.<suite>.spec.ts qaWithSkill — SKILL.md injected Primary eval
eval.<suite>.baseline.spec.ts qaNoSkill — no context doc Baseline for comparison

Both modes are passive context injection (the document goes directly into the system: field). There is no tool-call indirection. The delta between the two is a direct measure of what SKILL.md contributes.

Cache keys include systemPromptKey, so qaWithSkill and qaNoSkill results are always stored as separate entries and never collide.

Running the evals

# Run all evals (with skill, high-power model)
pnpm run test:eval

# Run all evals — baseline (no skill context, high-power model)
pnpm run test:eval -- eval.baseline

# Run a specific suite only
pnpm run test:eval -- eval.config
pnpm run test:eval -- eval.conventions

# Force a fresh run, bypassing the result cache
EVAL_NO_CACHE=true pnpm run test:eval

# Run with an interactive HTML report (opens in browser after run)
pnpm run test:eval:report

# Report for a specific suite
pnpm run test:eval:report -- eval.config

OPENAI_API_KEY must be set in your environment.

The test:eval:report script generates test/evals/eval-results/report.html and serves it locally via Vitest UI. The file is gitignored.

Pipelines

QA Pipeline

flowchart LR
    qaCase["EvalCase"]
    optFixture["fixture"]
    systemPrompt["system prompt\n(qaWithSkill or qaNoSkill)"]
    runEval["runEval"]
    scoreAnswer["scoreAnswer"]
    qaResult["EvalResult"]

    qaCase --> runEval
    optFixture -->|"injected into prompt"| runEval
    systemPrompt --> runEval
    runEval --> scoreAnswer
    scoreAnswer --> qaResult
Loading

Codegen Pipeline

flowchart LR
    codegenCase["CodegenEvalCase"]
    fixture["fixture"]
    runCodegenEval["runCodegenEval"]
    tsc["validateConfigTypes"]
    scoreConfigChange["scoreConfigChange"]
    codegenResult["EvalResult"]

    codegenCase --> fixture
    fixture --> runCodegenEval
    runCodegenEval --> tsc
    tsc -->|"valid"| scoreConfigChange
    tsc -->|"invalid"| codegenResult
    scoreConfigChange --> codegenResult
Loading

The tsc check is the hard gate — if the generated TypeScript does not compile, the case fails immediately without calling the scorer. This keeps the scorer focused on semantic correctness rather than syntax errors.

Codegen always uses the configModify system prompt regardless of skill variant. Codegen cache keys do not include systemPromptKey, so codegen results are shared between with-skill and baseline runs — this is intentional and correct.

Result Caching

flowchart LR
    start["Eval"]
    cacheCheck{"cache hit?"}
    cached["cached EvalResult"]
    run["Run full pipeline"]
    write["eval-results/cache/<hash>.json"]
    done["EvalResult"]

    start --> cacheCheck
    cacheCheck -->|"yes + EVAL_NO_CACHE unset"| cached
    cacheCheck -->|"no or EVAL_NO_CACHE=true"| run
    run --> write
    write --> done
    cached --> done
Loading

Cache keys include the model ID and (for QA) the systemPromptKey, so the following never collide:

  • eval.spec.ts (gpt-5.2 + qaWithSkill)
  • eval.baseline.spec.ts (gpt-5.2 + qaNoSkill)
  • eval.low-power.spec.ts (gpt-4o + qaWithSkill)

Token Usage Tracking

Every EvalResult includes a usage object covering all LLM calls for that case:

{
  "result": {
    "pass": true,
    "score": 0.92,
    "usage": {
      "runner": {
        "inputTokens": 3499,
        "cachedInputTokens": 3328,
        "outputTokens": 280,
        "totalTokens": 3779,
      },
      "scorer": {
        "inputTokens": 669,
        "cachedInputTokens": 0,
        "outputTokens": 89,
        "totalTokens": 758,
      },
      "total": {
        "inputTokens": 4168,
        "cachedInputTokens": 3328,
        "outputTokens": 369,
        "totalTokens": 4537,
      },
    },
  },
}
  • runner — tokens spent generating the answer or modified config.
  • scorer — tokens spent evaluating the result (consistent across skill variants since the scorer prompt is fixed).
  • total — sum of runner + scorer for full per-case cost.
  • cachedInputTokens — the key signal for skill efficiency. qaWithSkill injects SKILL.md (~3,400 tokens) into every system prompt. Once the API warms the prompt cache, ~95% of those tokens are cachedInputTokens (billed at a reduced rate), so the net new tokens per call drops to ~170 — nearly identical to the qaNoSkill baseline.

For codegen cases that fail tsc, scorer is absent and total equals runner.

Usage is stored in the cache alongside the result, so historical runs retain their token data for cost comparisons across model variants and skill configurations.

Negative Tests

The negative suite tests the evaluation pipeline itself as much as the model:

Test What it checks
Detection (QA) Given a broken config, does the model identify the specific error? Expects ≥ 70% accuracy.
Correction (Codegen) Given a broken config, does the model fix the error? tsc must pass after correction.
Invalid instruction The model is explicitly told to introduce a bad field type. The test passes only if tsc catches the error and the pipeline correctly reports it as a failure.

The three broken fixtures (invalid-field-type, invalid-access-return, missing-beforechange-return) are shared by both the detection and correction datasets.

Adding a new eval case

QA case — add an entry to the appropriate datasets/<category>/qa.ts:

{
  input: 'How do you configure Payload to send emails?',
  expected: 'set the email property in buildConfig with an adapter like nodemailerAdapter',
  category: 'config',
}

Codegen case — create a fixture first, then add the dataset entry:

  1. Add test/evals/fixtures/<category>/codegen/<name>/payload.config.ts — a minimal but valid config that gives the LLM context for the specific task.
  2. Add an entry to datasets/<category>/codegen.ts:
{
  input: 'Add a text field named "excerpt" to the posts collection.',
  expected: 'text field with name "excerpt" added to posts.fields',
  category: 'collections',
  fixturePath: 'collections/codegen/<name>',
}

The cache key for codegen includes the fixture file's content (not just its path), so updating a fixture automatically invalidates its cached result.

Debugging failed cases

Every failed case writes a JSON file to eval-results/failed-assertions/<label-slug>/. For codegen cases this includes the starter config, the LLM-generated config, tsc errors (if any), and the scorer's reasoning. For QA cases it includes the question, expected answer, actual answer, and reasoning.

The generated .ts files in eval-results/<category>/codegen/ show the last LLM output for each fixture and can be opened directly in the editor for manual inspection.

@kendelljoseph kendelljoseph changed the title feat(test): add LLM eval suite for Payload conventions and code generation feat: add LLM eval suite for Payload conventions and code generation Feb 20, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 20, 2026

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖
This PR introduced no changes to the esbuild bundle! 🙌

Copy link
Member

@denolfe denolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start. I think we should also use the TypeScript compiler to evaluate "correctness" of the LLM output. LLM-as-a-judge is great for free-form text, but I'd think it's possible that the LLM could evaluate output as "correct", but it still wouldn't be correct in a real TypeScript project.

What I'd like to see:

  • Each test should have its own payload.config.ts that the LLM can insert the code into
  • The TypeScript compiler can then run against the modified config
  • We still should keep the LLM-as-a-judge piece that you have here, as it's possible to get compiling code that doesn't actually fulfill the spirit of the test.

Let's look at https://github.com/vercel/next-evals-oss as a good example of this structure, which leverages their agent eval package: https://github.com/vercel-labs/agent-eval.

With the above, we should be able to get a good output on both measures of correctness of the LLM outputs.

…completeness scores

Refactors the eval scoring system to use weighted correctness and completeness subscores instead of a single boolean pass. Extracts runCodegenCase as a standalone exported function, adds averageScore to accuracy summaries, and introduces a thresholds.ts file for SCORE_THRESHOLD and ACCURACY_THRESHOLD constants.
… HTML report

Tracks token usage (input, output, cached) across runner and scorer LLM calls and attaches it to EvalResult. Renames the qa system prompt to qaWithSkill and adds a qaNoSkill baseline variant sourced from SKILL.md instead of CLAUDE.md, with one new baseline spec file per suite to enable A/B comparison. Adds @vitest/ui and a test:eval:report script for generating HTML reports.
@kendelljoseph kendelljoseph marked this pull request as ready for review February 25, 2026 21:02
Adds an eval dashboard with results and compare table views, a Payload config and generated types for storing eval runs, an eval report handler, and supporting icons/nav components. Also updates runDataset/runCodegenDataset to persist results and registers the evals app in vitest config.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants