feat: add LLM eval suite for Payload conventions and code generation#15710
Open
kendelljoseph wants to merge 17 commits intomainfrom
Open
feat: add LLM eval suite for Payload conventions and code generation#15710kendelljoseph wants to merge 17 commits intomainfrom
kendelljoseph wants to merge 17 commits intomainfrom
Conversation
Contributor
📦 esbuild Bundle Analysis for payloadThis analysis was generated by esbuild-bundle-analyzer. 🤖 |
denolfe
reviewed
Feb 20, 2026
Member
denolfe
left a comment
There was a problem hiding this comment.
Good start. I think we should also use the TypeScript compiler to evaluate "correctness" of the LLM output. LLM-as-a-judge is great for free-form text, but I'd think it's possible that the LLM could evaluate output as "correct", but it still wouldn't be correct in a real TypeScript project.
What I'd like to see:
- Each test should have its own
payload.config.tsthat the LLM can insert the code into - The TypeScript compiler can then run against the modified config
- We still should keep the LLM-as-a-judge piece that you have here, as it's possible to get compiling code that doesn't actually fulfill the spirit of the test.
Let's look at https://github.com/vercel/next-evals-oss as a good example of this structure, which leverages their agent eval package: https://github.com/vercel-labs/agent-eval.
With the above, we should be able to get a good output on both measures of correctness of the LLM outputs.
…completeness scores Refactors the eval scoring system to use weighted correctness and completeness subscores instead of a single boolean pass. Extracts runCodegenCase as a standalone exported function, adds averageScore to accuracy summaries, and introduces a thresholds.ts file for SCORE_THRESHOLD and ACCURACY_THRESHOLD constants.
… HTML report Tracks token usage (input, output, cached) across runner and scorer LLM calls and attaches it to EvalResult. Renames the qa system prompt to qaWithSkill and adds a qaNoSkill baseline variant sourced from SKILL.md instead of CLAUDE.md, with one new baseline spec file per suite to enable A/B comparison. Adds @vitest/ui and a test:eval:report script for generating HTML reports.
Adds an eval dashboard with results and compare table views, a Payload config and generated types for storing eval runs, an eval report handler, and supporting icons/nav components. Also updates runDataset/runCodegenDataset to persist results and registers the evals app in vitest config.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
The suite tests two complementary things:
payload.config.tsfile, producing valid TypeScript with the right outcome?Codegen evals use a three-step pipeline:
LLM generation→TypeScript compilation→LLM scoring.Skills Evaluation
Each QA suite runs in two modes to measure the impact of injecting
SKILL.mdas passive context:eval.<suite>.spec.tsqaWithSkill— SKILL.md injectedeval.<suite>.baseline.spec.tsqaNoSkill— no context docBoth modes are passive context injection (the document goes directly into the
system:field). There is no tool-call indirection. The delta between the two is a direct measure of what SKILL.md contributes.Running the evals
OPENAI_API_KEYmust be set in your environment.The
test:eval:reportscript generatestest/evals/eval-results/report.htmland serves it locally via Vitest UI. The file is gitignored.Pipelines
QA Pipeline
flowchart LR qaCase["EvalCase"] optFixture["fixture"] systemPrompt["system prompt\n(qaWithSkill or qaNoSkill)"] runEval["runEval"] scoreAnswer["scoreAnswer"] qaResult["EvalResult"] qaCase --> runEval optFixture -->|"injected into prompt"| runEval systemPrompt --> runEval runEval --> scoreAnswer scoreAnswer --> qaResultCodegen Pipeline
flowchart LR codegenCase["CodegenEvalCase"] fixture["fixture"] runCodegenEval["runCodegenEval"] tsc["validateConfigTypes"] scoreConfigChange["scoreConfigChange"] codegenResult["EvalResult"] codegenCase --> fixture fixture --> runCodegenEval runCodegenEval --> tsc tsc -->|"valid"| scoreConfigChange tsc -->|"invalid"| codegenResult scoreConfigChange --> codegenResultResult Caching
flowchart LR start["Eval"] cacheCheck{"cache hit?"} cached["cached EvalResult"] run["Run full pipeline"] write["eval-results/cache/<hash>.json"] done["EvalResult"] start --> cacheCheck cacheCheck -->|"yes + EVAL_NO_CACHE unset"| cached cacheCheck -->|"no or EVAL_NO_CACHE=true"| run run --> write write --> done cached --> doneCache keys include the model ID and (for QA) the
systemPromptKey, so the following never collide:eval.spec.ts(gpt-5.2 + qaWithSkill)eval.baseline.spec.ts(gpt-5.2 + qaNoSkill)eval.low-power.spec.ts(gpt-4o + qaWithSkill)Token Usage Tracking
Every
EvalResultincludes ausageobject covering all LLM calls for that case:{ "result": { "pass": true, "score": 0.92, "usage": { "runner": { "inputTokens": 3499, "cachedInputTokens": 3328, "outputTokens": 280, "totalTokens": 3779, }, "scorer": { "inputTokens": 669, "cachedInputTokens": 0, "outputTokens": 89, "totalTokens": 758, }, "total": { "inputTokens": 4168, "cachedInputTokens": 3328, "outputTokens": 369, "totalTokens": 4537, }, }, }, }runner— tokens spent generating the answer or modified config.scorer— tokens spent evaluating the result (consistent across skill variants since the scorer prompt is fixed).total— sum of runner + scorer for full per-case cost.cachedInputTokens— the key signal for skill efficiency.qaWithSkillinjects SKILL.md (~3,400 tokens) into every system prompt. Once the API warms the prompt cache, ~95% of those tokens arecachedInputTokens(billed at a reduced rate), so the net new tokens per call drops to ~170 — nearly identical to theqaNoSkillbaseline.For codegen cases that fail tsc,
scoreris absent andtotalequalsrunner.Usage is stored in the cache alongside the result, so historical runs retain their token data for cost comparisons across model variants and skill configurations.
Negative Tests
The negative suite tests the evaluation pipeline itself as much as the model:
The three broken fixtures (
invalid-field-type,invalid-access-return,missing-beforechange-return) are shared by both the detection and correction datasets.Adding a new eval case
QA case — add an entry to the appropriate
datasets/<category>/qa.ts:Codegen case — create a fixture first, then add the dataset entry:
test/evals/fixtures/<category>/codegen/<name>/payload.config.ts— a minimal but valid config that gives the LLM context for the specific task.datasets/<category>/codegen.ts:The cache key for codegen includes the fixture file's content (not just its path), so updating a fixture automatically invalidates its cached result.
Debugging failed cases
Every failed case writes a JSON file to
eval-results/failed-assertions/<label-slug>/. For codegen cases this includes the starter config, the LLM-generated config, tsc errors (if any), and the scorer's reasoning. For QA cases it includes the question, expected answer, actual answer, and reasoning.The generated
.tsfiles ineval-results/<category>/codegen/show the last LLM output for each fixture and can be opened directly in the editor for manual inspection.