feat(evals): add Promptfoo-based AI tool call evaluation suite by tupizz · Pull Request #2351 · superdoc-dev/superdoc

tupizz · 2026-03-10T12:35:44Z

Summary

Adds an automated evaluation suite for validating LLM tool call quality across SuperDoc's Document Engine API. This is the foundation for measuring and improving how well models use our tools.

What it does

Given a document editing task (e.g., "Find the indemnification clause and rewrite it"), the suite checks:

Did the model call the right tool? (query_match, not get_document_text)
Did it pass correct arguments? (select.type: "text", not a bare string)
Did it follow production rules? (no mixed rewrite + format batches, correct require values)

Architecture

User prompt          "Replace all mentions of Company A with Acme Corp"
      |
      v
Promptfoo sends      prompt + 6 tool definitions --> LLM (GPT-4o, GPT-5.4, etc.)
to the LLM                    |
                              v
LLM returns          [{ function: { name: "query_match", arguments: {...} } }]
tool calls                    |
                              v
Assertions check     tool-call-f1 + file://lib/assertions.cjs:validOpNames
the output                    |
                              v
Result               PASS / FAIL with reason

Three eval configs for different purposes

promptfooconfig.yaml                 Main suite: 4 OpenAI models, 25 deterministic tests
                                     Cost: ~$0.30 per run. Cached re-runs: free.

promptfooconfig.cross-provider.yaml  Cross-provider: GPT-5.4 vs Claude vs Gemini
                                     Tests with BOTH full prompt and minimal prompt
                                     to measure system prompt value.

promptfooconfig.gdpval.yaml          GDPval benchmark: Model+SuperDoc vs Model-Only
                                     Uses llm-rubric (costs ~$1-2 per run).

File structure

evals/
  promptfooconfig.yaml                 Configs (root level, Promptfoo convention)
  promptfooconfig.cross-provider.yaml
  promptfooconfig.gdpval.yaml

  prompts/                             What we send to LLMs
    agent.txt                          Full system prompt (from labs agent)
    minimal.txt                        Minimal prompt (customer simulation)

  tests/                               What we check
    tool-tests.yaml                    15 tests: tool selection + args + correctness
    workflows.yaml                     11 tests: find/replace, tracked changes, lists
    cross-provider.yaml                10 tests: realistic customer prompts
    gdpval-workflows.yaml              5 tests: Model+SuperDoc vs baseline

  lib/                                 Helper code
    assertions.cjs                     Shared assertion functions (15 exports)
    normalize.cjs                      Cross-provider output normalization
    extract.mjs                        Tool extraction from SDK artifacts
    save-baseline.mjs                  Save versioned result snapshot
    compare-baselines.mjs              Compare two snapshots

How tools are loaded

Tool definitions come from the SDK-generated packages/sdk/tools/tools.openai.json. The extraction script reads tools-policy.json for the essential tool list and writes a subset to lib/essential.json (gitignored). This keeps tool definitions DRY and automatically picks up upstream changes.

packages/sdk/tools/tools-policy.json  -->  lib/extract.mjs  -->  lib/essential.json
packages/sdk/tools/tools.openai.json       (reads both)          (6 tools, gitignored)

Cross-provider output normalization

LLM providers return tool calls in different formats:

OpenAI: [{function: {name, arguments}}]
Anthropic: {type: "tool_use", name, input}
Google: [{functionCall: {name, args}}]

lib/normalize.cjs converts all formats to OpenAI's array format so assertions work across providers.

Test results (initial baseline)

Model	Pass Rate	Notes
GPT-4o	100% (25/25)	Best tool-calling accuracy
GPT-4.1	100% (25/25)	Matches GPT-4o with `tool_choice: required`
GPT-5.4	96% (24/25)	Calls `get_document_text` for list ops
GPT-4.1-mini	96% (24/25)	Uses `text.insert` for headings

Cross-provider results (minimal vs full prompt)

Model + Prompt	Pass Rate
GPT-5.4 + Full prompt	100%
GPT-5.4 + Minimal prompt	80%
Claude Sonnet 4.6 + Full prompt	80%
Gemini 2.5 Pro + Minimal prompt	70%
Claude Sonnet 4.6 + Minimal prompt	40%

Key finding: Our system prompt doubles Claude's accuracy (40% to 80%). This proves the value of our tool documentation.

Why Promptfoo

Evaluated Promptfoo, Braintrust, DeepEval, Langfuse, and OpenAI Evals. Promptfoo won because:

TypeScript native (matches our stack)
YAML-driven (no code to run tests)
Built-in tool-call-f1 assertion type
Caching (re-runs after assertion changes are free)
Web UI for inspecting results
GitHub Actions support for CI

Commands

pnpm run extract-tools   # Extract tools from SDK (run once)
pnpm run eval            # Run main suite (~$0.30)
pnpm run eval:cross      # Cross-provider comparison
pnpm run eval:gdpval     # GDPval benchmark (~$1-2)
pnpm run eval:view       # Open web UI

Test plan

pnpm run extract-tools extracts 6 tools
pnpm run eval passes 97%+ on 4 OpenAI models
pnpm run eval:cross runs across 3 providers
Assertions produce descriptive failure reasons
Cached re-runs are instant and free
CI workflow (future PR)

Add automated evaluation infrastructure for validating LLM tool call quality across SuperDoc's Document Engine API. Tests whether models select the correct tools and construct valid arguments when given document editing tasks. The suite extracts 6 essential tool definitions from the SDK and runs them against multiple OpenAI models and cross-provider comparisons (Anthropic, Google). Includes deterministic assertions for tool selection, argument accuracy, and production correctness rules learned from the labs agent implementation.

Updated the GDPval benchmark configuration to include distinct prompts for SuperDoc tool-augmented and baseline models. Enhanced the test assertions in the GDPval workflows to provide clearer scoring criteria for model responses, focusing on the specificity and executable nature of the responses. Adjusted thresholds for scoring to better reflect the quality of tool calls and text descriptions in document editing tasks.

Changed the model identifier from GPT-4o to GPT-5.4 in the GDPval benchmark configuration for both SuperDoc tool-augmented and baseline prompts, ensuring alignment with the latest model updates.

…oc agent Introduced a new execution test suite for the SuperDoc agent, validating real document editing capabilities through the CLI. Added a new script command for executing these tests and updated the GDPval configuration to reflect the latest GPT model version. Included necessary dependencies and created a new provider for the SuperDoc agent to facilitate the execution of tool calls against DOCX files.

… validation Updated the SuperDoc agent to create temporary copies of documents for editing, ensuring original fixtures remain unaltered. Implemented round-trip validation to verify that edits persist after saving and re-opening DOCX files. Added a new memorandum fixture and expanded execution tests to cover various document editing scenarios, enhancing overall test coverage and reliability.

…rvation Enhanced the SuperDoc agent to include a `keepFile` option, allowing users to save edited documents to a specified output directory. Updated the logic to create the output directory if it doesn't exist and modified the cleanup process to conditionally copy the edited document based on this new option. Adjusted execution tests to validate the new functionality, ensuring comprehensive coverage of document editing scenarios.

…ate execution logic Enhanced the SuperDoc agent's execution configuration by increasing the `maxConcurrency` from 1 to 5, allowing for more efficient concurrent test execution. Updated the cleanup process to ensure isolated state directories are properly managed, improving resource handling during tests. Adjusted execution tests to reflect these changes, ensuring robust validation of document editing capabilities.

…ool configuration Refactored the SuperDoc agent's evaluation scripts to streamline the execution process and improve clarity. Removed the deprecated cross-provider configuration and consolidated tool evaluation logic into a unified structure. Introduced new assertion checks for tool quality and argument accuracy, ensuring comprehensive validation of document editing tasks. Updated the test suite to reflect these changes, enhancing overall test coverage and reliability.

…or SuperDoc agent Introduced the AI Gateway API key in the environment configuration to enable optional integration with Vercel AI Gateway. Added a new script command for executing evaluations through the gateway, enhancing the SuperDoc agent's capabilities. Created a new YAML configuration file for execution tests via the AI Gateway, allowing for testing across multiple models. Updated the package dependencies to include the necessary SDK for AI Gateway functionality.

…mer prompt tests Updated the SuperDoc agent to include tracking of total usage and steps during text generation, improving performance insights. Added a series of customer prompt tests in YAML format to validate various document editing tasks, ensuring comprehensive coverage of real-world scenarios. This enhancement aims to bolster the agent's capabilities and testing framework.

…d files Removed the JavaScript assertion file and context builder, simplifying the evaluation framework. Updated the prompt configuration to eliminate unused metrics and added new document fixtures for testing. Enhanced execution tests to validate document editing capabilities with the new fixtures, ensuring comprehensive coverage of various scenarios.

Updated the model labels in the execution gateway configuration for clarity and accuracy. Refined the execution test descriptions to better reflect the specific tasks being validated, enhancing the readability and intent of the tests. Commented out deprecated Google provider configurations to streamline the YAML files.

…files Updated the .gitignore to exclude temporary files and removed deprecated YAML configuration files related to GDPval and execution tests. Streamlined the package.json by eliminating unused evaluation scripts, enhancing overall project organization and clarity.

…agement Updated pnpm-lock.yaml to reflect new versions of dependencies, including @types/node and added new SDK entries for SuperDoc. Modified .gitignore to exclude additional temporary files and states, improving project cleanliness and organization.

Added a caching system to the SuperDoc agent and gateway providers to improve performance by storing and retrieving results based on a generated cache key. Updated the utility functions to handle cache operations, ensuring efficient reuse of previous evaluation results. Modified the evaluation logic to check for cached results before executing tasks, enhancing overall efficiency in the evaluation framework. Additionally, updated the package.json to reflect changes in evaluation scripts and added a new YAML configuration for end-to-end tests via the AI Gateway.

…nhanced documentation Updated the evaluation framework to include two levels of testing: tool quality and execution. Enhanced the README to clarify testing processes, commands, and configurations. Introduced new YAML files for tool quality and execution tests, detailing the number of tests and providers involved. Improved command descriptions for better usability and added new document fixtures for comprehensive testing of document editing capabilities.

Introduced a new Vercel tools provider for the SuperDoc evaluation framework, enabling structured tool calls with the Vercel AI SDK. Updated the package.json to include a new script for evaluating tools with the Vercel configuration. Enhanced the prompt configuration by adding a new YAML file for tool evaluations and refined existing evaluation scripts to support the new provider. Additionally, made minor adjustments to the presentation HTML for improved accessibility and clarity.

…oc/common dependency Removed outdated naive-ui entries and added @superdoc/common as a workspace dependency in pnpm-lock.yaml, ensuring the project reflects the latest dependency structure.

Refined evaluation scripts in package.json to output results to specific JSON files for better organization. Updated the Vercel tools provider to support live discovery of tools and improved error handling. Enhanced YAML configurations for tool evaluations, including clearer descriptions and adjustments to thresholds for tool-call metrics. Added caching functionality to optimize performance and ensure efficient reuse of evaluation results.

superdoc-bot bot added the risk: low label Mar 10, 2026

tupizz added 20 commits March 10, 2026 10:15

docs(evals): simplify README

be97f10

chore(evals): update GPT model version in GDPval configuration

3edb611

Changed the model identifier from GPT-4o to GPT-5.4 in the GDPval benchmark configuration for both SuperDoc tool-augmented and baseline prompts, ensuring alignment with the latest model updates.

Merge remote-tracking branch 'origin/main' into feat/ai-eval-suite

cf822b1

chore(deps): update pnpm-lock.yaml to remove naive-ui and add @superd…

82138a4

…oc/common dependency Removed outdated naive-ui entries and added @superdoc/common as a workspace dependency in pnpm-lock.yaml, ensuring the project reflects the latest dependency structure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add Promptfoo-based AI tool call evaluation suite#2351

feat(evals): add Promptfoo-based AI tool call evaluation suite#2351
tupizz wants to merge 21 commits intomainfrom
feat/ai-eval-suite

tupizz commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tupizz commented Mar 10, 2026

Summary

What it does

Architecture

Three eval configs for different purposes

File structure

How tools are loaded

Cross-provider output normalization

Test results (initial baseline)

Cross-provider results (minimal vs full prompt)

Why Promptfoo

Commands

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant