feat(evals): add Promptfoo-based AI tool call evaluation suite#2351
Draft
feat(evals): add Promptfoo-based AI tool call evaluation suite#2351
Conversation
Add automated evaluation infrastructure for validating LLM tool call quality across SuperDoc's Document Engine API. Tests whether models select the correct tools and construct valid arguments when given document editing tasks. The suite extracts 6 essential tool definitions from the SDK and runs them against multiple OpenAI models and cross-provider comparisons (Anthropic, Google). Includes deterministic assertions for tool selection, argument accuracy, and production correctness rules learned from the labs agent implementation.
Updated the GDPval benchmark configuration to include distinct prompts for SuperDoc tool-augmented and baseline models. Enhanced the test assertions in the GDPval workflows to provide clearer scoring criteria for model responses, focusing on the specificity and executable nature of the responses. Adjusted thresholds for scoring to better reflect the quality of tool calls and text descriptions in document editing tasks.
Changed the model identifier from GPT-4o to GPT-5.4 in the GDPval benchmark configuration for both SuperDoc tool-augmented and baseline prompts, ensuring alignment with the latest model updates.
…oc agent Introduced a new execution test suite for the SuperDoc agent, validating real document editing capabilities through the CLI. Added a new script command for executing these tests and updated the GDPval configuration to reflect the latest GPT model version. Included necessary dependencies and created a new provider for the SuperDoc agent to facilitate the execution of tool calls against DOCX files.
… validation Updated the SuperDoc agent to create temporary copies of documents for editing, ensuring original fixtures remain unaltered. Implemented round-trip validation to verify that edits persist after saving and re-opening DOCX files. Added a new memorandum fixture and expanded execution tests to cover various document editing scenarios, enhancing overall test coverage and reliability.
…rvation Enhanced the SuperDoc agent to include a `keepFile` option, allowing users to save edited documents to a specified output directory. Updated the logic to create the output directory if it doesn't exist and modified the cleanup process to conditionally copy the edited document based on this new option. Adjusted execution tests to validate the new functionality, ensuring comprehensive coverage of document editing scenarios.
…ate execution logic Enhanced the SuperDoc agent's execution configuration by increasing the `maxConcurrency` from 1 to 5, allowing for more efficient concurrent test execution. Updated the cleanup process to ensure isolated state directories are properly managed, improving resource handling during tests. Adjusted execution tests to reflect these changes, ensuring robust validation of document editing capabilities.
…ool configuration Refactored the SuperDoc agent's evaluation scripts to streamline the execution process and improve clarity. Removed the deprecated cross-provider configuration and consolidated tool evaluation logic into a unified structure. Introduced new assertion checks for tool quality and argument accuracy, ensuring comprehensive validation of document editing tasks. Updated the test suite to reflect these changes, enhancing overall test coverage and reliability.
…or SuperDoc agent Introduced the AI Gateway API key in the environment configuration to enable optional integration with Vercel AI Gateway. Added a new script command for executing evaluations through the gateway, enhancing the SuperDoc agent's capabilities. Created a new YAML configuration file for execution tests via the AI Gateway, allowing for testing across multiple models. Updated the package dependencies to include the necessary SDK for AI Gateway functionality.
…mer prompt tests Updated the SuperDoc agent to include tracking of total usage and steps during text generation, improving performance insights. Added a series of customer prompt tests in YAML format to validate various document editing tasks, ensuring comprehensive coverage of real-world scenarios. This enhancement aims to bolster the agent's capabilities and testing framework.
…d files Removed the JavaScript assertion file and context builder, simplifying the evaluation framework. Updated the prompt configuration to eliminate unused metrics and added new document fixtures for testing. Enhanced execution tests to validate document editing capabilities with the new fixtures, ensuring comprehensive coverage of various scenarios.
Updated the model labels in the execution gateway configuration for clarity and accuracy. Refined the execution test descriptions to better reflect the specific tasks being validated, enhancing the readability and intent of the tests. Commented out deprecated Google provider configurations to streamline the YAML files.
…files Updated the .gitignore to exclude temporary files and removed deprecated YAML configuration files related to GDPval and execution tests. Streamlined the package.json by eliminating unused evaluation scripts, enhancing overall project organization and clarity.
…agement Updated pnpm-lock.yaml to reflect new versions of dependencies, including @types/node and added new SDK entries for SuperDoc. Modified .gitignore to exclude additional temporary files and states, improving project cleanliness and organization.
Added a caching system to the SuperDoc agent and gateway providers to improve performance by storing and retrieving results based on a generated cache key. Updated the utility functions to handle cache operations, ensuring efficient reuse of previous evaluation results. Modified the evaluation logic to check for cached results before executing tasks, enhancing overall efficiency in the evaluation framework. Additionally, updated the package.json to reflect changes in evaluation scripts and added a new YAML configuration for end-to-end tests via the AI Gateway.
…nhanced documentation Updated the evaluation framework to include two levels of testing: tool quality and execution. Enhanced the README to clarify testing processes, commands, and configurations. Introduced new YAML files for tool quality and execution tests, detailing the number of tests and providers involved. Improved command descriptions for better usability and added new document fixtures for comprehensive testing of document editing capabilities.
Introduced a new Vercel tools provider for the SuperDoc evaluation framework, enabling structured tool calls with the Vercel AI SDK. Updated the package.json to include a new script for evaluating tools with the Vercel configuration. Enhanced the prompt configuration by adding a new YAML file for tool evaluations and refined existing evaluation scripts to support the new provider. Additionally, made minor adjustments to the presentation HTML for improved accessibility and clarity.
…oc/common dependency Removed outdated naive-ui entries and added @superdoc/common as a workspace dependency in pnpm-lock.yaml, ensuring the project reflects the latest dependency structure.
Refined evaluation scripts in package.json to output results to specific JSON files for better organization. Updated the Vercel tools provider to support live discovery of tools and improved error handling. Enhanced YAML configurations for tool evaluations, including clearer descriptions and adjustments to thresholds for tool-call metrics. Added caching functionality to optimize performance and ensure efficient reuse of evaluation results.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an automated evaluation suite for validating LLM tool call quality across SuperDoc's Document Engine API. This is the foundation for measuring and improving how well models use our tools.
What it does
Given a document editing task (e.g., "Find the indemnification clause and rewrite it"), the suite checks:
query_match, notget_document_text)select.type: "text", not a bare string)requirevalues)Architecture
Three eval configs for different purposes
File structure
How tools are loaded
Tool definitions come from the SDK-generated
packages/sdk/tools/tools.openai.json. The extraction script readstools-policy.jsonfor the essential tool list and writes a subset tolib/essential.json(gitignored). This keeps tool definitions DRY and automatically picks up upstream changes.Cross-provider output normalization
LLM providers return tool calls in different formats:
[{function: {name, arguments}}]{type: "tool_use", name, input}[{functionCall: {name, args}}]lib/normalize.cjsconverts all formats to OpenAI's array format so assertions work across providers.Test results (initial baseline)
tool_choice: requiredget_document_textfor list opstext.insertfor headingsCross-provider results (minimal vs full prompt)
Key finding: Our system prompt doubles Claude's accuracy (40% to 80%). This proves the value of our tool documentation.
Why Promptfoo
Evaluated Promptfoo, Braintrust, DeepEval, Langfuse, and OpenAI Evals. Promptfoo won because:
tool-call-f1assertion typeCommands
Test plan
pnpm run extract-toolsextracts 6 toolspnpm run evalpasses 97%+ on 4 OpenAI modelspnpm run eval:crossruns across 3 providers