diff --git a/docs/superpowers/plans/2026-04-16-unified-eval-schema.md b/docs/superpowers/plans/2026-04-16-unified-eval-schema.md deleted file mode 100644 index c35f3c9..0000000 --- a/docs/superpowers/plans/2026-04-16-unified-eval-schema.md +++ /dev/null @@ -1,512 +0,0 @@ -# Unified Eval Schema Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Make `validate-evals.sh` accept Anthropic's `expectations` field as a valid alternative to regex `assertions`, then convert the github-release evals to the unified format. - -**Architecture:** The validator's Python block gets a new code path that checks for `expectations` (string list) when `assertions` is missing. The github-release evals.json is rewritten from the orphan format to the unified schema with both `expectations` and `assertions`. - -**Tech Stack:** Bash, Python 3 (inline in validate-evals.sh), JSON - ---- - -### Task 1: Update validate-evals.sh — accept `expectations` as alternative to `assertions` - -**Files:** -- Modify: `/home/sme/p/skill-repo-skill/main/skills/skill-repo/scripts/validate-evals.sh:94-153` (Python block, per-eval validation) - -- [ ] **Step 1: Create feature branch in skill-repo-skill** - -```bash -cd /home/sme/p/skill-repo-skill/.bare -git worktree add ../fix/unified-eval-schema -b fix/unified-eval-schema main -cd /home/sme/p/skill-repo-skill/fix/unified-eval-schema -``` - -- [ ] **Step 2: Write test — validate-evals.sh accepts expectations-only eval** - -Create a test fixture at `tests/evals-expectations-only.json`: - -```json -{ - "skill_name": "test-skill", - "evals": [ - { - "id": 1, - "prompt": "Test prompt one", - "expected_output": "Should do X", - "expectations": [ - "The output explains X", - "The output recommends Y" - ] - }, - { - "id": 2, - "prompt": "Test prompt two", - "expected_output": "Should do Y", - "expectations": [ - "The output contains A", - "The output avoids B" - ] - }, - { - "id": 3, - "prompt": "Test prompt three", - "expected_output": "Should do Z", - "expectations": [ - "First expectation", - "Second expectation" - ] - }, - { - "id": 4, - "prompt": "Test prompt four", - "expected_output": "Should do W", - "expectations": [ - "First expectation", - "Second expectation" - ] - }, - { - "id": 5, - "prompt": "Test prompt five", - "expected_output": "Should do V", - "expectations": [ - "First expectation", - "Second expectation" - ] - }, - { - "id": 6, - "prompt": "Test prompt six", - "expected_output": "Should do U", - "expectations": [ - "First expectation", - "Second expectation" - ] - }, - { - "id": 7, - "prompt": "Test prompt seven", - "expected_output": "Should do T", - "expectations": [ - "First expectation", - "Second expectation" - ] - }, - { - "id": 8, - "prompt": "Test prompt eight", - "expected_output": "Should do S", - "expectations": [ - "First expectation", - "Second expectation" - ] - }, - { - "id": 9, - "prompt": "Test prompt nine", - "expected_output": "Should do R", - "expectations": [ - "First expectation", - "Second expectation" - ] - }, - { - "id": 10, - "prompt": "Test prompt ten", - "expected_output": "Should do Q", - "expectations": [ - "First expectation", - "Second expectation" - ] - } - ] -} -``` - -- [ ] **Step 3: Run test to verify it fails with current validator** - -```bash -bash skills/skill-repo/scripts/validate-evals.sh tests/evals-expectations-only.json -``` - -Expected: FAIL — `missing assertions` for every eval. - -- [ ] **Step 4: Write test fixture — unified format (both expectations and assertions)** - -Create `tests/evals-unified.json`: - -```json -{ - "skill_name": "test-skill", - "evals": [ - { - "id": 1, - "prompt": "Test prompt one", - "expected_output": "Should do X", - "expectations": [ - "The output explains X", - "The output recommends Y" - ], - "assertions": [ - {"type": "content", "pattern": "(?i)explains.*X"}, - {"type": "must_not", "pattern": "(?i)ignores.*X"} - ] - }, - { - "id": 2, - "prompt": "Test prompt two", - "expected_output": "Should do Y", - "expectations": [ - "The output contains A", - "The output avoids B" - ], - "assertions": [ - {"type": "content", "pattern": "(?i)contains.*A"} - ] - }, - { - "id": 3, - "prompt": "Test prompt three", - "expected_output": "Should do Z", - "expectations": ["First", "Second"] - }, - {"id": 4, "prompt": "P4", "expected_output": "E4", "expectations": ["A", "B"]}, - {"id": 5, "prompt": "P5", "expected_output": "E5", "expectations": ["A", "B"]}, - {"id": 6, "prompt": "P6", "expected_output": "E6", "expectations": ["A", "B"]}, - {"id": 7, "prompt": "P7", "expected_output": "E7", "expectations": ["A", "B"]}, - {"id": 8, "prompt": "P8", "expected_output": "E8", "expectations": ["A", "B"]}, - {"id": 9, "prompt": "P9", "expected_output": "E9", "expectations": ["A", "B"]}, - {"id": 10, "prompt": "P10", "expected_output": "E10", "expectations": ["A", "B"]} - ] -} -``` - -- [ ] **Step 5: Write test fixture — legacy format still works** - -Create `tests/evals-legacy-regex.json`: - -```json -[ - {"name": "test_1", "prompt": "Do X", "assertions": [{"type": "content", "pattern": "X"}, {"type": "content", "pattern": "Y"}]}, - {"name": "test_2", "prompt": "Do Y", "assertions": [{"type": "content", "pattern": "A"}, {"type": "content", "pattern": "B"}]}, - {"name": "test_3", "prompt": "Do Z", "assertions": [{"type": "content", "pattern": "C"}, {"type": "content", "pattern": "D"}]}, - {"name": "test_4", "prompt": "P4", "assertions": [{"type": "content", "pattern": "E"}, {"type": "content", "pattern": "F"}]}, - {"name": "test_5", "prompt": "P5", "assertions": [{"type": "content", "pattern": "G"}, {"type": "content", "pattern": "H"}]}, - {"name": "test_6", "prompt": "P6", "assertions": [{"type": "content", "pattern": "I"}, {"type": "content", "pattern": "J"}]}, - {"name": "test_7", "prompt": "P7", "assertions": [{"type": "content", "pattern": "K"}, {"type": "content", "pattern": "L"}]}, - {"name": "test_8", "prompt": "P8", "assertions": [{"type": "content", "pattern": "M"}, {"type": "content", "pattern": "N"}]}, - {"name": "test_9", "prompt": "P9", "assertions": [{"type": "content", "pattern": "O"}, {"type": "content", "pattern": "P"}]}, - {"name": "test_10", "prompt": "P10", "assertions": [{"type": "content", "pattern": "Q"}, {"type": "content", "pattern": "R"}]} -] -``` - -- [ ] **Step 6: Update the Python validation block in validate-evals.sh** - -Replace the per-eval validation section (lines 94-153) in `skills/skill-repo/scripts/validate-evals.sh`. The key change: the `assertions` check becomes "must have `expectations` OR `assertions` (or both)": - -```python -for i, ev in enumerate(evals): - label = f"eval[{i}]" - - if not isinstance(ev, dict): - print(f"FAIL|{label}: not an object") - continue - - # Name/ID check (supports 'name', 'eval_name', or 'id') - name = ev.get("name") or ev.get("eval_name") or "" - has_id = "id" in ev - if has_id: - has_ids = True - ids_found.append(ev["id"]) - if not name: - name = str(ev["id"]) - if not name and not has_id: - print(f"FAIL|{label}: missing name/eval_name/id") - elif name: - names.append(str(name).strip()) - - # Prompt check (supports 'prompt' and legacy 'input') - prompt = ev.get("prompt") or ev.get("input") or "" - if not prompt or not str(prompt).strip(): - print(f"FAIL|{label} ({name}): missing or empty prompt") - else: - print(f"PASS|{label} ({name}): has prompt") - - # expected_output check (recommended, not required) - if not ev.get("expected_output"): - print(f"WARN|{label} ({name}): missing expected_output (recommended)") - - # Grading check: must have expectations OR assertions (or both) - expectations = ev.get("expectations") - assertions = ev.get("assertions") - has_expectations = isinstance(expectations, list) and len(expectations) > 0 - has_assertions = isinstance(assertions, list) and len(assertions) > 0 - - if not has_expectations and not has_assertions: - print(f"FAIL|{label} ({name}): missing both expectations and assertions (need at least one)") - continue - - # Validate expectations (natural language strings) - if has_expectations: - if len(expectations) < 2: - print(f"FAIL|{label} ({name}): has {len(expectations)} expectations, need >= 2") - else: - empty_exp = sum(1 for e in expectations if not isinstance(e, str) or not e.strip()) - if empty_exp > 0: - print(f"FAIL|{label} ({name}): {empty_exp} empty/invalid expectation(s)") - else: - print(f"PASS|{label} ({name}): {len(expectations)} valid expectations") - - # Validate assertions (regex patterns) - if has_assertions: - if len(assertions) < 2: - print(f"FAIL|{label} ({name}): has {len(assertions)} assertions, need >= 2") - else: - empty_assertions = 0 - for j, a in enumerate(assertions): - if isinstance(a, str): - if not a.strip(): - empty_assertions += 1 - elif isinstance(a, dict): - if "type" not in a: - print(f"FAIL|{label} ({name}): assertion[{j}] missing 'type'") - val = a.get("value") or a.get("pattern") or "" - if not str(val).strip(): - print(f"FAIL|{label} ({name}): assertion[{j}] missing 'value' or 'pattern'") - else: - print(f"FAIL|{label} ({name}): assertion[{j}] invalid type (not string or object)") - - if empty_assertions > 0: - print(f"FAIL|{label} ({name}): {empty_assertions} empty string assertion(s)") - else: - print(f"PASS|{label} ({name}): {len(assertions)} valid assertions") -``` - -- [ ] **Step 7: Update the header comment in validate-evals.sh** - -```bash -# validate-evals.sh - Structural validation of evals.json files -# Supports three formats: -# Unified (recommended): {"skill_name": "...", "evals": [{id, prompt, expected_output, expectations, assertions?}]} -# Legacy A: {"evals": [{eval_name, prompt, assertions}]} -# Legacy B: [{name, prompt, assertions: [{type, pattern}]}] -# -# Grading fields: expectations (natural language, for LLM grading) and/or assertions (regex, for automated checks) -# At least one of expectations or assertions is required per eval. -``` - -- [ ] **Step 8: Run all three test fixtures** - -```bash -bash skills/skill-repo/scripts/validate-evals.sh tests/evals-expectations-only.json -# Expected: PASS (all expectations valid) - -bash skills/skill-repo/scripts/validate-evals.sh tests/evals-unified.json -# Expected: PASS (both expectations and assertions valid) - -bash skills/skill-repo/scripts/validate-evals.sh tests/evals-legacy-regex.json -# Expected: PASS (legacy format still works) -``` - -- [ ] **Step 9: Run against existing skill-repo evals to verify no regression** - -```bash -bash skills/skill-repo/scripts/validate-evals.sh skills/skill-repo/evals/evals.json -# Expected: PASS (existing evals unchanged) -``` - -- [ ] **Step 10: Commit** - -```bash -git add skills/skill-repo/scripts/validate-evals.sh tests/ -git commit -m "feat: accept expectations as alternative to assertions in eval validation - -Implements the unified eval schema (spec: 2026-04-16). The validator -now accepts Anthropic's expectations field (natural language strings -for LLM-as-judge grading) as an alternative to regex assertions. -Both can coexist in the same eval. Legacy formats remain supported." -``` - ---- - -### Task 2: Convert github-release evals to unified format - -**Files:** -- Rewrite: `/home/sme/p/github-release-skill/fix/release-description-overhaul/skills/github-release/evals/evals.json` - -- [ ] **Step 1: Write a conversion script** - -Create `/home/sme/p/github-release-skill/fix/release-description-overhaul/scripts/convert-evals.py`: - -```python -#!/usr/bin/env python3 -"""Convert github-release evals from orphan format to unified schema.""" -import json -import re -import sys - -with open(sys.argv[1]) as f: - data = json.load(f) - -converted = [] -for i, e in enumerate(data["evals"], 1): - new_eval = { - "id": i, - "prompt": e.get("input", e.get("prompt", "")), - "expected_output": e.get("expected", {}).get("behavior", ""), - "expectations": [], - "assertions": [], - } - - # Convert expected.assertions -> expectations (natural language) - for text in e.get("expected", {}).get("assertions", []): - new_eval["expectations"].append(text) - - # Convert expected.must_not -> expectations with "does NOT" phrasing - for text in e.get("expected", {}).get("must_not", []): - new_eval["expectations"].append(f"Does NOT: {text}") - - # Generate regex assertions from the natural language expectations - # These are best-effort deterministic checks alongside the expectations - skip_words = { - "mentions", "explains", "uses", "creates", "checks", "reports", - "identifies", "detects", "recommends", "provides", "shows", - "includes", "lists", "finds", "reads", "extracts", "compares", - "blocks", "follows", "marks", "recognizes", "covers", "verifies", - "discusses", "updates", "ensures", "the", "a", "an", "and", "or", - "for", "with", "that", "is", "in", "of", "to", "all", "as", - } - - for text in e.get("expected", {}).get("assertions", []): - clean = re.sub(r"[^a-zA-Z0-9 _-]", "", text.lower()) - words = [w for w in clean.split() if w not in skip_words and len(w) > 2] - if len(words) >= 2: - pattern = "(?i)" + ".*".join(words[:3]) - elif words: - pattern = "(?i)" + words[0] - else: - continue - new_eval["assertions"].append({"type": "content", "pattern": pattern}) - - for text in e.get("expected", {}).get("must_not", []): - clean = re.sub(r"[^a-zA-Z0-9 _-]", "", text.lower()) - words = [w for w in clean.split() if w not in skip_words and len(w) > 2] - if len(words) >= 2: - pattern = "(?i)" + ".*".join(words[:3]) - new_eval["assertions"].append({"type": "must_not", "pattern": pattern}) - - converted.append(new_eval) - -output = {"skill_name": "github-release", "evals": converted} -print(json.dumps(output, indent=2)) -``` - -- [ ] **Step 2: Run conversion** - -```bash -cd /home/sme/p/github-release-skill/fix/release-description-overhaul -python3 scripts/convert-evals.py skills/github-release/evals/evals.json > /tmp/evals-converted.json -``` - -- [ ] **Step 3: Validate converted output against updated validator** - -```bash -bash /home/sme/p/skill-repo-skill/fix/unified-eval-schema/skills/skill-repo/scripts/validate-evals.sh /tmp/evals-converted.json -``` - -Expected: PASS - -- [ ] **Step 4: Review and hand-edit the converted evals** - -Open `/tmp/evals-converted.json` and review: -- Are the auto-generated regex assertions reasonable? Remove any that are too vague. -- Are the expectations clear natural language? Edit any that read awkwardly. -- Verify the `expected_output` field captures the intent. - -Copy the reviewed file into place: - -```bash -cp /tmp/evals-converted.json skills/github-release/evals/evals.json -``` - -- [ ] **Step 5: Validate the final evals in-place** - -```bash -bash /home/sme/p/skill-repo-skill/fix/unified-eval-schema/skills/skill-repo/scripts/validate-evals.sh skills/github-release/evals/evals.json -``` - -Expected: PASS - -- [ ] **Step 6: Clean up conversion script** - -```bash -rm scripts/convert-evals.py -``` - -- [ ] **Step 7: Commit** - -```bash -git add skills/github-release/evals/evals.json -git commit -m "feat: convert evals to unified schema (Anthropic + regex) - -Evals now use the unified format with both expectations (natural -language for LLM grading) and assertions (regex for fast checks). -Converted from orphan format (id/input/expected) to Anthropic- -compatible schema (id/prompt/expected_output/expectations)." -``` - ---- - -### Task 3: Push and verify CI for both repos - -- [ ] **Step 1: Push skill-repo-skill branch** - -```bash -cd /home/sme/p/skill-repo-skill/fix/unified-eval-schema -git push origin fix/unified-eval-schema -``` - -- [ ] **Step 2: Push github-release-skill branch (already exists)** - -```bash -cd /home/sme/p/github-release-skill/fix/release-description-overhaul -git push origin fix/release-description-overhaul -``` - -- [ ] **Step 3: Create PR for skill-repo-skill** - -```bash -cd /home/sme/p/skill-repo-skill/fix/unified-eval-schema -gh pr create --title "feat: accept expectations field in eval validation" --body "$(cat <<'EOF' -## Summary - -- `validate-evals.sh` now accepts `expectations` (string list) as an alternative to `assertions` (regex objects) -- Both can coexist in the same eval — the unified schema from the design spec -- Legacy formats (skill-repo regex, Anthropic-like) continue to pass -- Adds test fixtures for expectations-only, unified, and legacy formats - -## Context - -Design spec: github-release-skill docs/superpowers/specs/2026-04-16-unified-eval-schema-design.md - -## Test plan - -- [ ] Expectations-only evals pass validation -- [ ] Unified evals (both expectations + assertions) pass validation -- [ ] Legacy regex evals still pass (no regression) -- [ ] Existing skill-repo evals still pass -EOF -)" -``` - -- [ ] **Step 4: Verify CI passes on both PRs** - -```bash -cd /home/sme/p/skill-repo-skill/fix/unified-eval-schema -gh run list --limit 3 - -cd /home/sme/p/github-release-skill/fix/release-description-overhaul -gh pr checks 2 -``` diff --git a/docs/superpowers/specs/2026-04-16-unified-eval-schema-design.md b/docs/superpowers/specs/2026-04-16-unified-eval-schema-design.md deleted file mode 100644 index ed57956..0000000 --- a/docs/superpowers/specs/2026-04-16-unified-eval-schema-design.md +++ /dev/null @@ -1,169 +0,0 @@ -# Unified Eval Schema Design - -**Date**: 2026-04-16 -**Scope**: Eval format standardization across Netresearch skill repos -**Status**: Approved - -## Problem - -Three incompatible eval formats exist across ~37 Netresearch skill repos: - -1. **Skill-repo format** (~18 repos): `name`/`prompt`/`assertions[{type,pattern}]` — regex-based -2. **Anthropic-like format** (~12 repos): `id`/`eval_name`/`prompt`/`expected_output`/`files`/`assertions` — mixed -3. **Orphan formats** (~2 repos): ad-hoc schemas that match neither - -Anthropic's official skill-creator plugin defines a schema (`id`/`prompt`/`expected_output`/`expectations`/`files`) that none of our repos fully match. The skill-repo `validate-evals.sh` rejects Anthropic-format evals, and the `run-ab-evals.sh` runner only understands regex assertions. - -## Decision - -Adopt Anthropic's skill-creator schema as the canonical format. Extend it with optional regex assertions for fast/cheap automated checks. Support legacy formats during transition. - -## Unified Schema - -```json -{ - "skill_name": "example-skill", - "evals": [ - { - "id": 1, - "prompt": "User's task prompt", - "expected_output": "Human-readable description of correct behavior", - "files": [], - "expectations": [ - "The output explains why X is dangerous", - "The output recommends Y as an alternative", - "The output does not execute Z" - ], - "assertions": [ - {"type": "content", "pattern": "(?i)dangerous"}, - {"type": "content", "pattern": "(?i)recommend.*alternative"}, - {"type": "must_not", "pattern": "(?i)executed.*Z"} - ] - } - ] -} -``` - -### Required Fields (Anthropic base) - -| Field | Type | Description | -|-------|------|-------------| -| `id` | integer | Unique, sequential (1-based) | -| `prompt` | string | The user input to evaluate | -| `expected_output` | string | Human-readable success description (for LLM grader and humans reading the eval) | -| `expectations` | string[] | Natural language verifiable statements (for LLM-as-judge grading) | - -### Optional Fields - -| Field | Type | Description | -|-------|------|-------------| -| `files` | string[] | Input fixture paths relative to skill root (Anthropic standard) | -| `assertions` | object[] | Regex patterns for fast automated checks (Netresearch extension) | - -### Assertion Object - -When `assertions` is present, each element has: - -| Field | Type | Description | -|-------|------|-------------| -| `type` | string | `"content"` (must match) or `"must_not"` (must not match) | -| `pattern` | string | Regex pattern to test against output | - -### Wrapper Object - -The top-level object has: - -| Field | Type | Required | -|-------|------|----------| -| `skill_name` | string | Recommended (matches SKILL.md frontmatter `name`) | -| `evals` | array | Required | - -A top-level array (no wrapper) is accepted for backward compatibility but deprecated for new evals. - -## Grading Modes - -The runner selects grading mode based on which fields are present: - -| Fields present | Grading mode | Speed | Cost | -|----------------|-------------|-------|------| -| `assertions` only | Regex matching | Instant | Free | -| `expectations` only | LLM-as-judge | ~5s/eval | API call | -| Both | Regex + LLM | ~5s/eval | API call | - -When both are present, regex assertions run first (fast fail). LLM grading runs only if regex passes, providing qualitative assessment on top of the deterministic checks. - -## Validation Rules (validate-evals.sh) - -### Format Detection - -The validator accepts three input shapes: - -1. **Unified/Anthropic** (recommended): `{"skill_name": "...", "evals": [...]}` -2. **Legacy object**: `{"evals": [...]}` (no `skill_name`) -3. **Legacy array**: `[...]` (top-level array, deprecated) - -### Per-Eval Validation - -| Check | Rule | -|-------|------| -| Identity | Must have `id` (integer) OR `name`/`eval_name` (string) | -| Prompt | `prompt` must be non-empty string | -| Grading | Must have `expectations` (string[], 2+ items) OR `assertions` (object[], 2+ items) — or both | -| Expected output | `expected_output` recommended (warn if missing) | -| Assertions format | If `assertions` present, each must have `type` and `pattern` | -| Expectations format | If `expectations` present, each must be a non-empty string | - -### Global Validation - -| Check | Rule | -|-------|------| -| Minimum count | 10+ evals required, 15+ recommended | -| Unique identifiers | No duplicate `id` or `name` values | -| Sequential IDs | If using integer `id`, must be sequential (1-based) | - -### Legacy Field Mapping - -During transition, the validator maps legacy fields: - -| Legacy field | Maps to | Notes | -|-------------|---------|-------| -| `name` | `id` (as string) | Accepted as identifier | -| `eval_name` | display name | Accepted as identifier | -| `input` | `prompt` | Rare orphan format | -| `expected.assertions` | `expectations` | Rare orphan format | - -## Migration Path - -### Phase 1: Immediate - -- Update `validate-evals.sh` to accept `expectations` as valid alternative to `assertions` -- Convert github-release evals to unified format (both `expectations` and `assertions`) -- Document the unified schema in skill-repo SKILL.md - -### Phase 2: Gradual (when skills are touched) - -- New skills use Anthropic format with optional `assertions` -- Existing skills add `expectations` alongside `assertions` when modified -- A/B runner gains LLM grader path for `expectations` - -### Phase 3: Eventual - -- `assertions`-only evals emit deprecation warning in validator -- Both assertion types remain functional indefinitely -- Regex assertions remain useful for deterministic checks even in fully migrated evals - -## Affected Repos - -| Repo | Change | -|------|--------| -| `skill-repo-skill` | Update `validate-evals.sh`, `run-ab-evals.sh`, SKILL.md docs | -| `github-release-skill` | Convert evals to unified format | -| All other skill repos | No immediate changes required — migrate when touched | - -## Rationale - -1. **Anthropic is upstream** — their skill-creator is the official tool for building and testing skills. Alignment means their tooling (grader, viewer, benchmarks) works with our evals. -2. **LLM-as-judge is more robust** — regex patterns are fragile against LLM output variance (word order, phrasing, synonyms). Natural language expectations handle this naturally. -3. **Regex still has value** — deterministic checks ("did it run command X", "did it NOT execute Y") are faster and cheaper via regex. The unified schema supports both. -4. **Already split** — ~12 repos are close to Anthropic's format. Consolidating toward it is less migration than away from it. -5. **Backward compatible** — the validator supports all existing formats. No repo breaks.