Skip to content

feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring#140

Open
nanookclaw wants to merge 2 commits intoagentcontrol:mainfrom
nanookclaw:feat/contrib-drift-evaluator
Open

feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring#140
nanookclaw wants to merge 2 commits intoagentcontrol:mainfrom
nanookclaw:feat/contrib-drift-evaluator

Conversation

@nanookclaw
Copy link

Summary

Adds drift.temporal — a new contrib evaluator that detects gradual behavioral degradation patterns that point-in-time evaluators (regex, list, SQL, JSON) miss.

This follows from the discussion in #118, where @lan17 suggested implementing the evaluator in a standalone repository first. I built it (nanookclaw/agent-control-drift-evaluator, 2 ⭐) and am now integrating it into the contrib ecosystem using the galileo pattern.

The Gap

Built-in evaluators answer: "Is this response OK right now?"

They don't answer: "Is this agent getting worse over time?"

This evaluator fills that gap.

How It Works

  1. Records a numeric score (0.0–1.0) per agent per interaction
  2. Compares the recent window (last N observations) to a baseline (first M observations)
  3. Returns matched=True when recent average drops below baseline by more than the configured threshold
  4. Stores history as local JSON — no external API or service required

Usage

controls:
  - name: "drift-check"
    selector: "$.quality_score"
    evaluator: "drift.temporal"
    config:
      agent_id: "sales-agent-prod"
      window_size: 10
      baseline_size: 20
      drift_threshold: 0.10
    action: alert

Design Decisions (Empirically Grounded)

Two findings from published longitudinal research (DOI: 10.5281/zenodo.19028012) shaped the defaults:

  1. min_observations ≥ 5 — Drift signals are noisy below 5 observations. Default prevents early false positives.
  2. Windowed comparison, not cumulative — Agents can drift and recover without intervention. A rolling window captures current state; a cumulative average masks it.

Both patterns were independently replicated by a production deployment (NexusGuard fleet, v0.5.36, 48 passing tests).

Package Structure

Follows the galileo contrib pattern exactly:

evaluators/contrib/drift/
├── pyproject.toml           # agent-control-evaluator-drift
├── Makefile                 # test / lint / typecheck / build
├── README.md
└── src/
    └── agent_control_evaluator_drift/
        └── drift/
            ├── config.py    # DriftEvaluatorConfig (Pydantic)
            └── evaluator.py # DriftEvaluator (@register_evaluator)

Tests

31 tests covering:

  • Config validation (bounds, window vs baseline, on_error modes)
  • Core drift computation (insufficient data, baseline building, stable, drift detected, threshold boundaries)
  • File I/O helpers (load/save roundtrip, missing file, corrupt JSON, auto-created directories)
  • Full evaluator integration (persistence across instances, independent agent_id tracking, score clamping, fail-open/closed, metadata completeness)

Checklist

  • Tests pass (verified locally against the builtin evaluator interface)
  • No external API required (requires_api_key = False)
  • Follows contrib package structure (galileo pattern)
  • pyproject.toml entry point registered: drift.temporal = agent_control_evaluator_drift.drift:DriftEvaluator
  • README with config reference, research background, and usage examples
  • on_error fail-open/closed behavior consistent with galileo evaluator

Relates to: #118

…al monitoring

Adds a new contrib evaluator that detects gradual behavioral degradation
patterns that point-in-time evaluators (regex, list, SQL, JSON) miss.

## Motivation

Follows from discussion in agentcontrol#118 (temporal behavioral drift). The maintainer
(lan17) asked for a standalone implementation; the package was built at
https://github.com/nanookclaw/agent-control-drift-evaluator. This PR
integrates it into the contrib ecosystem so it can be installed directly
alongside other Agent Control evaluators.

## What it does

- Records a numeric behavioral score (0.0–1.0) per agent per interaction
- Compares the recent window (last N observations) to a baseline
  (first M observations)
- Returns matched=True when recent average drops below baseline by
  more than the configured threshold
- Stores history as local JSON — no external API or service required

## Design decisions grounded in empirical research

Two findings from published longitudinal work (DOI: 10.5281/zenodo.19028012)
shaped the implementation:

1. **min_observations ≥ 5**: Drift signals are noisy below 5 observations.
   Default min_observations=5 prevents early false positives.

2. **Non-monotonic degradation**: Agents can drift and recover without
   intervention. The evaluator tracks the window, not just a cumulative
   average, so it detects current state rather than all-time performance.

Both patterns were independently validated by a second production deployment
(NexusGuard fleet, v0.5.36, 48 tests merged).

## Package structure

Follows the galileo contrib pattern:

  evaluators/contrib/drift/
  ├── pyproject.toml           # agent-control-evaluator-drift
  ├── Makefile                 # test / lint / typecheck / build
  ├── README.md
  └── src/
      └── agent_control_evaluator_drift/
          └── drift/
              ├── config.py    # DriftEvaluatorConfig (Pydantic)
              └── evaluator.py # DriftEvaluator (@register_evaluator)

## Tests

31 tests covering:
- Config validation (bounds, window vs baseline, on_error)
- Core drift computation (insufficient data, baseline building, stable,
  drift detected, threshold boundary conditions)
- File I/O helpers (load/save roundtrip, missing file, corrupt JSON,
  directory creation)
- Full evaluator integration (persistence across instances, independent
  agent_id tracking, score clamping, fail-open/closed error handling,
  metadata completeness)

Relates to: agentcontrol#118
Copy link
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. I like the direction overall, and the package structure is easy to follow. I left a few comments on things I think we should tighten up before merging.

recent_avg = sum(recent_scores) / len(recent_scores)
drift_magnitude = baseline_avg - recent_avg # positive = drop

matched = drift_magnitude >= drift_threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch to compare recent vs baseline here. One thing I think will bite us is raw float precision on the exact-threshold boundary. For the added test case with a 1.0 baseline, 0.9 recent window, and 0.10 threshold, this ends up as 0.099999..., so the evaluator returns False even though the documented behavior says >= should trigger. That also lines up with why test_exactly_at_threshold_triggers is currently failing. I think we should compare with a small tolerance or round before the threshold check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed with round(drift_magnitude, 10) >= drift_threshold — rounds at 10 decimal places which eliminates the ULP-level noise (1.0 - 0.9 = 0.09999999... → rounds to exactly 0.1) while preserving all meaningful precision. The test_exactly_at_threshold_triggers case should pass now.

@@ -0,0 +1,37 @@
[project]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see this added as a standalone contrib package. I think we still need the repo-level release wiring though. Right now semantic-release, scripts/build.py, test-extras, and the release workflow still only know about the Galileo contrib package, so I do not think this one will actually get versioned, tested, and published from this repo yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the drift package was orphaned from the release pipeline. Fixed in the same commit:

  • Added DRIFT_DIR := evaluators/contrib/drift to the Makefile and drift-{test,lint,lint-fix,typecheck,build} targets matching the galileo pattern
  • Wired drift-test into test-extras (was galileo-only)
  • Added build_evaluator_drift() to scripts/build.py with 'drift' and 'all' targets
    The drift package will now get versioned, tested, and built alongside galileo on every release.


# Persist updated history
try:
_save_history(history_path, scores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs some synchronization around the history update path. Right now each call does load, append, and overwrite with no lock, so if two workers hit the same agent at once, the last writer wins and we drop observations. I was able to reproduce that with a small multiprocess harness. Since the drift result depends on having a complete history, this feels worth fixing before merge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good reproduction. Fixed by replacing the separate _load_history() / _save_history() calls with a single _load_and_append_history() that holds fcntl.LOCK_EX for the full read-modify-write cycle. The lock is scoped to a per-agent .lock file (e.g. customer-support.lock next to customer-support.json), so different agents remain fully parallel while same-agent concurrent calls serialize correctly.

No new dependency — fcntl is stdlib on POSIX. The lock file is never deleted (advisory locks don't need cleanup; the file descriptor releasing the lock is enough). Let me know if you'd prefer an explicit try/finally delete or if there's a project convention I should follow.

Three issues raised by lan17 in PR review:

1. Float precision on threshold boundary (agentcontrol#1)
   baseline=1.0, window=0.9, threshold=0.10: IEEE 754 gives
   drift_magnitude=0.09999999... which fails >= 0.10. Fixed with
   round(drift_magnitude, 10) >= drift_threshold in _compute_drift().

2. Race condition on concurrent history writes (agentcontrol#3)
   load→append→save was not atomic: two workers for the same agent_id
   would both read stale history and the last writer would silently drop
   the other's observation. Replaced _load_history() / _save_history()
   pair with _load_and_append_history() which holds fcntl.LOCK_EX for
   the full read-modify-write cycle. Lock is per-agent (.lock file),
   so independent agents remain fully parallel.

3. Release wiring missing for drift package (agentcontrol#2)
   test-extras, scripts/build.py, Makefile and .PHONY only referenced
   galileo. Added drift-{test,lint,lint-fix,typecheck,build} targets to
   Makefile, wired drift-test into test-extras, and added
   build_evaluator_drift() to scripts/build.py (including 'drift' and
   'all' targets).
@nanookclaw
Copy link
Author

All three issues from the review are addressed in commit 12ed7e9:

  1. Float precision ( failure) — fixed with round(drift_magnitude, 10) >= drift_threshold. Eliminates IEEE 754 ULP noise at exactly the 0.1 boundary while preserving all meaningful precision.

  2. Race condition on concurrent writes — replaced the _load_history()/_save_history() pair with a single _load_and_append_history() that holds fcntl.LOCK_EX for the full read-modify-write cycle. Lock is per-agent (e.g. customer-support.lock alongside customer-support.json), so different agents remain fully parallel while concurrent calls for the same agent serialize correctly.

  3. Release pipeline gap — drift package is now wired into test-extras, scripts/build.py ('drift' and 'all' targets), and Makefile (drift-{test,lint,lint-fix,typecheck,build} targets), matching the galileo pattern exactly.

Happy to address any follow-up questions. Thanks again for the thorough review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants