feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring#140
feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring#140nanookclaw wants to merge 2 commits intoagentcontrol:mainfrom
Conversation
…al monitoring Adds a new contrib evaluator that detects gradual behavioral degradation patterns that point-in-time evaluators (regex, list, SQL, JSON) miss. ## Motivation Follows from discussion in agentcontrol#118 (temporal behavioral drift). The maintainer (lan17) asked for a standalone implementation; the package was built at https://github.com/nanookclaw/agent-control-drift-evaluator. This PR integrates it into the contrib ecosystem so it can be installed directly alongside other Agent Control evaluators. ## What it does - Records a numeric behavioral score (0.0–1.0) per agent per interaction - Compares the recent window (last N observations) to a baseline (first M observations) - Returns matched=True when recent average drops below baseline by more than the configured threshold - Stores history as local JSON — no external API or service required ## Design decisions grounded in empirical research Two findings from published longitudinal work (DOI: 10.5281/zenodo.19028012) shaped the implementation: 1. **min_observations ≥ 5**: Drift signals are noisy below 5 observations. Default min_observations=5 prevents early false positives. 2. **Non-monotonic degradation**: Agents can drift and recover without intervention. The evaluator tracks the window, not just a cumulative average, so it detects current state rather than all-time performance. Both patterns were independently validated by a second production deployment (NexusGuard fleet, v0.5.36, 48 tests merged). ## Package structure Follows the galileo contrib pattern: evaluators/contrib/drift/ ├── pyproject.toml # agent-control-evaluator-drift ├── Makefile # test / lint / typecheck / build ├── README.md └── src/ └── agent_control_evaluator_drift/ └── drift/ ├── config.py # DriftEvaluatorConfig (Pydantic) └── evaluator.py # DriftEvaluator (@register_evaluator) ## Tests 31 tests covering: - Config validation (bounds, window vs baseline, on_error) - Core drift computation (insufficient data, baseline building, stable, drift detected, threshold boundary conditions) - File I/O helpers (load/save roundtrip, missing file, corrupt JSON, directory creation) - Full evaluator integration (persistence across instances, independent agent_id tracking, score clamping, fail-open/closed error handling, metadata completeness) Relates to: agentcontrol#118
lan17
left a comment
There was a problem hiding this comment.
Thanks for putting this together. I like the direction overall, and the package structure is easy to follow. I left a few comments on things I think we should tighten up before merging.
| recent_avg = sum(recent_scores) / len(recent_scores) | ||
| drift_magnitude = baseline_avg - recent_avg # positive = drop | ||
|
|
||
| matched = drift_magnitude >= drift_threshold |
There was a problem hiding this comment.
Nice catch to compare recent vs baseline here. One thing I think will bite us is raw float precision on the exact-threshold boundary. For the added test case with a 1.0 baseline, 0.9 recent window, and 0.10 threshold, this ends up as 0.099999..., so the evaluator returns False even though the documented behavior says >= should trigger. That also lines up with why test_exactly_at_threshold_triggers is currently failing. I think we should compare with a small tolerance or round before the threshold check.
There was a problem hiding this comment.
Good catch. Fixed with round(drift_magnitude, 10) >= drift_threshold — rounds at 10 decimal places which eliminates the ULP-level noise (1.0 - 0.9 = 0.09999999... → rounds to exactly 0.1) while preserving all meaningful precision. The test_exactly_at_threshold_triggers case should pass now.
| @@ -0,0 +1,37 @@ | |||
| [project] | |||
There was a problem hiding this comment.
Happy to see this added as a standalone contrib package. I think we still need the repo-level release wiring though. Right now semantic-release, scripts/build.py, test-extras, and the release workflow still only know about the Galileo contrib package, so I do not think this one will actually get versioned, tested, and published from this repo yet.
There was a problem hiding this comment.
Agreed — the drift package was orphaned from the release pipeline. Fixed in the same commit:
- Added
DRIFT_DIR := evaluators/contrib/driftto the Makefile anddrift-{test,lint,lint-fix,typecheck,build}targets matching the galileo pattern - Wired
drift-testintotest-extras(was galileo-only) - Added
build_evaluator_drift()toscripts/build.pywith'drift'and'all'targets
The drift package will now get versioned, tested, and built alongside galileo on every release.
|
|
||
| # Persist updated history | ||
| try: | ||
| _save_history(history_path, scores) |
There was a problem hiding this comment.
I think this needs some synchronization around the history update path. Right now each call does load, append, and overwrite with no lock, so if two workers hit the same agent at once, the last writer wins and we drop observations. I was able to reproduce that with a small multiprocess harness. Since the drift result depends on having a complete history, this feels worth fixing before merge.
There was a problem hiding this comment.
Good reproduction. Fixed by replacing the separate _load_history() / _save_history() calls with a single _load_and_append_history() that holds fcntl.LOCK_EX for the full read-modify-write cycle. The lock is scoped to a per-agent .lock file (e.g. customer-support.lock next to customer-support.json), so different agents remain fully parallel while same-agent concurrent calls serialize correctly.
No new dependency — fcntl is stdlib on POSIX. The lock file is never deleted (advisory locks don't need cleanup; the file descriptor releasing the lock is enough). Let me know if you'd prefer an explicit try/finally delete or if there's a project convention I should follow.
Three issues raised by lan17 in PR review: 1. Float precision on threshold boundary (agentcontrol#1) baseline=1.0, window=0.9, threshold=0.10: IEEE 754 gives drift_magnitude=0.09999999... which fails >= 0.10. Fixed with round(drift_magnitude, 10) >= drift_threshold in _compute_drift(). 2. Race condition on concurrent history writes (agentcontrol#3) load→append→save was not atomic: two workers for the same agent_id would both read stale history and the last writer would silently drop the other's observation. Replaced _load_history() / _save_history() pair with _load_and_append_history() which holds fcntl.LOCK_EX for the full read-modify-write cycle. Lock is per-agent (.lock file), so independent agents remain fully parallel. 3. Release wiring missing for drift package (agentcontrol#2) test-extras, scripts/build.py, Makefile and .PHONY only referenced galileo. Added drift-{test,lint,lint-fix,typecheck,build} targets to Makefile, wired drift-test into test-extras, and added build_evaluator_drift() to scripts/build.py (including 'drift' and 'all' targets).
|
All three issues from the review are addressed in commit 12ed7e9:
Happy to address any follow-up questions. Thanks again for the thorough review. |
Summary
Adds
drift.temporal— a new contrib evaluator that detects gradual behavioral degradation patterns that point-in-time evaluators (regex, list, SQL, JSON) miss.This follows from the discussion in #118, where @lan17 suggested implementing the evaluator in a standalone repository first. I built it (nanookclaw/agent-control-drift-evaluator, 2 ⭐) and am now integrating it into the contrib ecosystem using the galileo pattern.
The Gap
Built-in evaluators answer: "Is this response OK right now?"
They don't answer: "Is this agent getting worse over time?"
This evaluator fills that gap.
How It Works
matched=Truewhen recent average drops below baseline by more than the configured thresholdUsage
Design Decisions (Empirically Grounded)
Two findings from published longitudinal research (DOI: 10.5281/zenodo.19028012) shaped the defaults:
min_observations ≥ 5— Drift signals are noisy below 5 observations. Default prevents early false positives.Both patterns were independently replicated by a production deployment (NexusGuard fleet, v0.5.36, 48 passing tests).
Package Structure
Follows the galileo contrib pattern exactly:
Tests
31 tests covering:
on_errormodes)agent_idtracking, score clamping, fail-open/closed, metadata completeness)Checklist
requires_api_key = False)pyproject.tomlentry point registered:drift.temporal = agent_control_evaluator_drift.drift:DriftEvaluatoron_errorfail-open/closed behavior consistent with galileo evaluatorRelates to: #118