feat(evaluators): add built-in budget evaluator for per-agent cost tracking by amabito · Pull Request #144 · agentcontrol/agent-control

amabito · 2026-03-21T02:09:46Z

Summary

New built-in evaluator "budget" that tracks cumulative token/cost usage
per agent, channel, user. Configurable time windows (daily/weekly/monthly).
Addresses feat(evaluators): add built-in budget evaluator for per-agent cost tracking #130.

Scope

User-facing/API changes:
- "budget" evaluator registered alongside regex/list/json/sql
- BudgetStore protocol + InMemoryBudgetStore (dict + threading.Lock)
- BudgetEvaluatorConfig: limits list, optional pricing table, path configs
Internal changes:
- evaluators/builtin/src/agent_control_evaluators/budget/ -- 4 files, ~650 LOC
- evaluators/builtin/tests/budget/test_budget.py -- 63 tests
Out of scope:
- No Postgres store, no DB tables, no new dependencies

Risk and Rollout

Risk level: low -- new evaluator, no changes to existing code. 230 existing tests untouched.
Rollback plan: revert PR

Testing

Added or updated automated tests (63 tests incl. thread safety, NaN/Inf, scope injection, double-count)
Ran pytest tests/ -- 293 passed
Manually verified behavior

Checklist

Linked issue/spec -- feat(evaluators): add built-in budget evaluator for per-agent cost tracking #130
Updated docs/examples -- follow-up: config example in docs/evaluators/
Included follow-up tasks -- Postgres BudgetStore (separate package)

…acking Closes agentcontrol#130 Add BudgetEvaluator -- a deterministic evaluator that tracks cumulative LLM token and cost usage per agent, per channel, per user, with configurable time windows (daily/weekly/monthly/cumulative). Components: - BudgetStore protocol + InMemoryBudgetStore (dict + threading.Lock) - BudgetSnapshot frozen dataclass for atomic state reads - BudgetEvaluator with scope key building, period key derivation, token extraction, and optional model pricing estimation - BudgetLimitRule config with scope, per, window, limit_usd, limit_tokens - 48 tests covering store, config, evaluator, registration Design: - In-memory only (no PostgreSQL, no new dependencies) - Store is "dumb" (accumulate + check), evaluator is "smart" (resolve scope, derive period, extract tokens, check limits) - record_and_check() is atomic (single lock acquisition) - Evaluator instances are cached per config (thread-safe by design) - matched=True only when limit exceeded, confidence=1.0 always - Utilization ratio in metadata, not confidence

…arial tests 3-body review findings: Security: - Sanitize pipe/equals in scope key metadata values (injection prevention) - Add max_buckets=100K to InMemoryBudgetStore (OOM prevention, fail-closed) - Block dunder attribute access in _extract_by_path - Add math.isfinite guard on extracted cost values - Skip per-user rules when per field missing from metadata (was collapsing per-user budgets into global bucket) Correctness: - Changed exceeded check from > to >= (utilization=100% now triggers exceeded) - Removed unused BudgetSnapshot import from evaluator.py Tests (6 adversarial): - Exact limit boundary (USD and tokens) - Scope key injection via pipe character - max_buckets OOM prevention - per-field missing skips rule - dunder path rejection 54 budget tests, 284 total evaluator tests passing.

… dunder guard R2 findings: - _sanitize_scope_value: percent-encode |/= instead of replacing with _ (was causing key collisions between "a|b" and "a_b") - max_buckets fail-closed: spent_usd/spent_tokens now 0.0/0 (not recorded, previously reported current-call-only values misleading callers) - _extract_by_path: narrowed guard from startswith("_") to startswith("__") (single-underscore dict keys are legitimate data fields) - Fixed tautological test assertion in test_scope_key_injection_pipe - Added 3 tests: no-collision, single-underscore access, NaN/Inf cost 57 budget tests, 287 total evaluator tests passing.

R4 finding: negative pricing rates in config caused _estimate_cost to return negative cost_usd, which subtracted from spent_usd and disabled USD limit enforcement entirely. Fix: max(0.0, cost) in _estimate_cost return. Test: negative pricing rates produce spent_usd >= 0. 58 budget tests, 288 total evaluator tests passing.

R5 finding: Inf pricing rates produced inf cost, permanently locking buckets in exceeded state. max(0.0, inf) = inf. Fix: isfinite + negative check on _estimate_cost return value. Tests: Inf pricing rate test, strengthened negative pricing assertion. 59 budget tests passing.

…dation R8 finding: float("nan") passed the `v <= 0` validator (IEEE 754: nan <= 0 is False). NaN limit_usd silently disabled budget enforcement because all NaN comparisons return False. Fix: added math.isfinite(v) guard to validate_limit_usd. Tests: NaN and Inf limit_usd rejection. 61 budget tests, 291 total evaluator tests passing.

…pe+period R10 finding: when multiple limit rules share the same (scope_key, period_key), each rule called record_and_check() independently, causing the same tokens and cost to be counted N times in the store. Fix: track recorded (scope_key, period_key) pairs per evaluate() call. First rule records; subsequent rules for the same pair use get_snapshot(). Tests: 2 new tests for same-scope double-count prevention. 63 budget tests, 293 total evaluator tests passing. Review loop: R9 CLEAN, R10 fix, R11 CLEAN -- 3 consecutive clean achieved.

amabito added 7 commits March 21, 2026 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluators): add built-in budget evaluator for per-agent cost tracking#144

feat(evaluators): add built-in budget evaluator for per-agent cost tracking#144
amabito wants to merge 7 commits intoagentcontrol:mainfrom
amabito:feat/budget-evaluator

amabito commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amabito commented Mar 21, 2026

Summary

Scope

Risk and Rollout

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant