Skip to content

feat: add financial governance evaluators (spend limits + transaction policy)#141

Draft
up2itnow0822 wants to merge 2 commits intoagentcontrol:mainfrom
up2itnow0822:feat/financial-governance-evaluator
Draft

feat: add financial governance evaluators (spend limits + transaction policy)#141
up2itnow0822 wants to merge 2 commits intoagentcontrol:mainfrom
up2itnow0822:feat/financial-governance-evaluator

Conversation

@up2itnow0822
Copy link

@up2itnow0822 up2itnow0822 commented Mar 20, 2026

Summary

Implements the financial governance evaluator proposed in #129, following @lan17's technical guidance.

Two contrib evaluators for enforcing financial spend limits on autonomous AI agent transactions:

financial_governance.spend_limit

Tracks cumulative agent spend and enforces rolling budget limits:

  • Per-transaction caps
  • Rolling period budgets (configurable window, default 24h)
  • Context-aware overrides (different limits per channel/agent/session via evaluate metadata)
  • Pluggable SpendStore protocol — InMemorySpendStore included, bring your own PostgreSQL/Redis

financial_governance.transaction_policy

Static policy checks with no state tracking:

  • Currency allowlist
  • Recipient blocklist/allowlist
  • Min/max amount bounds

Design Decisions (per #129 discussion)

  1. Decoupled from data sourceSpendStore protocol means no new tables in core. Custom backends implement two methods: record_spend() and get_spend().
  2. No changes to core — Self-contained contrib package under evaluators/contrib/financial-governance/.
  3. Context-aware limits — Override keys in the evaluate data dict (channel_max_per_transaction, channel_max_per_period) allow per-context policies without separate evaluator instances.
  4. Python SDK compatible — Standard Evaluator interface, works with both server and SDK evaluation engine.

Tests

53 tests passing — covers both evaluators, the InMemorySpendStore, config validation, edge cases, and context overrides.

pytest tests/ -v
53 passed in 2.79s

Related

@up2itnow0822 up2itnow0822 force-pushed the feat/financial-governance-evaluator branch from 9901522 to c55d52c Compare March 20, 2026 17:21
@up2itnow0822
Copy link
Author

Hey @lan17, Sorry for not getting back sooner. I appreciate you reaching out.

I followed your guidance and put together a quick draft — PR #141. Implements both evaluators as a contrib package with the decoupled SpendStore approach you described. Hopefully, I got close to what you were after. Let me know what you think!

@codecov
Copy link

codecov bot commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

matched=False,
confidence=1.0,
message="Transaction data missing required field 'amount'",
error="Missing required field: amount",
Copy link
Contributor

@lan17 lan17 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is conflating malformed runtime payload with evaluator failure. The config itself is already validated via the Pydantic config model; what this code is checking here is the selector output passed into evaluate(). EvaluatorResult.error is supposed to be for crashes/timeouts/missing deps, and the engine treats any non-null error as an errored control. So with a deny rule, a malformed transaction now fail-closes as an evaluator error instead of behaving like a normal policy result. I think these branches should stay in the regular matched/non-matched path and reserve error for actual evaluator failures. Same issue shows up in transaction_policy.



@pytest.mark.asyncio
async def test_context_override_channel_max_per_period() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d add a stronger test around the context-aware budget story here. Right now this only proves that the override number is read; it doesn’t prove spend is actually isolated by channel / agent_id / session_id. A case like “90 USDC in channel A, then 20 USDC in channel B with channel_max_per_period=100” would catch whether the store lookup is scoped or just summing all USDC spend. I’d also like at least one engine-level test for the invalid-input path so we don’t accidentally lock in result.error as a policy outcome.

"""
...

def get_spend(self, currency: str, since_timestamp: float) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there’s a deeper issue with the API shape here. get_spend() only lets the evaluator query by (currency, since_timestamp), but the evaluator is also recording channel / agent_id / session_id metadata and the README/tests are positioning this as context-aware budgeting. As written, channel_max_per_period is still checking against all spend in that currency, not spend for the current channel/agent/session.

I reproduced it locally with 90 USDC in channel A, then a 20 USDC transaction in channel B with channel_max_per_period=100, and the second transaction gets denied because the lookup is summing global USDC spend. Feels like the store API needs a scoped query (or an atomic check+record API) before this behavior is actually correct.


{
"condition": {
"selector": {"path": "*"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is meant to be a real Agent Control config example, I think selector.path: "*" is misleading here. In the engine, * passes the whole Step object into the evaluator, not the raw transaction dict, so amount / currency / recipient won’t be top-level fields.

input seems closer to what this evaluator actually expects, but then the context-aware override story probably needs to be spelled out separately since step context lives under context, not inside input.

@up2itnow0822 up2itnow0822 force-pushed the feat/financial-governance-evaluator branch from c55d52c to 614860b Compare March 20, 2026 18:43
… policy)

Implements the financial governance evaluator proposed in agentcontrol#129, following
the technical guidance from @lan17:

1. Decoupled from data source — SpendStore protocol with pluggable backends
   (InMemorySpendStore included, PostgreSQL/Redis via custom implementation)
2. No new tables in core agent control — self-contained contrib package
3. Context-aware limits — channel/agent/session overrides via evaluate metadata
4. Python SDK compatible — standard Evaluator interface, works with both
   server and SDK evaluation engine

Two evaluators:
- financial_governance.spend_limit: Cumulative spend tracking with
  per-transaction caps and rolling period budgets
- financial_governance.transaction_policy: Static policy enforcement
  (currency allowlists, recipient blocklists, amount bounds)

53 tests passing.

Closes agentcontrol#129

Signed-off-by: up2itnow0822 <up2itnow0822@users.noreply.github.com>
Signed-off-by: up2itnow0822 <up2itnow0822@gmail.com>
Signed-off-by: up2itnow0822 <up2itnow0822@users.noreply.github.com>
@up2itnow0822 up2itnow0822 force-pushed the feat/financial-governance-evaluator branch from 614860b to c36c7e2 Compare March 20, 2026 19:15
@up2itnow0822
Copy link
Author

@lan17 I made some changes based on your feedback. Let me know if I'm getting closer.

I think I've addressed your concerns. Pushed to the same branch:

  • error field removed from malformed-input paths - now matched=False with descriptive message only
  • get_spend() takes an optional scope dict - budgets are isolated per channel/agent/session. Added the 90-in-A + 20-in-B test.
  • Both evaluators handle selector.path: "input" and "*" - Step context merges into the transaction data.

README updated with proper controls: config using selector.path: input.

Best,
Bill

Copy link
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest pass looks much better overall. The original evaluator-error and scoped-budget issues seem fixed. I just left a few follow-ups on packaging/docs/test coverage.

@@ -0,0 +1,55 @@
[project]
name = "agent-control-evaluator-financial-governance"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have this as a standalone package, but I do not think it is actually reachable for end users yet. As-is, I do not think pip install "agent-control-evaluators[financial-governance]" will pull this in, since agent-control-evaluators only exposes galileo and cisco extras today, and I do not see release wiring to publish this contrib package either. If the goal is for this to be installable the same way as the other optional evaluators, I think we still need the extra in evaluators/builtin/pyproject.toml plus the publish/release wiring.

result = await ev.evaluate(step_data)
assert result.matched is False
# input's channel should win (not clobbered)
assert result.metadata and result.metadata.get("channel") is None or True # just verify no crash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this test is asserting the intended behavior right now. ... or True makes it always pass, and on the success path result.metadata does not even carry channel, so this will not catch a regression in input-vs-context precedence. I would tighten this to assert against the store state or another observable that proves input["channel"] won over context["channel"].

(amount, currency, json.dumps(metadata)),
)

def get_spend(self, currency: str, since_timestamp: float) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom store example is already stale relative to the protocol above. SpendStore.get_spend() now takes scope, so anyone copying this signature will implement the wrong interface and miss the context-scoped budget behavior. I would update the example to include scope: dict[str, str] | None = None and show how the metadata filter is applied.

1. **Decoupled from data source** — The `SpendStore` protocol means no new tables in core Agent Control. Bring your own persistence.
2. **Context-aware limits** — Override keys in the evaluate data dict allow per-channel, per-agent, or per-session limits without multiple evaluator instances.
3. **Python SDK compatible** — Uses the standard evaluator interface; works with both the server and the Python SDK evaluation engine.
4. **Fail-open on errors** — Missing or malformed data returns `matched=False` with an `error` field, following Agent Control conventions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line does not match the implementation anymore. After the fix, malformed runtime payload returns matched=False with error=None, which is the right shape. Leaving this as written is going to send people back toward the old behavior.

Copy link
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more pass on the store shape. I think it is a reasonable MVP, but I still have two concerns around enforceability and scoping semantics.

if not scope:
scope = None

period_spend = self._store.get_spend(tx_currency, since, scope=scope)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the store contract is still racy for real budget enforcement. We read get_spend(...), decide, and only then record_spend(...), so two concurrent requests can both pass this check and both record, overshooting the cap. If this is only meant as a lightweight contrib example that may be fine, but if we want hard spend limits I think the store API needs an atomic check_and_record or reservation-style primitive.

# When channel/agent/session overrides are present, query only
# spend matching that context — not global spend.
scope: dict[str, str] | None = None
if any(k in data for k in ("channel", "agent_id", "session_id")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scoping semantics here are a little different from how I first read the docs. If a transaction carries both channel and agent_id, we scope to the exact tuple, not to an independent per-channel budget and per-agent budget. That means spend can effectively bypass a "channel budget" just by varying agent/session within the same channel. If tuple-scoped budgets are the intent, I would make that explicit.

Copy link
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more API-design thoughts here. The current shape works for the narrow rolling-budget case, but I think there are two places where we may be locking ourselves in more than we want.

}
"""

max_per_transaction: float = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be pretty hesitant to use float for money here. If this is meant to be a real spend-control API, I think we should either use Decimal or make amounts explicit integer minor/atomic units before more code starts depending on the current shape.

"""
...

def get_spend(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config/store API feels a bit too tied to one budget shape: one rolling window defined by max_per_period + period_seconds, queried as get_spend(..., since_timestamp, ...). That works for this MVP, but if we ever want fixed daily/weekly windows or multiple concurrent limits, I think this will get awkward fast. I would consider moving toward structured limit definitions plus a range-based query (start, end) instead of baking rolling-window semantics into the store contract.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concrete sketch of what I had in mind, roughly:

class BudgetWindow(BaseModel):
    kind: Literal["rolling", "fixed"]
    seconds: int | None = None          # rolling window
    unit: Literal["day", "week", "month"] | None = None
    timezone: str | None = None         # fixed calendar windows

class BudgetLimit(BaseModel):
    amount: Decimal
    currency: str
    scope_by: tuple[Literal["channel", "agent_id", "session_id"], ...] = ()
    window: BudgetWindow | None = None  # None => per-transaction cap

Then config can just be limits: list[BudgetLimit], and the store/query side can move toward get_spend(currency, start, end, scope=...) instead of baking rolling-window semantics into since_timestamp. Not saying this exact API, just the shape.

Addresses all feedback from lan17's review:

- Float → Decimal: All money amounts use Decimal for precision. Config,
  store protocol, evaluator, and transaction_policy all updated.
  Decimal(str(raw)) for safe conversion, float() only in metadata output.

- Scoped budget semantics: Documented tuple-based scope behavior.
  Channel+agent_id+session_id form a composite scope key. Independent
  per-dimension budgets documented as requiring separate get_spend calls.

- Store API: get_spend() now accepts start/end range instead of just
  since_timestamp. Backward compatible (end defaults to None).

- Fixed always-passing test: Removed 'or True' from context override
  test. Now asserts concrete store state per scope.

- Added lan17's exact test case: 90 USDC channel A, then 20 USDC
  channel B with channel_max_per_period=100. Second tx allowed.

- README: Updated custom store example with scope param and Decimal
  return. Fixed error handling docs. Added Known Limitations section
  (race condition, tuple scoping, package wiring).

- __init__.py: selector.path '*' → 'input' with context merge note.

67/67 tests passing.

Signed-off-by: up2itnow0822 <up2itnow0822@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add financial governance evaluator for agent spend limits

2 participants