Agent Action Guard classifies proposed AI agent actions as safe or harmful and blocks or flags harmful actions. This repository provides the model, dataset, integration helpers, and example MCP-compatible tooling to enable runtime action screening in agent loops.
- Repository URL: https://github.com/Pro-GenAI/Agent-Action-Guard
- Helps prevent autonomous agents from executing harmful, unethical, or risky operations.
- Provides a reproducible benchmark (HarmActionsEval) and dataset (HarmActions) for evaluating agent safety.
- Lightweight model for easy integration into MCP or custom agent frameworks.
- Install the package (recommended in a venv):
python3 -m venv .venv
source .venv/bin/activate
pip install agent-action-guard-
Start or configure an embedding server if using vector features (see
USAGE.md). -
In your agent runtime, call the convenience API to check actions before execution:
from agent_action_guard import is_action_harmful, action_guarded
# Manual Check
is_harmful, confidence = is_action_harmful(action_dict)
if is_harmful:
raise Exception("Harmful action blocked")
# Decorator (Automatic safety check based on function name and kwargs)
@action_guarded(conf_threshold=0.8)
def send_email(to, subject, body):
# This tool will be blocked if the model classifies the 'send_email' action as harmful
print(f"Sending email to {to}")agent_action_guard/— implementation package (classifier, runtime helpers, dataset loaders).training/— training scripts and dataset artifacts used to produce the classifier.examples/— sample integrations and MCP server examples.tests/— unit tests validating core behavior.USAGE.md— detailed usage examples and environment setup.README.md— project overview, demos, and citations.
- Input: proposed agent action (structured dict describing tool call, intent, parameters).
- Preprocessing: optional embedding + metadata normalization.
- Classifier: lightweight NN (PyTorch / ONNX) outputs harmful/safe logits and confidence.
- Policy: decision layer in the agent runtime that blocks, allows, or requests human approval.
- Formatting and linting:
make formatandmake lint(created throughMakefile). - Tests: run
pytest(configured bypytest.ini) to run test cases intests/directory.
- Use
USAGE.mdandexamples/for integration patterns rather than reproducing code. - Prefer runtime API
is_action_harmful()for decision making. - Respect model limitations: the classifier is trained on a limited dataset; combine with rule-based checks for high-risk systems.