Project: Agent Action Guard

Agent Action Guard classifies proposed AI agent actions as safe or harmful and blocks or flags harmful actions. This repository provides the model, dataset, integration helpers, and example MCP-compatible tooling to enable runtime action screening in agent loops.

Repository URL: https://github.com/Pro-GenAI/Agent-Action-Guard

Why it matters

Helps prevent autonomous agents from executing harmful, unethical, or risky operations.
Provides a reproducible benchmark (HarmActionsEval) and dataset (HarmActions) for evaluating agent safety.
Lightweight model for easy integration into MCP or custom agent frameworks.

Quick Usage (for agents)

Install the package (recommended in a venv):

python3 -m venv .venv
source .venv/bin/activate
pip install agent-action-guard

Start or configure an embedding server if using vector features (see USAGE.md).
In your agent runtime, call the convenience API to check actions before execution:

from agent_action_guard import is_action_harmful, action_guarded

# Manual Check
is_harmful, confidence = is_action_harmful(action_dict)
if is_harmful:
    raise Exception("Harmful action blocked")

# Decorator (Automatic safety check based on function name and kwargs)
@action_guarded(conf_threshold=0.8)
def send_email(to, subject, body):
    # This tool will be blocked if the model classifies the 'send_email' action as harmful
    print(f"Sending email to {to}")

Key Files & Structure

agent_action_guard/ — implementation package (classifier, runtime helpers, dataset loaders).
training/ — training scripts and dataset artifacts used to produce the classifier.
examples/ — sample integrations and MCP server examples.
tests/ — unit tests validating core behavior.
USAGE.md — detailed usage examples and environment setup.
README.md — project overview, demos, and citations.

Architecture Overview

Input: proposed agent action (structured dict describing tool call, intent, parameters).
Preprocessing: optional embedding + metadata normalization.
Classifier: lightweight NN (PyTorch / ONNX) outputs harmful/safe logits and confidence.
Policy: decision layer in the agent runtime that blocks, allows, or requests human approval.

Development & CI

Formatting and linting: make format and make lint (created through Makefile).
Tests: run pytest (configured by pytest.ini) to run test cases in tests/ directory.

Guidance for AI agents reading this repo

Use USAGE.md and examples/ for integration patterns rather than reproducing code.
Prefer runtime API is_action_harmful() for decision making.
Respect model limitations: the classifier is trained on a limited dataset; combine with rule-based checks for high-risk systems.

Where to look next (quick links)

Full details and demo: README.md
Integration and examples: USAGE.md and examples/
Implementation: agent_action_guard/
Training scripts & dataset: training/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Project: Agent Action Guard

Why it matters

Quick Usage (for agents)

Key Files & Structure

Architecture Overview

Development & CI

Guidance for AI agents reading this repo

Where to look next (quick links)

Uh oh!

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Project: Agent Action Guard

Why it matters

Quick Usage (for agents)

Key Files & Structure

Architecture Overview

Development & CI

Guidance for AI agents reading this repo

Where to look next (quick links)