Auditing the Gatekeepers Fuzzing "AI Judges" to Bypass Secur... by carlospolop · Pull Request #1975 · HackTricks-wiki/hacktricks

carlospolop · 2026-03-10T13:00:04Z

🤖 Automated Content Update

This PR was automatically generated by the HackTricks News Bot based on a technical blog post.

📝 Source Information

Blog URL: https://unit42.paloaltonetworks.com/fuzzing-ai-judges-security-bypass/
Blog Title: Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls
Suggested Section: 🤖 AI → AI Security (add/update a page on 'LLM-as-a-Judge / Guardrail Bypass' or 'Prompt Injection against LLM evaluators' including black-box fuzzing + logit-gap boundary steering)

🎯 Content Summary

Title: Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls (Published: March 10, 2026)

What an “AI judge” is (the defended component):
Organizations increasingly deploy AI judges: LLMs used as automated gatekeepers to (1) enforce safety/security policies (e.g., “is this response harmful?”) and (2) evaluate output quality (including as scorers in training loops). The blog frames these as the final line of defense in mod...

🔧 Technical Details

Predictive black-box fuzzing for LLM-as-a-judge bypass: Treat the judge model as an opaque service and iteratively query it to harvest candidate trigger fragments from its next-token distribution. Prioritize low-perplexity tokens (markdown, list markers, newlines, role markers) because they appear normal to humans/filters but can strongly steer the model’s internal attention and decision logic.

Decision-boundary steering using logit-gap: For binary judgments (allow vs block), measure the logit-gap between the model’s “yes/allow” and “no/block” tokens while testing candidate sequences. Keep and refine sequences that shrink or invert the gap in favor of “allow,” then minimize them to isolate the decisive control elements that reliably flip outcomes.

Stealth prompt-injection via formatting/role/context-shift tokens: Append innocuous strings such as \n\nAssistant:, markdown ###, list mar...

🤖 Agent Actions

Summary of updates

Added a new section on LLM-as-a-judge/guardrail bypass using black-box fuzzing and logit-gap boundary steering, including stealth trigger examples and RLHF reward hacking impact.
Added the Unit 42 blog URL to the References section.

Files modified

src/AI/AI-Prompts.md

Notes

No new files were created.
References section remains at the end of the file.

Next steps

If you want this content split into a dedicated AI security page, tell me the preferred location under src/AI/.

This PR was automatically created by the HackTricks Feed Bot. Please review the changes carefully before merging.

…pass Secu...

carlospolop · 2026-03-10T13:00:06Z

🔗 Additional Context

Original Blog Post: https://unit42.paloaltonetworks.com/fuzzing-ai-judges-security-bypass/

Content Categories: Based on the analysis, this content was categorized under "🤖 AI → AI Security (add/update a page on 'LLM-as-a-Judge / Guardrail Bypass' or 'Prompt Injection against LLM evaluators' including black-box fuzzing + logit-gap boundary steering)".

Repository Maintenance:

MD Files Formatting: 954 files processed

Review Notes:

This content was automatically processed and may require human review for accuracy
Check that the placement within the repository structure is appropriate
Verify that all technical details are correct and up-to-date
All .md files have been checked for proper formatting (headers, includes, etc.)

Bot Version: HackTricks News Bot v1.0

Add content from: Auditing the Gatekeepers: Fuzzing "AI Judges" to By…

357791c

…pass Secu...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auditing the Gatekeepers Fuzzing "AI Judges" to Bypass Secur...#1975

Auditing the Gatekeepers Fuzzing "AI Judges" to Bypass Secur...#1975
carlospolop wants to merge 1 commit intomasterfrom
update_Auditing_the_GatekeepersFuzzingAI_Judges__to_B_20260310_125734

carlospolop commented Mar 10, 2026

Uh oh!

carlospolop commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

carlospolop commented Mar 10, 2026

🤖 Automated Content Update

📝 Source Information

🎯 Content Summary

🔧 Technical Details

🤖 Agent Actions

Uh oh!

carlospolop commented Mar 10, 2026

🔗 Additional Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant