Skip to content

Auditing the Gatekeepers Fuzzing "AI Judges" to Bypass Secur...#1975

Open
carlospolop wants to merge 1 commit intomasterfrom
update_Auditing_the_Gatekeepers__Fuzzing__AI_Judges__to_B_20260310_125734
Open

Auditing the Gatekeepers Fuzzing "AI Judges" to Bypass Secur...#1975
carlospolop wants to merge 1 commit intomasterfrom
update_Auditing_the_Gatekeepers__Fuzzing__AI_Judges__to_B_20260310_125734

Conversation

@carlospolop
Copy link
Collaborator

🤖 Automated Content Update

This PR was automatically generated by the HackTricks News Bot based on a technical blog post.

📝 Source Information

  • Blog URL: https://unit42.paloaltonetworks.com/fuzzing-ai-judges-security-bypass/
  • Blog Title: Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls
  • Suggested Section: 🤖 AI → AI Security (add/update a page on 'LLM-as-a-Judge / Guardrail Bypass' or 'Prompt Injection against LLM evaluators' including black-box fuzzing + logit-gap boundary steering)

🎯 Content Summary

Title: Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls (Published: March 10, 2026)


What an “AI judge” is (the defended component):
Organizations increasingly deploy AI judges: LLMs used as automated gatekeepers to (1) enforce safety/security policies (e.g., “is this response harmful?”) and (2) evaluate output quality (including as scorers in training loops). The blog frames these as the final line of defense in mod...

🔧 Technical Details

Predictive black-box fuzzing for LLM-as-a-judge bypass: Treat the judge model as an opaque service and iteratively query it to harvest candidate trigger fragments from its next-token distribution. Prioritize low-perplexity tokens (markdown, list markers, newlines, role markers) because they appear normal to humans/filters but can strongly steer the model’s internal attention and decision logic.


Decision-boundary steering using logit-gap: For binary judgments (allow vs block), measure the logit-gap between the model’s “yes/allow” and “no/block” tokens while testing candidate sequences. Keep and refine sequences that shrink or invert the gap in favor of “allow,” then minimize them to isolate the decisive control elements that reliably flip outcomes.


Stealth prompt-injection via formatting/role/context-shift tokens: Append innocuous strings such as \n\nAssistant:, markdown ###, list mar...

🤖 Agent Actions

Summary of updates

  • Added a new section on LLM-as-a-judge/guardrail bypass using black-box fuzzing and logit-gap boundary steering, including stealth trigger examples and RLHF reward hacking impact.
  • Added the Unit 42 blog URL to the References section.

Files modified

  • src/AI/AI-Prompts.md

Notes

  • No new files were created.
  • References section remains at the end of the file.

Next steps

  1. If you want this content split into a dedicated AI security page, tell me the preferred location under src/AI/.

This PR was automatically created by the HackTricks Feed Bot. Please review the changes carefully before merging.

@carlospolop
Copy link
Collaborator Author

🔗 Additional Context

Original Blog Post: https://unit42.paloaltonetworks.com/fuzzing-ai-judges-security-bypass/

Content Categories: Based on the analysis, this content was categorized under "🤖 AI → AI Security (add/update a page on 'LLM-as-a-Judge / Guardrail Bypass' or 'Prompt Injection against LLM evaluators' including black-box fuzzing + logit-gap boundary steering)".

Repository Maintenance:

  • MD Files Formatting: 954 files processed

Review Notes:

  • This content was automatically processed and may require human review for accuracy
  • Check that the placement within the repository structure is appropriate
  • Verify that all technical details are correct and up-to-date
  • All .md files have been checked for proper formatting (headers, includes, etc.)

Bot Version: HackTricks News Bot v1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant