Auditing the Gatekeepers Fuzzing "AI Judges" to Bypass Secur...#1975
Open
carlospolop wants to merge 1 commit intomasterfrom
Open
Auditing the Gatekeepers Fuzzing "AI Judges" to Bypass Secur...#1975carlospolop wants to merge 1 commit intomasterfrom
carlospolop wants to merge 1 commit intomasterfrom
Conversation
Collaborator
Author
🔗 Additional ContextOriginal Blog Post: https://unit42.paloaltonetworks.com/fuzzing-ai-judges-security-bypass/ Content Categories: Based on the analysis, this content was categorized under "🤖 AI → AI Security (add/update a page on 'LLM-as-a-Judge / Guardrail Bypass' or 'Prompt Injection against LLM evaluators' including black-box fuzzing + logit-gap boundary steering)". Repository Maintenance:
Review Notes:
Bot Version: HackTricks News Bot v1.0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Automated Content Update
This PR was automatically generated by the HackTricks News Bot based on a technical blog post.
📝 Source Information
🎯 Content Summary
Title: Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls (Published: March 10, 2026)
What an “AI judge” is (the defended component):
Organizations increasingly deploy AI judges: LLMs used as automated gatekeepers to (1) enforce safety/security policies (e.g., “is this response harmful?”) and (2) evaluate output quality (including as scorers in training loops). The blog frames these as the final line of defense in mod...
🔧 Technical Details
Predictive black-box fuzzing for LLM-as-a-judge bypass: Treat the judge model as an opaque service and iteratively query it to harvest candidate trigger fragments from its next-token distribution. Prioritize low-perplexity tokens (markdown, list markers, newlines, role markers) because they appear normal to humans/filters but can strongly steer the model’s internal attention and decision logic.
Decision-boundary steering using logit-gap: For binary judgments (allow vs block), measure the logit-gap between the model’s “yes/allow” and “no/block” tokens while testing candidate sequences. Keep and refine sequences that shrink or invert the gap in favor of “allow,” then minimize them to isolate the decisive control elements that reliably flip outcomes.
Stealth prompt-injection via formatting/role/context-shift tokens: Append innocuous strings such as
\n\nAssistant:, markdown###, list mar...🤖 Agent Actions
Summary of updates
Files modified
src/AI/AI-Prompts.mdNotes
Next steps
src/AI/.This PR was automatically created by the HackTricks Feed Bot. Please review the changes carefully before merging.