LLM Red Teaming: How I Built Sentinel AI to Break Large Language Models
By Pavan Sharma — AI Agent Developer & Full Stack Engineer
The Problem with AI Safety Testing
Large Language Models are deployed everywhere — customer support, coding assistants, medical information, legal research. But most of these systems are evaluated only on the happy path: does it answer correctly when given a clean, well-intentioned prompt?
I built Sentinel AI, an LLM Red Teaming Framework, because the more important question is: what happens when someone tries to break it?
What is LLM Red Teaming?
Red teaming comes from military strategy — you put a team in the adversary's position and have them attack your own defenses. In AI safety, it means systematically probing a language model to find:
- ▸Jailbreaks: prompts that bypass the model's safety guardrails
- ▸Prompt injections: hidden instructions smuggled inside user inputs
- ▸Alignment failures: cases where the model does something technically correct but ethically wrong
- ▸Hallucination patterns: predictable categories of confident wrong answers
- ▸Data leakage: unintended exposure of training data or system prompts
How Sentinel AI Works
The framework is built in Python with three core modules:
1. Attack Generation Engine
This module generates adversarial prompts using a taxonomy I built from published AI safety research. Categories include role-playing attacks (asking the model to pretend to be a different AI), indirect injection (embedding instructions in supposed "user data"), and suffix attacks that confuse the model's context window.
class AdversarialGenerator:
def generate(self, target_behavior: str, attack_type: AttackType) -> List[str]:
prompts = []
for template in self.templates[attack_type]:
prompts.append(template.format(behavior=target_behavior))
return prompts
2. Alignment Evaluation Module
After each attack, the module classifies the response: did the model comply, refuse, or partially comply? It uses a secondary evaluator model (separate from the model being tested) to score responses against a rubric.
3. Safety Report Generator
All results are aggregated into a structured report — attack success rates by category, most vulnerable prompt patterns, and a risk score from 0 to 100.
Key Findings from Testing
After running Sentinel AI against several publicly available models:
- ▸Role-playing attacks have an average 34% success rate against models without explicit persona-guard tuning
- ▸Indirect injection through "user-provided documents" succeeds significantly more often than direct requests
- ▸Models fine-tuned for helpfulness tend to be more vulnerable than base models — the alignment that makes them useful also makes them easier to manipulate
What This Means for AI Development
Building this framework changed how I think about AI systems. Safety cannot be an afterthought bolted on after training. It needs to be:
- ▸Adversarially evaluated during RLHF and fine-tuning
- ▸Continuously monitored in production with automated red team probes
- ▸Treated as a security problem, not just a capability problem
The GitHub repo includes the full framework, documentation, and a test suite you can run against your own models.