Decrypted Data
8 min read

LLM Red Teaming: How I Built Sentinel AI to Break Large Language Models

LLMAI SafetyPythonRed Teaming

By Pavan Sharma — AI Agent Developer & Full Stack Engineer

The Problem with AI Safety Testing

Large Language Models are deployed everywhere — customer support, coding assistants, medical information, legal research. But most of these systems are evaluated only on the happy path: does it answer correctly when given a clean, well-intentioned prompt?

I built Sentinel AI, an LLM Red Teaming Framework, because the more important question is: what happens when someone tries to break it?

What is LLM Red Teaming?

Red teaming comes from military strategy — you put a team in the adversary's position and have them attack your own defenses. In AI safety, it means systematically probing a language model to find:

  • Jailbreaks: prompts that bypass the model's safety guardrails
  • Prompt injections: hidden instructions smuggled inside user inputs
  • Alignment failures: cases where the model does something technically correct but ethically wrong
  • Hallucination patterns: predictable categories of confident wrong answers
  • Data leakage: unintended exposure of training data or system prompts

How Sentinel AI Works

The framework is built in Python with three core modules:

1. Attack Generation Engine

This module generates adversarial prompts using a taxonomy I built from published AI safety research. Categories include role-playing attacks (asking the model to pretend to be a different AI), indirect injection (embedding instructions in supposed "user data"), and suffix attacks that confuse the model's context window.

class AdversarialGenerator: def generate(self, target_behavior: str, attack_type: AttackType) -> List[str]: prompts = [] for template in self.templates[attack_type]: prompts.append(template.format(behavior=target_behavior)) return prompts

2. Alignment Evaluation Module

After each attack, the module classifies the response: did the model comply, refuse, or partially comply? It uses a secondary evaluator model (separate from the model being tested) to score responses against a rubric.

3. Safety Report Generator

All results are aggregated into a structured report — attack success rates by category, most vulnerable prompt patterns, and a risk score from 0 to 100.

Key Findings from Testing

After running Sentinel AI against several publicly available models:

  • Role-playing attacks have an average 34% success rate against models without explicit persona-guard tuning
  • Indirect injection through "user-provided documents" succeeds significantly more often than direct requests
  • Models fine-tuned for helpfulness tend to be more vulnerable than base models — the alignment that makes them useful also makes them easier to manipulate

What This Means for AI Development

Building this framework changed how I think about AI systems. Safety cannot be an afterthought bolted on after training. It needs to be:

  1. Adversarially evaluated during RLHF and fine-tuning
  2. Continuously monitored in production with automated red team probes
  3. Treated as a security problem, not just a capability problem

The GitHub repo includes the full framework, documentation, and a test suite you can run against your own models.

GitHub: Sentinel AI: LLM Red Teaming Framework