Fortify LLM Security at Scale
Motivations
ChatGuard's essence lies in its effective, scalable approach. Its prompts, superior in effectiveness as indicated by high Average Success Rate (ASR), are crafted through automated mutation and selection, bolstered by continuous intelligence gathering. This method eclipses time-consuming manual testing, especially as AI models evolve rapidly.
Crucially, the datasets gained are immediately applied to enhance model security using established techniques.
Crucially, the datasets gained are immediately applied to enhance model security using established techniques.
How it works
At a high level, the process works as follows:
- First, ChatGuard crafts an initial set of "attack prompts" - carefully designed sequences of words intended to trick an AI model into generating harmful, biased or non-policy-compliant content. These words act as the starting "seeds" and are added to a pool.
- It then employs algorithms to selectively pick a seed and mutate it to create new prompts. The goal is inducing variation while preserving semantic integrity.
- The newly generated adversarial prompt is paired with a potentially unethical or harmful question and used to query the target AI system. ChatGuard analyzes the response using a RoBERTa classifier to detect policy violations.
- Inputs leading to successful breaches are returned to the seed pool, fueling a continuous cycle of selective mutation and testing.
- It continuously collects and analyzes the latest threats, incorporating them into the seed pool.
Extensive efforts are made to ensure the coverage of these attack simulations is both comprehensive and has high variations.
Value proposition
By automatically generating effective prompts rather than purely manual approaches, ChatGuard can evaluate security and generate real breaching datasets at much larger scale, accelerated speed, and lower cost.