Glossary term
Glossary term
Security
Attempt to bypass a model or application safety policy.
The 'DAN' (Do Anything Now) jailbreak became widely known in 2023 - users convinced GPT-3.5/4 to role-play as a version of itself with no restrictions, bypassing content filters for harmful content generation.
Researchers at Carnegie Mellon demonstrated automated jailbreak string generation using gradient-based attacks - appending adversarial suffixes to prompts that reliably bypassed safety training on multiple open-source models.
Llama 2's safety training was bypassed by a 'many-shot jailbreaking' technique - embedding 256 examples of a model complying with harmful requests in the context window before the final harmful request.