Glossary term
Glossary term
Safety and Alignment
A model capability that could enable severe harm if misused or poorly controlled, such as advanced cyber offense, biological design assistance, scalable deception, or autonomous execution of harmful tasks. Dangerous-capability findings should drive access restrictions, additional safeguards, leadership review, and sometimes non-deployment.
Anthropic's Responsible Scaling Policy explicitly defines ASL thresholds for dangerous capabilities including biological weapons uplift and autonomous self-replication.
OpenAI's Preparedness Framework risk categories of CBRN, Cybersecurity, Persuasion, and Model Autonomy each define dangerous capability bands.
Apollo Research published evaluations of in-context scheming behaviors in frontier models in December 2024, documenting nascent dangerous capabilities.