Glossary term
Glossary term
Safety and Alignment
A documented statement of intended model behavior, prohibited behavior, escalation rules, and acceptable limitations. It helps align testing, monitoring, policy enforcement, and stakeholder expectations. This specification gives auditors a clear target for testing whether outputs, refusals, escalation, and tool use match approved expectations.
OpenAI's Model Spec (published May 2024 and updated multiple times) defines intended model behavior including objectives, defaults, and rules.
Anthropic's Constitutional AI publishes the constitution used to align Claude, functioning as a model behavior specification.
Google DeepMind's Frontier Safety Framework and OpenAI's Preparedness Framework include behavior expectations tied to capability thresholds.