Safety Classifier

A model or ruleset used to detect, label, block, or route unsafe, policy-violating, or sensitive AI inputs and outputs. Classifiers need evaluation because false positives and false negatives both create business and governance risk. Classifier performance should be monitored by category and context.

Examples

1.
OpenAI's Moderation API, free for developers, classifies content across categories like hate, harassment, self-harm, sexual, and violence.
2.
Meta's Llama Guard (2023, updated 2024) is an open weights safety classifier benchmarked against ToxicChat and OpenAI Moderation datasets.
3.
Anthropic's Constitutional Classifiers research (2024) demonstrates classifier-based defenses against jailbreaks with measurable reduction in attack success rates.

Related terms

Back to glossary

Examples

Related terms

Loading…

Examples

Related terms