Glossary term
Glossary term
Safety and Alignment
Toxicity is when AI says something harmful, offensive, or inappropriate. It is not always intentional, it is just repeating patterns it has seen. That is why filters and safeguards are used to catch and prevent it from showing up in responses.
The degree to which content is abusive, threatening, or offensive. Many machine learning models can identify, measure, and classify toxicity. Most of these models identify toxicity along multiple parameters, such as the level of abusive language and the level of threatening language.
TPU
Abbreviation for Tensor Processing Unit.
Microsoft's 2016 Tay chatbot was shut down within 24 hours after producing toxic content on Twitter.
Google Perspective API by Jigsaw scores toxicity for moderation in publishers like the New York Times.
OpenAI Moderation API, Anthropic Claude's safety filters, and Lakera Guard detect and block toxic outputs.
Created for this library
A content moderation team uses a toxicity classifier to flag harmful content for human reviewer attention.
An LLM safety team measures toxicity rates on a prompt set known to elicit harmful content before each model release.
A social-media platform uses a toxicity model to demote harmful comments in the feed without removing them outright.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License