Glossary term
Glossary term
Safety and Alignment
A model's ability to maintain performance under distribution shift, noisy inputs, or adversarial perturbations.
Anthropic measures Claude's robustness by testing performance on out-of-distribution paraphrases of MMLU questions - a 5% performance drop on rephrased questions signals brittleness that triggers training improvements.
Google's ImageNet-C benchmark measures CNN robustness to 15 types of image corruption (blur, noise, weather) - models deployed in autonomous driving systems must achieve <20% error degradation on all corruption types.
PromptBench (Microsoft) evaluates LLM robustness to adversarial text inputs - adding spelling errors, character swaps, and semantic paraphrases to reveal that some models show 30% accuracy drops under mild perturbations.