Glossary term
Glossary term
Safety and Alignment
Inputs with small, imperceptible perturbations that cause AI models to produce incorrect outputs with high confidence.
Goodfellow et al. (2014) demonstrated that adding human-imperceptible pixel noise to a panda image causes GoogLeNet to classify it as a gibbon with 99.3% confidence - founding the adversarial ML research field.
IBM's Adversarial Robustness Toolbox is used by financial institutions to test credit-scoring models against adversarial feature perturbations - identifying when small data changes flip a loan decision.
Automotive OEMs test object detection systems against adversarial stop-sign stickers - physically printed adversarial patterns on stop signs cause some detection models to misclassify them at road speed.