Glossary term
Glossary term
Safety and Alignment
Alignment between a model's expressed confidence and its actual accuracy; a well-calibrated model's 70% confidence predictions are correct 70% of the time.
OpenAI measures calibration of GPT-4 on MMLU by comparing stated probabilities to empirical accuracy - GPT-4 shows strong calibration (ECE <3%) while smaller models show significant overconfidence.
Meta's LIMA paper shows that alignment-tuned models are less well-calibrated than base models - RLHF training improves helpfulness but increases overconfidence, motivating calibration-aware fine-tuning.
Medical AI deployments require calibration certificates - a radiology AI must demonstrate that its 90%-confidence predictions are correct 88-92% of the time before receiving FDA clearance for clinical use.