Glossary term
Glossary term
Evaluation and Benchmarks
Benchmark measuring whether a model produces truthful answers or mimics human misconceptions and falsehoods.
TruthfulQA (Lin et al. 2021, OpenAI/Oxford) tests 817 questions covering health myths, conspiracy theories, and common misconceptions - Llama 3.1 70B achieves 65% truthfulness vs Claude 3 Opus at 88%.
Healthcare AI procurement teams use TruthfulQA health and science categories to vet models for patient-facing applications - a model that affirms myths like 'vaccines cause autism' is disqualified regardless of MMLU score.
Constitutional AI training improves TruthfulQA performance by reducing sycophantic agreement with false premises - Anthropic publishes TruthfulQA scores in model cards as evidence of reduced hallucination.