Glossary term
Glossary term
Evaluation and Benchmarks
Two-turn conversation benchmark using GPT-4 as judge to evaluate instruction following and conversational quality.
MT-Bench (Zheng et al. 2023, LMSYS) scores models 1-10 across 8 categories using GPT-4 as judge - used by Mistral AI to validate that Mistral 7B achieves MT-Bench score of 7.3, outperforming Llama 2 13B.
Enterprise AI procurement teams use MT-Bench scores alongside MMLU to compare models for multi-turn chat assistant deployments - a score above 8.0 is considered enterprise-grade for complex instruction following.
AlpacaEval (Stanford) extends MT-Bench with 805 diverse instructions - used by Anthropic to validate Claude 3.5 Sonnet's win rate of 52.4% vs GPT-4 Turbo, reported in the Claude 3.5 model card.