Glossary term
Glossary term
Evaluation and Benchmarks
Crowd-sourced model ranking system using Elo scoring from blind pairwise human preference votes.
Chatbot Arena (LMSYS) has collected 1M+ human preference votes comparing model outputs side-by-side - GPT-4o tops the ELO leaderboard at 1290, used by enterprises as a human-preference complement to benchmark scores.
Chatbot Arena has revealed that model performance on standard benchmarks (MMLU, HumanEval) does not always predict human preference - motivating the use of Arena ELO alongside traditional benchmarks for procurement.
Mistral Large 2 rose to top-5 in Arena ELO within weeks of release, validating its conversational quality despite limited independent benchmarking - used by procurement teams as a real-world quality signal.