Glossary term
Glossary term
Evaluation and Benchmarks
Using a model to evaluate outputs against a rubric.
Chatbot Arena (LMSYS) uses GPT-4 as an LLM judge to evaluate pairwise model responses when human votes are unavailable - the LLM judge achieves 80%+ agreement with human raters on most domains.
Anthropic uses Claude as an LLM judge in its internal model evaluation pipeline - scoring generated responses on helpfulness, harmlessness, and honesty dimensions before human raters validate a sample.
LangSmith supports LLM-as-judge evaluators - enterprise teams define custom rubrics (e.g., 'Is the answer grounded in the retrieved context? Is it concise?') and GPT-4 scores each production trace automatically.