Glossary term
Glossary term
Evaluation and Benchmarks
Structured testing of a model against defined criteria covering accuracy, safety, robustness, bias, security, and use-case relevance. Evaluation should match the deployment context and risk tier. Evaluation results should be retained as evidence and form part of release decisions, monitoring baselines, and audit trails.
Hugging Face's Open LLM Leaderboard and Stanford CRFM's HELM provide standardized model evaluation comparisons for foundation models.
Under SR 11-7, US banks conduct formal independent model validation before production deployment as a regulatory expectation.
Scale AI's SEAL leaderboards (2024) provide private, contamination-resistant evaluations for frontier models.