Glossary term
Glossary term
Evaluation and Benchmarks
Testing is how we make sure AI works as expected before it goes live. It includes checking accuracy, behavior, and edge cases, so there are no surprises when customers or teams start using it.
LangSmith Evaluations, Promptfoo, and Patronus AI are widely used LLM testing platforms.
DeepEval is an open-source pytest-style framework for testing LLM applications.
Giskard and Robust Intelligence test models for bias, robustness, and adversarial attacks.