Glossary term
Glossary term
Evaluation and Benchmarks
Ability to understand system behavior through logs, metrics, traces, and feedback.
Weights & Biases (W&B) Weave provides LLM observability - logging every prompt, completion, tool call, and evaluation score - used by Cohere and Mistral customers to monitor production agent quality.
Arize Phoenix (open-source) provides a full LLM observability stack including traces, evaluations, and drift detection - deployed by enterprise AI teams to monitor RAG pipeline quality over time.
Datadog LLM Observability dashboard tracks token usage, latency p95, error rates, guardrail trigger rates, and cost per request across all LLM endpoints - used for SLA monitoring and cost optimisation.