Glossary term
Glossary term
Evaluation and Benchmarks
Recorded sequence of model calls, tool calls, inputs, outputs, and events.
LangSmith captures full traces for every LangChain agent invocation - storing each LLM call, tool call, input/output, latency, and token count - used by teams to debug unexpected agent behavior in production.
Arize AI's tracing platform ingests traces from OpenAI and Anthropic agents, visualising multi-step reasoning chains as waterfall diagrams - used by ML engineers to identify where agents make incorrect tool selections.
OpenTelemetry's semantic conventions for LLM traces are adopted by Datadog and Grafana, enabling standard trace collection from any LLM framework (LangChain, LlamaIndex, Semantic Kernel) into existing observability stacks.