Glossary term
Glossary term
Evaluation and Benchmarks
Benchmark testing whether models correctly select tools and produce valid tool arguments.
Berkeley Function-Calling Leaderboard (BFCL) evaluates 50+ models on tool selection and argument generation across 2,000 function-calling scenarios - GPT-4o and Claude 3.5 Sonnet score >90% on simple tasks.
Gorilla LLM (UC Berkeley) fine-tunes LLaMA for function calling and benchmarks against HuggingFace Hub, TorchHub, and TensorFlow Hub APIs - showing that fine-tuning on API docs outperforms GPT-4 on tool selection.
ToolBench evaluates models on real-world API calls from 16,000 APIs - used by enterprise teams to select base models for API-heavy agent deployments in CRM, ERP, and ITSM integrations.