Glossary term
Glossary term
Evaluation and Benchmarks
Test checking whether a change breaks previously working behavior.
OpenAI runs thousands of regression evals before each GPT-4 update - comparing performance on curated benchmark suites to ensure new versions don't degrade accuracy on known-good tasks.
LangSmith's dataset versioning allows teams to run regression suites against new prompt versions - a contract-review agent team runs 500 historical contracts through the new agent version before deploying.
Brex's AI platform team runs weekly regression tests on their LLM-powered spend-categorisation model - flagging any category accuracy drop above 2% as a blocking issue before production deployment.