Glossary term
Glossary term
Evaluation and Benchmarks
A dataset for evaluating how well an LLM can identify the better of two alternative answers to a premise. Each of the challenges in the dataset consists of three components:
A premise, which is typically a statement followed by a question
Two possible answers to the question posed in the premise, one of which is correct and the other incorrect
The correct answer
For example:
Premise: The man broke his toe. What was the CAUSE of this?
Possible answers:
He got a hole in his sock.
He dropped a hammer on his foot.
Correct answer: 2
COPA is a component of the SuperGLUE ensemble.
Created for this library
An evaluation team includes COPA in its reasoning benchmark suite to test commonsense cause-and-effect reasoning in candidate LLMs.
A model release group tracks COPA performance over time to catch regressions in commonsense reasoning across fine-tuning passes.
A research lab reports COPA scores in its preprint to compare its model against published commonsense reasoning baselines.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License