Glossary term
Glossary term
Evaluation and Benchmarks
A metric to determine the quality of code (for example, Python) that a large language model generates. More specifically, pass at k tells you the likelihood that at least one generated block of code out of k generated blocks of code will pass all of its unit tests.
Large language models often struggle to generate good code for complex programming problems. Software engineers adapt to this problem by prompting the large language model to generate multiple (k) solutions for the same problem. Then, software engineers test each of the solutions against unit tests. The calculation of pass at k depends on the outcome of the unit tests:
If one or more of those solutions pass the unit test, then the LLM Passes that code generation challenge.
If none of the solutions pass the unit test, then the LLM Fails that code generation challenge.
The formula for pass at k is as follows:
In general, higher values of k produce higher pass at k scores; however, higher values of k require more large language model and unit testing resources.
Click the icon for an example.
Created for this library
An LLM evaluation team uses pass@k as the headline metric for coding tasks where multiple candidate solutions are produced per problem.
A research lab reports pass@k on coding benchmarks in its model card so enterprise developers can compare model versions.
A model release team uses pass@k to grade candidate models on whether at least one of k samples is correct.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License