Pass at K (pass@k)

A metric to determine the quality of code (for example, Python) that a large language model generates. More specifically, pass at k tells you the likelihood that at least one generated block of code out of k generated blocks of code will pass all of its unit tests.

Large language models often struggle to generate good code for complex programming problems. Software engineers adapt to this problem by prompting the large language model to generate multiple (k) solutions for the same problem. Then, software engineers test each of the solutions against unit tests. The calculation of pass at k depends on the outcome of the unit tests:

If one or more of those solutions pass the unit test, then the LLM Passes that code generation challenge.

If none of the solutions pass the unit test, then the LLM Fails that code generation challenge.

The formula for pass at k is as follows:

In general, higher values of k produce higher pass at k scores; however, higher values of k require more large language model and unit testing resources.

Click the icon for an example.

Real-world uses

Created for this library

1.
An LLM evaluation team uses pass@k as the headline metric for coding tasks where multiple candidate solutions are produced per problem.
2.
A research lab reports pass@k on coding benchmarks in its model card so enterprise developers can compare model versions.
3.
A model release team uses pass@k to grade candidate models on whether at least one of k samples is correct.

Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License

Back to glossary

If one or more of those solutions pass the unit test, then the LLM Passes that code generation challenge.

If none of the solutions pass the unit test, then the LLM Fails that code generation challenge.

The formula for pass at k is as follows:

In general, higher values of k produce higher pass at k scores; however, higher values of k require more large language model and unit testing resources.

Click the icon for an example.

Real-world uses

Created for this library

1.
An LLM evaluation team uses pass@k as the headline metric for coding tasks where multiple candidate solutions are produced per problem.
2.
A research lab reports pass@k on coding benchmarks in its model card so enterprise developers can compare model versions.
3.
A model release team uses pass@k to grade candidate models on whether at least one of k samples is correct.

Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License

Back to glossary

Real-world uses

Loading…

Real-world uses