Glossary term
Glossary term
Evaluation and Benchmarks
A dataset for evaluating an LLM's proficiency in generating Python code. Mostly Basic Python Problems provides about 1,000 crowd-sourced programming problems. Each problem in the dataset contains:
A task description
Solution code
Three automated test cases
Created for this library
An LLM evaluation team uses Mostly Basic Python Problems to measure basic Python coding ability before promoting a new model.
A research lab reports MBPP scores in its model card so enterprise developers can compare basic coding ability across model versions.
A model release team uses MBPP as a baseline coding benchmark in addition to harder coding benchmarks for releases.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License