Glossary term
Glossary term
Foundations
A dataset for a classification in which the total number of labels of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows:
1,000,000 negative labels
10 positive labels
The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset.
In contrast, the following dataset is class-balanced because the ratio of negative labels to positive labels is relatively close to 1:
517 negative labels
483 positive labels
Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two:
1,000,000 labels with class "green"
200 labels with class "purple"
350 labels with class "orange"
Training class-imbalanced datasets can present special challenges. See Imbalanced datasets in Machine Learning Crash Course for details.
See also entropy, majority class, and minority class.
Created for this library
A fraud team trains on a class-imbalanced dataset where positive cases are under 1 percent and uses class weights to prevent the model from predicting the majority class only.
A medical screening team works with a class-imbalanced dataset for rare cancers and reports precision-recall curves rather than ROC because of the imbalance.
A churn team handles a class-imbalanced dataset by undersampling retained customers during training to balance the gradient signal.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License