Glossary term
Glossary term
Training and Fine-Tuning
Data Augmentation expands training data by generating or tweaking existing examples, like rephrasing sentences or adding noise, improving model robustness and performance without collecting large amounts of new data.
Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your dataset doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add enough labeled images to your dataset to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.
Albumentations is a Python library widely used for image data augmentation in PyTorch training pipelines.
Gretel.ai and Mostly AI provide synthetic data augmentation for tabular enterprise datasets.
NLP teams use libraries like nlpaug to augment training data with paraphrasing and back-translation.
Created for this library
A computer vision team uses data augmentation with random crops, flips, and color jitter to make its detector robust without collecting more images.
A speech recognition vendor uses data augmentation with simulated noise and reverberation to make its model robust on smartphone microphones.
An NLP team uses back-translation as a data augmentation strategy to expand its training set for low-resource languages.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License