Glossary term
Glossary term
Multimodal AI
Training objective that aligns visual and language representations so semantically similar images and texts have similar embeddings.
ALIGN (Google, 2021) trains on 1.8 billion noisy image-text pairs without any curation - demonstrating that scale compensates for noise in contrastive alignment, matching CLIP quality at larger scale.
SigLIP (Google, 2023) replaces CLIP's softmax contrastive loss with sigmoid binary classification - enabling efficient training on smaller batches and improving zero-shot image classification accuracy.
BLIP-2 uses a lightweight Q-Former module to bridge a frozen CLIP image encoder with a frozen LLM - learning image-text alignment without retraining either component, reducing alignment training cost by 10x.