Image-Text Alignment

Training objective that aligns visual and language representations so semantically similar images and texts have similar embeddings.

1.
ALIGN (Google, 2021) trains on 1.8 billion noisy image-text pairs without any curation - demonstrating that scale compensates for noise in contrastive alignment, matching CLIP quality at larger scale.
2.
SigLIP (Google, 2023) replaces CLIP's softmax contrastive loss with sigmoid binary classification - enabling efficient training on smaller batches and improving zero-shot image classification accuracy.
3.
BLIP-2 uses a lightweight Q-Former module to bridge a frozen CLIP image encoder with a frozen LLM - learning image-text alignment without retraining either component, reducing alignment training cost by 10x.

Loading…