Glossary term
Glossary term
Multimodal AI
Training approach that pulls representations of similar pairs together and pushes dissimilar pairs apart in embedding space.
SimCLR (Google) uses contrastive learning to train visual representations without labels - an image and its augmented version are pulled together while other batch images are pushed apart, learning a transferable encoder.
CLIP (OpenAI) uses contrastive learning on 400M image-text pairs to align visual and language embeddings - enabling zero-shot image classification by comparing image embeddings to class label text embeddings.
E5-Mistral (Microsoft) uses contrastive learning to train a universal text embedding model - pulling semantically similar document pairs together and pushing dissimilar pairs apart across 93 datasets.