Glossary term
Glossary term
Multimodal AI
Contrastive Language-Image Pre-training model that learns joint text and image embeddings via contrastive learning on 400M image-text pairs.
OpenAI's CLIP is used by Unsplash to power semantic image search - users query 'warm sunset over mountains' and CLIP retrieves visually matching images without manual keyword tagging.
CLIP image encoders serve as the visual backbone in Stable Diffusion, DALL-E 3, and most production text-to-image systems - providing the image representation aligned to text for conditioning generation.
Meta's AI infrastructure uses CLIP-based embeddings to power content moderation at scale - images are embedded and compared against known policy-violating content clusters before human review.