Glossary term
Glossary term
Multimodal AI
Generative model that learns to reverse a gradual noise process to generate high-quality images or other data.
DALL-E 3 (OpenAI) uses a diffusion model conditioned on CLIP text embeddings to generate photorealistic images - integrated into ChatGPT Plus and Bing Image Creator, generating 4 million images per day.
Stability AI's Stable Diffusion (2022) brought open-source diffusion models to consumer hardware - enabling a 512x512 image generation in 5 seconds on a single RTX 3090, spawning a 100,000+ model ecosystem.
Google's Imagen Video and VideoPoet use diffusion models conditioned on text and image frames to generate temporally consistent video - used in Google's creative tools for automated storyboard generation.