Glossary term
Glossary term
Multimodal AI
Multimodal model capable of accepting and generating any combination of modalities (text, image, audio, video).
GPT-4o (OpenAI) accepts and generates text, images, and audio in a unified model architecture - enabling a single API endpoint to handle document Q&A, image captioning, and voice conversation tasks.
Gemini 1.5 Pro processes text, images, video, audio, and code in a single context window - a media company uses it to analyse a 1-hour video alongside its subtitle file and audio transcript in one API call.
Meta's ImageBind (2023) aligns 6 modalities (text, image, audio, depth, thermal, IMU) in a shared embedding space - enabling cross-modal retrieval where an image query returns semantically matching audio clips.