Glossary term
Glossary term
Foundations
A model that processes and generates multiple data modalities (text, image, audio, video).
A model whose inputs, outputs, or both include more than one modality. For example, consider a model that takes both an image and a text caption (two modalities) as features, and outputs a score indicating how appropriate the text caption is for the image. So, this model's inputs are multimodal and the output is unimodal.
GPT-4o processes images, audio, and text simultaneously - a radiologist-assist tool at Mayo Clinic uses it to analyse chest X-rays alongside patient history text and return structured diagnostic observations.
Google Gemini 1.5 Pro processes video frames, audio transcripts, and text in a single context - a media company uses it to automatically generate time-coded content summaries from 1-hour documentary footage.
Anthropic Claude 3.5 Sonnet is used by Figma to analyse UI mockup images and generate React code, processing both the visual layout and developer comments in a single multimodal request.
Created for this library
A retail e-commerce team uses a multimodal model that ranks products from text, image, and tabular inputs in a single ranker.
A medical AI team uses a multimodal model that combines imaging and lab data for risk stratification of inpatients.
An insurance underwriting team uses a multimodal model that handles claim photos and text descriptions together for first-pass evaluation.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License