Multimodal Model

A model that processes and generates multiple data modalities (text, image, audio, video).

A model whose inputs, outputs, or both include more than one modality. For example, consider a model that takes both an image and a text caption (two modalities) as features, and outputs a score indicating how appropriate the text caption is for the image. So, this model's inputs are multimodal and the output is unimodal.

Examples

1.
GPT-4o processes images, audio, and text simultaneously - a radiologist-assist tool at Mayo Clinic uses it to analyse chest X-rays alongside patient history text and return structured diagnostic observations.
2.
Google Gemini 1.5 Pro processes video frames, audio transcripts, and text in a single context - a media company uses it to automatically generate time-coded content summaries from 1-hour documentary footage.
3.
Anthropic Claude 3.5 Sonnet is used by Figma to analyse UI mockup images and generate React code, processing both the visual layout and developer comments in a single multimodal request.

Real-world uses

Created for this library

1.
A retail e-commerce team uses a multimodal model that ranks products from text, image, and tabular inputs in a single ranker.
2.
A medical AI team uses a multimodal model that combines imaging and lab data for risk stratification of inpatients.
3.
An insurance underwriting team uses a multimodal model that handles claim photos and text descriptions together for first-pass evaluation.

Back to glossary

A model that processes and generates multiple data modalities (text, image, audio, video).

Examples

1.
GPT-4o processes images, audio, and text simultaneously - a radiologist-assist tool at Mayo Clinic uses it to analyse chest X-rays alongside patient history text and return structured diagnostic observations.
2.
Google Gemini 1.5 Pro processes video frames, audio transcripts, and text in a single context - a media company uses it to automatically generate time-coded content summaries from 1-hour documentary footage.
3.
Anthropic Claude 3.5 Sonnet is used by Figma to analyse UI mockup images and generate React code, processing both the visual layout and developer comments in a single multimodal request.

Real-world uses

Created for this library

1.
A retail e-commerce team uses a multimodal model that ranks products from text, image, and tabular inputs in a single ranker.
2.
A medical AI team uses a multimodal model that combines imaging and lab data for risk stratification of inpatients.
3.
An insurance underwriting team uses a multimodal model that handles claim photos and text descriptions together for first-pass evaluation.

Back to glossary

Multimodal Model

Examples

Real-world uses

Related terms

Loading…

Multimodal Model

Examples

Real-world uses

Related terms