Glossary term
Glossary term
Multimodal AI
AI capability to understand and reason about spatial relationships, positions, orientations, and physical arrangements.
SpatialVLM (Google, 2024) fine-tunes a VLM with spatial chain-of-thought reasoning on 10B synthetic spatial Q&A pairs - enabling robots to answer 'Is the cup to the left or right of the plate?' from a camera image.
GPT-4V is used in architectural design review workflows to reason about floor-plan images - identifying spatial relationships between rooms, verifying code compliance, and estimating circulation paths.
Gemini 1.5 Pro demonstrates spatial reasoning over video by tracking object positions across frames - used in warehouse robotics to plan pick-and-place sequences from a single overhead camera feed.