Vision Language Model (VLM)

A multimodal model that jointly processes visual inputs (images, video) and text, enabling image-grounded language tasks.

1.
GPT-4V (OpenAI) is used by Be My Eyes to describe images to visually impaired users - answering questions about product labels, navigation, and documents in real time via a mobile app.
2.
Claude 3.5 Sonnet is used by Anthropic's computer use demo to interpret screenshots of desktop UIs and determine which elements to click, drag, or type into - grounding UI automation in visual understanding.
3.
Google Gemini 1.5 Pro is used by manufacturers for visual quality inspection - the VLM is shown a reference image alongside a production image and asked to identify defects, deviations, or missing components.

Vision Language ModelVLM