Glossary term
Glossary term
Multimodal AI
Task of answering natural language questions about an image or video.
Google Lens uses VQA capabilities to answer questions about photographed objects, plants, animals, and text - processing 8 billion visual searches monthly across Android and iOS.
InstructBLIP (Salesforce) achieves state-of-the-art VQA performance by instruction-tuning a frozen image encoder with a language model - used by researchers to benchmark multimodal understanding on ScienceQA and TextVQA.
PaLI-X (Google, 55B) is used for medical imaging Q&A - answering questions about radiological images, pathology slides, and retinal scans to support clinical decision-making workflows.