Glossary term
Glossary term
Multimodal AI
Task of generating natural language descriptions of image content.
Microsoft Azure AI Vision's dense captioning feature generates descriptions for each region of an image - used by media companies to auto-generate alt-text for millions of news photos, improving web accessibility.
BLIP-2 (Salesforce) generates factually grounded captions by bootstrapping a frozen image encoder with a frozen LLM via a lightweight Q-Former module - deployed in e-commerce to auto-generate product descriptions from photos.
GIT (Generative Image-to-text Transformer, Microsoft) achieves state-of-the-art results on nocaps by pre-training on 0.8B image-text pairs - used in accessibility tools to describe images for screen-reader users.