Glossary term
Glossary term
Multimodal AI
VLM capability to locate specific regions or objects in an image based on a natural language description.
Grounding DINO (IDEA Research) achieves zero-shot object detection guided by text queries - used in robotics pipelines to locate objects by natural language description ('the red coffee mug on the left side of the table').
Florence-2 (Microsoft) supports visual grounding tasks including referring expression comprehension and region-level captioning - used in Azure AI Vision for document layout analysis and UI element detection.
OWLv2 (Google) is a zero-shot object detector using vision-language contrastive training - used in robotics manipulation pipelines to locate novel objects from language descriptions without retraining.