Visual Grounding

VLM capability to locate specific regions or objects in an image based on a natural language description.

1.
Grounding DINO (IDEA Research) achieves zero-shot object detection guided by text queries - used in robotics pipelines to locate objects by natural language description ('the red coffee mug on the left side of the table').
2.
Florence-2 (Microsoft) supports visual grounding tasks including referring expression comprehension and region-level captioning - used in Azure AI Vision for document layout analysis and UI element detection.
3.
OWLv2 (Google) is a zero-shot object detector using vision-language contrastive training - used in robotics manipulation pipelines to locate novel objects from language descriptions without retraining.

Loading…