Glossary term
Glossary term
Multimodal AI
Open-source VLM series from Shanghai AI Lab achieving GPT-4V-class multimodal performance with models from 2B to 108B parameters.
InternVL2 (2024) achieves GPT-4V-comparable performance on MMMU (multimodal understanding) while being fully open-source. The 108B variant scores 55.2% on MMMU, comparable to GPT-4V's 56.8%, used by researchers requiring open-weight frontier VLMs.
InternVL2-8B outperforms GPT-4V and Gemini 1.0 Pro on document understanding and chart comprehension benchmarks, making it the preferred open-source VLM for enterprise document-analysis pipelines that cannot use proprietary APIs.
InternVL's dynamic resolution training strategy divides high-resolution images into variable-size patches, enabling accurate OCR and document understanding on images up to 4K resolution without the memory overhead of large fixed-size patch grids.