Glossary term
Glossary term
Multimodal AI
Open-source large multimodal model connecting a CLIP visual encoder to a language model via a projection layer, trained with GPT-4-generated visual instruction data.
LLaVA (NeurIPS 2023 Oral, Haotian Liu et al.) achieves 85.1% relative score compared to GPT-4 on synthetic multimodal instruction-following and 92.53% on ScienceQA, demonstrating that visual instruction tuning with GPT-4-generated data produces strong multimodal capabilities.
LLaVA-1.5 adds a simple MLP connector and academic VQA data to achieve state-of-the-art across 11 benchmarks while training in approximately 1 day on a single 8xA100 node using only 1.2M publicly available data samples.
LLaVA-NeXT-110B (2024) shows near GPT-4V performance on selected benchmarks. The LLaVA family has been adopted by the research community as the standard open-source VLM baseline, with 30,000+ GitHub stars and integration in llama.cpp and Ollama.