Glossary term
Glossary term
Multimodal AI
VLM from DeepMind using gated cross-attention layers interleaved with a frozen LLM to enable few-shot visual question answering from interleaved image-text sequences.
Flamingo (Alayrac et al. DeepMind 2022) achieves few-shot learning on 16 visual question answering and captioning benchmarks by conditioning a frozen Chinchilla LLM on visual features via cross-attention, without updating the LLM weights.
Flamingo's architectural pattern of freezing the LLM and connecting vision via cross-attention layers directly inspired OpenFlamingo, IDEFICS, and other open-source VLMs, establishing the standard approach for efficient VLM training.
Flamingo demonstrates that 32-shot visual in-context learning outperforms prior zero-shot-specialist models on VQAv2, showing that general-purpose few-shot VLMs can match task-specific models without retraining.