Interleaved Multimodal

Model that processes and generates text and images interleaved in a single sequence, not as separate modalities.

1.
Chameleon (Meta, 2024) tokenises both text and images as discrete tokens and trains a unified causal transformer - enabling the model to generate image and text tokens interleaved in a single coherent narrative.
2.
Claude 3's vision capabilities process interleaved text and image inputs - a user submits a recipe image followed by text followed by another ingredient image and asks a question spanning all inputs in a single context.
3.
MM-Interleaved (HuggingFace) fine-tunes LLaMA on interleaved image-text documents from scientific papers - enabling the model to reason about figures and text jointly when answering paper-content questions.

Loading…