Glossary term
Glossary term
Architecture
Attention where queries come from one sequence and keys/values come from another, enabling multimodal or encoder-decoder interaction.
In encoder-decoder translation models, the decoder uses cross-attention to attend to encoder representations of the source sentence at each decoding step - giving the model access to the full source context.
Stable Diffusion's UNet uses cross-attention to condition image generation on text embeddings - CLIP text features are projected as keys/values that the UNet's query vectors attend to at each resolution level.
Flamingo (DeepMind) uses gated cross-attention layers interleaved with a frozen LLM to fuse visual features with language representations - enabling few-shot visual question answering without full retraining.