Vision Transformer (ViT)

Transformer architecture applied to image patches instead of text tokens, treating image regions as a sequence.

1.
Google's ViT-G (6.5B parameters) is used as the image encoder in PaLI and Gemini - processing 14x14 pixel patches as tokens fed to the language model via cross-attention.
2.
Apple's on-device Vision framework uses ViT-based models for real-time object recognition and scene understanding in iOS - processing camera frames at 60fps on the Neural Engine without cloud calls.
3.
SAM 2 (Meta, 2024) uses a streaming ViT encoder to segment objects in video in real time - used by Adobe Premiere and DaVinci Resolve for AI-assisted rotoscoping and background removal.

Vision TransformerViT