Glossary term
Glossary term
Multimodal AI
Mistral AI's multimodal model family combining a natively trained vision encoder with Mistral language models for image and multi-image understanding.
Pixtral 12B (Mistral AI, September 2024) is the first open-weight VLM from Mistral, achieving competitive scores on MMMU (52.5%), MathVista (58.0%), and DocVQA (90.7%) while running on a single 24GB consumer GPU with 4-bit quantisation.
Pixtral Large 124B (November 2024) achieves 52.2% on MMMU, competitive with GPT-4V and Claude 3.5 Sonnet. It is deployed via Mistral's Le Chat product as the multimodal reasoning capability for enterprise document analysis.
Pixtral's variable image resolution capability (up to 1024x1024 per image, multiple images per prompt) is used by legal-tech companies to process multi-page contract PDFs as image sequences, extracting clause details without OCR pre-processing.