Multimodal Model
An AI model that processes multiple input types at once - text, images, audio, and video.
What is a multimodal model?
A multimodal model is a type of LLM that understands more than just text - it handles other "modalities" such as images, audio, video, and PDFs. It does not treat each input in isolation but combines them: it can describe a photo, read a chart from an image, or turn a hand-drawn sketch into HTML.
Examples of multimodal models
- GPT-4o (OpenAI) - text, image, audio, video
- Claude 3.5 Sonnet (Anthropic) - text and images (including PDFs and charts)
- Gemini 1.5 Pro (Google) - text, image, audio, video, code
Typical use cases
- Extracting data from scanned invoices and contracts (alternative to OCR + parsing)
- Describing product photos for an e-shop
- Quality control - product photo + text description of the defect
- Transcribing and summarizing meeting audio
What to watch out for
- Images consume significantly more tokens than text - watch the cost
- The context window applies to image inputs too
- Recognition quality drops on small print or low-quality scans