Multimodal Model

An AI model that processes multiple input types at once - text, images, audio, and video.

What is a multimodal model?

A multimodal model is a type of LLM that understands more than just text - it handles other "modalities" such as images, audio, video, and PDFs. It does not treat each input in isolation but combines them: it can describe a photo, read a chart from an image, or turn a hand-drawn sketch into HTML.

Examples of multimodal models

GPT-4o (OpenAI) - text, image, audio, video
Claude 3.5 Sonnet (Anthropic) - text and images (including PDFs and charts)
Gemini 1.5 Pro (Google) - text, image, audio, video, code

Typical use cases

Extracting data from scanned invoices and contracts (alternative to OCR + parsing)
Describing product photos for an e-shop
Quality control - product photo + text description of the defect
Transcribing and summarizing meeting audio

What to watch out for

Images consume significantly more tokens than text - watch the cost
The context window applies to image inputs too
Recognition quality drops on small print or low-quality scans