info Open to new work opportunities! Contact me
Daniel Hladik AI Automation Engineer

← All terms

Multimodal Model

An AI model that processes multiple input types at once - text, images, audio, and video.

What is a multimodal model?

A multimodal model is a type of LLM that understands more than just text - it handles other "modalities" such as images, audio, video, and PDFs. It does not treat each input in isolation but combines them: it can describe a photo, read a chart from an image, or turn a hand-drawn sketch into HTML.

Examples of multimodal models

  • GPT-4o (OpenAI) - text, image, audio, video
  • Claude 3.5 Sonnet (Anthropic) - text and images (including PDFs and charts)
  • Gemini 1.5 Pro (Google) - text, image, audio, video, code

Typical use cases

  • Extracting data from scanned invoices and contracts (alternative to OCR + parsing)
  • Describing product photos for an e-shop
  • Quality control - product photo + text description of the defect
  • Transcribing and summarizing meeting audio

What to watch out for

  • Images consume significantly more tokens than text - watch the cost
  • The context window applies to image inputs too
  • Recognition quality drops on small print or low-quality scans