Skip to content
Article Issue #5201

Multimodal Model

What to know

Multimodal Model is an AI model capable of accepting input or producing output in multiple modalities, such as text, images, audio, video, or structured data, within a single model rather than requiring separate specialized models for each modality; Multimodal models typically use modality-specific encoders, such as a vision encoder for images, to convert non-text inputs into token-like representations that share the same embedding space as text tokens; Multimodal capabilities unlock product features that were previously impossible or required separate specialized models chained together

Multimodal Model, WikiWalls Glossary illustration

« Back to Glossary Index

Multimodal Model is an AI model capable of accepting input or producing output in multiple modalities, such as text, images, audio, video, or structured data, within a single model rather than requiring separate specialized models for each modality. Multimodal capabilities allow applications to reason across image and text together, transcribe audio to text, or generate images from textual descriptions.

How it works

Multimodal models typically use modality-specific encoders, such as a vision encoder for images, to convert non-text inputs into token-like representations that share the same embedding space as text tokens. These representations are concatenated with text token embeddings and processed by the core transformer. This enables cross-modal attention, allowing the model to reason about relationships between image regions and text passages.

Key facts

  • Input modalities: Text, images, audio, video, and documents (PDFs) are supported across different frontier models.
  • Output modalities: Most models produce text; image generation (GPT-4o native output) and text-to-speech are emerging capabilities.
  • Pricing: Image inputs are priced by image size or converted to a fixed token count; a 1080p image typically costs 800 to 1,600 input tokens.
  • Use cases: OCR, chart interpretation, visual QA, screenshot-to-code, and audio transcription are common builder applications.

For builders

Multimodal capabilities unlock product features that were previously impossible or required separate specialized models chained together. Document processing pipelines that extract information from scanned PDFs, screenshots, or diagrams can now be built with a single multimodal model call. Builders should benchmark image understanding quality on their specific document types, as performance varies significantly across chart types, handwriting, and image resolution.

Sources

« Back to Definition Index
Administrator · 41 published guides · Joined 2016

Welcome to wikiwalls

The WikiWalls Journal · Free, weekly

One careful fix in your inbox each Wednesday.

No affiliate links inside the diagnosis. No sponsored "top 10". One careful fix per week — unsubscribe in one click.

No tracking pixels · No spam · Edited by a human.