Multimodal Model
Multimodal Model is an AI model capable of accepting input or producing output in multiple modalities, such as text, images, audio, video, or structured data, within a single model rather than requiring separate specialized models for each modality; Multimodal models typically use modality-specific encoders, such as a vision encoder for images, to convert non-text inputs into token-like representations that share the same embedding space as text tokens; Multimodal capabilities unlock product features that were previously impossible or required separate specialized models chained together
Multimodal Model is an AI model capable of accepting input or producing output in multiple modalities, such as text, images, audio, video, or structured data, within a single model rather than requiring separate specialized models for each modality. Multimodal capabilities allow applications to reason across image and text together, transcribe audio to text, or generate images from textual descriptions.
How it works
Multimodal models typically use modality-specific encoders, such as a vision encoder for images, to convert non-text inputs into token-like representations that share the same embedding space as text tokens. These representations are concatenated with text token embeddings and processed by the core transformer. This enables cross-modal attention, allowing the model to reason about relationships between image regions and text passages.
Key facts
- Input modalities: Text, images, audio, video, and documents (PDFs) are supported across different frontier models.
- Output modalities: Most models produce text; image generation (GPT-4o native output) and text-to-speech are emerging capabilities.
- Pricing: Image inputs are priced by image size or converted to a fixed token count; a 1080p image typically costs 800 to 1,600 input tokens.
- Use cases: OCR, chart interpretation, visual QA, screenshot-to-code, and audio transcription are common builder applications.
For builders
Multimodal capabilities unlock product features that were previously impossible or required separate specialized models chained together. Document processing pipelines that extract information from scanned PDFs, screenshots, or diagrams can now be built with a single multimodal model call. Builders should benchmark image understanding quality on their specific document types, as performance varies significantly across chart types, handwriting, and image resolution.
Sources
- Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762. arxiv.org
- Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165. arxiv.org
- Bommasani, R., et al. (2021). On the Opportunities and Risks of Foundation Models. Stanford CRFM. arxiv.org
- NIST. (2023). AI Risk Management Framework (AI RMF 1.0). nist.gov
- Stanford HAI. Foundation Models research portal. hai.stanford.edu