For most of their history, language models were text-only systems: text in, text out. Multimodal models extend this to handle images, audio, video, and other data types — either as inputs, outputs, or both.
The practical implication is significant: a model that can see an image and answer questions about it, or generate an image from a text description, enables qualitatively different applications than text-only systems.
How multimodal inputs work
The fundamental challenge of multimodal AI is that different modalities (text, images, audio) have very different representations. Text is discrete sequences of tokens. Images are grids of pixels. Audio is a waveform.
The solution used by most modern systems: convert each modality into a representation compatible with the language model's architecture.
For images: An image encoder (typically a Vision Transformer, or ViT) processes the image and produces a sequence of embedding vectors. These visual embeddings are projected into the same embedding space as text tokens and concatenated with the text tokens before being passed to the language model. The model can then "attend" to both text and visual information in its self-attention layers.
For audio: A similar approach — an audio encoder converts the waveform into embeddings compatible with the language model. Whisper, OpenAI's speech recognition model, is a common component in audio processing pipelines.
For video: Either treat video as a sequence of frames and encode each frame, or use specialized video encoders that capture temporal relationships across frames.
Training multimodal models
Multimodal models are typically trained in stages:
- Pre-train modality-specific encoders: Train the image encoder on image classification or image-text matching tasks.
- Align modalities: Train the projection layer that maps visual embeddings to the language model's embedding space, often using image-caption pairs.
- Joint fine-tuning: Fine-tune the combined model on multimodal tasks — visual question answering, image captioning, chart interpretation.
Large models like GPT-4V and Claude 3 (Vision) were trained on large amounts of image-text data, enabling them to understand images well without specialized per-task training.
What multimodal models can do
Visual question answering: Describe an image, answer questions about its contents, identify objects, read text within images.
Document understanding: Extract structured information from PDFs, forms, charts, and tables — going beyond what OCR alone can do.
Chart and graph interpretation: Understand the meaning of a bar chart or line graph and answer questions about the data it represents.
Code from UI: Describe or generate code from a screenshot of a user interface.
Medical imaging: Describe what is visible in medical scans (with appropriate caveats about the current limitations of AI for medical use).
Video understanding: Summarize video content, answer questions about events in a video, extract information from presentations.
Image generation
Separate from multimodal understanding models are image generation models: systems that create images from text descriptions.
The dominant approaches:
- Diffusion models (Stable Diffusion, DALL-E, Midjourney): Start from random noise and iteratively denoise it, guided by a text embedding, to produce an image.
- Autoregressive models: Generate images token by token, similar to how language models generate text.
These models are separate from (though increasingly integrated with) language models. Recent systems combine both — a language model for understanding and planning, an image generation model for producing images.
Limitations
Multimodal models have capabilities but also real limitations:
Counting is hard: Models often count objects in images inaccurately.
Spatial reasoning: Understanding exact positions and orientations of objects remains challenging.
Small details: Reading small text, identifying subtle differences, or interpreting highly technical diagrams can be unreliable.
Hallucination: Models can describe features of an image that are not present, especially when the description is expected based on context.
Video understanding at scale: Processing long videos remains expensive and accuracy degrades with length.
Summary
Multimodal AI systems handle inputs across modalities (text, images, audio, video) by using modality-specific encoders to convert each type into embeddings compatible with a central language model. This enables visual question answering, document understanding, chart interpretation, and more. Image generation uses diffusion models guided by text embeddings. Despite broad capability, multimodal models have documented limitations in counting, spatial reasoning, and reliability on fine-grained visual details.