Modern AI models process images, audio, and video alongside text. Images are split into patches, each patch embedded into a vector, and those vectors join the text token stream. The model learns to reason across modalities through contrastive pre-training (CLIP) and visual instruction tuning. The architecture is the same transformer — only the tokenizer changed.
Course: Advanced.
This lesson covers 5 concepts: Beyond Text, How Images Become Tokens, Vision-Language Alignment, Multimodal in Practice, Limitations and Failures.
Modern AI processes images, audio, and video by converting them into token sequences that join the text token stream. The transformer architecture is unchanged — only the tokenizer expands to handle new modalities.
Multimodal models unlock use cases impossible for text-only models: document analysis, chart interpretation, visual question answering, image generation guidance, and screen understanding.
Once everything is a token, the model can reason across modalities. A question about text, an image, and a chart can all be part of the same conversation.
GPT-4V processes an image of a whiteboard and explains the equations written on it. Claude 3.5 processes a PDF with mixed text and charts and answers questions across both modalities. Same transformer, new tokenizer.
A 224px image splits into 256 patches of 14×14 pixels each, each patch embedded into a vector. A 448px image produces 1,024 patches. A 896px image produces 4,096 patches — matching a long text context in token count.
For reading small text in documents, use high resolution (448px+). For general scene understanding, 224px is sufficient and 4× cheaper. Choose resolution based on task requirements.
More patches means more detail — but also more tokens, which means longer processing time and higher cost. The model sees the image as a sequence of patch tokens, not as a picture.
Screenshot of a spreadsheet at 224px: numbers are blurry, text is unreadable by the model. At 896px: every cell readable, formulas parseable. 16× more tokens but the task becomes feasible.
CLIP (Contrastive Language-Image Pre-training) aligns image and text embeddings in the same vector space. "photo of a cat" embeds close to a cat image (0.91). "text about finance" embeds far from the cat image (0.06). Cosine similarity bridges modalities.
Without alignment, the image encoder and text encoder produce incompatible vectors. Alignment training bridges the gap — making visual question answering, image captioning, and zero-shot classification possible.
This shared embedding space is what allows multimodal models to connect images and text — images become searchable by text, and text descriptions can be visualised.
Zero-shot image classification with CLIP: no training examples needed. Embed the image + embed class names as text → find the closest text label. 76% zero-shot accuracy on ImageNet — approaching supervised baselines.
The model received an image of a revenue chart alongside a text question and extracted specific numbers from the visual, performed arithmetic, and gave a precise answer. Text tokens and image patch tokens were processed together in the same attention layers.
This is the practical value of multimodal AI: documents, charts, screenshots, photos, and diagrams become queryable without manual text extraction. The model bridges the visual and textual information.
Multimodal models turn any document into a conversational interface. Upload a chart, a screenshot, a photo of a whiteboard, or a scanned form — and ask questions directly.
Without multimodal: extract chart data → transcribe to text → feed to LLM = manual, brittle pipeline. With multimodal: upload chart image → ask question → get answer in one step. 90% reduction in pipeline complexity.
Current multimodal models fail predictably on: precise counting (off-by-one errors common), exact spatial relationships, small text in complex images, and confident hallucination of objects not present. Understanding these failure modes determines when to trust multimodal outputs.
Knowing the failure modes prevents misapplication. Do not rely on multimodal models for precise object counting, exact measurements, or handwritten text recognition without verification.
Understanding failure modes is as important as understanding capabilities. Multimodal AI is a tool with specific strengths and specific weaknesses — knowing both makes you a better user.
Medical imaging: multimodal models should NOT be trusted for counting lesions, measuring tumour size, or reading handwritten clinical notes without expert verification — all are high-failure tasks.