Multi-Modal AI: Working with Images and Text
Multi-modal models accept images and text in one request-enabling document OCR, UI screenshot analysis, and visual Q&A.
Vision API Example
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'What error is shown in this screenshot?' },
{ type: 'image_url', image_url: { url: imageDataUrl } },
],
},
],
});
Use Cases
Receipt parsing, diagram explanation, accessibility alt-text generation, and visual regression triage.
Considerations
Image tokens cost more than text. Resize images before upload. Handle PII in screenshots carefully.
Conclusion
Multi-modal APIs blur the line between document pipelines and chat interfaces. Design uploads, storage, and retention policies before shipping user-facing features.