Vision & Screenshots
PaddleOCR text extraction, vision-language model analysis, and Playwright screenshot capture - all running locally on your infrastructure.
Vision capabilities
PaddleOCR text extraction
Fast, local OCR with bounding boxes and confidence scores. Detects rotated text, returns structured output with line positions. ~100ms per image.
Vision-language model analysis
Analyze images with custom prompts using qwen2.5vl. Describe UI layouts, identify bugs, extract meaning from diagrams.
Structured data extraction
Extract structured data from images using JSON schemas. Pull form data, table contents, or UI element properties into typed objects.
Screenshot capture
Playwright-powered screenshot capture with configurable viewports (desktop, laptop, tablet, mobile). Full-page captures for visual verification.
API endpoints
POST /vision/analyze
General image analysis with a custom prompt. Supports base64, file path, URL, or screenshot ID as image source.
POST /vision/ocr
LLM-enhanced OCR with optimized prompt and zero temperature for precision text extraction.
POST /vision/extract
Structured data extraction using a JSON schema. Returns typed objects matching your schema definition.
POST /vision/paddle-ocr
Direct PaddleOCR without LLM. Fast (~100ms), returns bounding boxes + confidence scores per text line.
MCP tools for vision
Vision and screenshot tools are available via MCP for use in chat, CLI, and editor integrations:
iter_vision_analyze
Analyze any image with a custom prompt using the vision-language model.
iter_vision_ocr
Extract text from images using LLM-enhanced OCR.
iter_capture_screenshot
Capture a URL with viewport and delay options.
iter_vision_extract
Extract structured data from images using a JSON schema.