Iter Iter

Vision & Screenshots

PaddleOCR text extraction, vision-language model analysis, and Playwright screenshot capture - all running locally on your infrastructure.

Vision analysis and screenshot capture

Vision capabilities

1

PaddleOCR text extraction

Fast, local OCR with bounding boxes and confidence scores. Detects rotated text, returns structured output with line positions. ~100ms per image.

2

Vision-language model analysis

Analyze images with custom prompts using qwen2.5vl. Describe UI layouts, identify bugs, extract meaning from diagrams.

3

Structured data extraction

Extract structured data from images using JSON schemas. Pull form data, table contents, or UI element properties into typed objects.

4

Screenshot capture

Playwright-powered screenshot capture with configurable viewports (desktop, laptop, tablet, mobile). Full-page captures for visual verification.

API endpoints

POST /vision/analyze

General image analysis with a custom prompt. Supports base64, file path, URL, or screenshot ID as image source.

POST /vision/ocr

LLM-enhanced OCR with optimized prompt and zero temperature for precision text extraction.

POST /vision/extract

Structured data extraction using a JSON schema. Returns typed objects matching your schema definition.

POST /vision/paddle-ocr

Direct PaddleOCR without LLM. Fast (~100ms), returns bounding boxes + confidence scores per text line.

MCP tools for vision

Vision and screenshot tools are available via MCP for use in chat, CLI, and editor integrations:

iter_vision_analyze

Analyze any image with a custom prompt using the vision-language model.

iter_vision_ocr

Extract text from images using LLM-enhanced OCR.

iter_capture_screenshot

Capture a URL with viewport and delay options.

iter_vision_extract

Extract structured data from images using a JSON schema.

See what your AI sees