GPT Vision

LocalAI supports understanding images by using LLaVA, and implements the GPT Vision API from OpenAI.

Usage

OpenAI docs: https://platform.openai.com/docs/guides/vision

First install a vision-capable model from the gallery (the examples below use moondream2-20250414, a small vision model):

local-ai run moondream2-20250414

To let LocalAI understand and reply with what sees in the image, use the /v1/chat/completions endpoint, for example with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "moondream2-20250414",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Grammars and function tools can be used as well in conjunction with vision APIs:

 curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "moondream2-20250414", "grammar": "root ::= (\"yes\" | \"no\")",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Setup

Install a vision-capable model from the gallery, either from the Models page in the web UI or from the CLI:

local-ai run moondream2-20250414

Other vision models are available in the gallery (for example smolvlm-instruct and smolvlm2-2.2b-instruct). Browse them on the Models page or see the /models/.