LocalAI is a free, open-source alternative to OpenAI (Anthropic, etc.), functioning as a drop-in replacement REST API for local inferencing. It allows you to run LLMs, generate images, and produce audio, all locally or on-premises with consumer-grade hardware, supporting multiple model families and architectures.
LocalAI comes with a built-in web interface for chatting with models, managing installations, configuring AI agents, and more — no extra tools needed.
Tip
Security considerations
If you are exposing LocalAI remotely, make sure you protect the API endpoints adequately. You have two options:
Simple API keys: Run with LOCALAI_API_KEY=your-key to gate access. API keys grant full admin access with no role separation.
User authentication: Run with LOCALAI_AUTH=true for multi-user support with admin/user roles, OAuth login, per-user API keys, and usage tracking. See Authentication & Authorization for details.
Once installed, start LocalAI. For Docker installations:
docker run -p 8080:8080 --name local-ai -ti localai/localai:latest-cpu
For GPU acceleration, choose the image that matches your hardware:
Hardware
Docker image
CPU only
localai/localai:latest-cpu
NVIDIA CUDA
localai/localai:latest-gpu-nvidia-cuda-12
AMD (ROCm)
localai/localai:latest-gpu-hipblas
Intel GPU
localai/localai:latest-gpu-intel
Vulkan
localai/localai:latest-gpu-vulkan
For NVIDIA GPUs, add --gpus all. For AMD/Intel/Vulkan, add the appropriate --device flags. See Container images for the full reference.
Using the Web Interface
Open http://localhost:8080 in your browser. The web interface lets you:
Chat with any installed model
Install models from the built-in gallery (Models page)
Generate images, audio, and more
Create and manage AI agents with MCP tool support
Monitor system resources and loaded models
Configure settings including GPU acceleration
To get started, navigate to the Models page, browse the gallery, and install a model. Once installed, head to the Chat page to start a conversation.
Downloading models from the CLI
When starting LocalAI (either via Docker or via CLI) you can specify as argument a list of models to install automatically before starting the API, for example:
local-ai run llama-3.2-1b-instruct:q4_k_m
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
local-ai run ollama://gemma:2b
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest
You can also manage models with the CLI:
local-ai models list # List available models in the gallerylocal-ai models install <name> # Install a model
Tip
Automatic Backend Detection: When you install models from the gallery or YAML files, LocalAI automatically detects your system’s GPU capabilities (NVIDIA, AMD, Intel) and downloads the appropriate backend. For advanced configuration options, see GPU Acceleration.
For a full list of options, you can run LocalAI with --help or refer to the Linux Installation guide for installer configuration options.
Using the API
LocalAI exposes an OpenAI-compatible API. You can use it with any OpenAI SDK or client by pointing it to http://localhost:8080. For example:
LocalAI also supports the Anthropic Messages API, the Open Responses API, and more. See Try it out for examples of all supported endpoints.
Built-in AI Agents
LocalAI includes a built-in AI agent platform with support for the Model Context Protocol (MCP). You can create agents that use tools, browse the web, execute code, and interact with external services — all from the web interface.
To get started with agents:
Install a model that supports tool calling (most modern LLMs do)
Navigate to the Agents page in the web interface
Create a new agent, configure its tools and system prompt
Start chatting — the agent will use tools autonomously
No separate installation required — agents are part of LocalAI.
Scaling with Distributed Mode
For production deployments or when you need more compute, LocalAI supports distributed mode with horizontal scaling:
Distributed nodes: Add GPU worker nodes that self-register with a frontend coordinator
P2P federation: Connect multiple LocalAI instances for load-balanced inference
Model sharding: Split large models across multiple machines
See the Nodes page in the web interface or the Distribution docs for setup instructions.
What’s Next?
There is much more to explore! LocalAI supports video generation, voice cloning, embeddings, image understanding, and more. Check out:
This section covers everything you need to know about installing and configuring models in LocalAI. You’ll learn multiple methods to get models running.
Prerequisites
LocalAI installed and running (see Quickstart if you haven’t set it up yet)
Basic understanding of command line usage
Method 1: Using the Model Gallery (Easiest)
The Model Gallery is the simplest way to install models. It provides pre-configured models ready to use.
# List available modelslocal-ai models list
# Install a specific modellocal-ai models install llama-3.2-1b-instruct:q4_k_m
# Start LocalAI with a model from the gallerylocal-ai run llama-3.2-1b-instruct:q4_k_m
To run models available in the LocalAI gallery, you can use the model name as the URI. For example, to run LocalAI with the Hermes model, execute:
local-ai run hermes-2-theta-llama-3-8b
To install only the model, use:
local-ai models install hermes-2-theta-llama-3-8b
Note: The galleries available in LocalAI can be customized to point to a different URL or a local directory. For more information on how to setup your own gallery, see the Gallery Documentation.
Browse Online
Visit models.localai.io to browse all available models in your browser.
Method 1.5: Import Models via WebUI
The WebUI provides a powerful model import interface that supports both simple and advanced configuration:
Simple Import Mode
Open the LocalAI WebUI at http://localhost:8080
Click “Import Model”
Enter the model URI (e.g., https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF)
Optionally configure preferences:
Backend selection
Model name
Description
Quantizations
Embeddings support
Custom preferences
Click “Import Model” to start the import process
Advanced Import Mode
For full control over model configuration:
In the WebUI, click “Import Model”
Toggle to “Advanced Mode”
Edit the YAML configuration directly in the code editor
Use the “Validate” button to check your configuration
Click “Create” or “Update” to save
The advanced editor includes:
Syntax highlighting
YAML validation
Format and copy tools
Full configuration options
This is especially useful for:
Custom model configurations
Fine-tuning model parameters
Setting up complex model setups
Editing existing model configurations
Method 2: Installing from Hugging Face
LocalAI can directly install models from Hugging Face:
# Install and run a model from Hugging Facelocal-ai run huggingface://TheBloke/phi-2-GGUF
The format is: huggingface://<repository>/<model-file> ( is optional)
Examples
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
Method 3: Installing from OCI Registries
Ollama Registry
local-ai run ollama://gemma:2b
Standard OCI Registry
local-ai run oci://localai/phi-2:latest
Run Models via URI
To run models via URI, specify a URI to a model file or a configuration file when starting LocalAI. Valid syntax includes:
file://path/to/model (absolute path to a file within your models directory)
From OCIs: oci://container_image:tag, ollama://model_id:tag
From configuration files: https://gist.githubusercontent.com/.../phi-2.yaml
Note
When using file:// URLs, the path must point to a file within your models directory (specified by MODELS_PATH). Files outside this directory are rejected for security reasons.
Configuration files can be used to customize the model defaults and settings. For advanced configurations, refer to the Customize Models section.
Examples
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
local-ai run ollama://gemma:2b
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest
Method 4: Manual Installation
For full control, you can manually download and configure models.
If running on Apple Silicon (ARM), it is not recommended to run on Docker due to emulation. Follow the build instructions to use Metal acceleration for full GPU support.
If you are running on Apple x86_64, you can use Docker without additional gain from building it from source.
git clone https://github.com/go-skynet/LocalAI
cd LocalAI
cp your-model.gguf models/
docker compose up -d --pull always
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "your-model.gguf",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
Tip
Other Docker Images:
For other Docker images, please refer to the table in Getting Started.
Note: If you are on Windows, ensure the project is on the Linux filesystem to avoid slow model loading. For more information, see the Microsoft Docs.
# Via APIcurl http://localhost:8080/v1/models
# Via CLIlocal-ai models list
Remove Models
Simply delete the model file and configuration from your models directory:
rm models/model-name.gguf
rm models/model-name.yaml # if exists
Troubleshooting
Model Not Loading
Check backend: Ensure the required backend is installed
local-ai backends list
local-ai backends install llama-cpp # if needed
Check logs: Enable debug mode
DEBUG=true local-ai
Verify file: Ensure the model file is not corrupted
Out of Memory
Use a smaller quantization (Q4_K_S or Q2_K)
Reduce context_size in configuration
Close other applications to free RAM
Wrong Backend
Check the Compatibility Table to ensure you’re using the correct backend for your model.
Best Practices
Start small: Begin with smaller models to test your setup
Use quantized models: Q4_K_M is a good balance for most use cases
Organize models: Keep your models directory organized
Backup configurations: Save your YAML configurations
Monitor resources: Watch RAM and disk usage
Try it out
Once LocalAI is installed, you can start it (either by using docker, or the cli, or the systemd service).
By default the LocalAI WebUI should be accessible from http://localhost:8080. You can also use 3rd party projects to interact with LocalAI as you would use OpenAI (see also Integrations ).
After installation, install new models by navigating the model gallery, or by using the local-ai CLI.
Tip
To install models with the WebUI, see the Models section.
With the CLI you can list the models with local-ai models list and install them with local-ai models install <model-name>.
You can also run models manually by copying files into the models directory.
You can test out the API endpoints using curl, few examples are listed below. The models we are referring here (gpt-4, gpt-4-vision-preview, tts-1, whisper-1) are examples - replace them with the model names you have installed.
LocalAI supports the Open Responses API specification with support for background processing, streaming, and advanced features. Open Responses documentation.
curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json"\
-d '{
"model": "gpt-4",
"input": "Say this is a test!",
"max_output_tokens": 1024,
"temperature": 0.7
}'
For background processing:
curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json"\
-d '{
"model": "gpt-4",
"input": "Generate a long story",
"max_output_tokens": 4096,
"background": true
}'
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json"\
-d '{
"model": "tts-1",
"input": "The quick brown fox jumped over the lazy dog.",
"voice": "alloy"
}'\
--output speech.mp3
Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. OpenAI Embeddings.
curl http://localhost:8080/embeddings \
-X POST -H "Content-Type: application/json"\
-d '{
"input": "Your text string goes here",
"model": "text-embedding-ada-002"
}'
Tip
Don’t use the model file as model in the request unless you want to handle the prompt template for yourself.
Use the model names like you would do with OpenAI like in the examples below. For instance gpt-4-vision-preview, or gpt-4.
Customizing the Model
To customize the prompt template or the default settings of the model, a configuration file is utilized. This file must adhere to the LocalAI YAML configuration standards. For comprehensive syntax details, refer to the advanced documentation. The configuration file can be located either remotely (such as in a Github Gist) or within the local filesystem or a remote URL.
LocalAI can be initiated using either its container image or binary, with a command that includes URLs of model config files or utilizes a shorthand format (like huggingface:// or github://), which is then expanded into complete URLs.
The configuration can also be set via an environment variable. For instance:
name: phi-2context_size: 2048f16: truethreads: 11gpu_layers: 90mmap: trueparameters:
# Reference any HF model or a local file heremodel: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguftemperature: 0.2top_k: 40top_p: 0.95template:
chat: &template | Instruct: {{.Input}}
Output:# Modify the prompt template here ^^^ as per your requirementscompletion: *template
Then, launch LocalAI using your gist’s URL:
## Important! Substitute with your gist's URL!docker run -p 8080:8080 localai/localai:v4.1.3 https://gist.githubusercontent.com/xxxx/phi-2.yaml
Next Steps
Visit the advanced section for more insights on prompt templates and configuration files.
Building LocalAI from source is an installation method that allows you to compile LocalAI yourself, which is useful for custom configurations, development, or when you need specific build options.
For complete build instructions, see the Build from Source documentation in the Installation section.
Run with container images
LocalAI provides a variety of images to support different environments. These images are available on quay.io and Docker Hub.
For GPU Acceleration support for Nvidia video graphic cards, use the Nvidia/CUDA images, if you don’t have a GPU, use the CPU images. If you have AMD or Mac Silicon, see the build section.
Tip
Available Images Types:
Images ending with -core are smaller images without predownload python dependencies. Use these images if you plan to use llama.cpp, stablediffusion-ncn or rwkv backends - if you are not sure which one to use, do not use these images.
Prerequisites
Before you begin, ensure you have a container engine installed if you are not using the binaries. Suitable options include Docker or Podman. For installation instructions, refer to the following guides:
Hardware Requirements: The hardware requirements for LocalAI vary based on the model size and quantization method used. For performance benchmarks with different backends, such as llama.cpp, visit this link. The rwkv backend is noted for its lower resource consumption.
Standard container images
Standard container images do not have pre-installed models. Use these if you want to configure models manually.
These images are compatible with Nvidia ARM64 devices with CUDA 12, such as the Jetson Nano, Jetson Xavier NX, and Jetson AGX Orin. For more information, see the Nvidia L4T guide.
This guide covers common issues you may encounter when using LocalAI, organized by category. For each issue, diagnostic steps and solutions are provided.
Quick Diagnostics
Before diving into specific issues, run these commands to gather diagnostic information:
# Check LocalAI is running and responsivecurl http://localhost:8080/readyz
# List loaded modelscurl http://localhost:8080/v1/models
# Check LocalAI versionlocal-ai --version
# Enable debug logging for detailed outputDEBUG=true local-ai run
# orlocal-ai run --log-level=debug
For Docker deployments:
# View container logsdocker logs local-ai
# Check container statusdocker ps -a | grep local-ai
# Test GPU access (NVIDIA)docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
Installation Issues
Binary Won’t Execute on Linux
Symptoms: Permission denied or “cannot execute binary file” errors.
Solution:
chmod +x local-ai-*
./local-ai-Linux-x86_64 run
If you see “cannot execute binary file: Exec format error”, you downloaded the wrong architecture. Verify with:
uname -m
# x86_64 → download the x86_64 binary# aarch64 → download the arm64 binary
macOS: Application Is Quarantined
Symptoms: macOS blocks LocalAI from running because the DMG is not signed by Apple.
Symptoms: API returns 404 or "model not found" error.
Diagnostic steps:
Check the model exists in your models directory:
ls -la /path/to/models/
Verify your models path is correct:
# Check what path LocalAI is usinglocal-ai run --models-path /path/to/models --log-level=debug
Confirm the model name matches your request:
# List available modelscurl http://localhost:8080/v1/models | jq '.data[].id'
Model Fails to Load (Backend Error)
Symptoms: Model is found but fails to load, with backend errors in the logs.
Common causes and fixes:
Wrong backend: Ensure the backend in your model YAML matches the model format. GGUF models use llama-cpp, diffusion models use diffusers, etc. See the compatibility table for details.
Backend not installed: Check installed backends:
local-ai backends list
# Install a missing backend:local-ai backends install llama-cpp
Corrupt model file: Re-download the model. Partial downloads or disk errors can corrupt files.
Wrong model format: LocalAI uses GGUF format for llama.cpp models. Older GGML format is deprecated.
Model Configuration Issues
Symptoms: Model loads but produces unexpected results or errors during inference.
Check your model YAML configuration:
# Example model configname: my-modelbackend: llama-cppparameters:
model: my-model.gguf # Relative to models directorycontext_size: 2048threads: 4# Should match physical CPU cores
Common mistakes:
model path must be relative to the models directory, not an absolute path
threads set higher than physical CPU cores causes contention
context_size too large for available RAM causes OOM errors
GPU and Memory Issues
GPU Not Detected
NVIDIA (CUDA):
# Verify CUDA is availablenvidia-smi
# For Docker, verify GPU passthroughdocker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
When working correctly, LocalAI logs should show: ggml_init_cublas: found X CUDA devices.
Ensure you are using a CUDA-enabled container image (tags containing cuda11, cuda12, or cuda13). CPU-only images cannot use NVIDIA GPUs.
If your GPU is not in the default target list, open up an Issue. Supported targets include: gfx908, gfx90a, gfx942, gfx950, gfx1030, gfx1100, gfx1101, gfx1102, gfx1200, gfx1201.
Intel (SYCL):
# Docker requires device passthroughdocker run --device /dev/dri ...
Use container images with gpu-intel in the tag. Known issue: SYCL hangs when mmap: true is set — disable it in your model config:
mmap: false
Overriding backend auto-detection:
If LocalAI picks the wrong GPU backend, override it:
LOCALAI_FORCE_META_BACKEND_CAPABILITY=nvidia local-ai run
# Options: default, nvidia, amd, intel
Out of Memory (OOM)
Symptoms: Model loading fails or the process is killed by the OS.
Solutions:
Use smaller quantizations: Q4_K_S or Q2_K use significantly less memory than Q8_0 or Q6_K
Reduce context size: Lower context_size in your model YAML
Enable low VRAM mode: Add low_vram: true to your model config
Limit active models: Only keep one model loaded at a time:
By default, models remain loaded in memory after first use. This can exhaust VRAM when switching between models.
Configure LRU eviction:
# Keep at most 2 models loaded; evict least recently usedlocal-ai run --max-active-backends=2
Configure watchdog auto-unload:
local-ai run \
--enable-watchdog-idle --watchdog-idle-timeout=15m \
--enable-watchdog-busy --watchdog-busy-timeout=5m
These can also be set via environment variables (LOCALAI_WATCHDOG_IDLE=true, LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m) or in the Web UI under Settings → Watchdog Settings.