LocalAI

The free, OpenAI, Anthropic alternative. Your All-in-One Complete AI Stack - Run powerful language models, autonomous agents, and document intelligence locally on your hardware.

No cloud, no limits, no compromise.

Tip

⭐ Star us on GitHub - 33.3k+ stars and growing!

Drop-in replacement for OpenAI API - modular suite of tools that work seamlessly together or independently.

Start with LocalAI’s OpenAI-compatible API, extend with LocalAGI’s autonomous agents, and enhance with LocalRecall’s semantic search - all running locally on your hardware.

Open Source MIT Licensed.

Why Choose LocalAI?

OpenAI API Compatible - Run AI models locally with our modular ecosystem. From language models to autonomous agents and semantic search, build your complete AI stack without the cloud.

Key Features

  • LLM Inferencing: LocalAI is a free, Open Source OpenAI alternative. Run LLMs, generate images, audio and more locally with consumer grade hardware.
  • Agentic-first: Extend LocalAI with LocalAGI, an autonomous AI agent platform that runs locally, no coding required. Build and deploy autonomous agents with ease.
  • Memory and Knowledge base: Extend LocalAI with LocalRecall, A local rest api for semantic search and memory management. Perfect for AI applications.
  • OpenAI Compatible: Drop-in replacement for OpenAI API. Compatible with existing applications and libraries.
  • No GPU Required: Run on consumer grade hardware. No need for expensive GPUs or cloud services.
  • Multiple Models: Support for various model families including LLMs, image generation, and audio models. Supports multiple backends for inferencing.
  • Privacy Focused: Keep your data local. No data leaves your machine, ensuring complete privacy.
  • Easy Setup: Simple installation and configuration. Get started in minutes with Binaries installation, Docker, Podman, Kubernetes or local installation.
  • Community Driven: Active community support and regular updates. Contribute and help shape the future of LocalAI.

Quick Start

Docker is the recommended installation method for most users:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

For complete installation instructions, see the Installation guide.

Get Started

  1. Install LocalAI - Choose your installation method (Docker recommended)
  2. Quickstart Guide - Get started quickly after installation
  3. Install and Run Models - Learn how to work with AI models
  4. Try It Out - Explore examples and use cases

Learn More

Subsections of LocalAI

Overview

LocalAI is your complete AI stack for running AI models locally. It’s designed to be simple, efficient, and accessible, providing a drop-in replacement for OpenAI’s API while keeping your data private and secure.

Why LocalAI?

In today’s AI landscape, privacy, control, and flexibility are paramount. LocalAI addresses these needs by:

  • Privacy First: Your data never leaves your machine
  • Complete Control: Run models on your terms, with your hardware
  • Open Source: MIT licensed and community-driven
  • Flexible Deployment: From laptops to servers, with or without GPUs
  • Extensible: Add new models and features as needed

Core Components

LocalAI is more than just a single tool - it’s a complete ecosystem:

  1. LocalAI Core

    • OpenAI-compatible API
    • Multiple model support (LLMs, image, audio)
    • Model Context Protocol (MCP) for agentic capabilities
    • No GPU required
    • Fast inference with native bindings
    • Github repository
  2. LocalAGI

    • Autonomous AI agents
    • No coding required
    • WebUI and REST API support
    • Extensible agent framework
    • Github repository
  3. LocalRecall

    • Semantic search
    • Memory management
    • Vector database
    • Perfect for AI applications
    • Github repository

Getting Started

LocalAI can be installed in several ways. Docker is the recommended installation method for most users as it provides the easiest setup and works across all platforms.

The quickest way to get started with LocalAI is using Docker:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

For complete installation instructions including Docker, macOS, Linux, Kubernetes, and building from source, see the Installation guide.

Key Features

  • Text Generation: Run various LLMs locally
  • Image Generation: Create images with stable diffusion
  • Audio Processing: Text-to-speech and speech-to-text
  • Vision API: Image understanding and analysis
  • Embeddings: Vector database support
  • Functions: OpenAI-compatible function calling
  • MCP Support: Model Context Protocol for agentic capabilities
  • P2P: Distributed inference capabilities

Community and Support

LocalAI is a community-driven project. You can:

Next Steps

Ready to dive in? Here are some recommended next steps:

  1. Install LocalAI - Start with Docker installation (recommended) or choose another method
  2. Explore available models
  3. Model compatibility
  4. Try out examples
  5. Join the community
  6. Check the LocalAI Github repository
  7. Check the LocalAGI Github repository

License

LocalAI is MIT licensed, created and maintained by Ettore Di Giacinto.

Chapter 2

Installation

LocalAI can be installed in multiple ways depending on your platform and preferences.

Tip

Recommended: Docker Installation

Docker is the recommended installation method for most users as it works across all platforms (Linux, macOS, Windows) and provides the easiest setup experience. It’s the fastest way to get started with LocalAI.

Installation Methods

Choose the installation method that best suits your needs:

  1. DockerRecommended - Works on all platforms, easiest setup
  2. macOS - Download and install the DMG application
  3. Linux - Install on Linux using the one-liner script or binaries
  4. Kubernetes - Deploy LocalAI on Kubernetes clusters
  5. Build from Source - Build LocalAI from source code

Quick Start

Recommended: Docker (works on all platforms)

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

This will start LocalAI. The API will be available at http://localhost:8080. For images with pre-configured models, see All-in-One images.

For other platforms:

  • macOS: Download the DMG
  • Linux: Use the curl https://localai.io/install.sh | sh one-liner

For detailed instructions, see the Docker installation guide.

Subsections of Installation

Docker Installation

Tip

Recommended Installation Method

Docker is the recommended way to install LocalAI and provides the easiest setup experience.

LocalAI provides Docker images that work with Docker, Podman, and other container engines. These images are available on Docker Hub and Quay.io.

Prerequisites

Before you begin, ensure you have Docker or Podman installed:

Quick Start

The fastest way to get started is with the CPU image:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

This will:

  • Start LocalAI (you’ll need to install models separately)
  • Make the API available at http://localhost:8080
Tip

Docker Run vs Docker Start

  • docker run creates and starts a new container. If a container with the same name already exists, this command will fail.
  • docker start starts an existing container that was previously created with docker run.

If you’ve already run LocalAI before and want to start it again, use: docker start -i local-ai

Image Types

LocalAI provides several image types to suit different needs:

Standard Images

Standard images don’t include pre-configured models. Use these if you want to configure models manually.

CPU Image

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest

GPU Images

NVIDIA CUDA 12:

docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12

NVIDIA CUDA 11:

docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-11

AMD GPU (ROCm):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas

Intel GPU:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-intel

Vulkan:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan

NVIDIA Jetson (L4T ARM64):

docker run -ti --name local-ai -p 8080:8080 --runtime nvidia --gpus all localai/localai:latest-nvidia-l4t-arm64

All-in-One (AIO) Images

Recommended for beginners - These images come pre-configured with models and backends, ready to use immediately.

CPU Image

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-cpu

GPU Images

NVIDIA CUDA 12:

docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-12

NVIDIA CUDA 11:

docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-11

AMD GPU (ROCm):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-aio-gpu-hipblas

Intel GPU:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-gpu-intel

Using Docker Compose

For a more manageable setup, especially with persistent volumes, use Docker Compose:

version: "3.9"
services:
  api:
    image: localai/localai:latest-aio-cpu
    # For GPU support, use one of:
    # image: localai/localai:latest-aio-gpu-nvidia-cuda-12
    # image: localai/localai:latest-aio-gpu-nvidia-cuda-11
    # image: localai/localai:latest-aio-gpu-hipblas
    # image: localai/localai:latest-aio-gpu-intel
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 5
    ports:
      - 8080:8080
    environment:
      - DEBUG=true
    volumes:
      - ./models:/models:cached
    # For NVIDIA GPUs, uncomment:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]

Save this as docker-compose.yml and run:

docker compose up -d

Persistent Storage

To persist models and configurations, mount a volume:

docker run -ti --name local-ai -p 8080:8080 \
  -v $PWD/models:/models \
  localai/localai:latest-aio-cpu

Or use a named volume:

docker volume create localai-models
docker run -ti --name local-ai -p 8080:8080 \
  -v localai-models:/models \
  localai/localai:latest-aio-cpu

What’s Included in AIO Images

All-in-One images come pre-configured with:

  • Text Generation: LLM models for chat and completion
  • Image Generation: Stable Diffusion models
  • Text to Speech: TTS models
  • Speech to Text: Whisper models
  • Embeddings: Vector embedding models
  • Function Calling: Support for OpenAI-compatible function calling

The AIO images use OpenAI-compatible model names (like gpt-4, gpt-4-vision-preview) but are backed by open-source models. See the container images documentation for the complete mapping.

Next Steps

After installation:

  1. Access the WebUI at http://localhost:8080
  2. Check available models: curl http://localhost:8080/v1/models
  3. Install additional models
  4. Try out examples

Advanced Configuration

For detailed information about:

  • All available image tags and versions
  • Advanced Docker configuration options
  • Custom image builds
  • Backend management

See the Container Images documentation.

Troubleshooting

Container won’t start

  • Check Docker is running: docker ps
  • Check port 8080 is available: netstat -an | grep 8080 (Linux/Mac)
  • View logs: docker logs local-ai

GPU not detected

  • Ensure Docker has GPU access: docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
  • For NVIDIA: Install NVIDIA Container Toolkit
  • For AMD: Ensure devices are accessible: ls -la /dev/kfd /dev/dri

Models not downloading

  • Check internet connection
  • Verify disk space: df -h
  • Check Docker logs for errors: docker logs local-ai

See Also

macOS Installation

The easiest way to install LocalAI on macOS is using the DMG application.

Download

Download the latest DMG from GitHub releases:

Download LocalAI for macOS

Installation Steps

  1. Download the LocalAI.dmg file from the link above
  2. Open the downloaded DMG file
  3. Drag the LocalAI application to your Applications folder
  4. Launch LocalAI from your Applications folder

Known Issues

Note: The DMGs are not signed by Apple and may show as quarantined.

Workaround: See this issue for details on how to bypass the quarantine.

Fix tracking: The signing issue is being tracked in this issue.

Next Steps

After installing LocalAI, you can:

Linux Installation

The fastest way to install LocalAI on Linux is with the installation script:

curl https://localai.io/install.sh | sh

This script will:

  • Detect your system architecture
  • Download the appropriate LocalAI binary
  • Set up the necessary configuration
  • Start LocalAI automatically

Installer Configuration Options

The installer can be configured using environment variables:

curl https://localai.io/install.sh | VAR=value sh

Environment Variables

Environment VariableDescription
DOCKER_INSTALLSet to "true" to enable the installation of Docker images
USE_AIOSet to "true" to use the all-in-one LocalAI Docker image
USE_VULKANSet to "true" to use Vulkan GPU support
API_KEYSpecify an API key for accessing LocalAI, if required
PORTSpecifies the port on which LocalAI will run (default is 8080)
THREADSNumber of processor threads the application should use. Defaults to the number of logical cores minus one
VERSIONSpecifies the version of LocalAI to install. Defaults to the latest available version
MODELS_PATHDirectory path where LocalAI models are stored (default is /usr/share/local-ai/models)
P2P_TOKENToken to use for the federation or for starting workers. See distributed inferencing documentation
WORKERSet to "true" to make the instance a worker (p2p token is required)
FEDERATEDSet to "true" to share the instance with the federation (p2p token is required)
FEDERATED_SERVERSet to "true" to run the instance as a federation server which forwards requests to the federation (p2p token is required)

Image Selection

The installer will automatically detect your GPU and select the appropriate image. By default, it uses the standard images without extra Python dependencies. You can customize the image selection:

  • USE_AIO=true: Use all-in-one images that include all dependencies
  • USE_VULKAN=true: Use Vulkan GPU support instead of vendor-specific GPU support

Uninstallation

To uninstall LocalAI installed via the script:

curl https://localai.io/install.sh | sh -s -- --uninstall

Manual Installation

Download Binary

You can manually download the appropriate binary for your system from the releases page:

  1. Go to GitHub Releases
  2. Download the binary for your architecture (amd64, arm64, etc.)
  3. Make it executable:
chmod +x local-ai-*
  1. Run LocalAI:
./local-ai-*

System Requirements

Hardware requirements vary based on:

  • Model size
  • Quantization method
  • Backend used

For performance benchmarks with different backends like llama.cpp, visit this link.

Configuration

After installation, you can:

  • Access the WebUI at http://localhost:8080
  • Configure models in the models directory
  • Customize settings via environment variables or config files

Next Steps

Run with Kubernetes

For installing LocalAI in Kubernetes, the deployment file from the examples can be used and customized as preferred:

kubectl apply -f https://raw.githubusercontent.com/mudler/LocalAI-examples/refs/heads/main/kubernetes/deployment.yaml

For Nvidia GPUs:

kubectl apply -f https://raw.githubusercontent.com/mudler/LocalAI-examples/refs/heads/main/kubernetes/deployment-nvidia.yaml

Alternatively, the helm chart can be used as well:

helm repo add go-skynet https://go-skynet.github.io/helm-charts/
helm repo update
helm show values go-skynet/local-ai > values.yaml


helm install local-ai go-skynet/local-ai -f values.yaml

Build LocalAI

Build

LocalAI can be built as a container image or as a single, portable binary. Note that some model architectures might require Python libraries, which are not included in the binary.

LocalAI’s extensible architecture allows you to add your own backends, which can be written in any language, and as such the container images contains also the Python dependencies to run all the available backends (for example, in order to run backends like Diffusers that allows to generate images and videos from text).

This section contains instructions on how to build LocalAI from source.

Build LocalAI locally

Requirements

In order to build LocalAI locally, you need the following requirements:

  • Golang >= 1.21
  • GCC
  • GRPC

To install the dependencies follow the instructions below:

Install xcode from the App Store

brew install go protobuf protoc-gen-go protoc-gen-go-grpc wget
apt install golang make protobuf-compiler-grpc

After you have golang installed and working, you can install the required binaries for compiling the golang protobuf components via the following commands

go install google.golang.org/protobuf/cmd/[email protected]
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
make build
Build

To build LocalAI with make:

git clone https://github.com/go-skynet/LocalAI
cd LocalAI
make build

This should produce the binary local-ai

Container image

Requirements:

  • Docker or podman, or a container engine

In order to build the LocalAI container image locally you can use docker, for example:

docker build -t localai .
docker run localai

Example: Build on mac

Building on Mac (M1, M2 or M3) works, but you may need to install some prerequisites using brew.

The below has been tested by one mac user and found to work. Note that this doesn’t use Docker to run the server:

Install xcode from the Apps Store (needed for metalkit)

brew install abseil cmake go grpc protobuf wget protoc-gen-go protoc-gen-go-grpc

git clone https://github.com/go-skynet/LocalAI.git

cd LocalAI

make build

wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q2_K.gguf -O models/phi-2.Q2_K

cp -rf prompt-templates/ggml-gpt4all-j.tmpl models/phi-2.Q2_K.tmpl

./local-ai backends install llama-cpp

./local-ai --models-path=./models/ --debug=true

curl http://localhost:8080/v1/models

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "phi-2.Q2_K",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9 
   }'

Troubleshooting mac

  • If you encounter errors regarding a missing utility metal, install Xcode from the App Store.

  • After the installation of Xcode, if you receive a xcrun error 'xcrun: error: unable to find utility "metal", not a developer tool or in PATH'. You might have installed the Xcode command line tools before installing Xcode, the former one is pointing to an incomplete SDK.

xcode-select --print-path

sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
  • If completions are slow, ensure that gpu-layers in your model yaml matches the number of layers from the model in use (or simply use a high number such as 256).

  • If you get a compile error: error: only virtual member functions can be marked 'final', reinstall all the necessary brew packages, clean the build, and try again.

brew reinstall go grpc protobuf wget

make clean

make build

Build backends

LocalAI have several backends available for installation in the backend gallery. The backends can be also built by source. As backends might vary from language and dependencies that they require, the documentation will provide generic guidance for few of the backends, which can be applied with some slight modifications also to the others.

Manually

Typically each backend include a Makefile which allow to package the backend.

In the LocalAI repository, for instance you can build bark-cpp by doing:

git clone https://github.com/go-skynet/LocalAI.git

make -C LocalAI/backend/go/bark-cpp build package

make -C LocalAI/backend/python/vllm

With Docker

Building with docker is simpler as abstracts away all the requirement, and focuses on building the final OCI images that are available in the gallery. This allows for instance also to build locally a backend and install it with LocalAI. You can refer to Backends for general guidance on how to install and develop backends.

In the LocalAI repository, you can build bark-cpp by doing:

git clone https://github.com/go-skynet/LocalAI.git

make docker-build-bark-cpp

Note that make is only by convenience, in reality it just runs a simple docker command as:

docker build --build-arg BUILD_TYPE=$(BUILD_TYPE) --build-arg BASE_IMAGE=$(BASE_IMAGE) -t local-ai-backend:bark-cpp -f LocalAI/backend/Dockerfile.golang --build-arg BACKEND=bark-cpp .               

Note:

  • BUILD_TYPE can be either: cublas, hipblas, sycl_f16, sycl_f32, metal.
  • BASE_IMAGE is tested on ubuntu:22.04 (and defaults to it) and quay.io/go-skynet/intel-oneapi-base:latest for intel/sycl
Chapter 3

Getting started

Welcome to LocalAI! This section covers everything you need to know after installation to start using LocalAI effectively.

Tip

Haven’t installed LocalAI yet?

See the Installation guide to install LocalAI first. Docker is the recommended installation method for most users.

What’s in This Section

Subsections of Getting started

Quickstart

LocalAI is a free, open-source alternative to OpenAI (Anthropic, etc.), functioning as a drop-in replacement REST API for local inferencing. It allows you to run LLMs, generate images, and produce audio, all locally or on-premises with consumer-grade hardware, supporting multiple model families and architectures.

Tip

Security considerations

If you are exposing LocalAI remotely, make sure you protect the API endpoints adequately with a mechanism which allows to protect from the incoming traffic or alternatively, run LocalAI with API_KEY to gate the access with an API key. The API key guarantees a total access to the features (there is no role separation), and it is to be considered as likely as an admin role.

Quickstart

This guide assumes you have already installed LocalAI. If you haven’t installed it yet, see the Installation guide first.

Starting LocalAI

Once installed, start LocalAI. For Docker installations:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

The API will be available at http://localhost:8080.

Downloading models on start

When starting LocalAI (either via Docker or via CLI) you can specify as argument a list of models to install automatically before starting the API, for example:

local-ai run llama-3.2-1b-instruct:q4_k_m
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
local-ai run ollama://gemma:2b
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest
Tip

Automatic Backend Detection: When you install models from the gallery or YAML files, LocalAI automatically detects your system’s GPU capabilities (NVIDIA, AMD, Intel) and downloads the appropriate backend. For advanced configuration options, see GPU Acceleration.

For a full list of options, you can run LocalAI with --help or refer to the Linux Installation guide for installer configuration options.

Using LocalAI and the full stack with LocalAGI

LocalAI is part of the Local family stack, along with LocalAGI and LocalRecall.

LocalAGI is a powerful, self-hostable AI Agent platform designed for maximum privacy and flexibility which encompassess and uses all the software stack. It provides a complete drop-in replacement for OpenAI’s Responses APIs with advanced agentic capabilities, working entirely locally on consumer-grade hardware (CPU and GPU).

Quick Start

git clone https://github.com/mudler/LocalAGI
cd LocalAGI

docker compose up

docker compose -f docker-compose.nvidia.yaml up

docker compose -f docker-compose.intel.yaml up

MODEL_NAME=gemma-3-12b-it docker compose up

MODEL_NAME=gemma-3-12b-it \
MULTIMODAL_MODEL=minicpm-v-4_5 \
IMAGE_MODEL=flux.1-dev-ggml \
docker compose -f docker-compose.nvidia.yaml up

Key Features

  • Privacy-Focused: All processing happens locally, ensuring your data never leaves your machine
  • Flexible Deployment: Supports CPU, NVIDIA GPU, and Intel GPU configurations
  • Multiple Model Support: Compatible with various models from Hugging Face and other sources
  • Web Interface: User-friendly chat interface for interacting with AI agents
  • Advanced Capabilities: Supports multimodal models, image generation, and more
  • Docker Integration: Easy deployment using Docker Compose

Environment Variables

You can customize your LocalAGI setup using the following environment variables:

  • MODEL_NAME: Specify the model to use (e.g., gemma-3-12b-it)
  • MULTIMODAL_MODEL: Set a custom multimodal model
  • IMAGE_MODEL: Configure an image generation model

For more advanced configuration and API documentation, visit the LocalAGI GitHub repository.

What’s Next?

There is much more to explore with LocalAI! You can run any model from Hugging Face, perform video generation, and also voice cloning. For a comprehensive overview, check out the features section.

Explore additional resources and community contributions:

Setting Up Models

This section covers everything you need to know about installing and configuring models in LocalAI. You’ll learn multiple methods to get models running.

Prerequisites

  • LocalAI installed and running (see Quickstart if you haven’t set it up yet)
  • Basic understanding of command line usage

The Model Gallery is the simplest way to install models. It provides pre-configured models ready to use.

Via WebUI

  1. Open the LocalAI WebUI at http://localhost:8080
  2. Navigate to the “Models” tab
  3. Browse available models
  4. Click “Install” on any model you want
  5. Wait for installation to complete

For more details, refer to the Gallery Documentation.

Via CLI

# List available models
local-ai models list

# Install a specific model
local-ai models install llama-3.2-1b-instruct:q4_k_m

# Start LocalAI with a model from the gallery
local-ai run llama-3.2-1b-instruct:q4_k_m

To run models available in the LocalAI gallery, you can use the model name as the URI. For example, to run LocalAI with the Hermes model, execute:

local-ai run hermes-2-theta-llama-3-8b

To install only the model, use:

local-ai models install hermes-2-theta-llama-3-8b

Note: The galleries available in LocalAI can be customized to point to a different URL or a local directory. For more information on how to setup your own gallery, see the Gallery Documentation.

Browse Online

Visit models.localai.io to browse all available models in your browser.

Method 1.5: Import Models via WebUI

The WebUI provides a powerful model import interface that supports both simple and advanced configuration:

Simple Import Mode

  1. Open the LocalAI WebUI at http://localhost:8080
  2. Click “Import Model”
  3. Enter the model URI (e.g., https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF)
  4. Optionally configure preferences:
    • Backend selection
    • Model name
    • Description
    • Quantizations
    • Embeddings support
    • Custom preferences
  5. Click “Import Model” to start the import process

Advanced Import Mode

For full control over model configuration:

  1. In the WebUI, click “Import Model”
  2. Toggle to “Advanced Mode”
  3. Edit the YAML configuration directly in the code editor
  4. Use the “Validate” button to check your configuration
  5. Click “Create” or “Update” to save

The advanced editor includes:

  • Syntax highlighting
  • YAML validation
  • Format and copy tools
  • Full configuration options

This is especially useful for:

  • Custom model configurations
  • Fine-tuning model parameters
  • Setting up complex model setups
  • Editing existing model configurations

Method 2: Installing from Hugging Face

LocalAI can directly install models from Hugging Face:

# Install and run a model from Hugging Face
local-ai run huggingface://TheBloke/phi-2-GGUF

The format is: huggingface://<repository>/<model-file> ( is optional)

Examples

local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf

Method 3: Installing from OCI Registries

Ollama Registry

local-ai run ollama://gemma:2b

Standard OCI Registry

local-ai run oci://localai/phi-2:latest

Run Models via URI

To run models via URI, specify a URI to a model file or a configuration file when starting LocalAI. Valid syntax includes:

  • file://path/to/model
  • huggingface://repository_id/model_file (e.g., huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf)
  • From OCIs: oci://container_image:tag, ollama://model_id:tag
  • From configuration files: https://gist.githubusercontent.com/.../phi-2.yaml

Configuration files can be used to customize the model defaults and settings. For advanced configurations, refer to the Customize Models section.

Examples

local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
local-ai run ollama://gemma:2b
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest

Method 4: Manual Installation

For full control, you can manually download and configure models.

Step 1: Download a Model

Download a GGUF model file. Popular sources:

Example:

mkdir -p models

wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
  -O models/phi-2.Q4_K_M.gguf

Step 2: Create a Configuration File (Optional)

Create a YAML file to configure the model:

# models/phi-2.yaml
name: phi-2
parameters:
  model: phi-2.Q4_K_M.gguf
  temperature: 0.7
context_size: 2048
threads: 4
backend: llama-cpp

Customize model defaults and settings with a configuration file. For advanced configurations, refer to the Advanced Documentation.

Step 3: Run LocalAI

Choose one of the following methods to run LocalAI:

mkdir models

cp your-model.gguf models/

docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.gguf",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'
Tip

Other Docker Images:

For other Docker images, please refer to the table in the container images section.

Example:

mkdir models

wget https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/resolve/main/luna-ai-llama2-uncensored.Q4_0.gguf -O models/luna-ai-llama2

cp -rf prompt-templates/getting_started.tmpl models/luna-ai-llama2.tmpl

docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4

curl http://localhost:8080/v1/models

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "luna-ai-llama2",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9
   }'
Note
  • If running on Apple Silicon (ARM), it is not recommended to run on Docker due to emulation. Follow the build instructions to use Metal acceleration for full GPU support.
  • If you are running on Apple x86_64, you can use Docker without additional gain from building it from source.
git clone https://github.com/go-skynet/LocalAI

cd LocalAI

cp your-model.gguf models/

docker compose up -d --pull always

curl http://localhost:8080/v1/models

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.gguf",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'
Tip

Other Docker Images:

For other Docker images, please refer to the table in Getting Started.

Note: If you are on Windows, ensure the project is on the Linux filesystem to avoid slow model loading. For more information, see the Microsoft Docs.

For Kubernetes deployment, see the Kubernetes installation guide.

LocalAI binary releases are available on GitHub.

# With binary
local-ai --models-path ./models
Tip

If installing on macOS, you might encounter a message saying:

“local-ai-git-Darwin-arm64” (or the name you gave the binary) can’t be opened because Apple cannot check it for malicious software.

Hit OK, then go to Settings > Privacy & Security > Security and look for the message:

“local-ai-git-Darwin-arm64” was blocked from use because it is not from an identified developer.

Press “Allow Anyway.”

For instructions on building LocalAI from source, see the Build from Source guide.

GPU Acceleration

For instructions on GPU acceleration, visit the GPU Acceleration page.

For more model configurations, visit the Examples Section.

Understanding Model Files

File Formats

  • GGUF: Modern format, recommended for most use cases
  • GGML: Older format, still supported but deprecated

Quantization Levels

Models come in different quantization levels (quality vs. size trade-off):

QuantizationSizeQualityUse Case
Q8_0LargestHighestBest quality, requires more RAM
Q6_KLargeVery HighHigh quality
Q4_K_MMediumHighBalanced (recommended)
Q4_K_SSmallMediumLower RAM usage
Q2_KSmallestLowerMinimal RAM, lower quality

Choosing the Right Model

Consider:

  • RAM available: Larger models need more RAM
  • Use case: Different models excel at different tasks
  • Speed: Smaller quantizations are faster
  • Quality: Higher quantizations produce better output

Model Configuration

Basic Configuration

Create a YAML file in your models directory:

name: my-model
parameters:
  model: model.gguf
  temperature: 0.7
  top_p: 0.9
context_size: 2048
threads: 4
backend: llama-cpp

Advanced Configuration

See the Model Configuration guide for all available options.

Managing Models

List Installed Models

# Via API
curl http://localhost:8080/v1/models

# Via CLI
local-ai models list

Remove Models

Simply delete the model file and configuration from your models directory:

rm models/model-name.gguf
rm models/model-name.yaml  # if exists

Troubleshooting

Model Not Loading

  1. Check backend: Ensure the required backend is installed

    local-ai backends list
    local-ai backends install llama-cpp  # if needed
  2. Check logs: Enable debug mode

    DEBUG=true local-ai
  3. Verify file: Ensure the model file is not corrupted

Out of Memory

  • Use a smaller quantization (Q4_K_S or Q2_K)
  • Reduce context_size in configuration
  • Close other applications to free RAM

Wrong Backend

Check the Compatibility Table to ensure you’re using the correct backend for your model.

Best Practices

  1. Start small: Begin with smaller models to test your setup
  2. Use quantized models: Q4_K_M is a good balance for most use cases
  3. Organize models: Keep your models directory organized
  4. Backup configurations: Save your YAML configurations
  5. Monitor resources: Watch RAM and disk usage

Try it out

Once LocalAI is installed, you can start it (either by using docker, or the cli, or the systemd service).

By default the LocalAI WebUI should be accessible from http://localhost:8080. You can also use 3rd party projects to interact with LocalAI as you would use OpenAI (see also Integrations ).

After installation, install new models by navigating the model gallery, or by using the local-ai CLI.

Tip

To install models with the WebUI, see the Models section. With the CLI you can list the models with local-ai models list and install them with local-ai models install <model-name>.

You can also run models manually by copying files into the models directory.

You can test out the API endpoints using curl, few examples are listed below. The models we are referring here (gpt-4, gpt-4-vision-preview, tts-1, whisper-1) are the default models that come with the AIO images - you can also use any other model you have installed.

Text Generation

Creates a model response for the given chat conversation. OpenAI documentation.

curl http://localhost:8080/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}] }' 

GPT Vision

Understand images.

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{ 
        "model": "gpt-4-vision-preview", 
        "messages": [
          {
            "role": "user", "content": [
              {"type":"text", "text": "What is in the image?"},
              {
                "type": "image_url", 
                "image_url": {
                  "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" 
                }
              }
            ], 
          "temperature": 0.9
          }
        ]
      }' 

Function calling

Call functions

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather like in Boston?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather in a given location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"]
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto"
  }'

Image Generation

Creates an image given a prompt. OpenAI documentation.

curl http://localhost:8080/v1/images/generations \
      -H "Content-Type: application/json" -d '{
          "prompt": "A cute baby sea otter",
          "size": "256x256"
        }'

Text to speech

Generates audio from the input text. OpenAI documentation.

curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy"
  }' \
  --output speech.mp3

Audio Transcription

Transcribes audio into the input language. OpenAI Documentation.

Download first a sample to transcribe:

wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg 

Send the example audio file to the transcriptions endpoint :

curl http://localhost:8080/v1/audio/transcriptions \
    -H "Content-Type: multipart/form-data" \
    -F file="@$PWD/gb1.ogg" -F model="whisper-1"

Embeddings Generation

Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. OpenAI Embeddings.

curl http://localhost:8080/embeddings \
    -X POST -H "Content-Type: application/json" \
    -d '{ 
        "input": "Your text string goes here", 
        "model": "text-embedding-ada-002"
      }'
Tip

Don’t use the model file as model in the request unless you want to handle the prompt template for yourself.

Use the model names like you would do with OpenAI like in the examples below. For instance gpt-4-vision-preview, or gpt-4.

Customizing the Model

To customize the prompt template or the default settings of the model, a configuration file is utilized. This file must adhere to the LocalAI YAML configuration standards. For comprehensive syntax details, refer to the advanced documentation. The configuration file can be located either remotely (such as in a Github Gist) or within the local filesystem or a remote URL.

LocalAI can be initiated using either its container image or binary, with a command that includes URLs of model config files or utilizes a shorthand format (like huggingface:// or github://), which is then expanded into complete URLs.

The configuration can also be set via an environment variable. For instance:

local-ai github://owner/repo/file.yaml@branch

MODELS="github://owner/repo/file.yaml@branch,github://owner/repo/file.yaml@branch" local-ai

Here’s an example to initiate the phi-2 model:

docker run -p 8080:8080 localai/localai:v3.7.0 https://gist.githubusercontent.com/mudler/ad601a0488b497b69ec549150d9edd18/raw/a8a8869ef1bb7e3830bf5c0bae29a0cce991ff8d/phi-2.yaml

You can also check all the embedded models configurations here.

Tip

The model configurations used in the quickstart are accessible here: https://github.com/mudler/LocalAI/tree/master/embedded/models. Contributions are welcome; please feel free to submit a Pull Request.

The phi-2 model configuration from the quickstart is expanded from https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml.

Example: Customizing the Prompt Template

To modify the prompt template, create a Github gist or a Pastebin file, and copy the content from https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml. Alter the fields as needed:

name: phi-2
context_size: 2048
f16: true
threads: 11
gpu_layers: 90
mmap: true
parameters:
  # Reference any HF model or a local file here
  model: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
  temperature: 0.2
  top_k: 40
  top_p: 0.95
template:
  
  chat: &template |
    Instruct: {{.Input}}
    Output:
  # Modify the prompt template here ^^^ as per your requirements
  completion: *template

Then, launch LocalAI using your gist’s URL:

## Important! Substitute with your gist's URL!
docker run -p 8080:8080 localai/localai:v3.7.0 https://gist.githubusercontent.com/xxxx/phi-2.yaml

Next Steps

Build LocalAI from source

Building LocalAI from source is an installation method that allows you to compile LocalAI yourself, which is useful for custom configurations, development, or when you need specific build options.

For complete build instructions, see the Build from Source documentation in the Installation section.

Run with container images

LocalAI provides a variety of images to support different environments. These images are available on quay.io and Docker Hub.

All-in-One images comes with a pre-configured set of models and backends, standard images instead do not have any model pre-configured and installed.

For GPU Acceleration support for Nvidia video graphic cards, use the Nvidia/CUDA images, if you don’t have a GPU, use the CPU images. If you have AMD or Mac Silicon, see the build section.

Tip

Available Images Types:

  • Images ending with -core are smaller images without predownload python dependencies. Use these images if you plan to use llama.cpp, stablediffusion-ncn or rwkv backends - if you are not sure which one to use, do not use these images.
  • Images containing the aio tag are all-in-one images with all the features enabled, and come with an opinionated set of configuration.

Prerequisites

Before you begin, ensure you have a container engine installed if you are not using the binaries. Suitable options include Docker or Podman. For installation instructions, refer to the following guides:

Tip

Hardware Requirements: The hardware requirements for LocalAI vary based on the model size and quantization method used. For performance benchmarks with different backends, such as llama.cpp, visit this link. The rwkv backend is noted for its lower resource consumption.

Standard container images

Standard container images do not have pre-installed models. Use these if you want to configure models manually.

DescriptionQuayDocker Hub
Latest images from the branch (development)quay.io/go-skynet/local-ai:masterlocalai/localai:master
Latest tagquay.io/go-skynet/local-ai:latestlocalai/localai:latest
Versioned imagequay.io/go-skynet/local-ai:v3.7.0localai/localai:v3.7.0
DescriptionQuayDocker Hub
Latest images from the branch (development)quay.io/go-skynet/local-ai:master-gpu-nvidia-cuda-11localai/localai:master-gpu-nvidia-cuda-11
Latest tagquay.io/go-skynet/local-ai:latest-gpu-nvidia-cuda-11localai/localai:latest-gpu-nvidia-cuda-11
Versioned imagequay.io/go-skynet/local-ai:v3.7.0-gpu-nvidia-cuda-11localai/localai:v3.7.0-gpu-nvidia-cuda-11
DescriptionQuayDocker Hub
Latest images from the branch (development)quay.io/go-skynet/local-ai:master-gpu-nvidia-cuda-12localai/localai:master-gpu-nvidia-cuda-12
Latest tagquay.io/go-skynet/local-ai:latest-gpu-nvidia-cuda-12localai/localai:latest-gpu-nvidia-cuda-12
Versioned imagequay.io/go-skynet/local-ai:v3.7.0-gpu-nvidia-cuda-12localai/localai:v3.7.0-gpu-nvidia-cuda-12
DescriptionQuayDocker Hub
Latest images from the branch (development)quay.io/go-skynet/local-ai:master-gpu-intellocalai/localai:master-gpu-intel
Latest tagquay.io/go-skynet/local-ai:latest-gpu-intellocalai/localai:latest-gpu-intel
Versioned imagequay.io/go-skynet/local-ai:v3.7.0-gpu-intellocalai/localai:v3.7.0-gpu-intel
DescriptionQuayDocker Hub
Latest images from the branch (development)quay.io/go-skynet/local-ai:master-gpu-hipblaslocalai/localai:master-gpu-hipblas
Latest tagquay.io/go-skynet/local-ai:latest-gpu-hipblaslocalai/localai:latest-gpu-hipblas
Versioned imagequay.io/go-skynet/local-ai:v3.7.0-gpu-hipblaslocalai/localai:v3.7.0-gpu-hipblas
DescriptionQuayDocker Hub
Latest images from the branch (development)quay.io/go-skynet/local-ai:master-vulkanlocalai/localai:master-vulkan
Latest tagquay.io/go-skynet/local-ai:latest-gpu-vulkanlocalai/localai:latest-gpu-vulkan
Versioned imagequay.io/go-skynet/local-ai:v3.7.0-vulkanlocalai/localai:v3.7.0-vulkan

These images are compatible with Nvidia ARM64 devices, such as the Jetson Nano, Jetson Xavier NX, and Jetson AGX Xavier. For more information, see the Nvidia L4T guide.

DescriptionQuayDocker Hub
Latest images from the branch (development)quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64localai/localai:master-nvidia-l4t-arm64
Latest tagquay.io/go-skynet/local-ai:latest-nvidia-l4t-arm64localai/localai:latest-nvidia-l4t-arm64
Versioned imagequay.io/go-skynet/local-ai:v3.7.0-nvidia-l4t-arm64localai/localai:v3.7.0-nvidia-l4t-arm64

All-in-one images

All-In-One images are images that come pre-configured with a set of models and backends to fully leverage almost all the LocalAI featureset. These images are available for both CPU and GPU environments. The AIO images are designed to be easy to use and require no configuration. Models configuration can be found here separated by size.

In the AIO images there are models configured with the names of OpenAI models, however, they are really backed by Open Source models. You can find the table below

CategoryModel nameReal model (CPU)Real model (GPU)
Text Generationgpt-4phi-2hermes-2-pro-mistral
Multimodal Visiongpt-4-vision-previewbakllavallava-1.6-mistral
Image Generationstablediffusionstablediffusiondreamshaper-8
Speech to Textwhisper-1whisper with whisper-base model<= same
Text to Speechtts-1en-us-amy-low.onnx from rhasspy/piper<= same
Embeddingstext-embedding-ada-002all-MiniLM-L6-v2 in Q4all-MiniLM-L6-v2

Usage

Select the image (CPU or GPU) and start the container with Docker:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest-aio-cpu

LocalAI will automatically download all the required models, and the API will be available at localhost:8080.

Or with a docker-compose file:

version: "3.9"
services:
  api:
    image: localai/localai:latest-aio-cpu
    # For a specific version:
    # image: localai/localai:v3.7.0-aio-cpu
    # For Nvidia GPUs decomment one of the following (cuda11 or cuda12):
    # image: localai/localai:v3.7.0-aio-gpu-nvidia-cuda-11
    # image: localai/localai:v3.7.0-aio-gpu-nvidia-cuda-12
    # image: localai/localai:latest-aio-gpu-nvidia-cuda-11
    # image: localai/localai:latest-aio-gpu-nvidia-cuda-12
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 5
    ports:
      - 8080:8080
    environment:
      - DEBUG=true
      # ...
    volumes:
      - ./models:/models:cached
    # decomment the following piece if running with Nvidia GPUs
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]
Tip

Models caching: The AIO image will download the needed models on the first run if not already present and store those in /models inside the container. The AIO models will be automatically updated with new versions of AIO images.

You can change the directory inside the container by specifying a MODELS_PATH environment variable (or --models-path).

If you want to use a named model or a local directory, you can mount it as a volume to /models:

docker run -p 8080:8080 --name local-ai -ti -v $PWD/models:/models localai/localai:latest-aio-cpu

or associate a volume:

docker volume create localai-models
docker run -p 8080:8080 --name local-ai -ti -v localai-models:/models localai/localai:latest-aio-cpu

Available AIO images

DescriptionQuayDocker Hub
Latest images for CPUquay.io/go-skynet/local-ai:latest-aio-cpulocalai/localai:latest-aio-cpu
Versioned image (e.g. for CPU)quay.io/go-skynet/local-ai:v3.7.0-aio-cpulocalai/localai:v3.7.0-aio-cpu
Latest images for Nvidia GPU (CUDA11)quay.io/go-skynet/local-ai:latest-aio-gpu-nvidia-cuda-11localai/localai:latest-aio-gpu-nvidia-cuda-11
Latest images for Nvidia GPU (CUDA12)quay.io/go-skynet/local-ai:latest-aio-gpu-nvidia-cuda-12localai/localai:latest-aio-gpu-nvidia-cuda-12
Latest images for AMD GPUquay.io/go-skynet/local-ai:latest-aio-gpu-hipblaslocalai/localai:latest-aio-gpu-hipblas
Latest images for Intel GPUquay.io/go-skynet/local-ai:latest-aio-gpu-intellocalai/localai:latest-aio-gpu-intel

Available environment variables

The AIO Images are inheriting the same environment variables as the base images and the environment of LocalAI (that you can inspect by calling --help). However, it supports additional environment variables available only from the container image

VariableDefaultDescription
PROFILEAuto-detectedThe size of the model to use. Available: cpu, gpu-8g
MODELSAuto-detectedA list of models YAML Configuration file URI/URL (see also running models)

See Also

Run with Kubernetes

For installing LocalAI in Kubernetes, the deployment file from the examples can be used and customized as preferred:

kubectl apply -f https://raw.githubusercontent.com/mudler/LocalAI-examples/refs/heads/main/kubernetes/deployment.yaml

For Nvidia GPUs:

kubectl apply -f https://raw.githubusercontent.com/mudler/LocalAI-examples/refs/heads/main/kubernetes/deployment-nvidia.yaml

Alternatively, the helm chart can be used as well:

helm repo add go-skynet https://go-skynet.github.io/helm-charts/
helm repo update
helm show values go-skynet/local-ai > values.yaml


helm install local-ai go-skynet/local-ai -f values.yaml

News

Release notes have been now moved completely over Github releases.

You can see the release notes here.

04-12-2023: v2.0.0

This release brings a major overhaul in some backends.

Breaking/important changes:

  • Backend rename: llama-stable renamed to llama-ggml 1287
  • Prompt template changes: 1254 (extra space in roles)
  • Apple metal bugfixes: 1365

New:

  • Added support for LLaVa and OpenAI Vision API support ( 1254 )
  • Python based backends are now using conda to track env dependencies ( 1144 )
  • Support for parallel requests ( 1290 )
  • Support for transformers-embeddings ( 1308 )
  • Watchdog for backends ( 1341 ). As https://github.com/ggerganov/llama.cpp/issues/3969 is hitting LocalAI’s llama-cpp implementation, we have now a watchdog that can be used to make sure backends are not stalling. This is a generic mechanism that can be enabled for all the backends now.
  • Whisper.cpp updates ( 1302 )
  • Petals backend ( 1350 )
  • Full LLM fine-tuning example to use with LocalAI: https://localai.io/advanced/fine-tuning/

Due to the python dependencies size of images grew in size. If you still want to use smaller images without python dependencies, you can use the corresponding images tags ending with -core.

Full changelog: https://github.com/mudler/LocalAI/releases/tag/v2.0.0

30-10-2023: v1.40.0

This release is a preparation before v2 - the efforts now will be to refactor, polish and add new backends. Follow up on: https://github.com/mudler/LocalAI/issues/1126

Hot topics

This release now brings the llama-cpp backend which is a c++ backend tied to llama.cpp. It follows more closely and tracks recent versions of llama.cpp. It is not feature compatible with the current llama backend but plans are to sunset the current llama backend in favor of this one. This one will be probably be the latest release containing the older llama backend written in go and c++. The major improvement with this change is that there are less layers that could be expose to potential bugs - and as well it ease out maintenance as well.

Support for ROCm/HIPBLAS

This release bring support for AMD thanks to @65a . See more details in 1100

More CLI commands

Thanks to @jespino now the local-ai binary has more subcommands allowing to manage the gallery or try out directly inferencing, check it out!

Release notes

25-09-2023: v1.30.0

This is an exciting LocalAI release! Besides bug-fixes and enhancements this release brings the new backend to a whole new level by extending support to vllm and vall-e-x for audio generation!

Check out the documentation for vllm here and Vall-E-X here

Release notes

26-08-2023: v1.25.0

Hey everyone, Ettore here, I’m so happy to share this release out - while this summer is hot apparently doesn’t stop LocalAI development :)

This release brings a lot of new features, bugfixes and updates! Also a big shout out to the community, this was a great release!

Attention 🚨

From this release the llama backend supports only gguf files (see 943 ). LocalAI however still supports ggml files. We ship a version of llama.cpp before that change in a separate backend, named llama-stable to allow still loading ggml files. If you were specifying the llama backend manually to load ggml files from this release you should use llama-stable instead, or do not specify a backend at all (LocalAI will automatically handle this).

Image generation enhancements

The Diffusers backend got now various enhancements, including support to generate images from images, longer prompts, and support for more kernels schedulers. See the Diffusers documentation for more information.

Lora adapters

Now it’s possible to load lora adapters for llama.cpp. See 955 for more information.

Device management

It is now possible for single-devices with one GPU to specify --single-active-backend to allow only one backend active at the time 925 .

Community spotlight

Resources management

Thanks to the continous community efforts (another cool contribution from dave-gray101 ) now it’s possible to shutdown a backend programmatically via the API. There is an ongoing effort in the community to better handling of resources. See also the 🔥Roadmap.

New how-to section

Thanks to the community efforts now we have a new how-to website with various examples on how to use LocalAI. This is a great starting point for new users! We are currently working on improving it, a huge shout out to lunamidori5 from the community for the impressive efforts on this!

💡 More examples!

LocalAGI in discord!

Did you know that we have now few cool bots in our Discord? come check them out! We also have an instance of LocalAGI ready to help you out!

Changelog summary

Breaking Changes 🛠

  • feat: bump llama.cpp, add gguf support by mudler in 943

Exciting New Features 🎉

  • feat(Makefile): allow to restrict backend builds by mudler in 890
  • feat(diffusers): various enhancements by mudler in 895
  • feat: make initializer accept gRPC delay times by mudler in 900
  • feat(diffusers): add DPMSolverMultistepScheduler++, DPMSolverMultistepSchedulerSDE++, guidance_scale by mudler in 903
  • feat(diffusers): overcome prompt limit by mudler in 904
  • feat(diffusers): add img2img and clip_skip, support more kernels schedulers by mudler in 906
  • Usage Features by dave-gray101 in 863
  • feat(diffusers): be consistent with pipelines, support also depthimg2img by mudler in 926
  • feat: add –single-active-backend to allow only one backend active at the time by mudler in 925
  • feat: add llama-stable backend by mudler in 932
  • feat: allow to customize rwkv tokenizer by dave-gray101 in 937
  • feat: backend monitor shutdown endpoint, process based by dave-gray101 in 938
  • feat: Allow to load lora adapters for llama.cpp by mudler in 955

Join our Discord community! our vibrant community is growing fast, and we are always happy to help! https://discord.gg/uJAeKSAGDy

The full changelog is available here.


🔥🔥🔥🔥 12-08-2023: v1.24.0 🔥🔥🔥🔥

This is release brings four(!) new additional backends to LocalAI: 🐶 Bark, 🦙 AutoGPTQ, 🧨 Diffusers, 🦙 exllama and a lot of improvements!

Major improvements:

🐶 Bark

Bark is a text-prompted generative audio model - it combines GPT techniques to generate Audio from text. It is a great addition to LocalAI, and it’s available in the container images by default.

It can also generate music, see the example: lion.webm

🦙 AutoGPTQ

AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

It is targeted mainly for GPU usage only. Check out the documentation for usage.

🦙 Exllama

Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”. It is a faster alternative to run LLaMA models on GPU.Check out the Exllama documentation for usage.

🧨 Diffusers

Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Currently it is experimental, and supports generation only of images so you might encounter some issues on models which weren’t tested yet. Check out the Diffusers documentation for usage.

🔑 API Keys

Thanks to the community contributions now it’s possible to specify a list of API keys that can be used to gate API requests.

API Keys can be specified with the API_KEY environment variable as a comma-separated list of keys.

🖼️ Galleries

Now by default the model-gallery repositories are configured in the container images

💡 New project

LocalAGI is a simple agent that uses LocalAI functions to have a full locally runnable assistant (with no API keys needed).

See it here in action planning a trip for San Francisco!

The full changelog is available here.


🔥🔥 29-07-2023: v1.23.0 🚀

This release focuses mostly on bugfixing and updates, with just a couple of new features:

  • feat: add rope settings and negative prompt, drop grammar backend by mudler in 797
  • Added CPU information to entrypoint.sh by @finger42 in 794
  • feat: cancel stream generation if client disappears by @tmm1 in 792

Most notably, this release brings important fixes for CUDA (and not only):

  • fix: add rope settings during model load, fix CUDA by mudler in 821
  • fix: select function calls if ’name’ is set in the request by mudler in 827
  • fix: symlink libphonemize in the container by mudler in 831
Note

From this release OpenAI functions are available in the llama backend. The llama-grammar has been deprecated. See also OpenAI functions.

The full changelog is available here


🔥🔥🔥 23-07-2023: v1.22.0 🚀

  • feat: add llama-master backend by mudler in 752
  • [build] pass build type to cmake on libtransformers.a build by @TonDar0n in 741
  • feat: resolve JSONSchema refs (planners) by mudler in 774
  • feat: backends improvements by mudler in 778
  • feat(llama2): add template for chat messages by dave-gray101 in 782
Note

From this release to use the OpenAI functions you need to use the llama-grammar backend. It has been added a llama backend for tracking llama.cpp master and llama-grammar for the grammar functionalities that have not been merged yet upstream. See also OpenAI functions. Until the feature is merged we will have two llama backends.

Huggingface embeddings

In this release is now possible to specify to LocalAI external gRPC backends that can be used for inferencing 778 . It is now possible to write internal backends in any language, and a huggingface-embeddings backend is now available in the container image to be used with https://github.com/UKPLab/sentence-transformers. See also Embeddings.

LLaMa 2 has been released!

Thanks to the community effort now LocalAI supports templating for LLaMa2! more at: 782 until we update the model gallery with LLaMa2 models!

Official langchain integration

Progress has been made to support LocalAI with langchain. See: https://github.com/langchain-ai/langchain/pull/8134


🔥🔥🔥 17-07-2023: v1.21.0 🚀

  • [whisper] Partial support for verbose_json format in transcribe endpoint by @ldotlopez in 721
  • LocalAI functions by @mudler in 726
  • gRPC-based backends by @mudler in 743
  • falcon support (7b and 40b) with ggllm.cpp by @mudler in 743

LocalAI functions

This allows to run OpenAI functions as described in the OpenAI blog post and documentation: https://openai.com/blog/function-calling-and-other-api-updates.

This is a video of running the same example, locally with LocalAI: localai-functions-1 localai-functions-1

And here when it actually picks to reply to the user instead of using functions! functions-2 functions-2

Note: functions are supported only with llama.cpp-compatible models.

A full example is available here: https://github.com/mudler/LocalAI-examples/tree/main/functions

gRPC backends

This is an internal refactor which is not user-facing, however, it allows to ease out maintenance and addition of new backends to LocalAI!

falcon support

Now Falcon 7b and 40b models compatible with https://github.com/cmp-nct/ggllm.cpp are supported as well.

The former, ggml-based backend has been renamed to falcon-ggml.

Default pre-compiled binaries

From this release the default behavior of images has changed. Compilation is not triggered on start automatically, to recompile local-ai from scratch on start and switch back to the old behavior, you can set REBUILD=true in the environment variables. Rebuilding can be necessary if your CPU and/or architecture is old and the pre-compiled binaries are not compatible with your platform. See the build section for more information.

Full release changelog


🔥🔥🔥 28-06-2023: v1.20.0 🚀

Exciting New Features 🎉

Container images

  • Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.20.0
  • FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-ffmpeg
  • CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-gpu-nvidia-cuda11-ffmpeg
  • CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-gpu-nvidia-cuda12-ffmpeg

Updates

Updates to llama.cpp, go-transformers, gpt4all.cpp and rwkv.cpp.

The NUMA option was enabled by mudler in 684 , along with many new parameters (mmap,mmlock, ..). See advanced for the full list of parameters.

In this release there is support for gallery repositories. These are repositories that contain models, and can be used to install models. The default gallery which contains only freely licensed models is in Github: https://github.com/go-skynet/model-gallery, but you can use your own gallery by setting the GALLERIES environment variable. An automatic index of huggingface models is available as well.

For example, now you can start LocalAI with the following environment variable to use both galleries:

GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:ci-robbot/localai-huggingface-zoo/index.yaml","name":"huggingface"}]

And in runtime you can install a model from huggingface now with:

curl http://localhost:8000/models/apply -H "Content-Type: application/json" -d '{ "id": "huggingface@thebloke__open-llama-7b-open-instruct-ggml__open-llama-7b-open-instruct.ggmlv3.q4_0.bin" }'

or a tts voice with:

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{ "id": "model-gallery@voice-en-us-kathleen-low" }'

See also models for a complete documentation.

Text to Audio

Now LocalAI uses piper and go-piper to generate audio from text. This is an experimental feature, and it requires GO_TAGS=tts to be set during build. It is enabled by default in the pre-built container images.

To setup audio models, you can use the new galleries, or setup the models manually as described in the API section of the documentation.

You can check the full changelog in Github


🔥🔥🔥 19-06-2023: v1.19.0 🚀

  • Full CUDA GPU offload support ( PR by mudler. Thanks to chnyda for handing over the GPU access, and lu-zero to help in debugging )
  • Full GPU Metal Support is now fully functional. Thanks to Soleblaze to iron out the Metal Apple silicon support!

Container images:

  • Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.19.2
  • FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-ffmpeg
  • CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-gpu-nvidia-cuda11-ffmpeg
  • CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-gpu-nvidia-cuda12-ffmpeg

🔥🔥🔥 06-06-2023: v1.18.0 🚀

This LocalAI release is plenty of new features, bugfixes and updates! Thanks to the community for the help, this was a great community release!

We now support a vast variety of models, while being backward compatible with prior quantization formats, this new release allows still to load older formats and new k-quants!

New features

  • ✨ Added support for falcon-based model families (7b) ( mudler )
  • ✨ Experimental support for Metal Apple Silicon GPU - ( mudler and thanks to Soleblaze for testing! ). See the build section.
  • ✨ Support for token stream in the /v1/completions endpoint ( samm81 )
  • ✨ Added huggingface backend ( Evilfreelancer )
  • 📷 Stablediffusion now can output 2048x2048 images size with esrgan! ( mudler )

Container images

Dependencies updates

  • 🆙 Bloomz has been updated to the latest ggml changes, including new quantization format ( mudler )
  • 🆙 RWKV has been updated to the new quantization format( mudler )
  • 🆙 k-quants format support for the llama models ( mudler )
  • 🆙 gpt4all has been updated, incorporating upstream changes allowing to load older models, and with different CPU instruction set (AVX only, AVX2) from the same binary! ( mudler )

Generic

  • 🐧 Fully Linux static binary releases ( mudler )
  • 📷 Stablediffusion has been enabled on container images by default ( mudler ) Note: You can disable container image rebuilds with REBUILD=false

Examples

Two new projects offer now direct integration with LocalAI!

Full release changelog


29-05-2023: v1.17.0

Support for OpenCL has been added while building from sources.

You can now build LocalAI from source with BUILD_TYPE=clblas to have an OpenCL build. See also the build section.

For instructions on how to install OpenCL/CLBlast see here.

rwkv.cpp has been updated to the new ggml format commit.


27-05-2023: v1.16.0

Now it’s possible to automatically download pre-configured models before starting the API.

Start local-ai with the PRELOAD_MODELS containing a list of models from the gallery, for instance to install gpt4all-j as gpt-3.5-turbo:

PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "name": "gpt-3.5-turbo"}]

llama.cpp models now can also automatically save the prompt cache state as well by specifying in the model YAML configuration file:

prompt_cache_path: "alpaca-cache"

prompt_cache_all: true

See also the advanced section.

Media, Blogs, Social

Previous

  • 23-05-2023: v1.15.0 released. go-gpt2.cpp backend got renamed to go-ggml-transformers.cpp updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. This impacts RedPajama, GptNeoX, MPT(not gpt4all-mpt), Dolly, GPT2 and Starcoder based models. Binary releases available, various fixes, including 341 .
  • 21-05-2023: v1.14.0 released. Minor updates to the /models/apply endpoint, llama.cpp backend updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. gpt4all is still compatible with the old format.
  • 19-05-2023: v1.13.0 released! 🔥🔥 updates to the gpt4all and llama backend, consolidated CUDA support ( 310 thanks to @bubthegreat and @Thireus ), preliminar support for installing models via API.
  • 17-05-2023: v1.12.0 released! 🔥🔥 Minor fixes, plus CUDA ( 258 ) support for llama.cpp-compatible models and image generation ( 272 ).
  • 16-05-2023: 🔥🔥🔥 Experimental support for CUDA ( 258 ) in the llama.cpp backend and Stable diffusion CPU image generation ( 272 ) in master.

Now LocalAI can generate images too:

mode=0mode=1 (winograd/sgemm)
b6441997879 b6441997879winograd2 winograd2
  • 14-05-2023: v1.11.1 released! rwkv backend patch release
  • 13-05-2023: v1.11.0 released! 🔥 Updated llama.cpp bindings: This update includes a breaking change in the model files ( https://github.com/ggerganov/llama.cpp/pull/1405 ) - old models should still work with the gpt4all-llama backend.
  • 12-05-2023: v1.10.0 released! 🔥🔥 Updated gpt4all bindings. Added support for GPTNeox (experimental), RedPajama (experimental), Starcoder (experimental), Replit (experimental), MosaicML MPT. Also now embeddings endpoint supports tokens arrays. See the langchain-chroma example! Note - this update does NOT include https://github.com/ggerganov/llama.cpp/pull/1405 which makes models incompatible.
  • 11-05-2023: v1.9.0 released! 🔥 Important whisper updates ( 233 229 ) and extended gpt4all model families support ( 232 ). Redpajama/dolly experimental ( 214 )
  • 10-05-2023: v1.8.0 released! 🔥 Added support for fast and accurate embeddings with bert.cpp ( 222 )
  • 09-05-2023: Added experimental support for transcriptions endpoint ( 211 )
  • 08-05-2023: Support for embeddings with models using the llama.cpp backend ( 207 )
  • 02-05-2023: Support for rwkv.cpp models ( 158 ) and for /edits endpoint
  • 01-05-2023: Support for SSE stream of tokens in llama.cpp backends ( 152 )
Chapter 8

Features

LocalAI provides a comprehensive set of features for running AI models locally. This section covers all the capabilities and functionalities available in LocalAI.

Core Features

  • Text Generation - Generate text with GPT-compatible models using various backends
  • Image Generation - Create images with Stable Diffusion and other diffusion models
  • Audio Processing - Transcribe audio to text and generate speech from text
  • Embeddings - Generate vector embeddings for semantic search and RAG applications
  • GPT Vision - Analyze and understand images with vision-language models

Advanced Features

Specialized Features

  • Object Detection - Detect and locate objects in images
  • Reranker - Improve retrieval accuracy with cross-encoder models
  • Stores - Vector similarity search for embeddings
  • Model Gallery - Browse and install pre-configured models
  • Backends - Learn about available backends and how to manage them
  • Runtime Settings - Configure application settings via web UI without restarting

Getting Started

To start using these features, make sure you have LocalAI installed and have downloaded some models. Then explore the feature pages above to learn how to use each capability.

Subsections of Features

⚙️ Backends

LocalAI supports a variety of backends that can be used to run different types of AI models. There are core Backends which are included, and there are containerized applications that provide the runtime environment for specific model types, such as LLMs, diffusion models, or text-to-speech models.

Managing Backends in the UI

The LocalAI web interface provides an intuitive way to manage your backends:

  1. Navigate to the “Backends” section in the navigation menu
  2. Browse available backends from configured galleries
  3. Use the search bar to find specific backends by name, description, or type
  4. Filter backends by type using the quick filter buttons (LLM, Diffusion, TTS, Whisper)
  5. Install or delete backends with a single click
  6. Monitor installation progress in real-time

Each backend card displays:

  • Backend name and description
  • Type of models it supports
  • Installation status
  • Action buttons (Install/Delete)
  • Additional information via the info button

Backend Galleries

Backend galleries are repositories that contain backend definitions. They work similarly to model galleries but are specifically for backends.

You can add backend galleries by specifying the Environment Variable LOCALAI_BACKEND_GALLERIES:

export LOCALAI_BACKEND_GALLERIES='[{"name":"my-gallery","url":"https://raw.githubusercontent.com/username/repo/main/backends"}]'

The URL needs to point to a valid yaml file, for example:

- name: "test-backend"
  uri: "quay.io/image/tests:localai-backend-test"
  alias: "foo-backend"

Where URI is the path to an OCI container image.

A backend gallery is a collection of YAML files, each defining a backend. Here’s an example structure:

name: "llm-backend"
description: "A backend for running LLM models"
uri: "quay.io/username/llm-backend:latest"
alias: "llm"
tags:
  - "llm"
  - "text-generation"

Pre-installing Backends

You can pre-install backends when starting LocalAI using the LOCALAI_EXTERNAL_BACKENDS environment variable:

export LOCALAI_EXTERNAL_BACKENDS="llm-backend,diffusion-backend"
local-ai run

Creating a Backend

To create a new backend, you need to:

  1. Create a container image that implements the LocalAI backend interface
  2. Define a backend YAML file
  3. Publish your backend to a container registry

Backend Container Requirements

Your backend container should:

  1. Implement the LocalAI backend interface (gRPC or HTTP)
  2. Handle model loading and inference
  3. Support the required model types
  4. Include necessary dependencies
  5. Have a top level run.sh file that will be used to run the backend
  6. Pushed to a registry so can be used in a gallery

Getting started

For getting started, see the available backends in LocalAI here: https://github.com/mudler/LocalAI/tree/master/backend .

Publishing Your Backend

  1. Build your container image:

    docker build -t quay.io/username/my-backend:latest .
  2. Push to a container registry:

    docker push quay.io/username/my-backend:latest
  3. Add your backend to a gallery:

    • Create a YAML entry in your gallery repository
    • Include the backend definition
    • Make the gallery accessible via HTTP/HTTPS

Backend Types

LocalAI supports various types of backends:

  • LLM Backends: For running language models
  • Diffusion Backends: For image generation
  • TTS Backends: For text-to-speech conversion
  • Whisper Backends: For speech-to-text conversion

⚡ GPU acceleration

Details

Section under construction

This section contains instruction on how to use LocalAI with GPU acceleration.

Details

For acceleration for AMD or Metal HW is still in development, for additional details see the build

Automatic Backend Detection

When you install a model from the gallery (or a YAML file), LocalAI intelligently detects the required backend and your system’s capabilities, then downloads the correct version for you. Whether you’re running on a standard CPU, an NVIDIA GPU, an AMD GPU, or an Intel GPU, LocalAI handles it automatically.

For advanced use cases or to override auto-detection, you can use the LOCALAI_FORCE_META_BACKEND_CAPABILITY environment variable. Here are the available options:

  • default: Forces CPU-only backend. This is the fallback if no specific hardware is detected.
  • nvidia: Forces backends compiled with CUDA support for NVIDIA GPUs.
  • amd: Forces backends compiled with ROCm support for AMD GPUs.
  • intel: Forces backends compiled with SYCL/oneAPI support for Intel GPUs.

Model configuration

Depending on the model architecture and backend used, there might be different ways to enable GPU acceleration. It is required to configure the model you intend to use with a YAML config file. For example, for llama.cpp workloads a configuration file might look like this (where gpu_layers is the number of layers to offload to the GPU):

name: my-model-name
parameters:
  # Relative to the models path
  model: llama.cpp-model.ggmlv3.q5_K_M.bin

context_size: 1024
threads: 1

f16: true # enable with GPU acceleration
gpu_layers: 22 # GPU Layers (only used when built with cublas)

For diffusers instead, it might look like this instead:

name: stablediffusion
parameters:
  model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
diffusers:
  pipeline_type: StableDiffusionPipeline
  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"

CUDA(NVIDIA) acceleration

Requirements

Requirement: nvidia-container-toolkit (installation instructions 1 2)

If using a system with SELinux, ensure you have the policies installed, such as those provided by nvidia

To check what CUDA version do you need, you can either run nvidia-smi or nvcc --version.

Alternatively, you can also check nvidia-smi with docker:

docker run --runtime=nvidia --rm nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi

To use CUDA, use the images with the cublas tag, for example.

The image list is on quay:

  • CUDA 11 tags: master-gpu-nvidia-cuda-11, v1.40.0-gpu-nvidia-cuda-11, …
  • CUDA 12 tags: master-gpu-nvidia-cuda-12, v1.40.0-gpu-nvidia-cuda-12, …

In addition to the commands to run LocalAI normally, you need to specify --gpus all to docker, for example:

docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.40.0-gpu-nvidia-cuda12

If the GPU inferencing is working, you should be able to see something like:

5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4
llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 4321.77 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1598 MB
...................................................................................................
llama_init_from_file: kv self size  =  512.00 MB

ROCM(AMD) acceleration

There are a limited number of tested configurations for ROCm systems however most newer deditated GPU consumer grade devices seem to be supported under the current ROCm6 implementation.

Due to the nature of ROCm it is best to run all implementations in containers as this limits the number of packages required for installation on host system, compatibility and package versions for dependencies across all variations of OS must be tested independently if desired, please refer to the build documentation.

Requirements

  • ROCm 6.x.x compatible GPU/accelerator
  • OS: Ubuntu (22.04, 20.04), RHEL (9.3, 9.2, 8.9, 8.8), SLES (15.5, 15.4)
  • Installed to host: amdgpu-dkms and rocm >=6.0.0 as per ROCm documentation.

Recommendations

  • Make sure to do not use GPU assigned for compute for desktop rendering.
  • Ensure at least 100GB of free space on disk hosting container runtime and storing images prior to installation.

Limitations

Ongoing verification testing of ROCm compatibility with integrated backends. Please note the following list of verified backends and devices.

LocalAI hipblas images are built against the following targets: gfx900,gfx906,gfx908,gfx940,gfx941,gfx942,gfx90a,gfx1030,gfx1031,gfx1100,gfx1101

If your device is not one of these you must specify the corresponding GPU_TARGETS and specify REBUILD=true. Otherwise you don’t need to specify these in the commands below.

Verified

The devices in the following list have been tested with hipblas images running ROCm 6.0.0

BackendVerifiedDevices
llama.cppyesRadeon VII (gfx906)
diffusersyesRadeon VII (gfx906)
piperyesRadeon VII (gfx906)
whispernonone
barknonone
coquinonone
transformersnonone
exllamanonone
exllama2nonone
mambanonone
sentencetransformersnonone
transformers-musicgennonone
vall-e-xnonone
vllmnonone

You can help by expanding this list.

System Prep

  1. Check your GPU LLVM target is compatible with the version of ROCm. This can be found in the LLVM Docs.
  2. Check which ROCm version is compatible with your LLVM target and your chosen OS (pay special attention to supported kernel versions). See the following for compatibility for (ROCm 6.0.0) or (ROCm 6.0.2)
  3. Install you chosen version of the dkms and rocm (it is recommended that the native package manager be used for this process for any OS as version changes are executed more easily via this method if updates are required). Take care to restart after installing amdgpu-dkms and before installing rocm, for details regarding this see the installation documentation for your chosen OS (6.0.2 or 6.0.0)
  4. Deploy. Yes it’s that easy.

Setup Example (Docker/containerd)

The following are examples of the ROCm specific configuration elements required.

    # For full functionality select a non-'core' image, version locking the image is recommended for debug purposes.
    image: quay.io/go-skynet/local-ai:master-aio-gpu-hipblas
    environment:
      - DEBUG=true
      # If your gpu is not already included in the current list of default targets the following build details are required.
      - REBUILD=true
      - BUILD_TYPE=hipblas
      - GPU_TARGETS=gfx906 # Example for Radeon VII
    devices:
      # AMD GPU only require the following devices be passed through to the container for offloading to occur.
      - /dev/dri
      - /dev/kfd

The same can also be executed as a run for your container runtime

docker run \
 -e DEBUG=true \
 -e REBUILD=true \
 -e BUILD_TYPE=hipblas \
 -e GPU_TARGETS=gfx906 \
 --device /dev/dri \
 --device /dev/kfd \
 quay.io/go-skynet/local-ai:master-aio-gpu-hipblas

Please ensure to add all other required environment variables, port forwardings, etc to your compose file or run command.

The rebuild process will take some time to complete when deploying these containers and it is recommended that you pull the image prior to deployment as depending on the version these images may be ~20GB in size.

Example (k8s) (Advanced Deployment/WIP)

For k8s deployments there is an additional step required before deployment, this is the deployment of the ROCm/k8s-device-plugin. For any k8s environment the documentation provided by AMD from the ROCm project should be successful. It is recommended that if you use rke2 or OpenShift that you deploy the SUSE or RedHat provided version of this resource to ensure compatibility. After this has been completed the helm chart from go-skynet can be configured and deployed mostly un-edited.

The following are details of the changes that should be made to ensure proper function. While these details may be configurable in the values.yaml development of this Helm chart is ongoing and is subject to change.

The following details indicate the final state of the localai deployment relevant to GPU function.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {NAME}-local-ai
...
spec:
  ...
  template:
    ...
    spec:
      containers:
        - env:
            - name: HIP_VISIBLE_DEVICES
              value: '0'
              # This variable indicates the devices available to container (0:device1 1:device2 2:device3) etc.
              # For multiple devices (say device 1 and 3) the value would be equivalent to HIP_VISIBLE_DEVICES="0,2"
              # Please take note of this when an iGPU is present in host system as compatibility is not assured.
          ...
          resources:
            limits:
              amd.com/gpu: '1'
            requests:
              amd.com/gpu: '1'

This configuration has been tested on a ‘custom’ cluster managed by SUSE Rancher that was deployed on top of Ubuntu 22.04.4, certification of other configuration is ongoing and compatibility is not guaranteed.

Notes

  • When installing the ROCM kernel driver on your system ensure that you are installing an equal or newer version that that which is currently implemented in LocalAI (6.0.0 at time of writing).
  • AMD documentation indicates that this will ensure functionality however your mileage may vary depending on the GPU and distro you are using.
  • If you encounter an Error 413 on attempting to upload an audio file or image for whisper or llava/bakllava on a k8s deployment, note that the ingress for your deployment may require the annotation nginx.ingress.kubernetes.io/proxy-body-size: "25m" to allow larger uploads. This may be included in future versions of the helm chart.

Intel acceleration (sycl)

Requirements

If building from source, you need to install Intel oneAPI Base Toolkit and have the Intel drivers available in the system.

Container images

To use SYCL, use the images with gpu-intel in the tag, for example v3.7.0-gpu-intel, …

The image list is on quay.

Example

To run LocalAI with Docker and sycl starting phi-2, you can use the following command as an example:

docker run -e DEBUG=true --privileged -ti -v $PWD/models:/models -p 8080:8080  -v /dev/dri:/dev/dri --rm quay.io/go-skynet/local-ai:master-gpu-intel phi-2

Notes

In addition to the commands to run LocalAI normally, you need to specify --device /dev/dri to docker, for example:

docker run --rm -ti --device /dev/dri -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v3.7.0-gpu-intel

Note also that sycl does have a known issue to hang with mmap: true. You have to disable it in the model configuration if explicitly enabled.

Vulkan acceleration

Requirements

If using nvidia, follow the steps in the CUDA section to configure your docker runtime to allow access to the GPU.

Container images

To use Vulkan, use the images with the vulkan tag, for example v3.7.0-gpu-vulkan.

Example

To run LocalAI with Docker and Vulkan, you can use the following command as an example:

docker run -p 8080:8080 -e DEBUG=true -v $PWD/models:/models localai/localai:latest-gpu-vulkan

Notes

In addition to the commands to run LocalAI normally, you need to specify additional flags to pass the GPU hardware to the container.

These flags are the same as the sections above, depending on the hardware, for nvidia, AMD or Intel.

If you have mixed hardware, you can pass flags for multiple GPUs, for example:

docker run -p 8080:8080 -e DEBUG=true -v $PWD/models:/models \
--gpus=all \ # nvidia passthrough
--device /dev/dri --device /dev/kfd \ # AMD/Intel passthrough
localai/localai:latest-gpu-vulkan

📖 Text generation (GPT)

LocalAI supports generating text with GPT with llama.cpp and other backends (such as rwkv.cpp as ) see also the Model compatibility for an up-to-date list of the supported model families.

Note:

  • You can also specify the model name as part of the OpenAI token.
  • If only one model is available, the API will use it for all the requests.

API Reference

Chat completions

https://platform.openai.com/docs/api-reference/chat

For example, to generate a chat completion, you can send a POST request to the /v1/chat/completions endpoint with the instruction as the request body:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "messages": [{"role": "user", "content": "Say this is a test!"}],
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

Edit completions

https://platform.openai.com/docs/api-reference/edits

To generate an edit completion you can send a POST request to the /v1/edits endpoint with the instruction as the request body:

curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "instruction": "rephrase",
  "input": "Black cat jumped out of the window",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens.

Completions

https://platform.openai.com/docs/api-reference/completions

To generate a completion, you can send a POST request to the /v1/completions endpoint with the instruction as per the request body:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "prompt": "A long time ago in a galaxy far, far away",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

List models

You can list all the models available with:

curl http://localhost:8080/v1/models

Backends

RWKV

RWKV support is available through llama.cpp (see below)

llama.cpp

llama.cpp is a popular port of Facebook’s LLaMA model in C/C++.

Note

The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For gguf models, use the llama backend. The go backend is deprecated as well but still available as go-llama.

Features

The llama.cpp model supports the following features:

Setup

LocalAI supports llama.cpp models out of the box. You can use the llama.cpp model in the same way as any other model.

Manual setup

It is sufficient to copy the ggml or gguf model files in the models folder. You can refer to the model in the model parameter in the API calls.

You can optionally create an associated YAML model config file to tune the model’s parameters or apply a template to the prompt.

Prompt templates are useful for models that are fine-tuned towards a specific prompt.

Automatic setup

LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml or gguf models.

For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.1
   }'

LocalAI will automatically download and configure the model in the model directory.

Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the model gallery documentation.

YAML configuration

To use the llama.cpp backend, specify llama-cpp as the backend in the YAML file:

name: llama
backend: llama-cpp
parameters:
  # Relative to the models path
  model: file.gguf

Backend Options

The llama.cpp backend supports additional configuration options that can be specified in the options field of your model YAML configuration. These options allow fine-tuning of the backend behavior:

OptionTypeDescriptionExample
use_jinja or jinjabooleanEnable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages.use_jinja:true
context_shiftbooleanEnable context shifting, which allows the model to dynamically adjust context window usage.context_shift:true
cache_ramintegerSet the maximum RAM cache size in MiB for KV cache. Use -1 for unlimited (default).cache_ram:2048
parallel or n_parallelintegerEnable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently.parallel:4
grpc_servers or rpc_serversstringComma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers.grpc_servers:localhost:50051,localhost:50052

Example configuration with options:

name: llama-model
backend: llama
parameters:
  model: model.gguf
options:
  - use_jinja:true
  - context_shift:true
  - cache_ram:4096
  - parallel:2

Note: The parallel option can also be set via the LLAMACPP_PARALLEL environment variable, and grpc_servers can be set via the LLAMACPP_GRPC_SERVERS environment variable. Options specified in the YAML file take precedence over environment variables.

Reference

exllama/2

Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”. Both exllama and exllama2 are supported.

Model setup

Download the model as a folder inside the model directory and create a YAML file specifying the exllama backend. For instance with the TheBloke/WizardLM-7B-uncensored-GPTQ model:

$ git lfs install
$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
$ ls models/                                                                 
.keep                        WizardLM-7B-uncensored-GPTQ/ exllama.yaml
$ cat models/exllama.yaml                                                     
name: exllama
parameters:
  model: WizardLM-7B-uncensored-GPTQ
backend: exllama

Test with:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
   "model": "exllama",
   "messages": [{"role": "user", "content": "How are you?"}],
   "temperature": 0.1
 }'

vLLM

vLLM is a fast and easy-to-use library for LLM inference.

LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out vllm performance here.

Setup

Create a YAML file for the model you want to use with vllm.

To setup a model, you need to just specify the model name in the YAML config file:

name: vllm
backend: vllm
parameters:
    model: "facebook/opt-125m"

The backend will automatically download the required files in order to run the model.

Usage

Use the completions endpoint by specifying the vllm backend:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "vllm",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'

Transformers

Transformers is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.

LocalAI has a built-in integration with Transformers, and it can be used to run models.

This is an extra backend - in the container images (the extra images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.

Setup

Create a YAML file for the model you want to use with transformers.

To setup a model, you need to just specify the model name in the YAML config file:

name: transformers
backend: transformers
parameters:
    model: "facebook/opt-125m"
type: AutoModelForCausalLM
quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)

The backend will automatically download the required files in order to run the model.

Parameters

Type
TypeDescription
AutoModelForCausalLMAutoModelForCausalLM is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration
OVModelForCausalLMfor Intel CPU/GPU/NPU OpenVINO Text Generation models
OVModelForFeatureExtractionfor Intel CPU/GPU/NPU OpenVINO Embedding acceleration
N/ADefaults to AutoModel
  • OVModelForCausalLM requires OpenVINO IR Text Generation models from Hugging face
  • OVModelForFeatureExtraction works with any Safetensors Transformer Feature Extraction model from Huggingface (Embedding Model)

Please note that streaming is currently not implemente in AutoModelForCausalLM for Intel GPU. AMD GPU support is not implemented. Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.

Embeddings

Use embeddings: true if the model is an embedding model

Inference device selection

Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the main_gpu parameter.

Inference EngineApplicable Values
CUDAcuda, cuda.X where X is the GPU device like in nvidia-smi -L output
OpenVINOAny applicable value from Inference Modes like AUTO,CPU,GPU,NPU,MULTI,HETERO

Example for CUDA: main_gpu: cuda.0

Example for OpenVINO: main_gpu: AUTO:-CPU

This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.

Inference Precision

Transformer backend automatically select the fastest applicable inference precision according to the device support. CUDA backend can manually enable bfloat16 if your hardware support it with the following parameter:

f16: true

Quantization
QuantizationDescription
bnb_8bit8-bit quantization
bnb_4bit4-bit quantization
xpu_8bit8-bit quantization for Intel XPUs
xpu_4bit4-bit quantization for Intel XPUs
Trust Remote Code

Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library. By default it is disabled for security. It can be manually enabled with: trust_remote_code: true

Maximum Context Size

Maximum context size in bytes can be specified with the parameter: context_size. Do not use values higher than what your model support.

Usage example: context_size: 8192

Auto Prompt Template

Usually chat template is defined by the model author in the tokenizer_config.json file. To enable it use the use_tokenizer_template: true parameter in the template section.

Usage example:

template:
  use_tokenizer_template: true
Custom Stop Words

Stopwords are usually defined in tokenizer_config.json file. They can be overridden with the stopwords parameter in case of need like in llama3-Instruct model.

Usage example:

stopwords:
- "<|eot_id|>"
- "<|end_of_text|>"

Usage

Use the completions endpoint by specifying the transformers model:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "transformers",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'

Examples

OpenVINO

A model configuration file for openvion and starling model:

name: starling-openvino
backend: transformers
parameters:
  model: fakezeta/Starling-LM-7B-beta-openvino-int8
context_size: 8192
threads: 6
f16: true
type: OVModelForCausalLM
stopwords:
- <|end_of_turn|>
- <|endoftext|>
prompt_cache_path: "cache"
prompt_cache_all: true
template:
  chat_message: |
    {{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}

  chat: |
    {{.Input}}<|end_of_turn|>GPT4 Correct Assistant:

  completion: |
    {{.Input}}

📈 Reranker

A reranking model, often referred to as a cross-encoder, is a core component in the two-stage retrieval systems used in information retrieval and natural language processing tasks. Given a query and a set of documents, it will output similarity scores.

We can use then the score to reorder the documents by relevance in our RAG system to increase its overall accuracy and filter out non-relevant results.

output output

LocalAI supports reranker models, and you can use them by using the rerankers backend, which uses rerankers.

Usage

You can test rerankers by using container images with python (this does NOT work with core images) and a model config file like this, or by installing cross-encoder from the gallery in the UI:

name: jina-reranker-v1-base-en
backend: rerankers
parameters:
  model: cross-encoder

and test it with:

    curl http://localhost:8080/v1/rerank \
      -H "Content-Type: application/json" \
      -d '{
      "model": "jina-reranker-v1-base-en",
      "query": "Organic skincare products for sensitive skin",
      "documents": [
        "Eco-friendly kitchenware for modern homes",
        "Biodegradable cleaning supplies for eco-conscious consumers",
        "Organic cotton baby clothes for sensitive skin",
        "Natural organic skincare range for sensitive skin",
        "Tech gadgets for smart homes: 2024 edition",
        "Sustainable gardening tools and compost solutions",
        "Sensitive skin-friendly facial cleansers and toners",
        "Organic food wraps and storage solutions",
        "All-natural pet food for dogs with allergies",
        "Yoga mats made from recycled materials"
      ],
      "top_n": 3
    }'

🗣 Text to audio (TTS)

API Compatibility

The LocalAI TTS API is compatible with the OpenAI TTS API and the Elevenlabs API.

LocalAI API

The /tts endpoint can also be used to generate speech from text.

Usage

Input: input, model

For example, to generate an audio file, you can send a POST request to the /tts endpoint with the instruction as the request body:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "input": "Hello world",
  "model": "tts"
}'

Returns an audio/wav file.

Backends

🐸 Coqui

Required: Don’t use LocalAI images ending with the -core tag,. Python dependencies are required in order to use this backend.

Coqui works without any configuration, to test it, you can run the following curl command:

    curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
        "backend": "coqui",
        "model": "tts_models/en/ljspeech/glow-tts",
        "input":"Hello, this is a test!"
        }'

You can use the env variable COQUI_LANGUAGE to set the language used by the coqui backend.

You can also use config files to configure tts models (see section below on how to use config files).

Bark

Bark allows to generate audio from text prompts.

This is an extra backend - in the container is already available and there is nothing to do for the setup.

Model setup

There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.

Usage

Use the tts endpoint by specifying the bark backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!"
   }' | aplay

To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the model parameter:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!",
     "model": "v2/en_speaker_4"
   }' | aplay

Piper

To install the piper audio models manually:

To use the tts endpoint, run the following command. You can specify a backend with the backend parameter. For example, to use the piper backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "model":"it-riccardo_fasol-x-low.onnx",
  "backend": "piper",
  "input": "Ciao, sono Ettore"
}' | aplay

Note:

  • aplay is a Linux command. You can use other tools to play the audio file.
  • The model name is the filename with the extension.
  • The model name is case sensitive.
  • LocalAI must be compiled with the GO_TAGS=tts flag.

Transformers-musicgen

LocalAI also has experimental support for transformers-musicgen for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:

curl --request POST \
  --url http://localhost:8080/tts \
  --header 'Content-Type: application/json' \
  --data '{
    "backend": "transformers-musicgen",
    "model": "facebook/musicgen-medium",
    "input": "Cello Rave"
}' | aplay

Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.

Vall-E-X

VALL-E-X is an open source implementation of Microsoft’s VALL-E X zero-shot TTS model.

Setup

The backend will automatically download the required files in order to run the model.

This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.

Usage

Use the tts endpoint by specifying the vall-e-x backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "vall-e-x",
     "input":"Hello!"
   }' | aplay

Voice cloning

In order to use voice cloning capabilities you must create a YAML configuration file to setup a model:

name: cloned-voice
backend: vall-e-x
parameters:
  model: "cloned-voice"
tts:
    vall-e:
      # The path to the audio file to be cloned
      # relative to the models directory
      # Max 15s
      audio_path: "audio-sample.wav"

Then you can specify the model name in the requests:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "model": "cloned-voice",
     "input":"Hello!"
   }' | aplay

Using config files

You can also use a config-file to specify TTS models and their parameters.

In the following example we define a custom config to load the xtts_v2 model, and specify a voice and language.

name: xtts_v2
backend: coqui
parameters:
  language: fr
  model: tts_models/multilingual/multi-dataset/xtts_v2

tts:
  voice: Ana Florence

With this config, you can now use the following curl command to generate a text-to-speech audio file:

curl -L http://localhost:8080/tts \
    -H "Content-Type: application/json" \
    -d '{
"model": "xtts_v2",
"input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?"
}' | aplay

Response format

To provide some compatibility with OpenAI API regarding response_format, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response.

Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI)

Supported format thanks to ffmpeg are wav, mp3, aac, flac, opus, defaulting to wav if an unknown or no format is provided.

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "input": "Hello world",
  "model": "tts",
  "response_format": "mp3"
}'

If a response_format is added in the query (other than wav) and ffmpeg is not available, the call will fail.

🎨 Image generation

anime_girl anime_girl (Generated with AnimagineXL)

LocalAI supports generating images with Stable diffusion, running on CPU using C++ and Python implementations.

Usage

OpenAI docs: https://platform.openai.com/docs/api-reference/images/create

To generate an image you can send a POST request to the /v1/images/generations endpoint with the instruction as the request body:

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "A cute baby sea otter",
  "size": "256x256"
}'

Available additional parameters: mode, step.

Note: To set a negative prompt, you can split the prompt with |, for instance: a cute baby sea otter|malformed.

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
  "size": "256x256"
}'

Backends

stablediffusion-ggml

This backend is based on stable-diffusion.cpp. Every model supported by that backend is supported indeed with LocalAI.

Setup

There are already several models in the gallery that are available to install and get up and running with this backend, you can for example run flux by searching it in the Model gallery (flux.1-dev-ggml) or start LocalAI with run:

local-ai run flux.1-dev-ggml

To use a custom model, you can follow these steps:

  1. Create a model file stablediffusion.yaml in the models folder:
name: stablediffusion
backend: stablediffusion-ggml
parameters:
  model: gguf_model.gguf
step: 25
cfg_scale: 4.5
options:
- "clip_l_path:clip_l.safetensors"
- "clip_g_path:clip_g.safetensors"
- "t5xxl_path:t5xxl-Q5_0.gguf"
- "sampler:euler"
  1. Download the required assets to the models repository
  2. Start LocalAI

Diffusers

Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the diffusers library.

anime_girl anime_girl (Generated with AnimagineXL)

Model setup

The models will be downloaded the first time you use the backend from huggingface automatically.

Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers

f16: false
diffusers:
  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: euler_a

Dependencies

This is an extra backend - in the container is already available and there is nothing to do for the setup. Do not use core images (ending with -core). If you are building manually, see the build instructions.

Model setup

The models will be downloaded the first time you use the backend from huggingface automatically.

Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers
cuda: true
f16: true
diffusers:
  scheduler_type: euler_a

Local models

You can also use local models, or modify some parameters like clip_skip, scheduler_type, for instance:

name: stablediffusion
parameters:
  model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
cuda: true
diffusers:
  pipeline_type: StableDiffusionPipeline
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"
  clip_skip: 11

cfg_scale: 8

Configuration parameters

The following parameters are available in the configuration file:

ParameterDescriptionDefault
f16Force the usage of float16 instead of float32false
stepNumber of steps to run the model for30
cudaEnable CUDA accelerationfalse
enable_parametersParameters to enable for the modelnegative_prompt,num_inference_steps,clip_skip
scheduler_typeScheduler typek_dpp_sde
cfg_scaleConfiguration scale8
clip_skipClip skipNone
pipeline_typePipeline typeAutoPipelineForText2Image
lora_adaptersA list of lora adapters (file names relative to model directory) to applyNone
lora_scalesA list of lora scales (floats) to applyNone

There are available several types of schedulers:

SchedulerDescription
ddimDDIM
pndmPNDM
heunHeun
unipcUniPC
eulerEuler
euler_aEuler a
lmsLMS
k_lmsLMS Karras
dpm_2DPM2
k_dpm_2DPM2 Karras
dpm_2_aDPM2 a
k_dpm_2_aDPM2 a Karras
dpmpp_2mDPM++ 2M
k_dpmpp_2mDPM++ 2M Karras
dpmpp_sdeDPM++ SDE
k_dpmpp_sdeDPM++ SDE Karras
dpmpp_2m_sdeDPM++ 2M SDE
k_dpmpp_2m_sdeDPM++ 2M SDE Karras

Pipelines types available:

Pipeline typeDescription
StableDiffusionPipelineStable diffusion pipeline
StableDiffusionImg2ImgPipelineStable diffusion image to image pipeline
StableDiffusionDepth2ImgPipelineStable diffusion depth to image pipeline
DiffusionPipelineDiffusion pipeline
StableDiffusionXLPipelineStable diffusion XL pipeline
StableVideoDiffusionPipelineStable video diffusion pipeline
AutoPipelineForText2ImageAutomatic detection pipeline for text to image
VideoDiffusionPipelineVideo diffusion pipeline
StableDiffusion3PipelineStable diffusion 3 pipeline
FluxPipelineFlux pipeline
FluxTransformer2DModelFlux transformer 2D model
SanaPipelineSana pipeline
Advanced: Additional parameters

Additional arbitrarly parameters can be specified in the option field in key/value separated by ::

name: animagine-xl
options:
- "cfg_scale:6"

Note: There is no complete parameter list. Any parameter can be passed arbitrarly and is passed to the model directly as argument to the pipeline. Different pipelines/implementations support different parameters.

The example above, will result in the following python code when generating images:

pipe(
    prompt="A cute baby sea otter", # Options passed via API
    size="256x256", # Options passed via API
    cfg_scale=6 # Additional parameter passed via configuration file
)

Usage

Text to Image

Use the image generation endpoint with the model name from the configuration file:

curl http://localhost:8080/v1/images/generations \
    -H "Content-Type: application/json" \
    -d '{
      "prompt": "<positive prompt>|<negative prompt>", 
      "model": "animagine-xl", 
      "step": 51,
      "size": "1024x1024" 
    }'

Image to Image

https://huggingface.co/docs/diffusers/using-diffusers/img2img

An example model (GPU):

name: stablediffusion-edit
parameters:
  model: nitrosocke/Ghibli-Diffusion
backend: diffusers
step: 25
cuda: true
f16: true
diffusers:
  pipeline_type: StableDiffusionImg2ImgPipeline
  enable_parameters: "negative_prompt,num_inference_steps,image"
IMAGE_PATH=/path/to/your/image
(echo -n '{"file": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations
🖼️ Flux kontext with stable-diffusion.cpp

LocalAI supports Flux Kontext and can be used to edit images via the API:

Install with:

local-ai run flux.1-kontext-dev

To test:

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "model": "flux.1-kontext-dev",
  "prompt": "change 'flux.cpp' to 'LocalAI'",
  "size": "256x256",
  "ref_images": [
  	"https://raw.githubusercontent.com/leejet/stable-diffusion.cpp/master/assets/flux/flux1-dev-q8_0.png"
  ]
}'

Depth to Image

https://huggingface.co/docs/diffusers/using-diffusers/depth2img

name: stablediffusion-depth
parameters:
  model: stabilityai/stable-diffusion-2-depth
backend: diffusers
step: 50
f16: true
cuda: true
diffusers:
  pipeline_type: StableDiffusionDepth2ImgPipeline
  enable_parameters: "negative_prompt,num_inference_steps,image"

cfg_scale: 6
(echo -n '{"file": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations

img2vid

name: img2vid
parameters:
  model: stabilityai/stable-video-diffusion-img2vid
backend: diffusers
step: 25
f16: true
cuda: true
diffusers:
  pipeline_type: StableVideoDiffusionPipeline
(echo -n '{"file": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true","size": "512x512","model":"img2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations

txt2vid

name: txt2vid
parameters:
  model: damo-vilab/text-to-video-ms-1.7b
backend: diffusers
step: 25
f16: true
cuda: true
diffusers:
  pipeline_type: VideoDiffusionPipeline
  cuda: true
(echo -n '{"prompt": "spiderman surfing","size": "512x512","model":"txt2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations

🔍 Object detection

LocalAI supports object detection through various backends. This feature allows you to identify and locate objects within images with high accuracy and real-time performance. Currently, RF-DETR is available as an implementation.

Overview

Object detection in LocalAI is implemented through dedicated backends that can identify and locate objects within images. Each backend provides different capabilities and model architectures.

Key Features:

  • Real-time object detection
  • High accuracy detection with bounding boxes
  • Support for multiple hardware accelerators (CPU, NVIDIA GPU, Intel GPU, AMD GPU)
  • Structured detection results with confidence scores
  • Easy integration through the /v1/detection endpoint

Usage

Detection Endpoint

LocalAI provides a dedicated /v1/detection endpoint for object detection tasks. This endpoint is specifically designed for object detection and returns structured detection results with bounding boxes and confidence scores.

API Reference

To perform object detection, send a POST request to the /v1/detection endpoint:

curl -X POST http://localhost:8080/v1/detection \
  -H "Content-Type: application/json" \
  -d '{
    "model": "rfdetr-base",
    "image": "https://media.roboflow.com/dog.jpeg"
  }'

Request Format

The request body should contain:

  • model: The name of the object detection model (e.g., “rfdetr-base”)
  • image: The image to analyze, which can be:
    • A URL to an image
    • A base64-encoded image

Response Format

The API returns a JSON response with detected objects:

{
  "detections": [
    {
      "x": 100.5,
      "y": 150.2,
      "width": 200.0,
      "height": 300.0,
      "confidence": 0.95,
      "class_name": "dog"
    },
    {
      "x": 400.0,
      "y": 200.0,
      "width": 150.0,
      "height": 250.0,
      "confidence": 0.87,
      "class_name": "person"
    }
  ]
}

Each detection includes:

  • x, y: Coordinates of the bounding box top-left corner
  • width, height: Dimensions of the bounding box
  • confidence: Detection confidence score (0.0 to 1.0)
  • class_name: The detected object class

Backends

RF-DETR Backend

The RF-DETR backend is implemented as a Python-based gRPC service that integrates seamlessly with LocalAI. It provides object detection capabilities using the RF-DETR model architecture and supports multiple hardware configurations:

  • CPU: Optimized for CPU inference
  • NVIDIA GPU: CUDA acceleration for NVIDIA GPUs
  • Intel GPU: Intel oneAPI optimization
  • AMD GPU: ROCm acceleration for AMD GPUs
  • NVIDIA Jetson: Optimized for ARM64 NVIDIA Jetson devices

Setup

  1. Using the Model Gallery (Recommended)

    The easiest way to get started is using the model gallery. The rfdetr-base model is available in the official LocalAI gallery:

    # Install and run the rfdetr-base model
    local-ai run rfdetr-base

    You can also install it through the web interface by navigating to the Models section and searching for “rfdetr-base”.

  2. Manual Configuration

    Create a model configuration file in your models directory:

    name: rfdetr
    backend: rfdetr
    parameters:
      model: rfdetr-base

Available Models

Currently, the following model is available in the Model Gallery:

  • rfdetr-base: Base model with balanced performance and accuracy

You can browse and install this model through the LocalAI web interface or using the command line.

Examples

Basic Object Detection

curl -X POST http://localhost:8080/v1/detection \
  -H "Content-Type: application/json" \
  -d '{
    "model": "rfdetr-base",
    "image": "https://example.com/image.jpg"
  }'

Base64 Image Detection

base64_image=$(base64 -w 0 image.jpg)
curl -X POST http://localhost:8080/v1/detection \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"rfdetr-base\",
    \"image\": \"data:image/jpeg;base64,$base64_image\"
  }"

Troubleshooting

Common Issues

  1. Model Loading Errors

    • Ensure the model file is properly downloaded
    • Check available disk space
    • Verify model compatibility with your backend version
  2. Low Detection Accuracy

    • Ensure good image quality and lighting
    • Check if objects are clearly visible
    • Consider using a larger model for better accuracy
  3. Slow Performance

    • Enable GPU acceleration if available
    • Use a smaller model for faster inference
    • Optimize image resolution

Debug Mode

Enable debug logging for troubleshooting:

local-ai run --debug rfdetr-base

Object Detection Category

LocalAI includes a dedicated object-detection category for models and backends that specialize in identifying and locating objects within images. This category currently includes:

  • RF-DETR: Real-time transformer-based object detection

Additional object detection models and backends will be added to this category in the future. You can filter models by the object-detection tag in the model gallery to find all available object detection models.

🧠 Embeddings

LocalAI supports generating embeddings for text or list of tokens.

For the API documentation you can refer to the OpenAI docs: https://platform.openai.com/docs/api-reference/embeddings

Model compatibility

The embedding endpoint is compatible with llama.cpp models, bert.cpp models and sentence-transformers models available in huggingface.

Manual Setup

Create a YAML config file in the models directory. Specify the backend and the model file.

name: text-embedding-ada-002 # The model name used in the API
parameters:
  model: <model_file>
backend: "<backend>"
embeddings: true

Huggingface embeddings

To use sentence-transformers and models in huggingface you can use the sentencetransformers embedding backend.

name: text-embedding-ada-002
backend: sentencetransformers
embeddings: true
parameters:
  model: all-MiniLM-L6-v2

The sentencetransformers backend uses Python sentence-transformers. For a list of all pre-trained models available see here: https://github.com/UKPLab/sentence-transformers#pre-trained-models

Note
  • The sentencetransformers backend is an optional backend of LocalAI and uses Python. If you are running LocalAI from the containers you are good to go and should be already configured for use.
  • For local execution, you also have to specify the extra backend in the EXTERNAL_GRPC_BACKENDS environment variable.
    • Example: EXTERNAL_GRPC_BACKENDS="sentencetransformers:/path/to/LocalAI/backend/python/sentencetransformers/sentencetransformers.py"
  • The sentencetransformers backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the bert backend or llama.cpp.
  • No models are required to be downloaded before using the sentencetransformers backend. The models will be downloaded automatically the first time the API is used.

Llama.cpp embeddings

Embeddings with llama.cpp are supported with the llama-cpp backend, it needs to be enabled with embeddings set to true.

name: my-awesome-model
backend: llama-cpp
embeddings: true
parameters:
  model: ggml-file.bin

Then you can use the API to generate embeddings:

curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
  "input": "My text",
  "model": "my-awesome-model"
}' | jq "."

💡 Examples

  • Example that uses LLamaIndex and LocalAI as embedding: here.

🥽 GPT Vision

LocalAI supports understanding images by using LLaVA, and implements the GPT Vision API from OpenAI.

llava llava

Usage

OpenAI docs: https://platform.openai.com/docs/guides/vision

To let LocalAI understand and reply with what sees in the image, use the /v1/chat/completions endpoint, for example with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Grammars and function tools can be used as well in conjunction with vision APIs:

 curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava", "grammar": "root ::= (\"yes\" | \"no\")",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Setup

All-in-One images have already shipped the llava model as gpt-4-vision-preview, so no setup is needed in this case.

To setup the LLaVa models, follow the full example in the configuration examples.

✍️ Constrained Grammars

Overview

The chat endpoint supports the grammar parameter, which allows users to specify a grammar in Backus-Naur Form (BNF). This feature enables the Large Language Model (LLM) to generate outputs adhering to a user-defined schema, such as JSON, YAML, or any other format that can be defined using BNF. For more details about BNF, see Backus-Naur Form on Wikipedia.

Note

Compatibility Notice: This feature is only supported by models that use the llama.cpp backend. For a complete list of compatible models, refer to the Model Compatibility page. For technical details, see the related pull requests: PR #1773 and PR #1887.

Setup

To use this feature, follow the installation and setup instructions on the LocalAI Functions page. Ensure that your local setup meets all the prerequisites specified for the llama.cpp backend.

💡 Usage Example

The following example demonstrates how to use the grammar parameter to constrain the model’s output to either “yes” or “no”. This can be particularly useful in scenarios where the response format needs to be strictly controlled.

Example: Binary Response Constraint

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Do you like apples?"}],
  "grammar": "root ::= (\"yes\" | \"no\")"
}'

In this example, the grammar parameter is set to a simple choice between “yes” and “no”, ensuring that the model’s response adheres strictly to one of these options regardless of the context.

Example: JSON Output Constraint

You can also use grammars to enforce JSON output format:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Generate a person object with name and age"}],
  "grammar": "root ::= \"{\" \"\\\"name\\\":\" string \",\\\"age\\\":\" number \"}\"\nstring ::= \"\\\"\" [a-z]+ \"\\\"\"\nnumber ::= [0-9]+"
}'

Example: YAML Output Constraint

Similarly, you can enforce YAML format:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Generate a YAML list of fruits"}],
  "grammar": "root ::= \"fruits:\" newline (\"  - \" string newline)+\nstring ::= [a-z]+\nnewline ::= \"\\n\""
}'

Advanced Usage

For more complex grammars, you can define multi-line BNF rules. The grammar parser supports:

  • Alternation (|)
  • Repetition (*, +)
  • Optional elements (?)
  • Character classes ([a-z])
  • String literals ("text")

🆕🖧 Distributed Inference

This functionality enables LocalAI to distribute inference requests across multiple worker nodes, improving efficiency and performance. Nodes are automatically discovered and connect via p2p by using a shared token which makes sure the communication is secure and private between the nodes of the network.

LocalAI supports two modes of distributed inferencing via p2p:

  • Federated Mode: Requests are shared between the cluster and routed to a single worker node in the network based on the load balancer’s decision.
  • Worker Mode (aka “model sharding” or “splitting weights”): Requests are processed by all the workers which contributes to the final inference result (by sharing the model weights).

A list of global instances shared by the community is available at explorer.localai.io.

Usage

Starting LocalAI with --p2p generates a shared token for connecting multiple instances: and that’s all you need to create AI clusters, eliminating the need for intricate network setups.

Simply navigate to the “Swarm” section in the WebUI and follow the on-screen instructions.

For fully shared instances, initiate LocalAI with –p2p –federated and adhere to the Swarm section’s guidance. This feature, while still experimental, offers a tech preview quality experience.

Federated mode

Federated mode allows to launch multiple LocalAI instances and connect them together in a federated network. This mode is useful when you want to distribute the load of the inference across multiple nodes, but you want to have a single point of entry for the API. In the Swarm section of the WebUI, you can see the instructions to connect multiple instances together.

346663124-1d2324fd-8b55-4fa2-9856-721a467969c2 346663124-1d2324fd-8b55-4fa2-9856-721a467969c2

To start a LocalAI server in federated mode, run:

local-ai run --p2p --federated

This will generate a token that you can use to connect other LocalAI instances to the network or others can use to join the network. If you already have a token, you can specify it using the TOKEN environment variable.

To start a load balanced server that routes the requests to the network, run with the TOKEN:

local-ai federated

To see all the available options, run local-ai federated --help.

The instructions are displayed in the “Swarm” section of the WebUI, guiding you through the process of connecting multiple instances.

Workers mode

Note

This feature is available exclusively with llama-cpp compatible models.

This feature was introduced in LocalAI pull request #2324 and is based on the upstream work in llama.cpp pull request #6829.

To connect multiple workers to a single LocalAI instance, start first a server in p2p mode:

local-ai run --p2p

And navigate the WebUI to the “Swarm” section to see the instructions to connect multiple workers to the network.

346663124-1d2324fd-8b55-4fa2-9856-721a467969c2 346663124-1d2324fd-8b55-4fa2-9856-721a467969c2

Without P2P

To start workers for distributing the computational load, run:

local-ai worker llama-cpp-rpc --llama-cpp-args="-H <listening_address> -p <listening_port> -m <memory>" 

And you can specify the address of the workers when starting LocalAI with the LLAMACPP_GRPC_SERVERS environment variable:

LLAMACPP_GRPC_SERVERS="address1:port,address2:port" local-ai run

The workload on the LocalAI server will then be distributed across the specified nodes.

Alternatively, you can build the RPC workers/server following the llama.cpp README, which is compatible with LocalAI.

Manual example (worker)

Use the WebUI to guide you in the process of starting new workers. This example shows the manual steps to highlight the process.

  1. Start the server with --p2p:
./local-ai run --p2p

Copy the token from the WebUI or via API call (e.g., curl http://localhost:8000/p2p/token) and save it for later use.

To reuse the same token later, restart the server with --p2ptoken or P2P_TOKEN.

  1. Start the workers. Copy the local-ai binary to other hosts and run as many workers as needed using the token:
TOKEN=XXX ./local-ai worker p2p-llama-cpp-rpc --llama-cpp-args="-m <memory>" 

(Note: You can also supply the token via command-line arguments)

The server logs should indicate that new workers are being discovered.

  1. Start inference as usual on the server initiated in step 1.

output output

Environment Variables

There are options that can be tweaked or parameters that can be set using environment variables

Environment VariableDescription
LOCALAI_P2PSet to “true” to enable p2p
LOCALAI_FEDERATEDSet to “true” to enable federated mode
FEDERATED_SERVERSet to “true” to enable federated server
LOCALAI_P2P_DISABLE_DHTSet to “true” to disable DHT and enable p2p layer to be local only (mDNS)
LOCALAI_P2P_ENABLE_LIMITSSet to “true” to enable connection limits and resources management (useful when running with poor connectivity or want to limit resources consumption)
LOCALAI_P2P_LISTEN_MADDRSSet to comma separated list of multiaddresses to override default libp2p 0.0.0.0 multiaddresses
LOCALAI_P2P_DHT_ANNOUNCE_MADDRSSet to comma separated list of multiaddresses to override announcing of listen multiaddresses (useful when external address:port is remapped)
LOCALAI_P2P_BOOTSTRAP_PEERS_MADDRSSet to comma separated list of multiaddresses to specify custom DHT bootstrap nodes
LOCALAI_P2P_TOKENSet the token for the p2p network
LOCALAI_P2P_LOGLEVELSet the loglevel for the LocalAI p2p stack (default: info)
LOCALAI_P2P_LIB_LOGLEVELSet the loglevel for the underlying libp2p stack (default: fatal)

Architecture

LocalAI uses https://github.com/libp2p/go-libp2p under the hood, the same project powering IPFS. Differently from other frameworks, LocalAI uses peer2peer without a single master server, but rather it uses sub/gossip and ledger functionalities to achieve consensus across different peers.

EdgeVPN is used as a library to establish the network and expose the ledger functionality under a shared token to ease out automatic discovery and have separated, private peer2peer networks.

The weights are split proportional to the memory when running into worker mode, when in federation mode each request is split to every node which have to load the model fully.

Debugging

To debug, it’s often useful to run in debug mode, for instance:

LOCALAI_P2P_LOGLEVEL=debug LOCALAI_P2P_LIB_LOGLEVEL=debug LOCALAI_P2P_ENABLE_LIMITS=true LOCALAI_P2P_DISABLE_DHT=true LOCALAI_P2P_TOKEN="<TOKEN>" ./local-ai ...

Notes

  • If running in p2p mode with container images, make sure you start the container with --net host or network_mode: host in the docker-compose file.
  • Only a single model is supported currently.
  • Ensure the server detects new workers before starting inference. Currently, additional workers cannot be added once inference has begun.
  • For more details on the implementation, refer to LocalAI pull request #2343

🔈 Audio to text

Audio to text models are models that can generate text from an audio file.

The transcription endpoint allows to convert audio files to text. The endpoint is based on whisper.cpp, a C++ library for audio transcription. The endpoint input supports all the audio formats supported by ffmpeg.

Usage

Once LocalAI is started and whisper models are installed, you can use the /v1/audio/transcriptions API endpoint.

For instance, with cURL:

curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@<FILE_PATH>" -F model="<MODEL_NAME>"

Example

Download one of the models from here in the models folder, and create a YAML file for your model:

name: whisper-1
backend: whisper
parameters:
  model: whisper-en

The transcriptions endpoint then can be tested like so:

## Get an example audio file
wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg

## Send the example audio file to the transcriptions endpoint
curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"

## Result
{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}

🔥 OpenAI functions and tools

LocalAI supports running OpenAI functions and tools API with llama.cpp compatible models.

localai-functions-1 localai-functions-1

To learn more about OpenAI functions, see also the OpenAI API blog post.

LocalAI is also supporting JSON mode out of the box with llama.cpp-compatible models.

💡 Check out also LocalAGI for an example on how to use LocalAI functions.

Setup

OpenAI functions are available only with ggml or gguf models compatible with llama.cpp.

You don’t need to do anything specific - just use ggml or gguf models.

Usage example

You can configure a model manually with a YAML config file in the models directory, for example:

name: gpt-3.5-turbo
parameters:
  # Model file name
  model: ggml-openllama.bin
  top_p: 80
  top_k: 0.9
  temperature: 0.1

To use the functions with the OpenAI client in python:

from openai import OpenAI

messages = [{"role": "user", "content": "What is the weather like in Beijing now?"}]
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Return the temperature of the specified region specified by the user",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "User specified region",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "temperature unit"
                    },
                },
                "required": ["location"],
            },
        },
    }
]

client = OpenAI(
    # This is the default and can be omitted
    api_key="test",
    base_url="http://localhost:8080/v1/"
)

response =client.chat.completions.create(
    messages=messages,
    tools=tools,
    tool_choice ="auto",
    model="gpt-4",
)
#...

For example, with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "What is the weather like in Beijing now?"}],
  "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Return the temperature of the specified region specified by the user",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "User specified region"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "temperature unit"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ],
    "tool_choice":"auto"
}'

Return data:

{
    "created": 1724210813,
    "object": "chat.completion",
    "id": "16b57014-477c-4e6b-8d25-aad028a5625e",
    "model": "gpt-4",
    "choices": [
        {
            "index": 0,
            "finish_reason": "tool_calls",
            "message": {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "index": 0,
                        "id": "16b57014-477c-4e6b-8d25-aad028a5625e",
                        "type": "function",
                        "function": {
                            "name": "get_current_weather",
                            "arguments": "{\"location\":\"Beijing\",\"unit\":\"celsius\"}"
                        }
                    }
                ]
            }
        }
    ],
    "usage": {
        "prompt_tokens": 221,
        "completion_tokens": 26,
        "total_tokens": 247
    }
}

Advanced

Use functions without grammars

The functions calls maps automatically to grammars which are currently supported only by llama.cpp, however, it is possible to turn off the use of grammars, and extract tool arguments from the LLM responses, by specifying in the YAML file no_grammar and a regex to map the response from the LLM:

name: model_name
parameters:
  # Model file name
  model: model/name

function:
  # set to true to not use grammars
  no_grammar: true
  # set one or more regexes used to extract the function tool arguments from the LLM response
  response_regex:
  - "(?P<function>\w+)\s*\((?P<arguments>.*)\)"

The response regex have to be a regex with named parameters to allow to scan the function name and the arguments. For instance, consider:

(?P<function>\w+)\s*\((?P<arguments>.*)\)

will catch

function_name({ "foo": "bar"})

Parallel tools calls

This feature is experimental and has to be configured in the YAML of the model by enabling function.parallel_calls:

name: gpt-3.5-turbo
parameters:
  # Model file name
  model: ggml-openllama.bin
  top_p: 80
  top_k: 0.9
  temperature: 0.1

function:
  # set to true to allow the model to call multiple functions in parallel
  parallel_calls: true

Use functions with grammar

It is possible to also specify the full function signature (for debugging, or to use with other clients).

The chat endpoint accepts the grammar_json_functions additional parameter which takes a JSON schema object.

For example, with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "gpt-4",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.1,
     "grammar_json_functions": {
        "oneOf": [
            {
                "type": "object",
                "properties": {
                    "function": {"const": "create_event"},
                    "arguments": {
                        "type": "object",
                        "properties": {
                            "title": {"type": "string"},
                            "date": {"type": "string"},
                            "time": {"type": "string"}
                        }
                    }
                }
            },
            {
                "type": "object",
                "properties": {
                    "function": {"const": "search"},
                    "arguments": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string"}
                        }
                    }
                }
            }
        ]
    }
   }'

Grammars and function tools can be used as well in conjunction with vision APIs:

 curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava", "grammar": "root ::= (\"yes\" | \"no\")",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

💡 Examples

A full e2e example with docker-compose is available here.

💾 Stores

Stores are an experimental feature to help with querying data using similarity search. It is a low level API that consists of only get, set, delete and find.

For example if you have an embedding of some text and want to find text with similar embeddings. You can create embeddings for chunks of all your text then compare them against the embedding of the text you are searching on.

An embedding here meaning a vector of numbers that represent some information about the text. The embeddings are created from an A.I. model such as BERT or a more traditional method such as word frequency.

Previously you would have to integrate with an external vector database or library directly. With the stores feature you can now do it through the LocalAI API.

Note however that doing a similarity search on embeddings is just one way to do retrieval. A higher level API can take this into account, so this may not be the best place to start.

API overview

There is an internal gRPC API and an external facing HTTP JSON API. We’ll just discuss the external HTTP API, however the HTTP API mirrors the gRPC API. Consult pkg/store/client for internal usage.

Everything is in columnar format meaning that instead of getting an array of objects with a key and a value each. You instead get two separate arrays of keys and values.

Keys are arrays of floating point numbers with a maximum width of 32bits. Values are strings (in gRPC they are bytes).

The key vectors must all be the same length and it’s best for search performance if they are normalized. When addings keys it will be detected if they are not normalized and what length they are.

All endpoints accept a store field which specifies which store to operate on. Presently they are created on the fly and there is only one store backend so no configuration is required.

Set

To set some keys you can do

curl -X POST http://localhost:8080/stores/set \
     -H "Content-Type: application/json" \
     -d '{"keys": [[0.1, 0.2], [0.3, 0.4]], "values": ["foo", "bar"]}'

Setting the same keys again will update their values.

On success 200 OK is returned with no body.

Get

To get some keys you can do

curl -X POST http://localhost:8080/stores/get \
     -H "Content-Type: application/json" \
     -d '{"keys": [[0.1, 0.2]]}'

Both the keys and values are returned, e.g: {"keys":[[0.1,0.2]],"values":["foo"]}

The order of the keys is not preserved! If a key does not exist then nothing is returned.

Delete

To delete keys and values you can do

curl -X POST http://localhost:8080/stores/delete \
     -H "Content-Type: application/json" \
     -d '{"keys": [[0.1, 0.2]]}'

If a key doesn’t exist then it is ignored.

On success 200 OK is returned with no body.

Find

To do a similarity search you can do

curl -X POST http://localhost:8080/stores/find 
     -H "Content-Type: application/json" \
     -d '{"topk": 2, "key": [0.2, 0.1]}'

topk limits the number of results returned. The result value is the same as get, except that it also includes an array of similarities. Where 1.0 is the maximum similarity. They are returned in the order of most similar to least.

🖼️ Model gallery

The model gallery is a curated collection of models configurations for LocalAI that enables one-click install of models directly from the LocalAI Web interface.

A list of the models available can also be browsed at the Public LocalAI Gallery.

LocalAI to ease out installations of models provide a way to preload models on start and downloading and installing them in runtime. You can install models manually by copying them over the models directory, or use the API or the Web interface to configure, download and verify the model assets for you.

Note

The models in this gallery are not directly maintained by LocalAI. If you find a model that is not working, please open an issue on the model gallery repository.

Note

GPT and text generation models might have a license which is not permissive for commercial use or might be questionable or without any license at all. Please check the model license before using it. The official gallery contains only open licensed models.

output output

  • Open LLM Leaderboard - here you can find a list of the most performing models on the Open LLM benchmark. Keep in mind models compatible with LocalAI must be quantized in the gguf format.

How it works

Navigate the WebUI interface in the “Models” section from the navbar at the top. Here you can find a list of models that can be installed, and you can install them by clicking the “Install” button.

Add other galleries

You can add other galleries by:

  1. Using the Web UI: Navigate to the Runtime Settings page and configure galleries through the interface.

  2. Using Environment Variables: Set the GALLERIES environment variable. The GALLERIES environment variable is a list of JSON objects, where each object has a name and a url field. The name field is the name of the gallery, and the url field is the URL of the gallery’s index file, for example:

GALLERIES=[{"name":"<GALLERY_NAME>", "url":"<GALLERY_URL"}]
  1. Using Configuration Files: Add galleries to runtime_settings.json in the LOCALAI_CONFIG_DIR directory.

The models in the gallery will be automatically indexed and available for installation.

API Reference

Model repositories

You can install a model in runtime, while the API is running and it is started already, or before starting the API by preloading the models.

To install a model in runtime you will need to use the /models/apply LocalAI API endpoint.

By default LocalAI is configured with the localai repository.

To use additional repositories you need to start local-ai with the GALLERIES environment variable:

GALLERIES=[{"name":"<GALLERY_NAME>", "url":"<GALLERY_URL"}]

For example, to enable the default localai repository, you can start local-ai with:

GALLERIES=[{"name":"localai", "url":"github:mudler/localai/gallery/index.yaml"}]

where github:mudler/localai/gallery/index.yaml will be expanded automatically to https://raw.githubusercontent.com/mudler/LocalAI/main/index.yaml.

Note: the url are expanded automatically for github and huggingface, however https:// and http:// prefix works as well.

Note

If you want to build your own gallery, there is no documentation yet. However you can find the source of the default gallery in the LocalAI repository.

List Models

To list all the available models, use the /models/available endpoint:

curl http://localhost:8080/models/available

To search for a model, you can use jq:

curl http://localhost:8080/models/available | jq '.[] | select(.name | contains("replit"))'

curl http://localhost:8080/models/available | jq '.[] | .name | select(contains("localmodels"))'

curl http://localhost:8080/models/available | jq '.[] | .urls | select(. != null) | add | select(contains("orca"))'

How to install a model from the repositories

Models can be installed by passing the full URL of the YAML config file, or either an identifier of the model in the gallery. The gallery is a repository of models that can be installed by passing the model name.

To install a model from the gallery repository, you can pass the model name in the id field. For instance, to install the bert-embeddings model, you can use the following command:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "localai@bert-embeddings"
   }'  

where:

  • localai is the repository. It is optional and can be omitted. If the repository is omitted LocalAI will search the model by name in all the repositories. In the case the same model name is present in both galleries the first match wins.
  • bert-embeddings is the model name in the gallery (read its config here).

If you don’t want to set any gallery repository, you can still install models by loading a model configuration file.

In the body of the request you must specify the model configuration file URL (url), optionally a name to install the model (name), extra files to install (files), and configuration overrides (overrides). When calling the API endpoint, LocalAI will download the models files and write the configuration to the folder used to store models.

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "config_url": "<MODEL_CONFIG_FILE_URL>"
   }' 
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "<GALLERY>@<MODEL_NAME>"
   }' 
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE_URL>"
   }' 

An example that installs hermes-2-pro-mistral can be:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "config_url": "https://raw.githubusercontent.com/mudler/LocalAI/v2.25.0/embedded/models/hermes-2-pro-mistral.yaml"
   }' 

The API will return a job uuid that you can use to track the job progress:

{"uuid":"1059474d-f4f9-11ed-8d99-c4cbe106d571","status":"http://localhost:8080/models/jobs/1059474d-f4f9-11ed-8d99-c4cbe106d571"}

For instance, a small example bash script that waits a job to complete can be (requires jq):

response=$(curl -s http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{"url": "$model_url"}')

job_id=$(echo "$response" | jq -r '.uuid')

while [ "$(curl -s http://localhost:8080/models/jobs/"$job_id" | jq -r '.processed')" != "true" ]; do 
  sleep 1
done

echo "Job completed"

To preload models on start instead you can use the PRELOAD_MODELS environment variable.

To preload models on start, use the PRELOAD_MODELS environment variable by setting it to a JSON array of model uri:

PRELOAD_MODELS='[{"url": "<MODEL_URL>"}]'

Note: url or id must be specified. url is used to a url to a model gallery configuration, while an id is used to refer to models inside repositories. If both are specified, the id will be used.

For example:

PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]

or as arg:

local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:mudler/LocalAI/gallery/stablediffusion.yaml@master
Note

You can find already some open licensed models in the LocalAI gallery.

If you don’t find the model in the gallery you can try to use the “base” model and provide an URL to LocalAI:

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "github:mudler/LocalAI/gallery/base.yaml@master",
     "name": "model-name",
     "files": [
        {
            "uri": "<URL>",
            "sha256": "<SHA>",
            "filename": "model"
        }
     ]
   }'

Override a model name

To install a model with a different name, specify a name parameter in the request body.

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>"
   }'  

For example, to install a model as gpt-3.5-turbo:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
      "url": "github:mudler/LocalAI/gallery/gpt4all-j.yaml",
      "name": "gpt-3.5-turbo"
   }'  

Additional Files

To download additional files with the model, use the files parameter:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>",
     "files": [
        {
            "uri": "<additional_file_url>",
            "sha256": "<additional_file_hash>",
            "filename": "<additional_file_name>"
        }
     ]
   }'  

Overriding configuration files

To override portions of the configuration file, such as the backend or the model file, use the overrides parameter:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>",
     "overrides": {
        "backend": "llama",
        "f16": true,
        ...
     }
   }'  

Examples

Embeddings: Bert

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "bert-embeddings",
     "name": "text-embedding-ada-002"
   }'  

To test it:

LOCALAI=http://localhost:8080
curl $LOCALAI/v1/embeddings -H "Content-Type: application/json" -d '{
    "input": "Test",
    "model": "text-embedding-ada-002"
  }'

Image generation: Stable diffusion

URL: https://github.com/EdVince/Stable-Diffusion-NCNN

While the API is running, you can install the model by using the /models/apply endpoint and point it to the stablediffusion model in the models-gallery:

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
     "url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"
   }'

You can set the PRELOAD_MODELS environment variable:

PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]

or as arg:

local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:mudler/LocalAI/gallery/stablediffusion.yaml@master

Test it:

curl $LOCALAI/v1/images/generations -H "Content-Type: application/json" -d '{
            "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
            "mode": 2,  "seed":9000,
            "size": "256x256", "n":2
}'

Audio transcription: Whisper

URL: https://github.com/ggerganov/whisper.cpp

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
     "url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master",
     "name": "whisper-1"
   }'

You can set the PRELOAD_MODELS environment variable:

PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master", "name": "whisper-1"}]

or as arg:

local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master", "name": "whisper-1"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:mudler/LocalAI/gallery/whisper-base.yaml@master
  name: whisper-1

Note

LocalAI will create a batch process that downloads the required files from a model definition and automatically reload itself to include the new model.

Input: url or id (required), name (optional), files (optional)

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_DEFINITION_URL>",
     "id": "<GALLERY>@<MODEL_NAME>",
     "name": "<INSTALLED_MODEL_NAME>",
     "files": [
        {
            "uri": "<additional_file>",
            "sha256": "<additional_file_hash>",
            "filename": "<additional_file_name>"
        },
      "overrides": { "backend": "...", "f16": true }
     ]
   }

An optional, list of additional files can be specified to be downloaded within files. The name allows to override the model name. Finally it is possible to override the model config file with override.

The url is a full URL, or a github url (github:org/repo/file.yaml), or a local file (file:///path/to/file.yaml). The id is a string in the form <GALLERY>@<MODEL_NAME>, where <GALLERY> is the name of the gallery, and <MODEL_NAME> is the name of the model in the gallery. Galleries can be specified during startup with the GALLERIES environment variable.

Returns an uuid and an url to follow up the state of the process:

{ "uuid":"251475c9-f666-11ed-95e0-9a8a4480ac58", "status":"http://localhost:8080/models/jobs/251475c9-f666-11ed-95e0-9a8a4480ac58"}

To see a collection example of curated models definition files, see the LocalAI repository.

Get model job state /models/jobs/<uid>

This endpoint returns the state of the batch job associated to a model installation.

curl http://localhost:8080/models/jobs/<JOB_ID>

Returns a json containing the error, and if the job is being processed:

{"error":null,"processed":true,"message":"completed"}

🔗 Model Context Protocol (MCP)

LocalAI now supports the Model Context Protocol (MCP), enabling powerful agentic capabilities by connecting AI models to external tools and services. This feature allows your LocalAI models to interact with various MCP servers, providing access to real-time data, APIs, and specialized tools.

What is MCP?

The Model Context Protocol is a standard for connecting AI models to external tools and data sources. It enables AI agents to:

  • Access real-time information from external APIs
  • Execute commands and interact with external systems
  • Use specialized tools for specific tasks
  • Maintain context across multiple tool interactions

Key Features

  • 🔄 Real-time Tool Access: Connect to external MCP servers for live data
  • 🛠️ Multiple Server Support: Configure both remote HTTP and local stdio servers
  • ⚡ Cached Connections: Efficient tool caching for better performance
  • 🔒 Secure Authentication: Support for bearer token authentication
  • 🎯 OpenAI Compatible: Uses the familiar /mcp/v1/chat/completions endpoint
  • 🧠 Advanced Reasoning: Configurable reasoning and re-evaluation capabilities
  • 📋 Auto-Planning: Break down complex tasks into manageable steps
  • 🎯 MCP Prompts: Specialized prompts for better MCP server interaction
  • 🔄 Plan Re-evaluation: Dynamic plan adjustment based on results
  • ⚙️ Flexible Agent Control: Customizable execution limits and retry behavior

Configuration

MCP support is configured in your model’s YAML configuration file using the mcp section:

name: my-agentic-model
backend: llama-cpp
parameters:
  model: qwen3-4b.gguf

mcp:
  remote: |
    {
      "mcpServers": {
        "weather-api": {
          "url": "https://api.weather.com/v1",
          "token": "your-api-token"
        },
        "search-engine": {
          "url": "https://search.example.com/mcp",
          "token": "your-search-token"
        }
      }
    }
  
  stdio: |
    {
      "mcpServers": {
        "file-manager": {
          "command": "python",
          "args": ["-m", "mcp_file_manager"],
          "env": {
            "API_KEY": "your-key"
          }
        },
        "database-tools": {
          "command": "node",
          "args": ["database-mcp-server.js"],
          "env": {
            "DB_URL": "postgresql://localhost/mydb"
          }
        }
      }
    }

agent:
  max_attempts: 3        # Maximum number of tool execution attempts
  max_iterations: 3     # Maximum number of reasoning iterations
  enable_reasoning: true # Enable tool reasoning capabilities
  enable_planning: false # Enable auto-planning capabilities
  enable_mcp_prompts: false # Enable MCP prompts
  enable_plan_re_evaluator: false # Enable plan re-evaluation

Configuration Options

Remote Servers (remote)

Configure HTTP-based MCP servers:

  • url: The MCP server endpoint URL
  • token: Bearer token for authentication (optional)

STDIO Servers (stdio)

Configure local command-based MCP servers:

  • command: The executable command to run
  • args: Array of command-line arguments
  • env: Environment variables (optional)

Agent Configuration (agent)

Configure agent behavior and tool execution:

  • max_attempts: Maximum number of tool execution attempts (default: 3)
  • max_iterations: Maximum number of reasoning iterations (default: 3)
  • enable_reasoning: Enable tool reasoning capabilities (default: false)
  • enable_planning: Enable auto-planning capabilities (default: false)
  • enable_mcp_prompts: Enable MCP prompts (default: false)
  • enable_plan_re_evaluator: Enable plan re-evaluation (default: false)

Usage

API Endpoint

Use the MCP-enabled completion endpoint:

curl http://localhost:8080/mcp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-agentic-model",
    "messages": [
      {"role": "user", "content": "What is the current weather in New York?"}
    ],
    "temperature": 0.7
  }'

Example Response

{
  "id": "chatcmpl-123",
  "created": 1699123456,
  "model": "my-agentic-model",
  "choices": [
    {
      "text": "The current weather in New York is 72°F (22°C) with partly cloudy skies. The humidity is 65% and there's a light breeze from the west at 8 mph."
    }
  ],
  "object": "text_completion"
}

Example Configurations

Docker-based Tools

name: docker-agent
backend: llama-cpp
parameters:
  model: qwen3-4b.gguf

mcp:
  stdio: |
    {
      "mcpServers": {
        "searxng": {
          "command": "docker",
          "args": [
            "run", "-i", "--rm",
            "quay.io/mudler/tests:duckduckgo-localai"
          ]
        }
      }
    }

agent:
  max_attempts: 5
  max_iterations: 5
  enable_reasoning: true
  enable_planning: true
  enable_mcp_prompts: true
  enable_plan_re_evaluator: true

Agent Configuration Details

The agent section controls how the AI model interacts with MCP tools:

Execution Control

  • max_attempts: Limits how many times a tool can be retried if it fails. Higher values provide more resilience but may increase response time.
  • max_iterations: Controls the maximum number of reasoning cycles the agent can perform. More iterations allow for complex multi-step problem solving.

Reasoning Capabilities

  • enable_reasoning: When enabled, the agent uses advanced reasoning to better understand tool results and plan next steps.

Planning Capabilities

  • enable_planning: When enabled, the agent uses auto-planning to break down complex tasks into manageable steps and execute them systematically. The agent will automatically detect when planning is needed.
  • enable_mcp_prompts: When enabled, the agent uses specialized prompts exposed by the MCP servers to interact with the exposed tools.
  • enable_plan_re_evaluator: When enabled, the agent can re-evaluate and adjust its execution plan based on intermediate results.
  • Simple tasks: max_attempts: 2, max_iterations: 2, enable_reasoning: false, enable_planning: false
  • Complex tasks: max_attempts: 5, max_iterations: 5, enable_reasoning: true, enable_planning: true, enable_mcp_prompts: true
  • Advanced planning: max_attempts: 5, max_iterations: 5, enable_reasoning: true, enable_planning: true, enable_mcp_prompts: true, enable_plan_re_evaluator: true
  • Development/Debugging: max_attempts: 1, max_iterations: 1, enable_reasoning: true, enable_planning: true

How It Works

  1. Tool Discovery: LocalAI connects to configured MCP servers and discovers available tools
  2. Tool Caching: Tools are cached per model for efficient reuse
  3. Agent Execution: The AI model uses the Cogito framework to execute tools
  4. Response Generation: The model generates responses incorporating tool results

Supported MCP Servers

LocalAI is compatible with any MCP-compliant server.

Best Practices

Security

  • Use environment variables for sensitive tokens
  • Validate MCP server endpoints before deployment
  • Implement proper authentication for remote servers

Performance

  • Cache frequently used tools
  • Use appropriate timeout values for external APIs
  • Monitor resource usage for stdio servers

Error Handling

  • Implement fallback mechanisms for tool failures
  • Log tool execution for debugging
  • Handle network timeouts gracefully

With External Applications

Use MCP-enabled models in your applications:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/mcp/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="my-agentic-model",
    messages=[
        {"role": "user", "content": "Analyze the latest research papers on AI"}
    ]
)

MCP and adding packages

It might be handy to install packages before starting the container to setup the environment. This is an example on how you can do that with docker-compose (installing and configuring docker)

services:
  local-ai:
    image: localai/localai:latest
    #image: localai/localai:latest-gpu-nvidia-cuda-12
    container_name: local-ai
    restart: always
    entrypoint: [ "/bin/bash" ]
    command: >
     -c "apt-get update &&
         apt-get install -y docker.io &&
         /entrypoint.sh"
    environment:
      - DEBUG=true
      - LOCALAI_WATCHDOG_IDLE=true
      - LOCALAI_WATCHDOG_BUSY=true
      - LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m
      - LOCALAI_WATCHDOG_BUSY_TIMEOUT=15m
      - LOCALAI_API_KEY=my-beautiful-api-key
      - DOCKER_HOST=tcp://docker:2376
      - DOCKER_TLS_VERIFY=1
      - DOCKER_CERT_PATH=/certs/client
    ports:
      - "8080:8080"
    volumes:
      - /data/models:/models
      - /data/backends:/backends
      - certs:/certs:ro
    # uncomment for nvidia
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - capabilities: [gpu]
    #           device_ids: ['7']
    # runtime: nvidia

  docker:
    image: docker:dind
    privileged: true
    container_name: docker
    volumes:
      - certs:/certs
    healthcheck:
      test: ["CMD", "docker", "info"]
      interval: 10s
      timeout: 5s
volumes:
  certs:

An example model config (to append to any existing model you have) can be:

mcp:
  stdio: |
     {
      "mcpServers": {
        "weather": {
          "command": "docker",
          "args": [
            "run", "-i", "--rm",
            "ghcr.io/mudler/mcps/weather:master"
          ]
        },
        "memory": {
          "command": "docker",
          "env": {
            "MEMORY_FILE_PATH": "/data/memory.json"
          },
          "args": [
            "run", "-i", "--rm", "-v", "/host/data:/data",
            "ghcr.io/mudler/mcps/memory:master"
          ]
        },
        "ddg": {
          "command": "docker",
          "env": {
            "MAX_RESULTS": "10"
          },
          "args": [
            "run", "-i", "--rm", "-e", "MAX_RESULTS",
            "ghcr.io/mudler/mcps/duckduckgo:master"
          ]
        }
      }
     }

⚙️ Runtime Settings

LocalAI provides a web-based interface for managing application settings at runtime. These settings can be configured through the web UI and are automatically persisted to a configuration file, allowing changes to take effect immediately without requiring a restart.

Accessing Runtime Settings

Navigate to the Settings page from the management interface at http://localhost:8080/manage. The settings page provides a comprehensive interface for configuring various aspects of LocalAI.

Available Settings

Watchdog Settings

The watchdog monitors backend activity and can automatically stop idle or overly busy models to free up resources.

  • Watchdog Enabled: Master switch to enable/disable the watchdog
  • Watchdog Idle Enabled: Enable stopping backends that are idle longer than the idle timeout
  • Watchdog Busy Enabled: Enable stopping backends that are busy longer than the busy timeout
  • Watchdog Idle Timeout: Duration threshold for idle backends (default: 15m)
  • Watchdog Busy Timeout: Duration threshold for busy backends (default: 5m)

Changes to watchdog settings are applied immediately by restarting the watchdog service.

Backend Configuration

  • Single Backend: Allow only one backend to run at a time
  • Parallel Backend Requests: Enable backends to handle multiple requests in parallel if supported

Performance Settings

  • Threads: Number of threads used for parallel computation (recommended: number of physical cores)
  • Context Size: Default context size for models (default: 512)
  • F16: Enable GPU acceleration using 16-bit floating point

Debug and Logging

  • Debug Mode: Enable debug logging (deprecated, use log-level instead)

API Security

  • CORS: Enable Cross-Origin Resource Sharing
  • CORS Allow Origins: Comma-separated list of allowed CORS origins
  • CSRF: Enable CSRF protection middleware
  • API Keys: Manage API keys for authentication (one per line or comma-separated)

P2P Settings

Configure peer-to-peer networking for distributed inference:

  • P2P Token: Authentication token for P2P network
  • P2P Network ID: Network identifier for P2P connections
  • Federated Mode: Enable federated mode for P2P network

Changes to P2P settings automatically restart the P2P stack with the new configuration.

Manage model and backend galleries:

  • Model Galleries: JSON array of gallery objects with url and name fields
  • Backend Galleries: JSON array of backend gallery objects
  • Autoload Galleries: Automatically load model galleries on startup
  • Autoload Backend Galleries: Automatically load backend galleries on startup

Configuration Persistence

All settings are automatically saved to runtime_settings.json in the LOCALAI_CONFIG_DIR directory (default: BASEPATH/configuration). This file is watched for changes, so modifications made directly to the file will also be applied at runtime.

Environment Variable Precedence

Environment variables take precedence over settings configured via the web UI or configuration files. If a setting is controlled by an environment variable, it cannot be modified through the web interface. The settings page will indicate when a setting is controlled by an environment variable.

The precedence order is:

  1. Environment variables (highest priority)
  2. Configuration files (runtime_settings.json, api_keys.json)
  3. Default values (lowest priority)

Example Configuration

The runtime_settings.json file follows this structure:

{
  "watchdog_enabled": true,
  "watchdog_idle_enabled": true,
  "watchdog_busy_enabled": false,
  "watchdog_idle_timeout": "15m",
  "watchdog_busy_timeout": "5m",
  "single_backend": false,
  "parallel_backend_requests": true,
  "threads": 8,
  "context_size": 2048,
  "f16": false,
  "debug": false,
  "cors": true,
  "csrf": false,
  "cors_allow_origins": "*",
  "p2p_token": "",
  "p2p_network_id": "",
  "federated": false,
  "galleries": [
    {
      "url": "github:mudler/LocalAI/gallery/index.yaml@master",
      "name": "localai"
    }
  ],
  "backend_galleries": [
    {
      "url": "github:mudler/LocalAI/backend/index.yaml@master",
      "name": "localai"
    }
  ],
  "autoload_galleries": true,
  "autoload_backend_galleries": true,
  "api_keys": []
}

API Keys Management

API keys can be managed through the runtime settings interface. Keys can be entered one per line or comma-separated.

Important Notes:

  • API keys from environment variables are always included and cannot be removed via the UI
  • Runtime API keys are stored in runtime_settings.json
  • For backward compatibility, API keys can also be managed via api_keys.json
  • Empty arrays will clear all runtime API keys (but preserve environment variable keys)

Dynamic Configuration

The runtime settings system supports dynamic configuration file watching. When LOCALAI_CONFIG_DIR is set, LocalAI monitors the following files for changes:

  • runtime_settings.json - Unified runtime settings
  • api_keys.json - API keys (for backward compatibility)
  • external_backends.json - External backend configurations

Changes to these files are automatically detected and applied without requiring a restart.

Best Practices

  1. Use Environment Variables for Production: For production deployments, use environment variables for critical settings to ensure they cannot be accidentally changed via the web UI.

  2. Backup Configuration Files: Before making significant changes, consider backing up your runtime_settings.json file.

  3. Monitor Resource Usage: When enabling watchdog features, monitor your system to ensure the timeout values are appropriate for your workload.

  4. Secure API Keys: API keys are sensitive information. Ensure proper file permissions on configuration files (they should be readable only by the LocalAI process).

  5. Test Changes: Some settings (like watchdog timeouts) may require testing to find optimal values for your specific use case.

Troubleshooting

Settings Not Applying

If settings are not being applied:

  1. Check if the setting is controlled by an environment variable
  2. Verify the LOCALAI_CONFIG_DIR is set correctly
  3. Check file permissions on runtime_settings.json
  4. Review application logs for configuration errors

Watchdog Not Working

If the watchdog is not functioning:

  1. Ensure “Watchdog Enabled” is turned on
  2. Verify at least one of the idle or busy watchdogs is enabled
  3. Check that timeout values are reasonable for your workload
  4. Review logs for watchdog-related messages

P2P Not Starting

If P2P is not starting:

  1. Verify the P2P token is set (non-empty)
  2. Check network connectivity
  3. Ensure the P2P network ID matches across nodes (if using federated mode)
  4. Review logs for P2P-related errors

Integrations

Community integrations

List of projects that are using directly LocalAI behind the scenes can be found here.

The list below is a list of software that integrates with LocalAI.

Feel free to open up a Pull request (by clicking at the “Edit page” below) to get a page for your project made or if you see a error on one of the pages!

Subsections of Advanced

Advanced usage

Model Configuration with YAML Files

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. You can create individual YAML files in the models directory or use a single configuration file with multiple models.

Quick Example:

name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat

For a complete reference of all available configuration options, see the Model Configuration page.

Configuration File Locations:

  1. Individual files: Create .yaml files in your models directory (e.g., models/gpt-3.5-turbo.yaml)
  2. Single config file: Use --models-config-file or LOCALAI_MODELS_CONFIG_FILE to specify a file containing multiple models
  3. Remote URLs: Specify a URL to a YAML configuration file at startup:
    local-ai run github://mudler/LocalAI/examples/configurations/phi-2.yaml@master

See also chatbot-ui as an example on how to use config files.

Prompt templates

The API doesn’t inject a default prompt for talking to the model. You have to use a prompt similar to what’s described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.

You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibling file, `foo.bin.tmpl` which will be used as a default prompt and can be used with alpaca:
The below instruction describes a task. Write a response that appropriately completes the request.

### Instruction:
{{.Input}}

### Response:

See the prompt-templates directory in this repository for templates for some of the most popular models.

For the edit endpoint, an example template for alpaca-based models can be:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{{.Instruction}}

### Input:
{{.Input}}

### Response:

Install models using the API

Instead of installing models manually, you can use the LocalAI API endpoints and a model definition to install programmatically via API models in runtime.

A curated collection of model files is in the model-gallery. The files of the model gallery are different from the model files used to configure LocalAI models. The model gallery files contains information about the model setup, and the files necessary to run the model locally.

To install for example lunademo, you can send a POST call to the /models/apply endpoint with the model definition url (url) and the name of the model should have in LocalAI (name, optional):

curl --location 'http://localhost:8080/models/apply' \
--header 'Content-Type: application/json' \
--data-raw '{
    "id": "TheBloke/Luna-AI-Llama2-Uncensored-GGML/luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin",
    "name": "lunademo"
}'

Preloading models during startup

In order to allow the API to start-up with all the needed model on the first-start, the model gallery files can be used during startup.

PRELOAD_MODELS='[{"url": "https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml","name": "gpt4all-j"}]' local-ai

PRELOAD_MODELS (or --preload-models) takes a list in JSON with the same parameter of the API calls of the /models/apply endpoint.

Similarly it can be specified a path to a YAML configuration file containing a list of models with PRELOAD_MODELS_CONFIG ( or --preload-models-config ):

- url: https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml
  name: gpt4all-j

Automatic prompt caching

LocalAI can automatically cache prompts for faster loading of the prompt. This can be useful if your model need a prompt template with prefixed text in the prompt before the input.

To enable prompt caching, you can control the settings in the model config YAML file:

prompt_cache_path: "cache"
prompt_cache_all: true

prompt_cache_path is relative to the models folder. you can enter here a name for the file that will be automatically create during the first load if prompt_cache_all is set to true.

Configuring a specific backend for the model

By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.

The available backends are listed in the model compatibility table.

In order to specify a backend for your models, create a model config file in your models directory specifying the backend:

name: gpt-3.5-turbo

parameters:
  # Relative to the models path
  model: ...

backend: llama-stable

Connect external backends

LocalAI backends are internally implemented using gRPC services. This also allows LocalAI to connect to external gRPC services on start and extend LocalAI functionalities via third-party binaries.

The --external-grpc-backends parameter in the CLI can be used either to specify a local backend (a file) or a remote URL. The syntax is <BACKEND_NAME>:<BACKEND_URI>. Once LocalAI is started with it, the new backend name will be available for all the API endpoints.

So for instance, to register a new backend which is a local file:

./local-ai --debug --external-grpc-backends "my-awesome-backend:/path/to/my/backend.py"

Or a remote URI:

./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port"

For example, to start vllm manually after compiling LocalAI (also assuming running the command from the root of the repository):

./local-ai --external-grpc-backends "vllm:$PWD/backend/python/vllm/run.sh"

Note that first is is necessary to create the environment with:

make -C backend/python/vllm

Environment variables

When LocalAI runs in a container, there are additional environment variables available that modify the behavior of LocalAI on startup:

Environment variableDefaultDescription
REBUILDfalseRebuild LocalAI on startup
BUILD_TYPEBuild type. Available: cublas, openblas, clblas, intel (intel core), sycl_f16, sycl_f32 (intel backends)
GO_TAGSGo tags. Available: stablediffusion
HUGGINGFACEHUB_API_TOKENSpecial token for interacting with HuggingFace Inference API, required only when using the langchain-huggingface backend
EXTRA_BACKENDSA space separated list of backends to prepare. For example EXTRA_BACKENDS="backend/python/diffusers backend/python/transformers" prepares the python environment on start
DISABLE_AUTODETECTfalseDisable autodetect of CPU flagset on start
LLAMACPP_GRPC_SERVERSA list of llama.cpp workers to distribute the workload. For example LLAMACPP_GRPC_SERVERS="address1:port,address2:port"

Here is how to configure these variables:

docker run --env REBUILD=true localai
docker run --env-file .env localai

CLI Parameters

For a complete reference of all CLI parameters, environment variables, and command-line options, see the CLI Reference page.

You can control LocalAI with command line arguments to specify a binding address, number of threads, model paths, and many other options. Any command line parameter can be specified via an environment variable.

.env files

Any settings being provided by an Environment Variable can also be provided from within .env files. There are several locations that will be checked for relevant .env files. In order of precedence they are:

  • .env within the current directory
  • localai.env within the current directory
  • localai.env within the home directory
  • .config/localai.env within the home directory
  • /etc/localai.env

Environment variables within files earlier in the list will take precedence over environment variables defined in files later in the list.

An example .env file is:

LOCALAI_THREADS=10
LOCALAI_MODELS_PATH=/mnt/storage/localai/models
LOCALAI_F16=true

Request headers

You can use ‘Extra-Usage’ request header key presence (‘Extra-Usage: true’) to receive inference timings in milliseconds extending default OpenAI response model in the usage field:

...
{
  "id": "...",
  "created": ...,
  "model": "...",
  "choices": [
    {
      ...
    },
    ...
  ],
  "object": "...",
  "usage": {
    "prompt_tokens": ...,
    "completion_tokens": ...,
    "total_tokens": ...,
    // Extra-Usage header key will include these two float fields:
    "timing_prompt_processing: ...,
    "timing_token_generation": ...,
  },
}
...

Extra backends

LocalAI can be extended with extra backends. The backends are implemented as gRPC services and can be written in any language. See the backend section for more details on how to install and build new backends for LocalAI.

In runtime

When using the -core container image it is possible to prepare the python backends you are interested into by using the EXTRA_BACKENDS variable, for instance:

docker run --env EXTRA_BACKENDS="backend/python/diffusers" quay.io/go-skynet/local-ai:master

Concurrent requests

LocalAI supports parallel requests for the backends that supports it. For instance, vLLM and llama.cpp supports parallel requests, and thus LocalAI allows to run multiple requests in parallel.

In order to enable parallel requests, you have to pass --parallel-requests or set the PARALLEL_REQUEST to true as environment variable.

A list of the environment variable that tweaks parallelism is the following:

### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.

### Define the number of parallel LLAMA.cpp workers (Defaults to 1)

### Enable to run parallel requests

Note that, for llama.cpp you need to set accordingly LLAMACPP_PARALLEL to the number of parallel processes your GPU/CPU can handle. For python-based backends (like vLLM) you can set PYTHON_GRPC_MAX_WORKERS to the number of parallel requests.

VRAM and Memory Management

For detailed information on managing VRAM when running multiple models, see the dedicated VRAM and Memory Management page.

Disable CPU flagset auto detection in llama.cpp

LocalAI will automatically discover the CPU flagset available in your host and will use the most optimized version of the backends.

If you want to disable this behavior, you can set DISABLE_AUTODETECT to true in the environment variables.

Fine-tuning LLMs for text generation

Note

Section under construction

This section covers how to fine-tune a language model for text generation and consume it in LocalAI.

Open In Colab Open In Colab

Requirements

For this example you will need at least a 12GB VRAM of GPU and a Linux box.

Fine-tuning

Fine-tuning a language model is a process that requires a lot of computational power and time.

Currently LocalAI doesn’t support the fine-tuning endpoint as LocalAI but there are are plans to support that. For the time being a guide is proposed here to give a simple starting point on how to fine-tune a model and use it with LocalAI (but also with llama.cpp).

There is an e2e example of fine-tuning a LLM model to use with LocalAI written by @mudler available here.

The steps involved are:

  • Preparing a dataset
  • Prepare the environment and install dependencies
  • Fine-tune the model
  • Merge the Lora base with the model
  • Convert the model to gguf
  • Use the model with LocalAI

Dataset preparation

We are going to need a dataset or a set of datasets.

Axolotl supports a variety of formats, in the notebook and in this example we are aiming for a very simple dataset and build that manually, so we are going to use the completion format which requires the full text to be used for fine-tuning.

A dataset for an instructor model (like Alpaca) can look like the following:

[
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 },
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 }
]

Every block in the text is the whole text that is used to fine-tune. For example, for an instructor model it follows the following format (more or less):

<System prompt>

## Instruction

<Question, instruction>

## Response

<Expected response from the LLM>

The instruction format works such as when we are going to inference with the model, we are going to feed it only the first part up to the ## Instruction block, and the model is going to complete the text with the ## Response block.

Prepare a dataset, and upload it to your Google Drive in case you are using the Google colab. Otherwise place it next the axolotl.yaml file as dataset.json.

Install dependencies

git clone https://github.com/OpenAccess-AI-Collective/axolotl && pushd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a && popd #0.3.0
pip install packaging
pushd axolotl && pip install -e '.[flash-attn,deepspeed]' && popd

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Configure accelerate:

accelerate config default

Fine-tuning

We will need to configure axolotl. In this example is provided a file to use axolotl.yaml that uses openllama-3b for fine-tuning. Copy the axolotl.yaml file and edit it to your needs. The dataset needs to be next to it as dataset.json. You can find the axolotl.yaml file here.

If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process:

python -m axolotl.cli.preprocess axolotl.yaml

Now we are ready to start the fine-tuning process:

accelerate launch -m axolotl.cli.train axolotl.yaml

After we have finished the fine-tuning, we merge the Lora base with the model:

python3 -m axolotl.cli.merge_lora axolotl.yaml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False

And we convert it to the gguf format that LocalAI can consume:

git clone https://github.com/ggerganov/llama.cpp.git
pushd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release && popd

pushd llama.cpp && python3 convert_hf_to_gguf.py ../qlora-out/merged && popd

pushd llama.cpp/build/bin &&  ./llama-quantize ../../../qlora-out/merged/Merged-33B-F16.gguf \
    ../../../custom-model-q4_0.gguf q4_0

Now you should have ended up with a custom-model-q4_0.gguf file that you can copy in the LocalAI models directory and use it with LocalAI.

VRAM and Memory Management

When running multiple models in LocalAI, especially on systems with limited GPU memory (VRAM), you may encounter situations where loading a new model fails because there isn’t enough available VRAM. LocalAI provides two mechanisms to automatically manage model memory allocation and prevent VRAM exhaustion.

The Problem

By default, LocalAI keeps models loaded in memory once they’re first used. This means:

  • If you load a large model that uses most of your VRAM, subsequent requests for other models may fail
  • Models remain in memory even when not actively being used
  • There’s no automatic mechanism to unload models to make room for new ones, unless done manually via the web interface

This is a common issue when working with GPU-accelerated models, as VRAM is typically more limited than system RAM. For more context, see issues #6068, #7269, and #5352.

Solution 1: Single Active Backend

The simplest approach is to ensure only one model is loaded at a time. When a new model is requested, LocalAI will automatically unload the currently active model before loading the new one.

Configuration

./local-ai --single-active-backend

LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai

Use cases

  • Single GPU systems with limited VRAM
  • When you only need one model active at a time
  • Simple deployments where model switching is acceptable

Example

LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}'

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'

Solution 2: Watchdog Mechanisms

For more flexible memory management, LocalAI provides watchdog mechanisms that automatically unload models based on their activity state. This allows multiple models to be loaded simultaneously, but automatically frees memory when models become inactive or stuck.

Note: Watchdog settings can be configured via the Runtime Settings web interface, which allows you to adjust settings without restarting the application.

Idle Watchdog

The idle watchdog monitors models that haven’t been used for a specified period and automatically unloads them to free VRAM.

Configuration

Via environment variables or CLI:

LOCALAI_WATCHDOG_IDLE=true ./local-ai

LOCALAI_WATCHDOG_IDLE=true LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m ./local-ai

./local-ai --enable-watchdog-idle --watchdog-idle-timeout=10m

Via web UI: Navigate to Settings → Watchdog Settings and enable “Watchdog Idle Enabled” with your desired timeout.

Busy Watchdog

The busy watchdog monitors models that have been processing requests for an unusually long time and terminates them if they exceed a threshold. This is useful for detecting and recovering from stuck or hung backends.

Configuration

Via environment variables or CLI:

LOCALAI_WATCHDOG_BUSY=true ./local-ai

LOCALAI_WATCHDOG_BUSY=true LOCALAI_WATCHDOG_BUSY_TIMEOUT=10m ./local-ai

./local-ai --enable-watchdog-busy --watchdog-busy-timeout=10m

Via web UI: Navigate to Settings → Watchdog Settings and enable “Watchdog Busy Enabled” with your desired timeout.

Combined Configuration

You can enable both watchdogs simultaneously for comprehensive memory management:

LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m \
LOCALAI_WATCHDOG_BUSY=true \
LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \
./local-ai

Or using command line flags:

./local-ai \
  --enable-watchdog-idle --watchdog-idle-timeout=15m \
  --enable-watchdog-busy --watchdog-busy-timeout=5m

Use cases

  • Multi-model deployments where different models may be used intermittently
  • Systems where you want to keep frequently-used models loaded but free memory from unused ones
  • Recovery from stuck or hung backend processes
  • Production environments requiring automatic resource management

Example

LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m \
LOCALAI_WATCHDOG_BUSY=true \
LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \
./local-ai

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}'
curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'

Timeout Format

Timeouts can be specified using Go’s duration format:

  • 15m - 15 minutes
  • 1h - 1 hour
  • 30s - 30 seconds
  • 2h30m - 2 hours and 30 minutes

Limitations and Considerations

VRAM Usage Estimation

LocalAI cannot reliably estimate VRAM usage of new models to load across different backends (llama.cpp, vLLM, diffusers, etc.) because:

  • Different backends report memory usage differently
  • VRAM requirements vary based on model architecture, quantization, and configuration
  • Some backends may not expose memory usage information before loading the model

Manual Management

If automatic management doesn’t meet your needs, you can manually stop models using the LocalAI management API:

curl -X POST http://localhost:8080/backend/shutdown \
  -H "Content-Type: application/json" \
  -d '{"model": "model-name"}'

To stop all models, you’ll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once.

Best Practices

  1. Monitor VRAM usage: Use nvidia-smi (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage
  2. Start with single active backend: For single-GPU systems, --single-active-backend is often the simplest solution
  3. Tune watchdog timeouts: Adjust timeouts based on your usage patterns - shorter timeouts free memory faster but may cause more frequent reloads
  4. Consider model size: Ensure your VRAM can accommodate at least one of your largest models
  5. Use quantization: Smaller quantized models use less VRAM and allow more flexibility

Model Configuration

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.

Overview

Model configuration files allow you to:

  • Define default parameters (temperature, top_p, etc.)
  • Configure prompt templates
  • Specify backend settings
  • Set up function calling
  • Configure GPU and memory options
  • And much more

Configuration File Locations

You can create model configuration files in several ways:

  1. Individual YAML files in the models directory (e.g., models/gpt-3.5-turbo.yaml)
  2. Single config file with multiple models using --models-config-file or LOCALAI_MODELS_CONFIG_FILE
  3. Remote URLs - specify a URL to a YAML configuration file at startup

Example: Basic Configuration

name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat

Example: Multiple Models in One File

When using --models-config-file, you can define multiple models as a list:

- name: model1
  parameters:
    model: model1.bin
  context_size: 512
  backend: llama-stable

- name: model2
  parameters:
    model: model2.bin
  context_size: 1024
  backend: llama-stable

Core Configuration Fields

Basic Model Settings

FieldTypeDescriptionExample
namestringModel name, used to identify the model in API callsgpt-3.5-turbo
backendstringBackend to use (e.g. llama-cpp, vllm, diffusers, whisper)llama-cpp
descriptionstringHuman-readable description of the modelA conversational AI model
usagestringUsage instructions or notesBest for general conversation

Model File and Downloads

FieldTypeDescription
parameters.modelstringPath to the model file (relative to models directory) or URL
download_filesarrayList of files to download. Each entry has filename, uri, and optional sha256

Example:

parameters:
  model: my-model.gguf

download_files:
  - filename: my-model.gguf
    uri: https://example.com/model.gguf
    sha256: abc123...

Parameters Section

The parameters section contains all OpenAI-compatible request parameters and model-specific options.

OpenAI-Compatible Parameters

These settings will be used as defaults for all the API calls to the model.

FieldTypeDefaultDescription
temperaturefloat0.9Sampling temperature (0.0-2.0). Higher values make output more random
top_pfloat0.95Nucleus sampling: consider tokens with top_p probability mass
top_kint40Consider only the top K most likely tokens
max_tokensint0Maximum number of tokens to generate (0 = unlimited)
frequency_penaltyfloat0.0Penalty for token frequency (-2.0 to 2.0)
presence_penaltyfloat0.0Penalty for token presence (-2.0 to 2.0)
repeat_penaltyfloat1.1Penalty for repeating tokens
repeat_last_nint64Number of previous tokens to consider for repeat penalty
seedint-1Random seed (omit for random)
echoboolfalseEcho back the prompt in the response
nint1Number of completions to generate
logprobsbool/intfalseReturn log probabilities of tokens
top_logprobsint0Number of top logprobs to return per token (0-20)
logit_biasmap{}Map of token IDs to bias values (-100 to 100)
typical_pfloat1.0Typical sampling parameter
tfzfloat1.0Tail free z parameter
keepint0Number of tokens to keep from the prompt

Language and Translation

FieldTypeDescription
languagestringLanguage code for transcription/translation
translateboolWhether to translate audio transcription

Custom Parameters

FieldTypeDescription
batchintBatch size for processing
ignore_eosboolIgnore end-of-sequence tokens
negative_promptstringNegative prompt for image generation
rope_freq_basefloat32RoPE frequency base
rope_freq_scalefloat32RoPE frequency scale
negative_prompt_scalefloat32Scale for negative prompt
tokenizerstringTokenizer to use (RWKV)

LLM Configuration

These settings apply to most LLM backends (llama.cpp, vLLM, etc.):

Performance Settings

FieldTypeDefaultDescription
threadsintprocessor countNumber of threads for parallel computation
context_sizeint512Maximum context size (number of tokens)
f16boolfalseEnable 16-bit floating point precision (GPU acceleration)
gpu_layersint0Number of layers to offload to GPU (0 = CPU only)

Memory Management

FieldTypeDefaultDescription
mmapbooltrueUse memory mapping for model loading (faster, less RAM)
mmlockboolfalseLock model in memory (prevents swapping)
low_vramboolfalseUse minimal VRAM mode
no_kv_offloadingboolfalseDisable KV cache offloading

GPU Configuration

FieldTypeDescription
tensor_splitstringComma-separated GPU memory allocation (e.g., "0.8,0.2" for 80%/20%)
main_gpustringMain GPU identifier for multi-GPU setups
cudaboolExplicitly enable/disable CUDA

Sampling and Generation

FieldTypeDefaultDescription
mirostatint0Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0)
mirostat_taufloat5.0Mirostat target entropy
mirostat_etafloat0.1Mirostat learning rate

LoRA Configuration

FieldTypeDescription
lora_adapterstringPath to LoRA adapter file
lora_basestringBase model for LoRA
lora_scalefloat32LoRA scale factor
lora_adaptersarrayMultiple LoRA adapters
lora_scalesarrayScales for multiple LoRA adapters

Advanced Options

FieldTypeDescription
no_mulmatqboolDisable matrix multiplication queuing
draft_modelstringDraft model for speculative decoding
n_draftint32Number of draft tokens
quantizationstringQuantization format
load_formatstringModel load format
numaboolEnable NUMA (Non-Uniform Memory Access)
rms_norm_epsfloat32RMS normalization epsilon
ngqaint32Natural question generation parameter
rope_scalingstringRoPE scaling configuration
typestringModel type/architecture
grammarstringGrammar file path for constrained generation

YARN Configuration

YARN (Yet Another RoPE extensioN) settings for context extension:

FieldTypeDescription
yarn_ext_factorfloat32YARN extension factor
yarn_attn_factorfloat32YARN attention factor
yarn_beta_fastfloat32YARN beta fast parameter
yarn_beta_slowfloat32YARN beta slow parameter

Prompt Caching

FieldTypeDescription
prompt_cache_pathstringPath to store prompt cache (relative to models directory)
prompt_cache_allboolCache all prompts automatically
prompt_cache_roboolRead-only prompt cache

Text Processing

FieldTypeDescription
stopwordsarrayWords or phrases that stop generation
cutstringsarrayStrings to cut from responses
trimspacearrayStrings to trim whitespace from
trimsuffixarraySuffixes to trim from responses
extract_regexarrayRegular expressions to extract content

System Prompt

FieldTypeDescription
system_promptstringDefault system prompt for the model

vLLM-Specific Configuration

These options apply when using the vllm backend:

FieldTypeDescription
gpu_memory_utilizationfloat32GPU memory utilization (0.0-1.0, default 0.9)
trust_remote_codeboolTrust and execute remote code
enforce_eagerboolForce eager execution mode
swap_spaceintSwap space in GB
max_model_lenintMaximum model length
tensor_parallel_sizeintTensor parallelism size
disable_log_statsboolDisable logging statistics
dtypestringData type (e.g., float16, bfloat16)
flash_attentionstringFlash attention configuration
cache_type_kstringKey cache type
cache_type_vstringValue cache type
limit_mm_per_promptobjectLimit multimodal content per prompt: {image: int, video: int, audio: int}

Template Configuration

Templates use Go templates with Sprig functions.

FieldTypeDescription
template.chatstringTemplate for chat completion endpoint
template.chat_messagestringTemplate for individual chat messages
template.completionstringTemplate for text completion
template.editstringTemplate for edit operations
template.functionstringTemplate for function/tool calls
template.multimodalstringTemplate for multimodal interactions
template.reply_prefixstringPrefix to add to model replies
template.use_tokenizer_templateboolUse tokenizer’s built-in template (vLLM/transformers)
template.join_chat_messages_by_characterstringCharacter to join chat messages (default: \n)

Template Variables

Templating supports sprig functions.

Following are common variables available in templates:

  • {{.Input}} - User input
  • {{.Instruction}} - Instruction for edit operations
  • {{.System}} - System message
  • {{.Prompt}} - Full prompt
  • {{.Functions}} - Function definitions (for function calling)
  • {{.FunctionCall}} - Function call result

Example Template

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}{{end}}
    {{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
    {{end}}
    Assistant:

Function Calling Configuration

Configure how the model handles function/tool calls:

FieldTypeDefaultDescription
function.disable_no_actionboolfalseDisable the no-action behavior
function.no_action_function_namestringanswerName of the no-action function
function.no_action_description_namestringDescription for no-action function
function.function_name_keystringnameJSON key for function name
function.function_arguments_keystringargumentsJSON key for function arguments
function.response_regexarrayNamed regex patterns to extract function calls
function.argument_regexarrayNamed regex to extract function arguments
function.argument_regex_key_namestringkeyNamed regex capture for argument key
function.argument_regex_value_namestringvalueNamed regex capture for argument value
function.json_regex_matcharrayRegex patterns to match JSON in tool mode
function.replace_function_resultsarrayReplace function call results with patterns
function.replace_llm_resultsarrayReplace LLM results with patterns
function.capture_llm_resultsarrayCapture LLM results as text (e.g., for “thinking” blocks)

Grammar Configuration

FieldTypeDefaultDescription
function.grammar.disableboolfalseCompletely disable grammar enforcement
function.grammar.parallel_callsboolfalseAllow parallel function calls
function.grammar.mixed_modeboolfalseAllow mixed-mode grammar enforcing
function.grammar.no_mixed_free_stringboolfalseDisallow free strings in mixed mode
function.grammar.disable_parallel_new_linesboolfalseDisable parallel processing for new lines
function.grammar.prefixstringPrefix to add before grammar rules
function.grammar.expect_strings_after_jsonboolfalseExpect strings after JSON data

Diffusers Configuration

For image generation models using the diffusers backend:

FieldTypeDescription
diffusers.cudaboolEnable CUDA for diffusers
diffusers.pipeline_typestringPipeline type (e.g., stable-diffusion, stable-diffusion-xl)
diffusers.scheduler_typestringScheduler type (e.g., euler, ddpm)
diffusers.enable_parametersstringComma-separated parameters to enable
diffusers.cfg_scalefloat32Classifier-free guidance scale
diffusers.img2imgboolEnable image-to-image transformation
diffusers.clip_skipintNumber of CLIP layers to skip
diffusers.clip_modelstringCLIP model to use
diffusers.clip_subfolderstringCLIP model subfolder
diffusers.control_netstringControlNet model to use
stepintNumber of diffusion steps

TTS Configuration

For text-to-speech models:

FieldTypeDescription
tts.voicestringVoice file path or voice ID
tts.audio_pathstringPath to audio files (for Vall-E)

Roles Configuration

Map conversation roles to specific strings:

roles:
  user: "### Instruction:"
  assistant: "### Response:"
  system: "### System Instruction:"

Feature Flags

Enable or disable experimental features:

feature_flags:
  feature_name: true
  another_feature: false

MCP Configuration

Model Context Protocol (MCP) configuration:

FieldTypeDescription
mcp.remotestringYAML string defining remote MCP servers
mcp.stdiostringYAML string defining STDIO MCP servers

Agent Configuration

Agent/autonomous agent configuration:

FieldTypeDescription
agent.max_attemptsintMaximum number of attempts
agent.max_iterationsintMaximum number of iterations
agent.enable_reasoningboolEnable reasoning capabilities
agent.enable_planningboolEnable planning capabilities
agent.enable_mcp_promptsboolEnable MCP prompts
agent.enable_plan_re_evaluatorboolEnable plan re-evaluation

Pipeline Configuration

Define pipelines for audio-to-audio processing:

FieldTypeDescription
pipeline.ttsstringTTS model name
pipeline.llmstringLLM model name
pipeline.transcriptionstringTranscription model name
pipeline.vadstringVoice activity detection model name

gRPC Configuration

Backend gRPC communication settings:

FieldTypeDescription
grpc.attemptsintNumber of retry attempts
grpc.attempts_sleep_timeintSleep time between retries (seconds)

Overrides

Override model configuration values at runtime (llama.cpp):

overrides:
  - "qwen3moe.expert_used_count=int:10"
  - "some_key=string:value"

Format: KEY=TYPE:VALUE where TYPE is int, float, string, or bool.

Known Use Cases

Specify which endpoints this model supports:

known_usecases:
  - chat
  - completion
  - embeddings

Available flags: chat, completion, edit, embeddings, rerank, image, transcript, tts, sound_generation, tokenize, vad, video, detection, llm (combination of CHAT, COMPLETION, EDIT).

Complete Example

Here’s a comprehensive example combining many options:

name: my-llm-model
description: A high-performance LLM model
backend: llama-stable

parameters:
  model: my-model.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  max_tokens: 2048

context_size: 4096
threads: 8
f16: true
gpu_layers: 35

system_prompt: "You are a helpful AI assistant."

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}
    {{else if eq .Role "assistant"}}Assistant: {{.Content}}
    {{end}}
    {{end}}
    Assistant:

roles:
  user: "User:"
  assistant: "Assistant:"
  system: "System:"

stopwords:
  - "\n\nUser:"
  - "\n\nHuman:"

prompt_cache_path: "cache/my-model"
prompt_cache_all: true

function:
  grammar:
    parallel_calls: true
    mixed_mode: false

feature_flags:
  experimental_feature: true

Subsections of References

Model compatibility table

Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the backends, compatible models families and the associated repository.

Note

LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.

Text Generation & Language Models

Backend and BindingsCompatible modelsCompletion/Chat endpointCapabilityEmbeddings supportToken stream supportAcceleration
llama.cppLLama, Mamba, RWKV, Falcon, Starcoder, GPT-2, and many othersyesGPT and FunctionsyesyesCUDA 11/12, ROCm, Intel SYCL, Vulkan, Metal, CPU
vLLMVarious GPTs and quantization formatsyesGPTnonoCUDA 12, ROCm, Intel
transformersVarious GPTs and quantization formatsyesGPT, embeddings, Audio generationyesyes*CUDA 11/12, ROCm, Intel, CPU
exllama2GPTQyesGPT onlynonoCUDA 12
MLXVarious LLMsyesGPTnonoMetal (Apple Silicon)
MLX-VLMVision-Language ModelsyesMultimodal GPTnonoMetal (Apple Silicon)
langchain-huggingfaceAny text generators available on HuggingFace through APIyesGPTnonoN/A

Audio & Speech Processing

Backend and BindingsCompatible modelsCompletion/Chat endpointCapabilityEmbeddings supportToken stream supportAcceleration
whisper.cppwhispernoAudio transcriptionnonoCUDA 12, ROCm, Intel SYCL, Vulkan, CPU
faster-whisperwhispernoAudio transcriptionnonoCUDA 12, ROCm, Intel, CPU
piper (binding)Any piper onnx modelnoText to voicenonoCPU
barkbarknoAudio generationnonoCUDA 12, ROCm, Intel
bark-cppbarknoAudio-OnlynonoCUDA, Metal, CPU
coquiCoqui TTSnoAudio generation and Voice cloningnonoCUDA 12, ROCm, Intel, CPU
kokoroKokoro TTSnoText-to-speechnonoCUDA 12, ROCm, Intel, CPU
chatterboxChatterbox TTSnoText-to-speechnonoCUDA 11/12, CPU
kitten-ttsKitten TTSnoText-to-speechnonoCPU
silero-vad with Golang bindingsSilero VADnoVoice Activity DetectionnonoCPU
neuttsNeuTTSAirnoText-to-speech with voice cloningnonoCUDA 12, ROCm, CPU
mlx-audioMLXnoText-tospeechnonoMetal (Apple Silicon)

Image & Video Generation

Backend and BindingsCompatible modelsCompletion/Chat endpointCapabilityEmbeddings supportToken stream supportAcceleration
stablediffusion.cppstablediffusion-1, stablediffusion-2, stablediffusion-3, flux, PhotoMakernoImagenonoCUDA 12, Intel SYCL, Vulkan, CPU
diffusersSD, various diffusion models,…noImage/Video generationnonoCUDA 11/12, ROCm, Intel, Metal, CPU
transformers-musicgenMusicGennoAudio generationnonoCUDA, CPU

Specialized AI Tasks

Backend and BindingsCompatible modelsCompletion/Chat endpointCapabilityEmbeddings supportToken stream supportAcceleration
rfdetrRF-DETRnoObject DetectionnonoCUDA 12, Intel, CPU
rerankersReranking APInoRerankingnonoCUDA 11/12, ROCm, Intel, CPU
local-storeVector databasenoVector storageyesnoCPU
huggingfaceHuggingFace API modelsyesVarious AI tasksyesyesAPI-based

Acceleration Support Summary

GPU Acceleration

  • NVIDIA CUDA: CUDA 11.7, CUDA 12.0 support across most backends
  • AMD ROCm: HIP-based acceleration for AMD GPUs
  • Intel oneAPI: SYCL-based acceleration for Intel GPUs (F16/F32 precision)
  • Vulkan: Cross-platform GPU acceleration
  • Metal: Apple Silicon GPU acceleration (M1/M2/M3+)

Specialized Hardware

  • NVIDIA Jetson (L4T): ARM64 support for embedded AI
  • Apple Silicon: Native Metal acceleration for Mac M1/M2/M3+
  • Darwin x86: Intel Mac support

CPU Optimization

  • AVX/AVX2/AVX512: Advanced vector extensions for x86
  • Quantization: 4-bit, 5-bit, 8-bit integer quantization support
  • Mixed Precision: F16/F32 mixed precision support

Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).

  • * Only for CUDA and OpenVINO CPU/XPU acceleration.

Architecture

LocalAI is an API written in Go that serves as an OpenAI shim, enabling software already developed with OpenAI SDKs to seamlessly integrate with LocalAI. It can be effortlessly implemented as a substitute, even on consumer-grade hardware. This capability is achieved by employing various C++ backends, including ggml, to perform inference on LLMs using both CPU and, if desired, GPU. Internally LocalAI backends are just gRPC server, indeed you can specify and build your own gRPC server and extend LocalAI in runtime as well. It is possible to specify external gRPC server and/or binaries that LocalAI will manage internally.

LocalAI uses a mixture of backends written in various languages (C++, Golang, Python, …). You can check the model compatibility table to learn about all the components of LocalAI.

localai localai

Backstory

As much as typical open source projects starts, I, mudler, was fiddling around with llama.cpp over my long nights and wanted to have a way to call it from go, as I am a Golang developer and use it extensively. So I’ve created LocalAI (or what was initially known as llama-cli) and added an API to it.

But guess what? The more I dived into this rabbit hole, the more I realized that I had stumbled upon something big. With all the fantastic C++ projects floating around the community, it dawned on me that I could piece them together to create a full-fledged OpenAI replacement. So, ta-da! LocalAI was born, and it quickly overshadowed its humble origins.

Now, why did I choose to go with C++ bindings, you ask? Well, I wanted to keep LocalAI snappy and lightweight, allowing it to run like a champ on any system and avoid any Golang penalties of the GC, and, most importantly built on shoulders of giants like llama.cpp. Go is good at backends and API and is easy to maintain. And hey, don’t forget that I’m all about sharing the love. That’s why I made LocalAI MIT licensed, so everyone can hop on board and benefit from it.

As if that wasn’t exciting enough, as the project gained traction, mkellerman and Aisuko jumped in to lend a hand. mkellerman helped set up some killer examples, while Aisuko is becoming our community maestro. The community now is growing even more with new contributors and users, and I couldn’t be happier about it!

Oh, and let’s not forget the real MVP here—llama.cpp. Without this extraordinary piece of software, LocalAI wouldn’t even exist. So, a big shoutout to the community for making this magic happen!

CLI Reference

Complete reference for all LocalAI command-line interface (CLI) parameters and environment variables.

Note: All CLI flags can also be set via environment variables. Environment variables take precedence over CLI flags. See .env files for configuration file support.

Global Flags

ParameterDefaultDescriptionEnvironment Variable
-h, --helpShow context-sensitive help
--log-levelinfoSet the level of logs to output [error,warn,info,debug,trace]$LOCALAI_LOG_LEVEL
--debugfalseDEPRECATED - Use --log-level=debug instead. Enable debug logging$LOCALAI_DEBUG, $DEBUG

Storage Flags

ParameterDefaultDescriptionEnvironment Variable
--models-pathBASEPATH/modelsPath containing models used for inferencing$LOCALAI_MODELS_PATH, $MODELS_PATH
--generated-content-path/tmp/generated/contentLocation for assets generated by backends (e.g. stablediffusion, images, audio, videos)$LOCALAI_GENERATED_CONTENT_PATH, $GENERATED_CONTENT_PATH
--upload-path/tmp/localai/uploadPath to store uploads from files API$LOCALAI_UPLOAD_PATH, $UPLOAD_PATH
--localai-config-dirBASEPATH/configurationDirectory for dynamic loading of certain configuration files (currently runtime_settings.json, api_keys.json, and external_backends.json). See Runtime Settings for web-based configuration.$LOCALAI_CONFIG_DIR
--localai-config-dir-poll-intervalTime duration to poll the LocalAI Config Dir if your system has broken fsnotify events (example: 1m)$LOCALAI_CONFIG_DIR_POLL_INTERVAL
--models-config-fileYAML file containing a list of model backend configs (alias: --config-file)$LOCALAI_MODELS_CONFIG_FILE, $CONFIG_FILE

Backend Flags

ParameterDefaultDescriptionEnvironment Variable
--backends-pathBASEPATH/backendsPath containing backends used for inferencing$LOCALAI_BACKENDS_PATH, $BACKENDS_PATH
--backends-system-path/usr/share/localai/backendsPath containing system backends used for inferencing$LOCALAI_BACKENDS_SYSTEM_PATH, $BACKEND_SYSTEM_PATH
--external-backendsA list of external backends to load from gallery on boot$LOCALAI_EXTERNAL_BACKENDS, $EXTERNAL_BACKENDS
--external-grpc-backendsA list of external gRPC backends (format: BACKEND_NAME:URI)$LOCALAI_EXTERNAL_GRPC_BACKENDS, $EXTERNAL_GRPC_BACKENDS
--backend-galleriesJSON list of backend galleries$LOCALAI_BACKEND_GALLERIES, $BACKEND_GALLERIES
--autoload-backend-galleriestrueAutomatically load backend galleries on startup$LOCALAI_AUTOLOAD_BACKEND_GALLERIES, $AUTOLOAD_BACKEND_GALLERIES
--parallel-requestsfalseEnable backends to handle multiple requests in parallel if they support it (e.g.: llama.cpp or vllm)$LOCALAI_PARALLEL_REQUESTS, $PARALLEL_REQUESTS
--single-active-backendfalseAllow only one backend to be run at a time$LOCALAI_SINGLE_ACTIVE_BACKEND, $SINGLE_ACTIVE_BACKEND
--preload-backend-onlyfalseDo not launch the API services, only the preloaded models/backends are started (useful for multi-node setups)$LOCALAI_PRELOAD_BACKEND_ONLY, $PRELOAD_BACKEND_ONLY
--enable-watchdog-idlefalseEnable watchdog for stopping backends that are idle longer than the watchdog-idle-timeout$LOCALAI_WATCHDOG_IDLE, $WATCHDOG_IDLE
--watchdog-idle-timeout15mThreshold beyond which an idle backend should be stopped$LOCALAI_WATCHDOG_IDLE_TIMEOUT, $WATCHDOG_IDLE_TIMEOUT
--enable-watchdog-busyfalseEnable watchdog for stopping backends that are busy longer than the watchdog-busy-timeout$LOCALAI_WATCHDOG_BUSY, $WATCHDOG_BUSY
--watchdog-busy-timeout5mThreshold beyond which a busy backend should be stopped$LOCALAI_WATCHDOG_BUSY_TIMEOUT, $WATCHDOG_BUSY_TIMEOUT

For more information on VRAM management, see VRAM and Memory Management.

Models Flags

ParameterDefaultDescriptionEnvironment Variable
--galleriesJSON list of galleries$LOCALAI_GALLERIES, $GALLERIES
--autoload-galleriestrueAutomatically load galleries on startup$LOCALAI_AUTOLOAD_GALLERIES, $AUTOLOAD_GALLERIES
--preload-modelsA list of models to apply in JSON at start$LOCALAI_PRELOAD_MODELS, $PRELOAD_MODELS
--modelsA list of model configuration URLs to load$LOCALAI_MODELS, $MODELS
--preload-models-configA list of models to apply at startup. Path to a YAML config file$LOCALAI_PRELOAD_MODELS_CONFIG, $PRELOAD_MODELS_CONFIG
--load-to-memoryA list of models to load into memory at startup$LOCALAI_LOAD_TO_MEMORY, $LOAD_TO_MEMORY

Note: You can also pass model configuration URLs as positional arguments: local-ai run MODEL_URL1 MODEL_URL2 ...

Performance Flags

ParameterDefaultDescriptionEnvironment Variable
--f16falseEnable GPU acceleration$LOCALAI_F16, $F16
-t, --threadsNumber of threads used for parallel computation. Usage of the number of physical cores in the system is suggested$LOCALAI_THREADS, $THREADS
--context-sizeDefault context size for models$LOCALAI_CONTEXT_SIZE, $CONTEXT_SIZE

API Flags

ParameterDefaultDescriptionEnvironment Variable
--address:8080Bind address for the API server$LOCALAI_ADDRESS, $ADDRESS
--corsfalseEnable CORS (Cross-Origin Resource Sharing)$LOCALAI_CORS, $CORS
--cors-allow-originsComma-separated list of allowed CORS origins$LOCALAI_CORS_ALLOW_ORIGINS, $CORS_ALLOW_ORIGINS
--csrffalseEnable Fiber CSRF middleware$LOCALAI_CSRF
--upload-limit15Default upload-limit in MB$LOCALAI_UPLOAD_LIMIT, $UPLOAD_LIMIT
--api-keysList of API Keys to enable API authentication. When this is set, all requests must be authenticated with one of these API keys$LOCALAI_API_KEY, $API_KEY
--disable-webuifalseDisables the web user interface. When set to true, the server will only expose API endpoints without serving the web interface$LOCALAI_DISABLE_WEBUI, $DISABLE_WEBUI
--disable-runtime-settingsfalseDisables the runtime settings feature. When set to true, the server will not load runtime settings from the runtime_settings.json file and the settings web interface will be disabled$LOCALAI_DISABLE_RUNTIME_SETTINGS, $DISABLE_RUNTIME_SETTINGS
--disable-gallery-endpointfalseDisable the gallery endpoints$LOCALAI_DISABLE_GALLERY_ENDPOINT, $DISABLE_GALLERY_ENDPOINT
--disable-metrics-endpointfalseDisable the /metrics endpoint$LOCALAI_DISABLE_METRICS_ENDPOINT, $DISABLE_METRICS_ENDPOINT
--machine-tagIf not empty, add that string to Machine-Tag header in each response. Useful to track response from different machines using multiple P2P federated nodes$LOCALAI_MACHINE_TAG, $MACHINE_TAG

Hardening Flags

ParameterDefaultDescriptionEnvironment Variable
--disable-predownload-scanfalseIf true, disables the best-effort security scanner before downloading any files$LOCALAI_DISABLE_PREDOWNLOAD_SCAN
--opaque-errorsfalseIf true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended$LOCALAI_OPAQUE_ERRORS
--use-subtle-key-comparisonfalseIf true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resilience against timing attacks$LOCALAI_SUBTLE_KEY_COMPARISON
--disable-api-key-requirement-for-http-getfalseIf true, a valid API key is not required to issue GET requests to portions of the web UI. This should only be enabled in secure testing environments$LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET
--http-get-exempted-endpoints^/$,^/browse/?$,^/talk/?$,^/p2p/?$,^/chat/?$,^/text2image/?$,^/tts/?$,^/static/.*$,^/swagger.*$If --disable-api-key-requirement-for-http-get is overridden to true, this is the list of endpoints to exempt. Only adjust this in case of a security incident or as a result of a personal security posture review$LOCALAI_HTTP_GET_EXEMPTED_ENDPOINTS

P2P Flags

ParameterDefaultDescriptionEnvironment Variable
--p2pfalseEnable P2P mode$LOCALAI_P2P, $P2P
--p2p-dht-interval360Interval for DHT refresh (used during token generation)$LOCALAI_P2P_DHT_INTERVAL, $P2P_DHT_INTERVAL
--p2p-otp-interval9000Interval for OTP refresh (used during token generation)$LOCALAI_P2P_OTP_INTERVAL, $P2P_OTP_INTERVAL
--p2ptokenToken for P2P mode (optional)$LOCALAI_P2P_TOKEN, $P2P_TOKEN, $TOKEN
--p2p-network-idNetwork ID for P2P mode, can be set arbitrarily by the user for grouping a set of instances$LOCALAI_P2P_NETWORK_ID, $P2P_NETWORK_ID
--federatedfalseEnable federated instance$LOCALAI_FEDERATED, $FEDERATED

Other Commands

LocalAI supports several subcommands beyond run:

  • local-ai models - Manage LocalAI models and definitions
  • local-ai backends - Manage LocalAI backends and definitions
  • local-ai tts - Convert text to speech
  • local-ai sound-generation - Generate audio files from text or audio
  • local-ai transcript - Convert audio to text
  • local-ai worker - Run workers to distribute workload (llama.cpp-only)
  • local-ai util - Utility commands
  • local-ai explorer - Run P2P explorer
  • local-ai federated - Run LocalAI in federated mode

Use local-ai <command> --help for more information on each command.

Examples

Basic Usage

./local-ai run

./local-ai run --models-path /path/to/models --address :9090

./local-ai run --f16

Environment Variables

export LOCALAI_MODELS_PATH=/path/to/models
export LOCALAI_ADDRESS=:9090
export LOCALAI_F16=true
./local-ai run

Advanced Configuration

./local-ai run \
  --models model1.yaml model2.yaml \
  --enable-watchdog-idle \
  --watchdog-idle-timeout=10m \
  --p2p \
  --federated

LocalAI binaries

LocalAI binaries are available for both Linux and MacOS platforms and can be executed directly from your command line. These binaries are continuously updated and hosted on our GitHub Releases page. This method also supports Windows users via the Windows Subsystem for Linux (WSL).

macOS Download

You can download the DMG and install the application:

Download LocalAI for macOS

Note: the DMGs are not signed by Apple as quarantined. See https://github.com/mudler/LocalAI/issues/6268 for a workaround, fix is tracked here: https://github.com/mudler/LocalAI/issues/6244

Otherwise, use the following one-liner command in your terminal to download and run LocalAI on Linux or MacOS:

curl -Lo local-ai "https://github.com/mudler/LocalAI/releases/download/v3.7.0/local-ai-$(uname -s)-$(uname -m)" && chmod +x local-ai && ./local-ai

Otherwise, here are the links to the binaries:

OSLink
Linux (amd64)Download
Linux (arm64)Download
MacOS (arm64)Download
Details

Binaries do have limited support compared to container images:

  • Python-based backends are not shipped with binaries (e.g. bark, diffusers or transformers)
  • MacOS binaries and Linux-arm64 do not ship TTS nor stablediffusion-cpp backends
  • Linux binaries do not ship stablediffusion-cpp backend

Running on Nvidia ARM64

LocalAI can be run on Nvidia ARM64 devices, such as the Jetson Nano, Jetson Xavier NX, and Jetson AGX Xavier. The following instructions will guide you through building the LocalAI container for Nvidia ARM64 devices.

Prerequisites

Build the container

Build the LocalAI container for Nvidia ARM64 devices using the following command:

git clone https://github.com/mudler/LocalAI

cd LocalAI

docker build --build-arg SKIP_DRIVERS=true --build-arg BUILD_TYPE=cublas --build-arg BASE_IMAGE=nvcr.io/nvidia/l4t-jetpack:r36.4.0 --build-arg IMAGE_TYPE=core -t quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core .

Otherwise images are available on quay.io and dockerhub:

docker pull quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core

Usage

Run the LocalAI container on Nvidia ARM64 devices using the following command, where /data/models is the directory containing the models:

docker run -e DEBUG=true -p 8080:8080 -v /data/models:/models  -ti --restart=always --name local-ai --runtime nvidia --gpus all quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core

Note: /data/models is the directory containing the models. You can replace it with the directory containing your models.

FAQ

Frequently asked questions

Here are answers to some of the most common questions.

How do I get models?

Most gguf-based models should work, but newer models may require additions to the API. If a model doesn’t work, please feel free to open up issues. However, be cautious about downloading models from the internet and directly onto your machine, as there may be security vulnerabilities in lama.cpp or ggml that could be maliciously exploited. Some models can be found on Hugging Face: https://huggingface.co/models?search=gguf, or models from gpt4all are compatible too: https://github.com/nomic-ai/gpt4all.

Where are models stored?

LocalAI stores downloaded models in the following locations by default:

  • Command line: ./models (relative to current working directory)
  • Docker: /models (inside the container, typically mounted to ./models on host)
  • Launcher application: ~/.localai/models (in your home directory)

You can customize the model storage location using the LOCALAI_MODELS_PATH environment variable or --models-path command line flag. This is useful if you want to store models outside your home directory for backup purposes or to avoid filling up your home directory with large model files.

How much storage space do models require?

Model sizes vary significantly depending on the model and quantization level:

  • Small models (1-3B parameters): 1-3 GB
  • Medium models (7-13B parameters): 4-8 GB
  • Large models (30B+ parameters): 15-30+ GB

Quantization levels (smaller files, slightly reduced quality):

  • Q4_K_M: ~75% of original size
  • Q4_K_S: ~60% of original size
  • Q2_K: ~50% of original size

Storage recommendations:

  • Ensure you have at least 2-3x the model size available for downloads and temporary files
  • Use SSD storage for better performance
  • Consider the model size relative to your system RAM - models larger than your RAM may not run efficiently

Benchmarking LocalAI and llama.cpp shows different results!

LocalAI applies a set of defaults when loading models with the llama.cpp backend, one of these is mirostat sampling - while it achieves better results, it slows down the inference. You can disable this by setting mirostat: 0 in the model config file. See also the advanced section (/advanced/) for more information and this issue.

What’s the difference with Serge, or XXX?

LocalAI is a multi-model solution that doesn’t focus on a specific model type (e.g., llama.cpp or alpaca.cpp), and it handles all of these internally for faster inference, easy to set up locally and deploy to Kubernetes.

Everything is slow, how is it possible?

There are few situation why this could occur. Some tips are:

  • Don’t use HDD to store your models. Prefer SSD over HDD. In case you are stuck with HDD, disable mmap in the model config file so it loads everything in memory.
  • Watch out CPU overbooking. Ideally the --threads should match the number of physical cores. For instance if your CPU has 4 cores, you would ideally allocate <= 4 threads to a model.
  • Run LocalAI with DEBUG=true. This gives more information, including stats on the token inference speed.
  • Check that you are actually getting an output: run a simple curl request with "stream": true to see how fast the model is responding.

Can I use it with a Discord bot, or XXX?

Yes! If the client uses OpenAI and supports setting a different base URL to send requests to, you can use the LocalAI endpoint. This allows to use this with every application that was supposed to work with OpenAI, but without changing the application!

Can this leverage GPUs?

There is GPU support, see /features/gpu-acceleration/.

Where is the webUI?

There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI’s APIs. There are several already on Github, and should be compatible with LocalAI already (as it mimics the OpenAI API)

Does it work with AutoGPT?

Yes, see the examples!

How can I troubleshoot when something is wrong?

Enable the debug mode by setting DEBUG=true in the environment variables. This will give you more information on what’s going on. You can also specify --debug in the command line.

I’m getting ‘invalid pitch’ error when running with CUDA, what’s wrong?

This typically happens when your prompt exceeds the context size. Try to reduce the prompt size, or increase the context size.

I’m getting a ‘SIGILL’ error, what’s wrong?

Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting REBUILD=true and disable the CPU instructions that are not compatible with your CPU. For instance: CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF" make build