Model Configuration

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.

Overview

Model configuration files allow you to:

Define default parameters (temperature, top_p, etc.)
Configure prompt templates
Specify backend settings
Set up function calling
Configure GPU and memory options
And much more

Configuration File Locations

You can create model configuration files in several ways:

Individual YAML files in the models directory (e.g., models/gpt-3.5-turbo.yaml)
Single config file with multiple models using --models-config-file or LOCALAI_MODELS_CONFIG_FILE
Remote URLs - specify a URL to a YAML configuration file at startup

Example: Basic Configuration

name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-cpp

template:
  completion: completion
  chat: chat

Example: Multiple Models in One File

When using --models-config-file, you can define multiple models as a list:

- name: model1
  parameters:
    model: model1.bin
  context_size: 512
  backend: llama-cpp

- name: model2
  parameters:
    model: model2.bin
  context_size: 1024
  backend: llama-cpp

Core Configuration Fields

Basic Model Settings

Field	Type	Description	Example
`name`	string	Model name, used to identify the model in API calls	`gpt-3.5-turbo`
`backend`	string	Backend to use (e.g. `llama-cpp`, `vllm`, `diffusers`, `whisper`)	`llama-cpp`
`description`	string	Human-readable description of the model	`A conversational AI model`
`usage`	string	Usage instructions or notes	`Best for general conversation`

Model File and Downloads

Field	Type	Description
`parameters.model`	string	Path to the model file (relative to models directory) or URL
`download_files`	array	List of files to download. Each entry has `filename`, `uri`, and optional `sha256`

Example:

parameters:
  model: my-model.gguf

download_files:
  - filename: my-model.gguf
    uri: https://example.com/model.gguf
    sha256: abc123...

Model artifacts

The artifacts section makes installation of a Hugging Face model eager and repeatable. LocalAI resolves the requested revision to an immutable commit, downloads the selected repository files, and commits the complete snapshot before the model installation succeeds.

artifacts:
  - name: model
    target: model
    source:
      type: huggingface
      repo: Qwen/Qwen3-ASR-1.7B
      revision: main
      token_env: HF_TOKEN
    resolved:
      endpoint: https://huggingface.co
      revision: 0123456789abcdef0123456789abcdef01234567
      cache_key: 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef

parameters:
  model: Qwen/Qwen3-ASR-1.7B

Declare source when authoring a configuration. LocalAI owns the resolved block and writes it after installation; do not choose its values manually. For a public repository, omit token_env. For a private or gated repository, set it to HF_TOKEN and provide that environment variable to the LocalAI controller.

Field	Meaning
`name`	Logical artifact name; `model` for the initial primary artifact
`target`	Binding target; only `model` is supported initially
`source.type`	`huggingface`
`source.repo`	`owner/repository` or `hf://owner/repository`
`source.revision`	Branch, tag, or commit; defaults to `main` and resolves to a commit
`source.token_env`	Empty or `HF_TOKEN`; the secret value is never persisted
`source.allow_patterns`	Optional slash-separated glob allow-list
`source.ignore_patterns`	Optional slash-separated glob deny-list
`resolved`	Installer-owned immutable endpoint, revision, and cache key

Managed installation finishes only after every selected file is committed locally. parameters.model remains the logical repository ID. Once resolved.cache_key is present, LocalAI derives .artifacts/huggingface/<cache-key>/snapshot as the runtime ModelFile. Configurations without artifacts keep the existing lazy repository-ID behavior.

The initially migrated backend families are transformers and its aliases, diffusers, qwen-asr, fish-speech, nemo, voxcpm, qwen-tts, liquid-audio, vllm, vllm-omni, and sglang. Automatic imports add artifact declarations only for this set. Compatible external backends may opt in by declaring the artifact explicitly.

Parameters Section

The parameters section contains all OpenAI-compatible request parameters and model-specific options.

OpenAI-Compatible Parameters

These settings will be used as defaults for all the API calls to the model.

Field	Type	Default	Description
`temperature`	float	`0.9`	Sampling temperature (0.0-2.0). Higher values make output more random
`top_p`	float	`0.95`	Nucleus sampling: consider tokens with top_p probability mass
`top_k`	int	`40`	Consider only the top K most likely tokens
`max_tokens`	int	`0`	Maximum number of tokens to generate (0 = unlimited)
`frequency_penalty`	float	`0.0`	Penalty for token frequency (-2.0 to 2.0)
`presence_penalty`	float	`0.0`	Penalty for token presence (-2.0 to 2.0)
`repeat_penalty`	float	`1.1`	Penalty for repeating tokens
`repeat_last_n`	int	`64`	Number of previous tokens to consider for repeat penalty
`seed`	int	`-1`	Random seed (omit for random)
`echo`	bool	`false`	Echo back the prompt in the response
`n`	int	`1`	Number of completions to generate
`logprobs`	bool/int	`false`	Return log probabilities of tokens
`top_logprobs`	int	`0`	Number of top logprobs to return per token (0-20)
`logit_bias`	map	`{}`	Map of token IDs to bias values (-100 to 100)
`typical_p`	float	`1.0`	Typical sampling parameter
`tfz`	float	`1.0`	Tail free z parameter
`keep`	int	`0`	Number of tokens to keep from the prompt

Language and Translation

Field	Type	Description
`language`	string	Language code for transcription/translation
`translate`	bool	Whether to translate audio transcription

Custom Parameters

Field	Type	Description
`batch`	int	Batch size for processing
`ignore_eos`	bool	Ignore end-of-sequence tokens
`negative_prompt`	string	Negative prompt for image generation
`rope_freq_base`	float32	RoPE frequency base
`rope_freq_scale`	float32	RoPE frequency scale
`negative_prompt_scale`	float32	Scale for negative prompt
`tokenizer`	string	Tokenizer to use (RWKV)

LLM Configuration

These settings apply to most LLM backends (llama.cpp, vLLM, etc.):

Performance Settings

Field	Type	Default	Description
`threads`	int	`processor count`	Number of threads for parallel computation
`context_size`	int	`512`	Maximum context size in tokens. Set to `-1` to auto-use the model’s full trained context from GGUF metadata (raw max, no VRAM capping; a warning is logged if it may not fit detected VRAM).
`f16`	bool	`false`	Enable 16-bit floating point precision (GPU acceleration)
`gpu_layers`	int	`0`	Number of layers to offload to GPU (0 = CPU only)

Memory Management

Field	Type	Default	Description
`mmap`	bool	`true`	Use memory mapping for model loading (faster, less RAM)
`mmlock`	bool	`false`	Lock model in memory (prevents swapping)
`low_vram`	bool	`false`	Use minimal VRAM mode
`no_kv_offloading`	bool	`false`	Disable KV cache offloading

GPU Configuration

Field	Type	Description
`tensor_split`	string	Comma-separated GPU memory allocation (e.g., `"0.8,0.2"` for 80%/20%)
`main_gpu`	string	Main GPU identifier for multi-GPU setups
`cuda`	bool	Explicitly enable/disable CUDA

Sampling and Generation

Field	Type	Default	Description
`mirostat`	int	`0`	Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0)
`mirostat_tau`	float	`5.0`	Mirostat target entropy
`mirostat_eta`	float	`0.1`	Mirostat learning rate

LoRA Configuration

Field	Type	Description
`lora_adapter`	string	Path to LoRA adapter file
`lora_base`	string	Base model for LoRA
`lora_scale`	float32	LoRA scale factor
`lora_adapters`	array	Multiple LoRA adapters
`lora_scales`	array	Scales for multiple LoRA adapters

Advanced Options

Field	Type	Description
`no_mulmatq`	bool	Disable matrix multiplication queuing
`draft_model`	string	Draft model GGUF file for speculative decoding (see Speculative Decoding)
`n_draft`	int32	Maximum number of draft tokens per speculative step (default: 16)
`quantization`	string	Quantization format
`load_format`	string	Model load format
`numa`	bool	Enable NUMA (Non-Uniform Memory Access)
`rms_norm_eps`	float32	RMS normalization epsilon
`ngqa`	int32	Natural question generation parameter
`rope_scaling`	string	RoPE scaling configuration
`type`	string	Model type/architecture
`grammar`	string	Grammar file path for constrained generation

YARN Configuration

YARN (Yet Another RoPE extensioN) settings for context extension:

Field	Type	Description
`yarn_ext_factor`	float32	YARN extension factor
`yarn_attn_factor`	float32	YARN attention factor
`yarn_beta_fast`	float32	YARN beta fast parameter
`yarn_beta_slow`	float32	YARN beta slow parameter

Speculative Decoding

Speculative decoding speeds up text generation by predicting multiple tokens ahead and verifying them in a single forward pass. The output is identical to normal decoding - only faster. This feature is only available with the llama-cpp backend.

There are two approaches:

Draft Model Speculative Decoding

Uses a smaller, faster model from the same model family to draft candidate tokens, which the main model then verifies. Requires a separate GGUF file for the draft model.

name: my-model
backend: llama-cpp
parameters:
  model: large-model.gguf
draft_model: small-draft-model.gguf
n_draft: 8
options:
  - spec_p_min:0.8
  - draft_gpu_layers:99

N-gram Self-Speculative Decoding

Uses patterns from the token history to predict future tokens - no extra model required. Works well for repetitive or structured output (code, JSON, lists).

name: my-model
backend: llama-cpp
parameters:
  model: my-model.gguf
options:
  - spec_type:ngram_simple
  - spec_n_max:16

Speculative Decoding Options

These are set via the options: array in the model configuration (format: key:value):

Common options

Option	Type	Default	Description
`spec_type` / `speculative_type`	string	`none`	Speculative decoding type, or comma-separated list to chain multiple (see table below)
`spec_n_max` / `draft_max`	int	16	Maximum number of tokens to draft per step
`spec_n_min` / `draft_min`	int	0	Minimum draft tokens required to use speculation
`spec_p_min` / `draft_p_min`	float	0.75	Minimum probability threshold for greedy acceptance
`spec_p_split`	float	0.1	Split probability for tree-based branching

Draft-model options (apply when spec_type=draft, i.e. a draft_model is configured)

Option	Type	Default	Description
`draft_gpu_layers`	int	-1	GPU layers for the draft model (-1 = use default)
`draft_threads` / `spec_draft_threads`	int	same as main	Threads used by the draft model (`<= 0` = hardware concurrency)
`draft_threads_batch` / `spec_draft_threads_batch`	int	same as `draft_threads`	Threads used by the draft model during batch / prompt processing
`draft_cache_type_k` / `spec_draft_cache_type_k`	string	`f16`	KV cache K data type for the draft model (same values as `cache_type_k`)
`draft_cache_type_v` / `spec_draft_cache_type_v`	string	`f16`	KV cache V data type for the draft model
`draft_cpu_moe` / `spec_draft_cpu_moe`	bool	false	Keep all MoE expert weights of the draft model on CPU
`draft_n_cpu_moe` / `spec_draft_n_cpu_moe`	int	0	Keep MoE expert weights of the first N draft-model layers on CPU
`draft_override_tensor` / `spec_draft_override_tensor`	string	""	Comma-separated `<tensor regex>=<buffer type>` overrides for the draft model
`draft_ctx_size`	int	(ignored)	Deprecated upstream: the draft now shares the target context size. Accepted for backward compatibility but has no effect.

ngram_simple options (used when spec_type includes ngram_simple)

Option	Type	Default	Description
`spec_ngram_size_n` / `ngram_size_n`	int	12	N-gram lookup size
`spec_ngram_size_m` / `ngram_size_m`	int	48	M-gram proposal size
`spec_ngram_min_hits` / `ngram_min_hits`	int	1	Minimum hits for accepting n-gram proposals

ngram_mod options (used when spec_type includes ngram_mod)

Option	Type	Default	Description
`spec_ngram_mod_n_min`	int	48	Minimum number of ngram tokens to use
`spec_ngram_mod_n_max`	int	64	Maximum number of ngram tokens to use
`spec_ngram_mod_n_match`	int	24	Ngram lookup length

ngram_map_k options (used when spec_type includes ngram_map_k)

Option	Type	Default	Description
`spec_ngram_map_k_size_n`	int	12	N-gram lookup size
`spec_ngram_map_k_size_m`	int	48	M-gram proposal size
`spec_ngram_map_k_min_hits`	int	1	Minimum hits for accepting proposals

ngram_map_k4v options (used when spec_type includes ngram_map_k4v)

Option	Type	Default	Description
`spec_ngram_map_k4v_size_n`	int	12	N-gram lookup size
`spec_ngram_map_k4v_size_m`	int	48	M-gram proposal size
`spec_ngram_map_k4v_min_hits`	int	1	Minimum hits for accepting proposals

ngram_cache lookup files

Option	Type	Default	Description
`spec_lookup_cache_static` / `lookup_cache_static`	string	""	Path to a static ngram lookup cache file
`spec_lookup_cache_dynamic` / `lookup_cache_dynamic`	string	""	Path to a dynamic ngram lookup cache file (updated by generation)

Speculative Type Values

The canonical names match upstream llama.cpp (dash-separated). For backward compatibility LocalAI also accepts the underscore-separated forms and the bare draft / eagle3 aliases.

Type	Aliases accepted	Description
`none`		No speculative decoding (default)
`draft-simple`	`draft`, `draft_simple`	Draft model-based speculation (auto-set when `draft_model` is configured)
`draft-eagle3`	`eagle3`, `draft_eagle3`	EAGLE3 draft model architecture
`draft-mtp`	`draft_mtp`	Multi-Token Prediction. Reuses the target model’s embedded MTP head; no separate draft GGUF required (`draft_model` can be omitted).
`ngram-simple`	`ngram_simple`	Simple self-speculative using token history
`ngram-map-k`	`ngram_map_k`	N-gram with key-only map
`ngram-map-k4v`	`ngram_map_k4v`	N-gram with keys and 4 m-gram values
`ngram-mod`	`ngram_mod`	Modified n-gram speculation
`ngram-cache`	`ngram_cache`	3-level n-gram cache

Multiple types can be chained by passing a comma-separated list to spec_type (e.g. spec_type:ngram-simple,ngram-mod). The runtime tries them in order and accepts the first proposal that meets the acceptance criteria.

Note

Speculative decoding is automatically disabled when multimodal models (with mmproj) are active. The n_draft parameter can also be overridden per-request.

Multi-Token Prediction (MTP)

draft-mtp enables Multi-Token Prediction (ggml-org/llama.cpp#22673). MTP uses a small prediction head trained into the target model: the head runs alongside the main forward pass and proposes the next few tokens, which the target then verifies in a single batched step. Upstream reports ~1.85x-2.1x token throughput at ~72-82% draft acceptance on Qwen3.6 27B / 35B A3B.

Auto-detection (default). When a GGUF declares an MTP head (the upstream <arch>.nextn_predict_layers metadata key, set by convert_hf_to_gguf.py for Qwen3.5/3.6 family models and similar), LocalAI auto-enables MTP with the following defaults:

options:
  - spec_type:draft-mtp
  - spec_n_max:6
  - spec_p_min:0.75

Detection runs both at import time (the /import-model UI / POST /models/import-uri flow range-fetches the GGUF header and writes the options into the generated YAML before you save it) and at load time (every llama-cpp model start re-checks the local header and appends the options if spec_type isn’t already set). To opt out, set an explicit spec_type: / speculative_type: in your YAML - auto-detection always preserves the user value, including spec_type:none.

Two ways to load the MTP head:

Embedded in the target GGUF (the recommended path for LocalAI, and what auto-detection assumes). When spec_type includes draft-mtp and draft_model is empty, the backend builds the MTP draft context directly from the target model’s weights. The GGUF must have been converted with the MTP tensors included.
Separate mtp-*.gguf sibling file. If you point draft_model at the separate MTP-head GGUF that ships next to the main weights on HuggingFace, the backend will load it as a draft model. Note: upstream’s -hf auto-discovery of mtp-*.gguf siblings is not wired into LocalAI’s gRPC layer - you need to download the sibling file and configure draft_model explicitly.

Manual override knobs (overlap with the auto-detect defaults above):

Option	Recommended	Notes
`spec_type`	`draft-mtp`	Activates MTP. Can be chained with other types (see below).
`spec_n_max` / `draft_max`	`2`-`6`	Number of draft tokens per step. Upstream’s PR suggests 2-3 for the tightest acceptance window; LocalAI’s auto-default is 6 to favour throughput on models with high acceptance.
`spec_p_min`	`0.75`	Pinned because upstream marks the current default with a “change to 0.0f” TODO; locking it here keeps acceptance thresholds stable across future llama.cpp bumps.
`mmproj_use_gpu`	`false` (or unset `mmproj`)	MTP has a prompt-processing overhead; if the model is non-vision, drop the mmproj entirely to save VRAM.

Minimal config (override-only, since auto-detection already covers this for MTP-capable GGUFs):

name: qwen3-mtp
backend: llama-cpp
parameters:
  model: qwen3-27b-with-mtp.gguf
options:
  - spec_type:draft-mtp
  - spec_n_max:3

With a separate MTP head file:

name: qwen3-mtp
backend: llama-cpp
parameters:
  model: qwen3-27b.gguf
  draft_model: qwen3-27b-mtp-head.gguf
options:
  - spec_type:draft-mtp
  - spec_n_max:3

Chaining MTP with n-gram fallback (experimental, from the PR’s usage notes - useful when MTP acceptance drops on highly repetitive output):

options:
  - spec_type:draft-mtp,ngram-mod
  - spec_n_max:3
  - spec_ngram_mod_n_match:24

Pre-converted GGUFs with MTP heads are published on the ggml-org HuggingFace org (initially Qwen3.6 27B and Qwen3.6 35B A3B).

Reasoning Models (DeepSeek-R1, Qwen3, etc.)

These load-time options control how the backend parses <think> reasoning blocks and how much budget the model is allowed for thinking. They are set per model via the options: array. For how reasoning is returned alongside tool calls and survives the tool-result round trip, see Interleaved Thinking with Tool Calls.

Option	Type	Default	Description
`reasoning_format`	string	`deepseek`	Parser for reasoning/thinking blocks. One of `none`, `auto`, `deepseek`, `deepseek-legacy` (alias `deepseek_legacy`).
`enable_reasoning` / `reasoning_budget`	int	`-1`	Reasoning budget in tokens: `-1` unlimited, `0` disabled, `>0` token cap for the thinking section.
`prefill_assistant`	bool	`true`	When `false`, the trailing assistant message is not pre-filled by the chat template.

Note

This is the load-time reasoning configuration. The orthogonal per-request enable_thinking chat-template kwarg toggles thinking on/off per call without restarting the model. It can be driven either by the YAML reasoning.disable field (model default) or per request via the OpenAI reasoning_effort field on /v1/chat/completions:

reasoning_effort: "none" disables thinking for that request (enable_thinking=false) - useful to run a single reasoning model like Qwen3 for low-latency tasks while still enabling reasoning on other requests.
reasoning_effort: "minimal" | "low" | "medium" | "high" enables thinking, unless the model config explicitly set reasoning.disable: true (an operator’s explicit disable wins and is never re-enabled by a request).

`reasoning_effort` as a chat-template kwarg

reasoning_effort is also forwarded to the backend as a chat_template_kwarg, so models whose jinja chat template keys on it - e.g. gpt-oss (Harmony) or LFM2.5 - honor the level, not just the on/off enable_thinking flag. This matters for models that ignore enable_thinking entirely (LFM2.5 keeps emitting <think> for enable_thinking=false, but respects reasoning_effort).

Set a per-model default in the config so every request inherits it (a per-request reasoning_effort still overrides):

name: my-model
reasoning_effort: none   # none | minimal | low | medium | high

For realtime pipelines, set it on the pipeline so it applies to the pipeline’s LLM without editing that model’s own config:

name: gpt-realtime
pipeline:
  llm: lfm2.5
  reasoning_effort: none   # overrides the LLM model's own reasoning_effort

Custom `chat_template_kwargs`

Some jinja chat templates expose extra variables beyond enable_thinking / reasoning_effort (for example Qwen3’s preserve_thinking). Set arbitrary key/values in the model config and they are forwarded to the backend’s chat_template_kwargs as-is, so you don’t need a dedicated server option per template variable:

name: qwen3
chat_template_kwargs:
  preserve_thinking: true

You can also override (or add) any of these per request through the OpenAI metadata field on /v1/chat/completions. Values are strings; "true" / "false" are coerced to booleans, anything else is passed through as a string:

{
  "model": "qwen3",
  "messages": [{"role": "user", "content": "hi"}],
  "metadata": { "preserve_thinking": "true", "enable_thinking": "false" }
}

Per-request metadata overrides the model config defaults and the reasoning-config levers, and (for enable_thinking / reasoning_effort) takes effect across every backend that reads them, not just llama.cpp. Typed (non-boolean) values are only supported through the model YAML chat_template_kwargs, where YAML preserves the type.

Multimodal Backend Options

Option	Type	Default	Description
`mmproj_use_gpu` / `mmproj_offload`	bool	`true`	Set `false` to keep the multimodal projector on CPU (saves VRAM at cost of speed).
`image_min_tokens`	int	`-1`	Minimum vision tokens per image. `-1` keeps the model default.
`image_max_tokens`	int	`-1`	Maximum vision tokens per image. `-1` keeps the model default.

Embedding & Reranking Backend Options

Option	Type	Default	Description
`pooling_type` / `pooling`	string	auto	Pooling strategy for embeddings: `none`, `mean`, `cls`, `last`, `rank`. Reranking automatically uses `rank`.
`embd_normalize` / `embedding_normalize`	int	`2`	Normalization: `-1` none, `0` max-abs, `1` taxicab, `2` Euclidean (L2), `>2` p-norm.

Other Backend Tuning Options

These llama.cpp options are passed through the options: array.

Option	Type	Default	Description
`n_ubatch` / `ubatch`	int	same as `batch`	Physical batch size. Decouple from `n_batch` when an embedding/rerank workload needs a different value.
`threads_batch` / `n_threads_batch`	int	same as `threads`	Threads used during prompt processing. `<= 0` means `hardware_concurrency()`.
`direct_io` / `use_direct_io`	bool	`false`	Open the model with `O_DIRECT` (faster cold loads on NVMe; ignored if not supported).
`verbosity`	int	`3`	llama.cpp internal log verbosity threshold. Higher = more verbose.
`device` / `devices`	string	all devices	Select the llama.cpp backend devices to use. Repeat the option or pass a comma-separated list; unlisted devices are excluded. Use the names reported by `llama-server --list-devices` / `--list-devices`.
`override_tensor` / `tensor_buft_overrides`	string	""	Per-tensor buffer-type overrides for the main model. Format: `<tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...`. Mirrors the existing `draft_override_tensor` syntax for the draft model.
`cpu_moe`	bool	false	Keep all MoE expert weights of the main model on CPU (upstream `--cpu-moe`). Frees VRAM on large MoE models (DeepSeek, Qwen3 `*-A3B`).
`n_cpu_moe`	int	0	Keep MoE expert weights of the first N main-model layers on CPU (upstream `--n-cpu-moe`).

Generic option passthrough

Any options: entry whose name starts with - is forwarded verbatim to upstream llama.cpp’s own llama-server argument parser. This means any flag the bundled llama.cpp supports works without LocalAI needing a dedicated option, even ones added after your LocalAI version was built. See the upstream server flags reference.

Format mirrors the rest of the array - --flag for a boolean, or --flag:value for a flag that takes a value. Everything after the first : is the value, so embedded colons (e.g. host:port) are preserved:

options:
  - "--cpu-moe"                 # boolean flag
  - "--n-cpu-moe:4"             # flag with a value
  - "--override-tensor:exps=CPU"
  - "devices:CUDA1,CUDA2,CUDA3" # skip CUDA0, e.g. a display GPU

Notes:

Precedence: passthrough flags are applied last, so an explicit flag overrides the LocalAI option it maps to (e.g. --ctx-size:8192 overrides context_size).
Power-user territory: an invalid flag or value is rejected by the upstream parser exactly as it would be by llama-server, which can fail model loading. Prefer the named options above when one exists.
Flags that would terminate the process (such as --help, --usage, --version, --license, --list-devices, --cache-list, and --completion*) are ignored.

Prompt Caching

The recommended way to enable prompt caching for the llama-cpp backend is the server-side prompt cache controlled by cache_ram / kv_unified / cache_idle_slots in the options: array (see llama.cpp backend options). It’s on by default since LocalAI v4.3 and is what gives repeated system prompts a near-zero prefill on the second call.

The fields below come from upstream llama.cpp’s CLI completion tool and are passed through to the gRPC backend for compatibility, but the gRPC server itself does not consume them: keep them empty unless you’re targeting a non-llama-cpp backend that reads them.

Field	Type	Description
`prompt_cache_path`	string	(legacy / unused by llama-cpp gRPC server) Path to a file-backed prompt cache for upstream’s CLI completion tool.
`prompt_cache_all`	bool	(legacy / unused by llama-cpp gRPC server)
`prompt_cache_ro`	bool	(legacy / unused by llama-cpp gRPC server)

Text Processing

Field	Type	Description
`stopwords`	array	Words or phrases that stop generation
`cutstrings`	array	Strings to cut from responses
`trimspace`	array	Strings to trim whitespace from
`trimsuffix`	array	Suffixes to trim from responses
`extract_regex`	array	Regular expressions to extract content

System Prompt

Field	Type	Description
`system_prompt`	string	Default system prompt for the model

vLLM-Specific Configuration

These options apply when using the vllm backend:

Field	Type	Description
`gpu_memory_utilization`	float32	GPU memory utilization (0.0-1.0, default 0.9)
`trust_remote_code`	bool	Trust and execute remote code
`enforce_eager`	bool	Force eager execution mode
`swap_space`	int	Swap space in GB
`max_model_len`	int	Maximum model length
`tensor_parallel_size`	int	Tensor parallelism size
`disable_log_stats`	bool	Disable logging statistics
`dtype`	string	Data type (e.g., `float16`, `bfloat16`)
`flash_attention`	string	Flash attention configuration
`cache_type_k`	string	Key cache quantization type. Maps to llama.cpp’s `-ctk`. Accepted values for llama.cpp-family backends (`llama-cpp`, `ik-llama-cpp`, `turboquant`): `f16`, `f32`, `q8_0`, `q4_0`, `q4_1`, `q5_0`, `q5_1`. The `turboquant` backend additionally accepts `turbo2`, `turbo3`, `turbo4` - the fork’s TurboQuant KV-cache schemes. `turbo3`/`turbo4` auto-enable flash_attention.
`cache_type_v`	string	Value cache quantization type. Maps to llama.cpp’s `-ctv`. Same accepted values as `cache_type_k`. Note: any quantized V cache requires flash_attention to be enabled.
`limit_mm_per_prompt`	object	Limit multimodal content per prompt: `{image: int, video: int, audio: int}`

Template Configuration

Templates use Go templates with Sprig functions.

Field	Type	Description
`template.chat`	string	Template for chat completion endpoint
`template.chat_message`	string	Template for individual chat messages
`template.completion`	string	Template for text completion
`template.edit`	string	Template for edit operations
`template.function`	string	Template for function/tool calls
`template.multimodal`	string	Template for multimodal interactions
`template.reply_prefix`	string	Prefix to add to model replies
`template.use_tokenizer_template`	bool	Use tokenizer’s built-in template (vLLM/transformers)
`template.join_chat_messages_by_character`	string	Character to join chat messages (default: `\n`)

Template Variables

Templating supports sprig functions.

Following are common variables available in templates:

{{.Input}} - User input
{{.Instruction}} - Instruction for edit operations
{{.System}} - System message
{{.Prompt}} - Full prompt
{{.Functions}} - Function definitions (for function calling)
{{.FunctionCall}} - Function call result

Example Template

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}{{end}}
    {{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
    {{end}}
    Assistant:

Function Calling Configuration

Configure how the model handles function/tool calls:

Field	Type	Default	Description
`function.disable_no_action`	bool	`false`	Disable the no-action behavior
`function.no_action_function_name`	string	`answer`	Name of the no-action function
`function.no_action_description_name`	string		Description for no-action function
`function.function_name_key`	string	`name`	JSON key for function name
`function.function_arguments_key`	string	`arguments`	JSON key for function arguments
`function.response_regex`	array		Named regex patterns to extract function calls
`function.argument_regex`	array		Named regex to extract function arguments
`function.argument_regex_key_name`	string	`key`	Named regex capture for argument key
`function.argument_regex_value_name`	string	`value`	Named regex capture for argument value
`function.json_regex_match`	array		Regex patterns to match JSON in tool mode
`function.replace_function_results`	array		Replace function call results with patterns
`function.replace_llm_results`	array		Replace LLM results with patterns
`function.capture_llm_results`	array		Capture LLM results as text (e.g., for “thinking” blocks)

Grammar Configuration

Field	Type	Default	Description
`function.grammar.disable`	bool	`false`	Completely disable grammar enforcement
`function.grammar.parallel_calls`	bool	`false`	Allow parallel function calls
`function.grammar.mixed_mode`	bool	`false`	Allow mixed-mode grammar enforcing
`function.grammar.no_mixed_free_string`	bool	`false`	Disallow free strings in mixed mode
`function.grammar.disable_parallel_new_lines`	bool	`false`	Disable parallel processing for new lines
`function.grammar.prefix`	string		Prefix to add before grammar rules
`function.grammar.expect_strings_after_json`	bool	`false`	Expect strings after JSON data

Diffusers Configuration

For image generation models using the diffusers backend:

Field	Type	Description
`diffusers.cuda`	bool	Enable CUDA for diffusers
`diffusers.pipeline_type`	string	Pipeline type (e.g., `stable-diffusion`, `stable-diffusion-xl`)
`diffusers.scheduler_type`	string	Scheduler type (e.g., `euler`, `ddpm`)
`diffusers.enable_parameters`	string	Comma-separated parameters to enable
`diffusers.cfg_scale`	float32	Classifier-free guidance scale
`diffusers.img2img`	bool	Enable image-to-image transformation
`diffusers.clip_skip`	int	Number of CLIP layers to skip
`diffusers.clip_model`	string	CLIP model to use
`diffusers.clip_subfolder`	string	CLIP model subfolder
`diffusers.control_net`	string	ControlNet model to use
`step`	int	Number of diffusion steps

TTS Configuration

For text-to-speech models:

Field	Type	Description
`tts.voice`	string	Default backend voice ID, speaker name, or reference path. A request `voice` takes precedence.
`tts.audio_path`	string	Default reference-audio path for cloning backends. A request voice or saved Voice Library profile takes precedence.
`tts.voice_cloning`	bool	Optional Voice Library capability override. Omit for automatic backend/variant detection; `true` opts in a verified custom-named variant and `false` rejects saved profile references.

For example, a custom-named model on a known cloning backend can declare support explicitly while retaining a model-wide reference fallback:

name: private-voice-model
backend: qwen3-tts-cpp
parameters:
  model: private/qwen-talker-base.gguf
known_usecases:
  - tts
tts:
  voice_cloning: true
  audio_path: voices/default-reference.wav

tts.voice_cloning: true only overrides model-variant detection. It cannot enable cloning on a backend that does not implement LocalAI’s reference-audio contract.

Roles Configuration

Map conversation roles to specific strings:

roles:
  user: "### Instruction:"
  assistant: "### Response:"
  system: "### System Instruction:"

Feature Flags

Enable or disable experimental features:

feature_flags:
  feature_name: true
  another_feature: false

MCP Configuration

Model Context Protocol (MCP) configuration:

Field	Type	Description
`mcp.remote`	string	YAML string defining remote MCP servers
`mcp.stdio`	string	YAML string defining STDIO MCP servers

Agent Configuration

Agent/autonomous agent configuration:

Field	Type	Description
`agent.max_attempts`	int	Maximum number of attempts
`agent.max_iterations`	int	Maximum number of iterations
`agent.enable_reasoning`	bool	Enable reasoning capabilities
`agent.enable_planning`	bool	Enable planning capabilities
`agent.enable_mcp_prompts`	bool	Enable MCP prompts
`agent.enable_plan_re_evaluator`	bool	Enable plan re-evaluation

Reasoning Configuration

Configure how reasoning tags are extracted and processed from model output. Reasoning tags are used by models like DeepSeek, Command-R, and others to include internal reasoning steps in their responses.

Field	Type	Default	Description
`reasoning.disable`	bool	`false`	When `true`, disables reasoning extraction entirely. The original content is returned without any processing.
`reasoning.disable_reasoning_tag_prefill`	bool	`false`	When `true`, disables automatic prepending of thinking start tokens. Use this when your model already includes reasoning tags in its output format.
`reasoning.strip_reasoning_only`	bool	`false`	When `true`, extracts and removes reasoning tags from content but discards the reasoning text. Useful when you want to clean reasoning tags from output without storing the reasoning content.
`reasoning.thinking_start_tokens`	array	`[]`	List of custom thinking start tokens to detect in prompts. Custom tokens are checked before default tokens.
`reasoning.tag_pairs`	array	`[]`	List of custom tag pairs for reasoning extraction. Each entry has `start` and `end` fields. Custom pairs are checked before default pairs.

Reasoning Tag Formats

The reasoning extraction supports multiple tag formats used by different models:

<thinking>...</thinking> - General thinking tag
<think>...</think> - DeepSeek, Granite, ExaOne, GLM models
<|START_THINKING|>...<|END_THINKING|> - Command-R models
<|inner_prefix|>...<|inner_suffix|> - Apertus models
<seed:think>...</seed:think> - Seed models
<|think|>...<|end|><|begin|>assistant<|content|> - Solar Open models
[THINK]...[/THINK] - Magistral models

Examples

Disable reasoning extraction:

reasoning:
  disable: true

Extract reasoning but don’t prepend tags:

reasoning:
  disable_reasoning_tag_prefill: true

Strip reasoning tags without storing reasoning content:

reasoning:
  strip_reasoning_only: true

Complete example with reasoning configuration:

name: deepseek-model
backend: llama-cpp
parameters:
  model: deepseek.gguf

reasoning:
  disable: false
  disable_reasoning_tag_prefill: false
  strip_reasoning_only: false

Example with custom tokens and tag pairs:

name: custom-reasoning-model
backend: llama-cpp
parameters:
  model: custom.gguf

reasoning:
  thinking_start_tokens:
    - "<custom:think>"
    - "<my:reasoning>"
  tag_pairs:
    - start: "<custom:think>"
      end: "</custom:think>"
    - start: "<my:reasoning>"
      end: "</my:reasoning>"

Note: Custom tokens and tag pairs are checked before the default ones, giving them priority. This allows you to override default behavior or add support for new reasoning tag formats.

Per-Request Override via Metadata

The reasoning.disable setting from model configuration can be overridden on a per-request basis using the metadata field in the OpenAI chat completion request. This allows you to enable or disable thinking for individual requests without changing the model configuration.

The metadata field accepts a map[string]string that is forwarded to the backend. The enable_thinking key controls thinking behavior:

# Enable thinking for a single request (overrides model config)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "metadata": {"enable_thinking": "true"}
  }'

# Disable thinking for a single request (overrides model config)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3",
    "messages": [{"role": "user", "content": "Hello"}],
    "metadata": {"enable_thinking": "false"}
  }'

Priority order:

Request-level metadata.enable_thinking (highest priority)
Model config reasoning.disable (fallback)
Auto-detected from model template (default)

Pipeline Configuration

Define pipelines for audio-to-audio processing and the Realtime API:

Field	Type	Description
`pipeline.tts`	string	TTS model name
`pipeline.llm`	string	LLM model name
`pipeline.transcription`	string	Transcription model name
`pipeline.vad`	string	Voice activity detection model name

gRPC Configuration

Backend gRPC communication settings. These control the readiness handshake between LocalAI and a freshly spawned backend process - LocalAI polls the backend’s Health gRPC method up to grpc.attempts times, sleeping grpc.attempts_sleep_time seconds between polls, before giving up and terminating the backend as unresponsive.

Field	Type	Default	Description
`grpc.attempts`	int	20	Number of health-check attempts before the backend is killed as unresponsive
`grpc.attempts_sleep_time`	int	2	Sleep time between health-check attempts (seconds)

Total load window ≈ grpc.attempts × (grpc.attempts_sleep_time + per-call gRPC dial timeout). The default of 20 × 2 s ≈ 40 s is fine for typical backends but is too short for large models that need substantial time to become gRPC-ready after the process starts - for example NVFP4 / FP8 models whose shard loading and CUDA-graph capture can take several minutes, or slow storage backends. If the backend keeps getting killed while still legitimately loading (visible as exitCode=120 + rpc error: code = Canceled desc = context canceled in the LocalAI log, while the backend’s own stderr shows continued forward progress), raise these values.

Example configuration for a model that needs up to ~10 minutes to become gRPC-ready (large NVFP4 model, cold shard load + CUDA-graph capture):

grpc:
  attempts: 140
  attempts_sleep_time: 5

This gives a ~700 s window while keeping health-check polling frequent enough to detect real backend crashes quickly. The values only affect the initial readiness handshake - inference-request timeouts and the watchdog are unchanged.

Overrides

Override model configuration values at runtime (llama.cpp):

overrides:
  - "qwen3moe.expert_used_count=int:10"
  - "some_key=string:value"

Format: KEY=TYPE:VALUE where TYPE is int, float, string, or bool.

Known Use Cases

Specify which endpoints this model supports:

known_usecases:
  - chat
  - completion
  - embeddings

Available flags: chat, completion, edit, embeddings, rerank, image, transcript, tts, sound_generation, tokenize, vad, video, detection, llm (combination of CHAT, COMPLETION, EDIT).

token_classify marks a model as a token-classification (NER) provider for the PII filter (e.g. an openai-privacy-filter GGUF). Declare it explicitly together with embeddings: true (the classifier loads via TOKEN_CLS pooling). It runs on the dedicated privacy-filter backend (backend/cpp/privacy-filter), a standalone GGML engine for the openai-privacy-filter family - separate from llama-cpp, which no longer carries the token-classification path.

Known input and output modalities

Use known_input_modalities and known_output_modalities when a use case does not fully describe a model’s I/O. For example, both text-to-video and audio-driven avatar models use the video use case, but only the avatar model accepts audio:

known_usecases:
  - video
known_input_modalities:
  - text
  - image
  - audio
known_output_modalities:
  - video

Valid modality values are text, image, audio, and video. Explicit values are combined with modalities LocalAI can infer from the model use cases and configuration. The resulting canonical, de-duplicated lists are exposed by GET /v1/models/capabilities.

PII filtering

PII redaction is NER-based and runs on the request (input) side. It has two halves:

Detector models are token_classify models that carry the detection policy in a top-level pii_detection: block. The policy is defined once, on the model itself:

name: privacy-filter-multilingual
backend: llama-cpp
embeddings: true
known_usecases:
  - token_classify
pii_detection:
  min_score: 0.5            # drop detections below this confidence
  default_action: mask      # mask | block | allow - applied to any detected
                            # group with no explicit entry (empty = mask)
  entity_actions:           # which PII to block vs mask vs allow-log
    PASSWORD: block
    CREDITCARD: block
    EMAIL: mask

Consuming models opt in and reference one or more detectors by name - no per-consumer policy:

name: my-assistant
pii:
  enabled: true             # default: off for local backends, on for cloud-proxy
  detectors:
    - privacy-filter-multilingual

Multiple detectors union their detections; overlapping spans resolve to the strongest action (block > mask > allow). A configured detector that can’t be loaded fails the request closed (HTTP 503) rather than silently skipping the check. Detections are audited at /api/pii/events (hash-prefix only, never the raw value).

The earlier regex pattern tier (pii.patterns, the global pattern catalogue, --pii-config, and the /api/pii/patterns admin endpoints) has been removed, along with response/streaming-side redaction. Those keys now no-op with a startup warning; migrate to pii.detectors + a detector’s pii_detection block.

Complete Example

Here’s a comprehensive example combining many options:

name: my-llm-model
description: A high-performance LLM model
backend: llama-cpp

parameters:
  model: my-model.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  max_tokens: 2048

context_size: 4096
threads: 8
f16: true
gpu_layers: 35

system_prompt: "You are a helpful AI assistant."

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}
    {{else if eq .Role "assistant"}}Assistant: {{.Content}}
    {{end}}
    {{end}}
    Assistant:

roles:
  user: "User:"
  assistant: "Assistant:"
  system: "System:"

stopwords:
  - "\n\nUser:"
  - "\n\nHuman:"

prompt_cache_path: "cache/my-model"
prompt_cache_all: true

function:
  grammar:
    parallel_calls: true
    mixed_mode: false

feature_flags:
  experimental_feature: true

See Advanced Usage for other configuration options
See Prompt Templates for template examples
See CLI Reference for command-line options

GPU Auto-Fit Mode

Note: By default, LocalAI sets gpu_layers to a very large value (9999999), which effectively disables llama-cpp’s auto-fit functionality. This is intentional to work with LocalAI’s VRAM-based model unloading mechanism.

To enable llama-cpp’s auto-fit mode, set gpu_layers: -1 in your model configuration. However, be aware of the following:

Trade-off: Enabling auto-fit conflicts with LocalAI’s built-in VRAM threshold-based unloading. Auto-fit attempts to fit all tensors into GPU memory automatically, while LocalAI’s unloading mechanism removes models when VRAM usage exceeds thresholds.
Known Issues: Setting gpu_layers: -1 may trigger tensor_buft_override buffer errors in some configurations, particularly when the model exceeds available GPU memory.
Recommendation:
- Use the default settings for most use cases (LocalAI manages VRAM automatically)
- Only enable gpu_layers: -1 if you understand the implications and have tested on your specific hardware
- Monitor VRAM usage carefully when using auto-fit mode

This is a known limitation being tracked in issue #8562. A future implementation may provide a runtime toggle or custom logic to reconcile auto-fit with threshold-based unloading.