Model Configuration

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.

Overview

Model configuration files allow you to:

  • Define default parameters (temperature, top_p, etc.)
  • Configure prompt templates
  • Specify backend settings
  • Set up function calling
  • Configure GPU and memory options
  • And much more

Configuration File Locations

You can create model configuration files in several ways:

  1. Individual YAML files in the models directory (e.g., models/gpt-3.5-turbo.yaml)
  2. Single config file with multiple models using --models-config-file or LOCALAI_MODELS_CONFIG_FILE
  3. Remote URLs - specify a URL to a YAML configuration file at startup

Example: Basic Configuration

name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat

Example: Multiple Models in One File

When using --models-config-file, you can define multiple models as a list:

- name: model1
  parameters:
    model: model1.bin
  context_size: 512
  backend: llama-stable

- name: model2
  parameters:
    model: model2.bin
  context_size: 1024
  backend: llama-stable

Core Configuration Fields

Basic Model Settings

FieldTypeDescriptionExample
namestringModel name, used to identify the model in API callsgpt-3.5-turbo
backendstringBackend to use (e.g. llama-cpp, vllm, diffusers, whisper)llama-cpp
descriptionstringHuman-readable description of the modelA conversational AI model
usagestringUsage instructions or notesBest for general conversation

Model File and Downloads

FieldTypeDescription
parameters.modelstringPath to the model file (relative to models directory) or URL
download_filesarrayList of files to download. Each entry has filename, uri, and optional sha256

Example:

parameters:
  model: my-model.gguf

download_files:
  - filename: my-model.gguf
    uri: https://example.com/model.gguf
    sha256: abc123...

Parameters Section

The parameters section contains all OpenAI-compatible request parameters and model-specific options.

OpenAI-Compatible Parameters

These settings will be used as defaults for all the API calls to the model.

FieldTypeDefaultDescription
temperaturefloat0.9Sampling temperature (0.0-2.0). Higher values make output more random
top_pfloat0.95Nucleus sampling: consider tokens with top_p probability mass
top_kint40Consider only the top K most likely tokens
max_tokensint0Maximum number of tokens to generate (0 = unlimited)
frequency_penaltyfloat0.0Penalty for token frequency (-2.0 to 2.0)
presence_penaltyfloat0.0Penalty for token presence (-2.0 to 2.0)
repeat_penaltyfloat1.1Penalty for repeating tokens
repeat_last_nint64Number of previous tokens to consider for repeat penalty
seedint-1Random seed (omit for random)
echoboolfalseEcho back the prompt in the response
nint1Number of completions to generate
logprobsbool/intfalseReturn log probabilities of tokens
top_logprobsint0Number of top logprobs to return per token (0-20)
logit_biasmap{}Map of token IDs to bias values (-100 to 100)
typical_pfloat1.0Typical sampling parameter
tfzfloat1.0Tail free z parameter
keepint0Number of tokens to keep from the prompt

Language and Translation

FieldTypeDescription
languagestringLanguage code for transcription/translation
translateboolWhether to translate audio transcription

Custom Parameters

FieldTypeDescription
batchintBatch size for processing
ignore_eosboolIgnore end-of-sequence tokens
negative_promptstringNegative prompt for image generation
rope_freq_basefloat32RoPE frequency base
rope_freq_scalefloat32RoPE frequency scale
negative_prompt_scalefloat32Scale for negative prompt
tokenizerstringTokenizer to use (RWKV)

LLM Configuration

These settings apply to most LLM backends (llama.cpp, vLLM, etc.):

Performance Settings

FieldTypeDefaultDescription
threadsintprocessor countNumber of threads for parallel computation
context_sizeint512Maximum context size (number of tokens)
f16boolfalseEnable 16-bit floating point precision (GPU acceleration)
gpu_layersint0Number of layers to offload to GPU (0 = CPU only)

Memory Management

FieldTypeDefaultDescription
mmapbooltrueUse memory mapping for model loading (faster, less RAM)
mmlockboolfalseLock model in memory (prevents swapping)
low_vramboolfalseUse minimal VRAM mode
no_kv_offloadingboolfalseDisable KV cache offloading

GPU Configuration

FieldTypeDescription
tensor_splitstringComma-separated GPU memory allocation (e.g., "0.8,0.2" for 80%/20%)
main_gpustringMain GPU identifier for multi-GPU setups
cudaboolExplicitly enable/disable CUDA

Sampling and Generation

FieldTypeDefaultDescription
mirostatint0Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0)
mirostat_taufloat5.0Mirostat target entropy
mirostat_etafloat0.1Mirostat learning rate

LoRA Configuration

FieldTypeDescription
lora_adapterstringPath to LoRA adapter file
lora_basestringBase model for LoRA
lora_scalefloat32LoRA scale factor
lora_adaptersarrayMultiple LoRA adapters
lora_scalesarrayScales for multiple LoRA adapters

Advanced Options

FieldTypeDescription
no_mulmatqboolDisable matrix multiplication queuing
draft_modelstringDraft model GGUF file for speculative decoding (see Speculative Decoding)
n_draftint32Maximum number of draft tokens per speculative step (default: 16)
quantizationstringQuantization format
load_formatstringModel load format
numaboolEnable NUMA (Non-Uniform Memory Access)
rms_norm_epsfloat32RMS normalization epsilon
ngqaint32Natural question generation parameter
rope_scalingstringRoPE scaling configuration
typestringModel type/architecture
grammarstringGrammar file path for constrained generation

YARN Configuration

YARN (Yet Another RoPE extensioN) settings for context extension:

FieldTypeDescription
yarn_ext_factorfloat32YARN extension factor
yarn_attn_factorfloat32YARN attention factor
yarn_beta_fastfloat32YARN beta fast parameter
yarn_beta_slowfloat32YARN beta slow parameter

Speculative Decoding

Speculative decoding speeds up text generation by predicting multiple tokens ahead and verifying them in a single forward pass. The output is identical to normal decoding — only faster. This feature is only available with the llama-cpp backend.

There are two approaches:

Draft Model Speculative Decoding

Uses a smaller, faster model from the same model family to draft candidate tokens, which the main model then verifies. Requires a separate GGUF file for the draft model.

name: my-model
backend: llama-cpp
parameters:
  model: large-model.gguf
draft_model: small-draft-model.gguf
n_draft: 8
options:
  - spec_p_min:0.8
  - draft_gpu_layers:99

N-gram Self-Speculative Decoding

Uses patterns from the token history to predict future tokens — no extra model required. Works well for repetitive or structured output (code, JSON, lists).

name: my-model
backend: llama-cpp
parameters:
  model: my-model.gguf
options:
  - spec_type:ngram_simple
  - spec_n_max:16

Speculative Decoding Options

These are set via the options: array in the model configuration (format: key:value):

Common options

OptionTypeDefaultDescription
spec_type / speculative_typestringnoneSpeculative decoding type, or comma-separated list to chain multiple (see table below)
spec_n_max / draft_maxint16Maximum number of tokens to draft per step
spec_n_min / draft_minint0Minimum draft tokens required to use speculation
spec_p_min / draft_p_minfloat0.75Minimum probability threshold for greedy acceptance
spec_p_splitfloat0.1Split probability for tree-based branching

Draft-model options (apply when spec_type=draft, i.e. a draft_model is configured)

OptionTypeDefaultDescription
draft_gpu_layersint-1GPU layers for the draft model (-1 = use default)
draft_threads / spec_draft_threadsintsame as mainThreads used by the draft model (<= 0 = hardware concurrency)
draft_threads_batch / spec_draft_threads_batchintsame as draft_threadsThreads used by the draft model during batch / prompt processing
draft_cache_type_k / spec_draft_cache_type_kstringf16KV cache K data type for the draft model (same values as cache_type_k)
draft_cache_type_v / spec_draft_cache_type_vstringf16KV cache V data type for the draft model
draft_cpu_moe / spec_draft_cpu_moeboolfalseKeep all MoE expert weights of the draft model on CPU
draft_n_cpu_moe / spec_draft_n_cpu_moeint0Keep MoE expert weights of the first N draft-model layers on CPU
draft_override_tensor / spec_draft_override_tensorstring""Comma-separated <tensor regex>=<buffer type> overrides for the draft model
draft_ctx_sizeint(ignored)Deprecated upstream: the draft now shares the target context size. Accepted for backward compatibility but has no effect.

ngram_simple options (used when spec_type includes ngram_simple)

OptionTypeDefaultDescription
spec_ngram_size_n / ngram_size_nint12N-gram lookup size
spec_ngram_size_m / ngram_size_mint48M-gram proposal size
spec_ngram_min_hits / ngram_min_hitsint1Minimum hits for accepting n-gram proposals

ngram_mod options (used when spec_type includes ngram_mod)

OptionTypeDefaultDescription
spec_ngram_mod_n_minint48Minimum number of ngram tokens to use
spec_ngram_mod_n_maxint64Maximum number of ngram tokens to use
spec_ngram_mod_n_matchint24Ngram lookup length

ngram_map_k options (used when spec_type includes ngram_map_k)

OptionTypeDefaultDescription
spec_ngram_map_k_size_nint12N-gram lookup size
spec_ngram_map_k_size_mint48M-gram proposal size
spec_ngram_map_k_min_hitsint1Minimum hits for accepting proposals

ngram_map_k4v options (used when spec_type includes ngram_map_k4v)

OptionTypeDefaultDescription
spec_ngram_map_k4v_size_nint12N-gram lookup size
spec_ngram_map_k4v_size_mint48M-gram proposal size
spec_ngram_map_k4v_min_hitsint1Minimum hits for accepting proposals

ngram_cache lookup files

OptionTypeDefaultDescription
spec_lookup_cache_static / lookup_cache_staticstring""Path to a static ngram lookup cache file
spec_lookup_cache_dynamic / lookup_cache_dynamicstring""Path to a dynamic ngram lookup cache file (updated by generation)

Speculative Type Values

The canonical names match upstream llama.cpp (dash-separated). For backward compatibility LocalAI also accepts the underscore-separated forms and the bare draft / eagle3 aliases.

TypeAliases acceptedDescription
noneNo speculative decoding (default)
draft-simpledraft, draft_simpleDraft model-based speculation (auto-set when draft_model is configured)
draft-eagle3eagle3, draft_eagle3EAGLE3 draft model architecture
draft-mtpdraft_mtpMulti-Token Prediction. Reuses the target model’s embedded MTP head; no separate draft GGUF required (draft_model can be omitted).
ngram-simplengram_simpleSimple self-speculative using token history
ngram-map-kngram_map_kN-gram with key-only map
ngram-map-k4vngram_map_k4vN-gram with keys and 4 m-gram values
ngram-modngram_modModified n-gram speculation
ngram-cachengram_cache3-level n-gram cache

Multiple types can be chained by passing a comma-separated list to spec_type (e.g. spec_type:ngram-simple,ngram-mod). The runtime tries them in order and accepts the first proposal that meets the acceptance criteria.

Note

Speculative decoding is automatically disabled when multimodal models (with mmproj) are active. The n_draft parameter can also be overridden per-request.

Multi-Token Prediction (MTP)

draft-mtp enables Multi-Token Prediction (ggml-org/llama.cpp#22673). MTP uses a small prediction head trained into the target model: the head runs alongside the main forward pass and proposes the next few tokens, which the target then verifies in a single batched step. Upstream reports ~1.85x-2.1x token throughput at ~72-82% draft acceptance on Qwen3.6 27B / 35B A3B.

Auto-detection (default). When a GGUF declares an MTP head (the upstream <arch>.nextn_predict_layers metadata key, set by convert_hf_to_gguf.py for Qwen3.5/3.6 family models and similar), LocalAI auto-enables MTP with the following defaults:

options:
  - spec_type:draft-mtp
  - spec_n_max:6
  - spec_p_min:0.75

Detection runs both at import time (the /import-model UI / POST /models/import-uri flow range-fetches the GGUF header and writes the options into the generated YAML before you save it) and at load time (every llama-cpp model start re-checks the local header and appends the options if spec_type isn’t already set). To opt out, set an explicit spec_type: / speculative_type: in your YAML - auto-detection always preserves the user value, including spec_type:none.

Two ways to load the MTP head:

  1. Embedded in the target GGUF (the recommended path for LocalAI, and what auto-detection assumes). When spec_type includes draft-mtp and draft_model is empty, the backend builds the MTP draft context directly from the target model’s weights. The GGUF must have been converted with the MTP tensors included.
  2. Separate mtp-*.gguf sibling file. If you point draft_model at the separate MTP-head GGUF that ships next to the main weights on HuggingFace, the backend will load it as a draft model. Note: upstream’s -hf auto-discovery of mtp-*.gguf siblings is not wired into LocalAI’s gRPC layer - you need to download the sibling file and configure draft_model explicitly.

Manual override knobs (overlap with the auto-detect defaults above):

OptionRecommendedNotes
spec_typedraft-mtpActivates MTP. Can be chained with other types (see below).
spec_n_max / draft_max2-6Number of draft tokens per step. Upstream’s PR suggests 2-3 for the tightest acceptance window; LocalAI’s auto-default is 6 to favour throughput on models with high acceptance.
spec_p_min0.75Pinned because upstream marks the current default with a “change to 0.0f” TODO; locking it here keeps acceptance thresholds stable across future llama.cpp bumps.
mmproj_use_gpufalse (or unset mmproj)MTP has a prompt-processing overhead; if the model is non-vision, drop the mmproj entirely to save VRAM.

Minimal config (override-only, since auto-detection already covers this for MTP-capable GGUFs):

name: qwen3-mtp
backend: llama-cpp
parameters:
  model: qwen3-27b-with-mtp.gguf
options:
  - spec_type:draft-mtp
  - spec_n_max:3

With a separate MTP head file:

name: qwen3-mtp
backend: llama-cpp
parameters:
  model: qwen3-27b.gguf
  draft_model: qwen3-27b-mtp-head.gguf
options:
  - spec_type:draft-mtp
  - spec_n_max:3

Chaining MTP with n-gram fallback (experimental, from the PR’s usage notes - useful when MTP acceptance drops on highly repetitive output):

options:
  - spec_type:draft-mtp,ngram-mod
  - spec_n_max:3
  - spec_ngram_mod_n_match:24

Pre-converted GGUFs with MTP heads are published on the ggml-org HuggingFace org (initially Qwen3.6 27B and Qwen3.6 35B A3B).

Reasoning Models (DeepSeek-R1, Qwen3, etc.)

These load-time options control how the backend parses <think> reasoning blocks and how much budget the model is allowed for thinking. They are set per model via the options: array.

OptionTypeDefaultDescription
reasoning_formatstringdeepseekParser for reasoning/thinking blocks. One of none, auto, deepseek, deepseek-legacy (alias deepseek_legacy).
enable_reasoning / reasoning_budgetint-1Reasoning budget in tokens: -1 unlimited, 0 disabled, >0 token cap for the thinking section.
prefill_assistantbooltrueWhen false, the trailing assistant message is not pre-filled by the chat template.
Note

This is the load-time reasoning configuration. The orthogonal per-request enable_thinking chat-template kwarg (set via the YAML reasoning.disable field) toggles thinking on/off per call without restarting the model.

Multimodal Backend Options

OptionTypeDefaultDescription
mmproj_use_gpu / mmproj_offloadbooltrueSet false to keep the multimodal projector on CPU (saves VRAM at cost of speed).
image_min_tokensint-1Minimum vision tokens per image. -1 keeps the model default.
image_max_tokensint-1Maximum vision tokens per image. -1 keeps the model default.

Embedding & Reranking Backend Options

OptionTypeDefaultDescription
pooling_type / poolingstringautoPooling strategy for embeddings: none, mean, cls, last, rank. Reranking automatically uses rank.
embd_normalize / embedding_normalizeint2Normalization: -1 none, 0 max-abs, 1 taxicab, 2 Euclidean (L2), >2 p-norm.

Other Backend Tuning Options

These llama.cpp options are passed through the options: array.

OptionTypeDefaultDescription
n_ubatch / ubatchintsame as batchPhysical batch size. Decouple from n_batch when an embedding/rerank workload needs a different value.
threads_batch / n_threads_batchintsame as threadsThreads used during prompt processing. <= 0 means hardware_concurrency().
direct_io / use_direct_ioboolfalseOpen the model with O_DIRECT (faster cold loads on NVMe; ignored if not supported).
verbosityint3llama.cpp internal log verbosity threshold. Higher = more verbose.
override_tensor / tensor_buft_overridesstring""Per-tensor buffer-type overrides for the main model. Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,.... Mirrors the existing draft_override_tensor syntax for the draft model.

Prompt Caching

FieldTypeDescription
prompt_cache_pathstringPath to store prompt cache (relative to models directory)
prompt_cache_allboolCache all prompts automatically
prompt_cache_roboolRead-only prompt cache

Text Processing

FieldTypeDescription
stopwordsarrayWords or phrases that stop generation
cutstringsarrayStrings to cut from responses
trimspacearrayStrings to trim whitespace from
trimsuffixarraySuffixes to trim from responses
extract_regexarrayRegular expressions to extract content

System Prompt

FieldTypeDescription
system_promptstringDefault system prompt for the model

vLLM-Specific Configuration

These options apply when using the vllm backend:

FieldTypeDescription
gpu_memory_utilizationfloat32GPU memory utilization (0.0-1.0, default 0.9)
trust_remote_codeboolTrust and execute remote code
enforce_eagerboolForce eager execution mode
swap_spaceintSwap space in GB
max_model_lenintMaximum model length
tensor_parallel_sizeintTensor parallelism size
disable_log_statsboolDisable logging statistics
dtypestringData type (e.g., float16, bfloat16)
flash_attentionstringFlash attention configuration
cache_type_kstringKey cache quantization type. Maps to llama.cpp’s -ctk. Accepted values for llama.cpp-family backends (llama-cpp, ik-llama-cpp, turboquant): f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1. The turboquant backend additionally accepts turbo2, turbo3, turbo4 — the fork’s TurboQuant KV-cache schemes. turbo3/turbo4 auto-enable flash_attention.
cache_type_vstringValue cache quantization type. Maps to llama.cpp’s -ctv. Same accepted values as cache_type_k. Note: any quantized V cache requires flash_attention to be enabled.
limit_mm_per_promptobjectLimit multimodal content per prompt: {image: int, video: int, audio: int}

Template Configuration

Templates use Go templates with Sprig functions.

FieldTypeDescription
template.chatstringTemplate for chat completion endpoint
template.chat_messagestringTemplate for individual chat messages
template.completionstringTemplate for text completion
template.editstringTemplate for edit operations
template.functionstringTemplate for function/tool calls
template.multimodalstringTemplate for multimodal interactions
template.reply_prefixstringPrefix to add to model replies
template.use_tokenizer_templateboolUse tokenizer’s built-in template (vLLM/transformers)
template.join_chat_messages_by_characterstringCharacter to join chat messages (default: \n)

Template Variables

Templating supports sprig functions.

Following are common variables available in templates:

  • {{.Input}} - User input
  • {{.Instruction}} - Instruction for edit operations
  • {{.System}} - System message
  • {{.Prompt}} - Full prompt
  • {{.Functions}} - Function definitions (for function calling)
  • {{.FunctionCall}} - Function call result

Example Template

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}{{end}}
    {{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
    {{end}}
    Assistant:

Function Calling Configuration

Configure how the model handles function/tool calls:

FieldTypeDefaultDescription
function.disable_no_actionboolfalseDisable the no-action behavior
function.no_action_function_namestringanswerName of the no-action function
function.no_action_description_namestringDescription for no-action function
function.function_name_keystringnameJSON key for function name
function.function_arguments_keystringargumentsJSON key for function arguments
function.response_regexarrayNamed regex patterns to extract function calls
function.argument_regexarrayNamed regex to extract function arguments
function.argument_regex_key_namestringkeyNamed regex capture for argument key
function.argument_regex_value_namestringvalueNamed regex capture for argument value
function.json_regex_matcharrayRegex patterns to match JSON in tool mode
function.replace_function_resultsarrayReplace function call results with patterns
function.replace_llm_resultsarrayReplace LLM results with patterns
function.capture_llm_resultsarrayCapture LLM results as text (e.g., for “thinking” blocks)

Grammar Configuration

FieldTypeDefaultDescription
function.grammar.disableboolfalseCompletely disable grammar enforcement
function.grammar.parallel_callsboolfalseAllow parallel function calls
function.grammar.mixed_modeboolfalseAllow mixed-mode grammar enforcing
function.grammar.no_mixed_free_stringboolfalseDisallow free strings in mixed mode
function.grammar.disable_parallel_new_linesboolfalseDisable parallel processing for new lines
function.grammar.prefixstringPrefix to add before grammar rules
function.grammar.expect_strings_after_jsonboolfalseExpect strings after JSON data

Diffusers Configuration

For image generation models using the diffusers backend:

FieldTypeDescription
diffusers.cudaboolEnable CUDA for diffusers
diffusers.pipeline_typestringPipeline type (e.g., stable-diffusion, stable-diffusion-xl)
diffusers.scheduler_typestringScheduler type (e.g., euler, ddpm)
diffusers.enable_parametersstringComma-separated parameters to enable
diffusers.cfg_scalefloat32Classifier-free guidance scale
diffusers.img2imgboolEnable image-to-image transformation
diffusers.clip_skipintNumber of CLIP layers to skip
diffusers.clip_modelstringCLIP model to use
diffusers.clip_subfolderstringCLIP model subfolder
diffusers.control_netstringControlNet model to use
stepintNumber of diffusion steps

TTS Configuration

For text-to-speech models:

FieldTypeDescription
tts.voicestringVoice file path or voice ID
tts.audio_pathstringPath to audio files (for Vall-E)

Roles Configuration

Map conversation roles to specific strings:

roles:
  user: "### Instruction:"
  assistant: "### Response:"
  system: "### System Instruction:"

Feature Flags

Enable or disable experimental features:

feature_flags:
  feature_name: true
  another_feature: false

MCP Configuration

Model Context Protocol (MCP) configuration:

FieldTypeDescription
mcp.remotestringYAML string defining remote MCP servers
mcp.stdiostringYAML string defining STDIO MCP servers

Agent Configuration

Agent/autonomous agent configuration:

FieldTypeDescription
agent.max_attemptsintMaximum number of attempts
agent.max_iterationsintMaximum number of iterations
agent.enable_reasoningboolEnable reasoning capabilities
agent.enable_planningboolEnable planning capabilities
agent.enable_mcp_promptsboolEnable MCP prompts
agent.enable_plan_re_evaluatorboolEnable plan re-evaluation

Reasoning Configuration

Configure how reasoning tags are extracted and processed from model output. Reasoning tags are used by models like DeepSeek, Command-R, and others to include internal reasoning steps in their responses.

FieldTypeDefaultDescription
reasoning.disableboolfalseWhen true, disables reasoning extraction entirely. The original content is returned without any processing.
reasoning.disable_reasoning_tag_prefillboolfalseWhen true, disables automatic prepending of thinking start tokens. Use this when your model already includes reasoning tags in its output format.
reasoning.strip_reasoning_onlyboolfalseWhen true, extracts and removes reasoning tags from content but discards the reasoning text. Useful when you want to clean reasoning tags from output without storing the reasoning content.
reasoning.thinking_start_tokensarray[]List of custom thinking start tokens to detect in prompts. Custom tokens are checked before default tokens.
reasoning.tag_pairsarray[]List of custom tag pairs for reasoning extraction. Each entry has start and end fields. Custom pairs are checked before default pairs.

Reasoning Tag Formats

The reasoning extraction supports multiple tag formats used by different models:

  • <thinking>...</thinking> - General thinking tag
  • <think>...</think> - DeepSeek, Granite, ExaOne, GLM models
  • <|START_THINKING|>...<|END_THINKING|> - Command-R models
  • <|inner_prefix|>...<|inner_suffix|> - Apertus models
  • <seed:think>...</seed:think> - Seed models
  • <|think|>...<|end|><|begin|>assistant<|content|> - Solar Open models
  • [THINK]...[/THINK] - Magistral models

Examples

Disable reasoning extraction:

reasoning:
  disable: true

Extract reasoning but don’t prepend tags:

reasoning:
  disable_reasoning_tag_prefill: true

Strip reasoning tags without storing reasoning content:

reasoning:
  strip_reasoning_only: true

Complete example with reasoning configuration:

name: deepseek-model
backend: llama-cpp
parameters:
  model: deepseek.gguf

reasoning:
  disable: false
  disable_reasoning_tag_prefill: false
  strip_reasoning_only: false

Example with custom tokens and tag pairs:

name: custom-reasoning-model
backend: llama-cpp
parameters:
  model: custom.gguf

reasoning:
  thinking_start_tokens:
    - "<custom:think>"
    - "<my:reasoning>"
  tag_pairs:
    - start: "<custom:think>"
      end: "</custom:think>"
    - start: "<my:reasoning>"
      end: "</my:reasoning>"

Note: Custom tokens and tag pairs are checked before the default ones, giving them priority. This allows you to override default behavior or add support for new reasoning tag formats.

Per-Request Override via Metadata

The reasoning.disable setting from model configuration can be overridden on a per-request basis using the metadata field in the OpenAI chat completion request. This allows you to enable or disable thinking for individual requests without changing the model configuration.

The metadata field accepts a map[string]string that is forwarded to the backend. The enable_thinking key controls thinking behavior:

# Enable thinking for a single request (overrides model config)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "metadata": {"enable_thinking": "true"}
  }'

# Disable thinking for a single request (overrides model config)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3",
    "messages": [{"role": "user", "content": "Hello"}],
    "metadata": {"enable_thinking": "false"}
  }'

Priority order:

  1. Request-level metadata.enable_thinking (highest priority)
  2. Model config reasoning.disable (fallback)
  3. Auto-detected from model template (default)

Pipeline Configuration

Define pipelines for audio-to-audio processing and the Realtime API:

FieldTypeDescription
pipeline.ttsstringTTS model name
pipeline.llmstringLLM model name
pipeline.transcriptionstringTranscription model name
pipeline.vadstringVoice activity detection model name

gRPC Configuration

Backend gRPC communication settings:

FieldTypeDescription
grpc.attemptsintNumber of retry attempts
grpc.attempts_sleep_timeintSleep time between retries (seconds)

Overrides

Override model configuration values at runtime (llama.cpp):

overrides:
  - "qwen3moe.expert_used_count=int:10"
  - "some_key=string:value"

Format: KEY=TYPE:VALUE where TYPE is int, float, string, or bool.

Known Use Cases

Specify which endpoints this model supports:

known_usecases:
  - chat
  - completion
  - embeddings

Available flags: chat, completion, edit, embeddings, rerank, image, transcript, tts, sound_generation, tokenize, vad, video, detection, llm (combination of CHAT, COMPLETION, EDIT).

Complete Example

Here’s a comprehensive example combining many options:

name: my-llm-model
description: A high-performance LLM model
backend: llama-stable

parameters:
  model: my-model.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  max_tokens: 2048

context_size: 4096
threads: 8
f16: true
gpu_layers: 35

system_prompt: "You are a helpful AI assistant."

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}
    {{else if eq .Role "assistant"}}Assistant: {{.Content}}
    {{end}}
    {{end}}
    Assistant:

roles:
  user: "User:"
  assistant: "Assistant:"
  system: "System:"

stopwords:
  - "\n\nUser:"
  - "\n\nHuman:"

prompt_cache_path: "cache/my-model"
prompt_cache_all: true

function:
  grammar:
    parallel_calls: true
    mixed_mode: false

feature_flags:
  experimental_feature: true

GPU Auto-Fit Mode

Note: By default, LocalAI sets gpu_layers to a very large value (9999999), which effectively disables llama-cpp’s auto-fit functionality. This is intentional to work with LocalAI’s VRAM-based model unloading mechanism.

To enable llama-cpp’s auto-fit mode, set gpu_layers: -1 in your model configuration. However, be aware of the following:

  1. Trade-off: Enabling auto-fit conflicts with LocalAI’s built-in VRAM threshold-based unloading. Auto-fit attempts to fit all tensors into GPU memory automatically, while LocalAI’s unloading mechanism removes models when VRAM usage exceeds thresholds.

  2. Known Issues: Setting gpu_layers: -1 may trigger tensor_buft_override buffer errors in some configurations, particularly when the model exceeds available GPU memory.

  3. Recommendation:

    • Use the default settings for most use cases (LocalAI manages VRAM automatically)
    • Only enable gpu_layers: -1 if you understand the implications and have tested on your specific hardware
    • Monitor VRAM usage carefully when using auto-fit mode

This is a known limitation being tracked in issue #8562. A future implementation may provide a runtime toggle or custom logic to reconcile auto-fit with threshold-based unloading.