Model Configuration
LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.
Overview
Model configuration files allow you to:
- Define default parameters (temperature, top_p, etc.)
- Configure prompt templates
- Specify backend settings
- Set up function calling
- Configure GPU and memory options
- And much more
Configuration File Locations
You can create model configuration files in several ways:
- Individual YAML files in the models directory (e.g.,
models/gpt-3.5-turbo.yaml) - Single config file with multiple models using
--models-config-fileorLOCALAI_MODELS_CONFIG_FILE - Remote URLs - specify a URL to a YAML configuration file at startup
Example: Basic Configuration
name: gpt-3.5-turbo
parameters:
model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
temperature: 0.3
context_size: 512
threads: 10
backend: llama-stable
template:
completion: completion
chat: chat
Example: Multiple Models in One File
When using --models-config-file, you can define multiple models as a list:
- name: model1
parameters:
model: model1.bin
context_size: 512
backend: llama-stable
- name: model2
parameters:
model: model2.bin
context_size: 1024
backend: llama-stable
Core Configuration Fields
Basic Model Settings
| Field | Type | Description | Example |
|---|---|---|---|
name | string | Model name, used to identify the model in API calls | gpt-3.5-turbo |
backend | string | Backend to use (e.g. llama-cpp, vllm, diffusers, whisper) | llama-cpp |
description | string | Human-readable description of the model | A conversational AI model |
usage | string | Usage instructions or notes | Best for general conversation |
Model File and Downloads
| Field | Type | Description |
|---|---|---|
parameters.model | string | Path to the model file (relative to models directory) or URL |
download_files | array | List of files to download. Each entry has filename, uri, and optional sha256 |
Example:
parameters:
model: my-model.gguf
download_files:
- filename: my-model.gguf
uri: https://example.com/model.gguf
sha256: abc123...
Parameters Section
The parameters section contains all OpenAI-compatible request parameters and model-specific options.
OpenAI-Compatible Parameters
These settings will be used as defaults for all the API calls to the model.
| Field | Type | Default | Description |
|---|---|---|---|
temperature | float | 0.9 | Sampling temperature (0.0-2.0). Higher values make output more random |
top_p | float | 0.95 | Nucleus sampling: consider tokens with top_p probability mass |
top_k | int | 40 | Consider only the top K most likely tokens |
max_tokens | int | 0 | Maximum number of tokens to generate (0 = unlimited) |
frequency_penalty | float | 0.0 | Penalty for token frequency (-2.0 to 2.0) |
presence_penalty | float | 0.0 | Penalty for token presence (-2.0 to 2.0) |
repeat_penalty | float | 1.1 | Penalty for repeating tokens |
repeat_last_n | int | 64 | Number of previous tokens to consider for repeat penalty |
seed | int | -1 | Random seed (omit for random) |
echo | bool | false | Echo back the prompt in the response |
n | int | 1 | Number of completions to generate |
logprobs | bool/int | false | Return log probabilities of tokens |
top_logprobs | int | 0 | Number of top logprobs to return per token (0-20) |
logit_bias | map | {} | Map of token IDs to bias values (-100 to 100) |
typical_p | float | 1.0 | Typical sampling parameter |
tfz | float | 1.0 | Tail free z parameter |
keep | int | 0 | Number of tokens to keep from the prompt |
Language and Translation
| Field | Type | Description |
|---|---|---|
language | string | Language code for transcription/translation |
translate | bool | Whether to translate audio transcription |
Custom Parameters
| Field | Type | Description |
|---|---|---|
batch | int | Batch size for processing |
ignore_eos | bool | Ignore end-of-sequence tokens |
negative_prompt | string | Negative prompt for image generation |
rope_freq_base | float32 | RoPE frequency base |
rope_freq_scale | float32 | RoPE frequency scale |
negative_prompt_scale | float32 | Scale for negative prompt |
tokenizer | string | Tokenizer to use (RWKV) |
LLM Configuration
These settings apply to most LLM backends (llama.cpp, vLLM, etc.):
Performance Settings
| Field | Type | Default | Description |
|---|---|---|---|
threads | int | processor count | Number of threads for parallel computation |
context_size | int | 512 | Maximum context size (number of tokens) |
f16 | bool | false | Enable 16-bit floating point precision (GPU acceleration) |
gpu_layers | int | 0 | Number of layers to offload to GPU (0 = CPU only) |
Memory Management
| Field | Type | Default | Description |
|---|---|---|---|
mmap | bool | true | Use memory mapping for model loading (faster, less RAM) |
mmlock | bool | false | Lock model in memory (prevents swapping) |
low_vram | bool | false | Use minimal VRAM mode |
no_kv_offloading | bool | false | Disable KV cache offloading |
GPU Configuration
| Field | Type | Description |
|---|---|---|
tensor_split | string | Comma-separated GPU memory allocation (e.g., "0.8,0.2" for 80%/20%) |
main_gpu | string | Main GPU identifier for multi-GPU setups |
cuda | bool | Explicitly enable/disable CUDA |
Sampling and Generation
| Field | Type | Default | Description |
|---|---|---|---|
mirostat | int | 0 | Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0) |
mirostat_tau | float | 5.0 | Mirostat target entropy |
mirostat_eta | float | 0.1 | Mirostat learning rate |
LoRA Configuration
| Field | Type | Description |
|---|---|---|
lora_adapter | string | Path to LoRA adapter file |
lora_base | string | Base model for LoRA |
lora_scale | float32 | LoRA scale factor |
lora_adapters | array | Multiple LoRA adapters |
lora_scales | array | Scales for multiple LoRA adapters |
Advanced Options
| Field | Type | Description |
|---|---|---|
no_mulmatq | bool | Disable matrix multiplication queuing |
draft_model | string | Draft model for speculative decoding |
n_draft | int32 | Number of draft tokens |
quantization | string | Quantization format |
load_format | string | Model load format |
numa | bool | Enable NUMA (Non-Uniform Memory Access) |
rms_norm_eps | float32 | RMS normalization epsilon |
ngqa | int32 | Natural question generation parameter |
rope_scaling | string | RoPE scaling configuration |
type | string | Model type/architecture |
grammar | string | Grammar file path for constrained generation |
YARN Configuration
YARN (Yet Another RoPE extensioN) settings for context extension:
| Field | Type | Description |
|---|---|---|
yarn_ext_factor | float32 | YARN extension factor |
yarn_attn_factor | float32 | YARN attention factor |
yarn_beta_fast | float32 | YARN beta fast parameter |
yarn_beta_slow | float32 | YARN beta slow parameter |
Prompt Caching
| Field | Type | Description |
|---|---|---|
prompt_cache_path | string | Path to store prompt cache (relative to models directory) |
prompt_cache_all | bool | Cache all prompts automatically |
prompt_cache_ro | bool | Read-only prompt cache |
Text Processing
| Field | Type | Description |
|---|---|---|
stopwords | array | Words or phrases that stop generation |
cutstrings | array | Strings to cut from responses |
trimspace | array | Strings to trim whitespace from |
trimsuffix | array | Suffixes to trim from responses |
extract_regex | array | Regular expressions to extract content |
System Prompt
| Field | Type | Description |
|---|---|---|
system_prompt | string | Default system prompt for the model |
vLLM-Specific Configuration
These options apply when using the vllm backend:
| Field | Type | Description |
|---|---|---|
gpu_memory_utilization | float32 | GPU memory utilization (0.0-1.0, default 0.9) |
trust_remote_code | bool | Trust and execute remote code |
enforce_eager | bool | Force eager execution mode |
swap_space | int | Swap space in GB |
max_model_len | int | Maximum model length |
tensor_parallel_size | int | Tensor parallelism size |
disable_log_stats | bool | Disable logging statistics |
dtype | string | Data type (e.g., float16, bfloat16) |
flash_attention | string | Flash attention configuration |
cache_type_k | string | Key cache type |
cache_type_v | string | Value cache type |
limit_mm_per_prompt | object | Limit multimodal content per prompt: {image: int, video: int, audio: int} |
Template Configuration
Templates use Go templates with Sprig functions.
| Field | Type | Description |
|---|---|---|
template.chat | string | Template for chat completion endpoint |
template.chat_message | string | Template for individual chat messages |
template.completion | string | Template for text completion |
template.edit | string | Template for edit operations |
template.function | string | Template for function/tool calls |
template.multimodal | string | Template for multimodal interactions |
template.reply_prefix | string | Prefix to add to model replies |
template.use_tokenizer_template | bool | Use tokenizer’s built-in template (vLLM/transformers) |
template.join_chat_messages_by_character | string | Character to join chat messages (default: \n) |
Template Variables
Templating supports sprig functions.
Following are common variables available in templates:
{{.Input}}- User input{{.Instruction}}- Instruction for edit operations{{.System}}- System message{{.Prompt}}- Full prompt{{.Functions}}- Function definitions (for function calling){{.FunctionCall}}- Function call result
Example Template
template:
chat: |
{{.System}}
{{range .Messages}}
{{if eq .Role "user"}}User: {{.Content}}{{end}}
{{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
{{end}}
Assistant:
Function Calling Configuration
Configure how the model handles function/tool calls:
| Field | Type | Default | Description |
|---|---|---|---|
function.disable_no_action | bool | false | Disable the no-action behavior |
function.no_action_function_name | string | answer | Name of the no-action function |
function.no_action_description_name | string | Description for no-action function | |
function.function_name_key | string | name | JSON key for function name |
function.function_arguments_key | string | arguments | JSON key for function arguments |
function.response_regex | array | Named regex patterns to extract function calls | |
function.argument_regex | array | Named regex to extract function arguments | |
function.argument_regex_key_name | string | key | Named regex capture for argument key |
function.argument_regex_value_name | string | value | Named regex capture for argument value |
function.json_regex_match | array | Regex patterns to match JSON in tool mode | |
function.replace_function_results | array | Replace function call results with patterns | |
function.replace_llm_results | array | Replace LLM results with patterns | |
function.capture_llm_results | array | Capture LLM results as text (e.g., for “thinking” blocks) |
Grammar Configuration
| Field | Type | Default | Description |
|---|---|---|---|
function.grammar.disable | bool | false | Completely disable grammar enforcement |
function.grammar.parallel_calls | bool | false | Allow parallel function calls |
function.grammar.mixed_mode | bool | false | Allow mixed-mode grammar enforcing |
function.grammar.no_mixed_free_string | bool | false | Disallow free strings in mixed mode |
function.grammar.disable_parallel_new_lines | bool | false | Disable parallel processing for new lines |
function.grammar.prefix | string | Prefix to add before grammar rules | |
function.grammar.expect_strings_after_json | bool | false | Expect strings after JSON data |
Diffusers Configuration
For image generation models using the diffusers backend:
| Field | Type | Description |
|---|---|---|
diffusers.cuda | bool | Enable CUDA for diffusers |
diffusers.pipeline_type | string | Pipeline type (e.g., stable-diffusion, stable-diffusion-xl) |
diffusers.scheduler_type | string | Scheduler type (e.g., euler, ddpm) |
diffusers.enable_parameters | string | Comma-separated parameters to enable |
diffusers.cfg_scale | float32 | Classifier-free guidance scale |
diffusers.img2img | bool | Enable image-to-image transformation |
diffusers.clip_skip | int | Number of CLIP layers to skip |
diffusers.clip_model | string | CLIP model to use |
diffusers.clip_subfolder | string | CLIP model subfolder |
diffusers.control_net | string | ControlNet model to use |
step | int | Number of diffusion steps |
TTS Configuration
For text-to-speech models:
| Field | Type | Description |
|---|---|---|
tts.voice | string | Voice file path or voice ID |
tts.audio_path | string | Path to audio files (for Vall-E) |
Roles Configuration
Map conversation roles to specific strings:
roles:
user: "### Instruction:"
assistant: "### Response:"
system: "### System Instruction:"
Feature Flags
Enable or disable experimental features:
feature_flags:
feature_name: true
another_feature: false
MCP Configuration
Model Context Protocol (MCP) configuration:
| Field | Type | Description |
|---|---|---|
mcp.remote | string | YAML string defining remote MCP servers |
mcp.stdio | string | YAML string defining STDIO MCP servers |
Agent Configuration
Agent/autonomous agent configuration:
| Field | Type | Description |
|---|---|---|
agent.max_attempts | int | Maximum number of attempts |
agent.max_iterations | int | Maximum number of iterations |
agent.enable_reasoning | bool | Enable reasoning capabilities |
agent.enable_planning | bool | Enable planning capabilities |
agent.enable_mcp_prompts | bool | Enable MCP prompts |
agent.enable_plan_re_evaluator | bool | Enable plan re-evaluation |
Pipeline Configuration
Define pipelines for audio-to-audio processing:
| Field | Type | Description |
|---|---|---|
pipeline.tts | string | TTS model name |
pipeline.llm | string | LLM model name |
pipeline.transcription | string | Transcription model name |
pipeline.vad | string | Voice activity detection model name |
gRPC Configuration
Backend gRPC communication settings:
| Field | Type | Description |
|---|---|---|
grpc.attempts | int | Number of retry attempts |
grpc.attempts_sleep_time | int | Sleep time between retries (seconds) |
Overrides
Override model configuration values at runtime (llama.cpp):
overrides:
- "qwen3moe.expert_used_count=int:10"
- "some_key=string:value"
Format: KEY=TYPE:VALUE where TYPE is int, float, string, or bool.
Known Use Cases
Specify which endpoints this model supports:
known_usecases:
- chat
- completion
- embeddings
Available flags: chat, completion, edit, embeddings, rerank, image, transcript, tts, sound_generation, tokenize, vad, video, detection, llm (combination of CHAT, COMPLETION, EDIT).
Complete Example
Here’s a comprehensive example combining many options:
name: my-llm-model
description: A high-performance LLM model
backend: llama-stable
parameters:
model: my-model.gguf
temperature: 0.7
top_p: 0.9
top_k: 40
max_tokens: 2048
context_size: 4096
threads: 8
f16: true
gpu_layers: 35
system_prompt: "You are a helpful AI assistant."
template:
chat: |
{{.System}}
{{range .Messages}}
{{if eq .Role "user"}}User: {{.Content}}
{{else if eq .Role "assistant"}}Assistant: {{.Content}}
{{end}}
{{end}}
Assistant:
roles:
user: "User:"
assistant: "Assistant:"
system: "System:"
stopwords:
- "\n\nUser:"
- "\n\nHuman:"
prompt_cache_path: "cache/my-model"
prompt_cache_all: true
function:
grammar:
parallel_calls: true
mixed_mode: false
feature_flags:
experimental_feature: true
Related Documentation
- See Advanced Usage for other configuration options
- See Prompt Templates for template examples
- See CLI Reference for command-line options
Last updated 17 Nov 2025, 18:39 +0100 .