LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.

Overview

Model configuration files allow you to:

  • Define default parameters (temperature, top_p, etc.)
  • Configure prompt templates
  • Specify backend settings
  • Set up function calling
  • Configure GPU and memory options
  • And much more

Configuration File Locations

You can create model configuration files in several ways:

  1. Individual YAML files in the models directory (e.g., models/gpt-3.5-turbo.yaml)
  2. Single config file with multiple models using --models-config-file or LOCALAI_MODELS_CONFIG_FILE
  3. Remote URLs - specify a URL to a YAML configuration file at startup

Example: Basic Configuration

  name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat
  

Example: Multiple Models in One File

When using --models-config-file, you can define multiple models as a list:

  - name: model1
  parameters:
    model: model1.bin
  context_size: 512
  backend: llama-stable

- name: model2
  parameters:
    model: model2.bin
  context_size: 1024
  backend: llama-stable
  

Core Configuration Fields

Basic Model Settings

FieldTypeDescriptionExample
namestringModel name, used to identify the model in API callsgpt-3.5-turbo
backendstringBackend to use (e.g. llama-cpp, vllm, diffusers, whisper)llama-cpp
descriptionstringHuman-readable description of the modelA conversational AI model
usagestringUsage instructions or notesBest for general conversation

Model File and Downloads

FieldTypeDescription
parameters.modelstringPath to the model file (relative to models directory) or URL
download_filesarrayList of files to download. Each entry has filename, uri, and optional sha256

Example:

  parameters:
  model: my-model.gguf

download_files:
  - filename: my-model.gguf
    uri: https://example.com/model.gguf
    sha256: abc123...
  

Parameters Section

The parameters section contains all OpenAI-compatible request parameters and model-specific options.

OpenAI-Compatible Parameters

These settings will be used as defaults for all the API calls to the model.

FieldTypeDefaultDescription
temperaturefloat0.9Sampling temperature (0.0-2.0). Higher values make output more random
top_pfloat0.95Nucleus sampling: consider tokens with top_p probability mass
top_kint40Consider only the top K most likely tokens
max_tokensint0Maximum number of tokens to generate (0 = unlimited)
frequency_penaltyfloat0.0Penalty for token frequency (-2.0 to 2.0)
presence_penaltyfloat0.0Penalty for token presence (-2.0 to 2.0)
repeat_penaltyfloat1.1Penalty for repeating tokens
repeat_last_nint64Number of previous tokens to consider for repeat penalty
seedint-1Random seed (omit for random)
echoboolfalseEcho back the prompt in the response
nint1Number of completions to generate
logprobsbool/intfalseReturn log probabilities of tokens
top_logprobsint0Number of top logprobs to return per token (0-20)
logit_biasmap{}Map of token IDs to bias values (-100 to 100)
typical_pfloat1.0Typical sampling parameter
tfzfloat1.0Tail free z parameter
keepint0Number of tokens to keep from the prompt

Language and Translation

FieldTypeDescription
languagestringLanguage code for transcription/translation
translateboolWhether to translate audio transcription

Custom Parameters

FieldTypeDescription
batchintBatch size for processing
ignore_eosboolIgnore end-of-sequence tokens
negative_promptstringNegative prompt for image generation
rope_freq_basefloat32RoPE frequency base
rope_freq_scalefloat32RoPE frequency scale
negative_prompt_scalefloat32Scale for negative prompt
tokenizerstringTokenizer to use (RWKV)

LLM Configuration

These settings apply to most LLM backends (llama.cpp, vLLM, etc.):

Performance Settings

FieldTypeDefaultDescription
threadsintprocessor countNumber of threads for parallel computation
context_sizeint512Maximum context size (number of tokens)
f16boolfalseEnable 16-bit floating point precision (GPU acceleration)
gpu_layersint0Number of layers to offload to GPU (0 = CPU only)

Memory Management

FieldTypeDefaultDescription
mmapbooltrueUse memory mapping for model loading (faster, less RAM)
mmlockboolfalseLock model in memory (prevents swapping)
low_vramboolfalseUse minimal VRAM mode
no_kv_offloadingboolfalseDisable KV cache offloading

GPU Configuration

FieldTypeDescription
tensor_splitstringComma-separated GPU memory allocation (e.g., "0.8,0.2" for 80%/20%)
main_gpustringMain GPU identifier for multi-GPU setups
cudaboolExplicitly enable/disable CUDA

Sampling and Generation

FieldTypeDefaultDescription
mirostatint0Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0)
mirostat_taufloat5.0Mirostat target entropy
mirostat_etafloat0.1Mirostat learning rate

LoRA Configuration

FieldTypeDescription
lora_adapterstringPath to LoRA adapter file
lora_basestringBase model for LoRA
lora_scalefloat32LoRA scale factor
lora_adaptersarrayMultiple LoRA adapters
lora_scalesarrayScales for multiple LoRA adapters

Advanced Options

FieldTypeDescription
no_mulmatqboolDisable matrix multiplication queuing
draft_modelstringDraft model for speculative decoding
n_draftint32Number of draft tokens
quantizationstringQuantization format
load_formatstringModel load format
numaboolEnable NUMA (Non-Uniform Memory Access)
rms_norm_epsfloat32RMS normalization epsilon
ngqaint32Natural question generation parameter
rope_scalingstringRoPE scaling configuration
typestringModel type/architecture
grammarstringGrammar file path for constrained generation

YARN Configuration

YARN (Yet Another RoPE extensioN) settings for context extension:

FieldTypeDescription
yarn_ext_factorfloat32YARN extension factor
yarn_attn_factorfloat32YARN attention factor
yarn_beta_fastfloat32YARN beta fast parameter
yarn_beta_slowfloat32YARN beta slow parameter

Prompt Caching

FieldTypeDescription
prompt_cache_pathstringPath to store prompt cache (relative to models directory)
prompt_cache_allboolCache all prompts automatically
prompt_cache_roboolRead-only prompt cache

Text Processing

FieldTypeDescription
stopwordsarrayWords or phrases that stop generation
cutstringsarrayStrings to cut from responses
trimspacearrayStrings to trim whitespace from
trimsuffixarraySuffixes to trim from responses
extract_regexarrayRegular expressions to extract content

System Prompt

FieldTypeDescription
system_promptstringDefault system prompt for the model

vLLM-Specific Configuration

These options apply when using the vllm backend:

FieldTypeDescription
gpu_memory_utilizationfloat32GPU memory utilization (0.0-1.0, default 0.9)
trust_remote_codeboolTrust and execute remote code
enforce_eagerboolForce eager execution mode
swap_spaceintSwap space in GB
max_model_lenintMaximum model length
tensor_parallel_sizeintTensor parallelism size
disable_log_statsboolDisable logging statistics
dtypestringData type (e.g., float16, bfloat16)
flash_attentionstringFlash attention configuration
cache_type_kstringKey cache type
cache_type_vstringValue cache type
limit_mm_per_promptobjectLimit multimodal content per prompt: {image: int, video: int, audio: int}

Template Configuration

Templates use Go templates with Sprig functions.

FieldTypeDescription
template.chatstringTemplate for chat completion endpoint
template.chat_messagestringTemplate for individual chat messages
template.completionstringTemplate for text completion
template.editstringTemplate for edit operations
template.functionstringTemplate for function/tool calls
template.multimodalstringTemplate for multimodal interactions
template.reply_prefixstringPrefix to add to model replies
template.use_tokenizer_templateboolUse tokenizer’s built-in template (vLLM/transformers)
template.join_chat_messages_by_characterstringCharacter to join chat messages (default: \n)

Template Variables

Templating supports sprig functions.

Following are common variables available in templates:

  • {{.Input}} - User input
  • {{.Instruction}} - Instruction for edit operations
  • {{.System}} - System message
  • {{.Prompt}} - Full prompt
  • {{.Functions}} - Function definitions (for function calling)
  • {{.FunctionCall}} - Function call result

Example Template

  template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}{{end}}
    {{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
    {{end}}
    Assistant:
  

Function Calling Configuration

Configure how the model handles function/tool calls:

FieldTypeDefaultDescription
function.disable_no_actionboolfalseDisable the no-action behavior
function.no_action_function_namestringanswerName of the no-action function
function.no_action_description_namestringDescription for no-action function
function.function_name_keystringnameJSON key for function name
function.function_arguments_keystringargumentsJSON key for function arguments
function.response_regexarrayNamed regex patterns to extract function calls
function.argument_regexarrayNamed regex to extract function arguments
function.argument_regex_key_namestringkeyNamed regex capture for argument key
function.argument_regex_value_namestringvalueNamed regex capture for argument value
function.json_regex_matcharrayRegex patterns to match JSON in tool mode
function.replace_function_resultsarrayReplace function call results with patterns
function.replace_llm_resultsarrayReplace LLM results with patterns
function.capture_llm_resultsarrayCapture LLM results as text (e.g., for “thinking” blocks)

Grammar Configuration

FieldTypeDefaultDescription
function.grammar.disableboolfalseCompletely disable grammar enforcement
function.grammar.parallel_callsboolfalseAllow parallel function calls
function.grammar.mixed_modeboolfalseAllow mixed-mode grammar enforcing
function.grammar.no_mixed_free_stringboolfalseDisallow free strings in mixed mode
function.grammar.disable_parallel_new_linesboolfalseDisable parallel processing for new lines
function.grammar.prefixstringPrefix to add before grammar rules
function.grammar.expect_strings_after_jsonboolfalseExpect strings after JSON data

Diffusers Configuration

For image generation models using the diffusers backend:

FieldTypeDescription
diffusers.cudaboolEnable CUDA for diffusers
diffusers.pipeline_typestringPipeline type (e.g., stable-diffusion, stable-diffusion-xl)
diffusers.scheduler_typestringScheduler type (e.g., euler, ddpm)
diffusers.enable_parametersstringComma-separated parameters to enable
diffusers.cfg_scalefloat32Classifier-free guidance scale
diffusers.img2imgboolEnable image-to-image transformation
diffusers.clip_skipintNumber of CLIP layers to skip
diffusers.clip_modelstringCLIP model to use
diffusers.clip_subfolderstringCLIP model subfolder
diffusers.control_netstringControlNet model to use
stepintNumber of diffusion steps

TTS Configuration

For text-to-speech models:

FieldTypeDescription
tts.voicestringVoice file path or voice ID
tts.audio_pathstringPath to audio files (for Vall-E)

Roles Configuration

Map conversation roles to specific strings:

  roles:
  user: "### Instruction:"
  assistant: "### Response:"
  system: "### System Instruction:"
  

Feature Flags

Enable or disable experimental features:

  feature_flags:
  feature_name: true
  another_feature: false
  

MCP Configuration

Model Context Protocol (MCP) configuration:

FieldTypeDescription
mcp.remotestringYAML string defining remote MCP servers
mcp.stdiostringYAML string defining STDIO MCP servers

Agent Configuration

Agent/autonomous agent configuration:

FieldTypeDescription
agent.max_attemptsintMaximum number of attempts
agent.max_iterationsintMaximum number of iterations
agent.enable_reasoningboolEnable reasoning capabilities
agent.enable_planningboolEnable planning capabilities
agent.enable_mcp_promptsboolEnable MCP prompts
agent.enable_plan_re_evaluatorboolEnable plan re-evaluation

Pipeline Configuration

Define pipelines for audio-to-audio processing:

FieldTypeDescription
pipeline.ttsstringTTS model name
pipeline.llmstringLLM model name
pipeline.transcriptionstringTranscription model name
pipeline.vadstringVoice activity detection model name

gRPC Configuration

Backend gRPC communication settings:

FieldTypeDescription
grpc.attemptsintNumber of retry attempts
grpc.attempts_sleep_timeintSleep time between retries (seconds)

Overrides

Override model configuration values at runtime (llama.cpp):

  overrides:
  - "qwen3moe.expert_used_count=int:10"
  - "some_key=string:value"
  

Format: KEY=TYPE:VALUE where TYPE is int, float, string, or bool.

Known Use Cases

Specify which endpoints this model supports:

  known_usecases:
  - chat
  - completion
  - embeddings
  

Available flags: chat, completion, edit, embeddings, rerank, image, transcript, tts, sound_generation, tokenize, vad, video, detection, llm (combination of CHAT, COMPLETION, EDIT).

Complete Example

Here’s a comprehensive example combining many options:

  name: my-llm-model
description: A high-performance LLM model
backend: llama-stable

parameters:
  model: my-model.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  max_tokens: 2048

context_size: 4096
threads: 8
f16: true
gpu_layers: 35

system_prompt: "You are a helpful AI assistant."

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}
    {{else if eq .Role "assistant"}}Assistant: {{.Content}}
    {{end}}
    {{end}}
    Assistant:

roles:
  user: "User:"
  assistant: "Assistant:"
  system: "System:"

stopwords:
  - "\n\nUser:"
  - "\n\nHuman:"

prompt_cache_path: "cache/my-model"
prompt_cache_all: true

function:
  grammar:
    parallel_calls: true
    mixed_mode: false

feature_flags:
  experimental_feature: true
  

Last updated 17 Nov 2025, 18:39 +0100 . history