Troubleshooting
This guide covers common issues you may encounter when using LocalAI, organized by category. For each issue, diagnostic steps and solutions are provided.
Quick Diagnostics
Before diving into specific issues, run these commands to gather diagnostic information:
For Docker deployments:
Installation Issues
Binary Won’t Execute on Linux
Symptoms: Permission denied or “cannot execute binary file” errors.
Solution:
If you see “cannot execute binary file: Exec format error”, you downloaded the wrong architecture. Verify with:
macOS: Application Is Quarantined
Symptoms: macOS blocks LocalAI from running because the DMG is not signed by Apple.
Solution: See GitHub issue #6268 for quarantine bypass instructions. This is tracked for resolution in issue #6244.
Model Loading Problems
Model Not Found
Symptoms: API returns 404 or "model not found" error.
Diagnostic steps:
Check the model exists in your models directory:
Verify your models path is correct:
Confirm the model name matches your request:
Model Fails to Load (Backend Error)
Symptoms: Model is found but fails to load, with backend errors in the logs.
Common causes and fixes:
- Wrong backend: Ensure the backend in your model YAML matches the model format. GGUF models use
llama-cpp, diffusion models usediffusers, etc. See the compatibility table for details. - Backend not installed: Check installed backends:
- Corrupt model file: Re-download the model. Partial downloads or disk errors can corrupt files.
- Wrong model format: LocalAI uses GGUF format for llama.cpp models. Older GGML format is deprecated.
Model Configuration Issues
Symptoms: Model loads but produces unexpected results or errors during inference.
Check your model YAML configuration:
Common mistakes:
modelpath must be relative to the models directory, not an absolute paththreadsset higher than physical CPU cores causes contentioncontext_sizetoo large for available RAM causes OOM errors
GPU and Memory Issues
GPU Not Detected
NVIDIA (CUDA):
When working correctly, LocalAI logs should show: ggml_init_cublas: found X CUDA devices.
Ensure you are using a CUDA-enabled container image (tags containing cuda11, cuda12, or cuda13). CPU-only images cannot use NVIDIA GPUs.
AMD (ROCm):
If your GPU is not in the default target list, open up an Issue. Supported targets include: gfx900, gfx906, gfx908, gfx90a, gfx940, gfx941, gfx942, gfx1030, gfx1031, gfx1100, gfx1101.
Intel (SYCL):
Use container images with gpu-intel in the tag. Known issue: SYCL hangs when mmap: true is set — disable it in your model config:
Overriding backend auto-detection:
If LocalAI picks the wrong GPU backend, override it:
Out of Memory (OOM)
Symptoms: Model loading fails or the process is killed by the OS.
Solutions:
- Use smaller quantizations: Q4_K_S or Q2_K use significantly less memory than Q8_0 or Q6_K
- Reduce context size: Lower
context_sizein your model YAML - Enable low VRAM mode: Add
low_vram: trueto your model config - Limit active models: Only keep one model loaded at a time:
- Enable idle watchdog: Automatically unload unused models:
- Manually unload a model:
Models Stay Loaded and Consume Memory
By default, models remain loaded in memory after first use. This can exhaust VRAM when switching between models.
Configure LRU eviction:
Configure watchdog auto-unload:
These can also be set via environment variables (LOCALAI_WATCHDOG_IDLE=true, LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m) or in the Web UI under Settings → Watchdog Settings.
See the VRAM Management guide for more details.
API Connection Problems
Connection Refused
Symptoms: curl: (7) Failed to connect to localhost port 8080: Connection refused
Diagnostic steps:
Verify LocalAI is running:
Check the bind address and port:
Check for port conflicts:
Authentication Errors (401)
Symptoms: 401 Unauthorized response.
If API key authentication is enabled (LOCALAI_API_KEY or --api-keys), include the key in your requests:
Keys can also be passed via x-api-key or xi-api-key headers.
Request Errors (400/422)
Symptoms: 400 Bad Request or 422 Unprocessable Entity.
Common causes:
- Malformed JSON in request body
- Missing required fields (e.g.,
modelormessages) - Invalid parameter values (e.g., negative
top_nfor reranking)
Enable debug logging to see the full request/response:
See the API Errors reference for a complete list of error codes and their meanings.
Performance Issues
Slow Inference
Diagnostic steps:
Enable debug mode to see inference timing:
Use streaming to measure time-to-first-token:
Common causes and fixes:
- Model on HDD: Move models to an SSD. If stuck with HDD, disable memory mapping (
mmap: false) to load the model entirely into RAM. - Thread overbooking: Set
--threadsto match your physical CPU core count (not logical/hyperthreaded count). - Default sampling: LocalAI uses mirostat sampling by default, which produces better quality output but is slower. Disable it for benchmarking:
- No GPU offloading: Ensure
gpu_layersis set in your model config to offload layers to GPU: - Context size too large: Larger context sizes require more memory and slow down inference. Use the smallest context size that meets your needs.
High Memory Usage
- Use quantized models (Q4_K_M is a good balance of quality and size)
- Reduce
context_size - Enable
low_vram: truein model config - Disable
mmlock(memory locking) if it’s enabled - Set
--max-active-backends=1to keep only one model in memory
Docker-Specific Problems
Container Fails to Start
Diagnostic steps:
GPU Not Available Inside Container
NVIDIA:
AMD:
Intel:
Health Checks Failing
Add a health check to your Docker Compose configuration:
Models Not Persisted Between Restarts
Mount a volume for your models directory:
Network and P2P Issues
P2P Workers Not Discovered
Symptoms: Distributed inference setup but workers are not found.
Key requirements:
- Use
--net hostornetwork_mode: hostin Docker - Share the same P2P token across all nodes
Debug P2P connectivity:
If DHT is causing issues, try disabling it to use local mDNS discovery instead:
P2P Limitations
- Only a single model is currently supported for distributed inference
- Workers must be detected before inference starts — you cannot add workers mid-inference
- Workers mode supports llama-cpp compatible models only
See the Distributed Inferencing guide for full setup instructions.
Still Having Issues?
If your issue isn’t covered here:
- Search existing issues: Check the GitHub Issues for similar problems
- Enable debug logging: Run with
DEBUG=trueor--log-level=debugand include the logs when reporting - Open a new issue: Include your OS, hardware (CPU/GPU), LocalAI version, model being used, full error logs, and steps to reproduce
- Community help: Join the LocalAI Discord for community support