Model compatibility table

Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the backends, compatible models families and the associated repository.

Note

LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.

All backends listed here can be installed on demand from the Backend Gallery. The exact set of acceleration variants published for each backend is defined in backend/index.yaml.

Text Generation & Language Models

BackendDescriptionCapabilityEmbeddingsStreamingAcceleration
llama.cppLLM inference in C/C++. Supports LLaMA, Mamba, RWKV, Falcon, Starcoder, GPT-2, and many othersGPT, FunctionsyesyesCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
ik_llama.cppHard fork of llama.cpp optimized for CPU/hybrid CPU+GPU with IQK quants, custom quant mixes, and MLA for DeepSeekGPTyesyesCPU (AVX2+)
turboquantllama.cpp fork adding the TurboQuant KV-cache quantization schemeGPTyesyesCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Jetson L4T
ds4DeepSeek V4 Flash single-model inference engine, optimized for Metal and CUDAGPTnoyesCPU, CUDA 12/13, Metal, Jetson L4T
vLLMFast LLM serving with PagedAttention; GPTQ/AWQ/FP8 quantizationGPT, Functions, MultimodalnoyesCUDA 12/13, ROCm, Intel SYCL, Jetson L4T
vLLM OmniUnified multimodal generation (text, image, video, audio) on top of vLLMMultimodal GPT, FunctionsnoyesCUDA 12/13, ROCm, Jetson L4T
SGLangFast serving framework for LLMs and vision-language models with speculative decodingGPT, Functions, MultimodalnoyesCUDA 12/13, ROCm, Intel SYCL, Jetson L4T
transformersHuggingFace Transformers frameworkGPT, Embeddings, Multimodalyesyes*CUDA 12/13, ROCm, Intel SYCL, Metal
MLXApple Silicon LLM inferenceGPT, FunctionsnoyesCPU, CUDA 12/13, Metal, Jetson L4T
MLX-VLMVision-Language Models on Apple SiliconMultimodal GPT, FunctionsnoyesCPU, CUDA 12/13, Metal, Jetson L4T
MLX DistributedDistributed LLM inference across multiple Apple Silicon MacsGPTnonoCPU, CUDA 12/13, Metal, Jetson L4T
tinygradMinimalist deep-learning framework with zero runtime dependenciesGPT, Embeddings, MultimodalyesyesCPU

Speech-to-Text

BackendDescriptionAcceleration
whisper.cppOpenAI Whisper in C/C++CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
faster-whisperFast Whisper with CTranslate2CPU, CUDA 12/13, ROCm, Intel SYCL, Metal, Jetson L4T
WhisperXWord-level timestamps and speaker diarizationCPU, CUDA 12/13, Metal, Jetson L4T
moonshineUltra-fast transcription for low-end devices (ONNX)CPU, CUDA 12/13, Metal
parakeet.cppC++/GGML port of NVIDIA NeMo Parakeet (tdt/ctc/rnnt/hybrid), with cache-aware streamingCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
CrispASRUnified speech engine (whisper.cpp fork) supporting Parakeet, Canary, and many ASR architectures, plus TTSCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
voxtralVoxtral Realtime 4B speech-to-text in pure CCPU, Metal
Qwen3-ASRQwen3 automatic speech recognitionCPU, CUDA 12/13, ROCm, Intel SYCL, Metal, Jetson L4T
NeMoNVIDIA NeMo ASR toolkitCPU, CUDA 12/13, ROCm, Intel SYCL, Metal
sherpa-onnxSherpa-ONNX ASR (Whisper, Paraformer, SenseVoice) and TTSCPU, CUDA 12, Metal

Text-to-Speech

BackendDescriptionAcceleration
piperFast neural TTSCPU, Metal
Coqui TTSTTS with 1100+ languages and voice cloningCUDA 12, ROCm, Intel SYCL, Metal
KokoroLightweight TTS (82M params)CUDA 12/13, ROCm, Intel SYCL, Metal, Jetson L4T
KokorosPure Rust Kokoro TTS via ONNXCPU
ChatterboxProduction-grade TTS with emotion controlCPU, CUDA 12/13, Metal, Jetson L4T
VibeVoiceReal-time TTS with voice cloningCPU, CUDA 12/13, ROCm, Intel SYCL, Metal, Jetson L4T
vibevoice.cppNative C++/GGML port of VibeVoice for TTS (voice cloning) and long-form ASR with diarizationCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
Qwen3-TTSTTS with custom voice, voice design, and voice cloningCPU, CUDA 12/13, ROCm, Intel SYCL, Metal, Jetson L4T
qwentts.cppNative C++/GGML Qwen3-TTS with streaming, named speakers, and voice designCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
OmniVoiceNative C++/GGML TTS with voice cloning, voice design, and streamingCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
fish-speechHigh-quality TTS with voice cloningCPU, CUDA 12/13, ROCm, Intel SYCL, Metal, Jetson L4T
Pocket TTSLightweight CPU-efficient TTS with voice cloningCPU, CUDA 12/13, ROCm, Intel SYCL, Metal, Jetson L4T
OuteTTSTTS with custom speaker voicesCPU, CUDA 12
faster-qwen3-ttsReal-time Qwen3-TTS with CUDA graph captureCPU, CUDA 12/13, Jetson L4T
NeuTTS AirInstant voice cloning, on-device TTSCPU, CUDA 12, ROCm
VoxCPMExpressive end-to-end TTSCPU, CUDA 12/13, ROCm, Intel SYCL, Metal
Kitten TTSKitten TTS modelCPU, Metal
SupertonicLightning-fast on-device multilingual TTS via ONNXCPU
MLX-AudioAudio models on Apple SiliconCPU, CUDA 12/13, Metal, Jetson L4T
liquid-audioLFM2 end-to-end speech-to-speech, ASR, and TTSCPU, CUDA 12/13, ROCm, Intel SYCL, Jetson L4T

Music & Sound Generation

BackendDescriptionAcceleration
ACE-StepMusic generation from text descriptions, lyrics, or audioCPU, CUDA 12/13, ROCm, Intel SYCL, Metal
acestep.cppACE-Step 1.5 C++ backend using GGMLCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T

Image & Video Generation

BackendDescriptionAcceleration
stable-diffusion.cppStable Diffusion, Flux, PhotoMaker, Ideogram in C/C++CPU, CUDA 12/13, Intel SYCL, Vulkan, Metal, Jetson L4T
diffusersHuggingFace diffusion models (image and video generation)CPU, CUDA 12/13, ROCm, Intel SYCL, Metal, Jetson L4T
vLLM OmniMultimodal generation including text-to-image and text-to-videoCUDA 12/13, ROCm, Jetson L4T

Vision, Detection & Recognition

BackendDescriptionAcceleration
RF-DETRReal-time transformer-based object detection (Python)CPU, CUDA 12/13, Intel SYCL, Metal, Jetson L4T
rf-detr.cppNative RF-DETR object detection and instance segmentation in C/C++ using GGMLCPU, CUDA 12/13, Intel SYCL, Vulkan, Jetson L4T
locate-anything.cppOpen-vocabulary object detection and visual grounding (LocateAnything-3B) in C/C++ using GGMLCPU, CUDA 12/13, Intel SYCL, Vulkan, Jetson L4T
depth-anything.cppDepth Anything 3 monocular metric depth + camera pose in C/C++ using GGMLCPU, CUDA 12/13, Intel SYCL, Vulkan, Jetson L4T
sam3.cppSegment Anything (SAM 3/2/EdgeTAM) with text/point/box prompts in C/C++ using GGMLCPU, CUDA 12/13, Intel SYCL, Vulkan, Jetson L4T
face-detect.cppNative face detection, recognition, embedding, demographics and anti-spoofing (SCRFD/ArcFace, YuNet/SFace) in C/C++ using GGMLCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
voice-detect.cppNative speaker (voice) recognition and voice analysis (ECAPA-TDNN, WeSpeaker, ERes2Net, CAM++, wav2vec2) in C/C++ using GGMLCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
insightfaceFace verification, embedding, and anti-spoofing liveness (ONNX Runtime)CPU, CUDA 12
speaker-recognitionSpeaker (voice) recognition via SpeechBrain ECAPA-TDNNCPU, CUDA 12, Metal

Audio Processing

BackendDescriptionAcceleration
Silero VADVoice Activity DetectionCPU, Metal
LocalVQEJoint acoustic echo cancellation, noise suppression, and dereverberation in C/C++ using GGMLCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Jetson L4T
OpusAudio codec for WebRTC / Realtime APICPU, Metal

Utilities & Other

BackendDescriptionAcceleration
rerankersDocument reranking for RAGCUDA 12, ROCm, Intel SYCL, Metal
privacy-filter.cppStandalone GGML engine for the openai-privacy-filter PII/NER token-classification model family (powers LocalAI’s PII redaction tier)CPU, CUDA 13, Vulkan
local-storeLocal-first vector database for embeddingsCPU, Metal
TRLFine-tuning (SFT, DPO, GRPO, RLOO, KTO, ORPO)CPU, CUDA 12/13
llama.cpp quantizationHuggingFace → GGUF model conversion and quantizationCPU, Metal

Acceleration Support Summary

GPU Acceleration

  • NVIDIA CUDA: CUDA 12.0, CUDA 13.0 support across most backends
  • AMD ROCm: HIP-based acceleration for AMD GPUs
  • Intel oneAPI: SYCL-based acceleration for Intel GPUs (F16/F32 precision)
  • Vulkan: Cross-platform GPU acceleration
  • Metal: Apple Silicon GPU acceleration (M1/M2/M3+)

Specialized Hardware

  • NVIDIA Jetson (L4T CUDA 12): ARM64 support for embedded AI (AGX Orin, Jetson Nano, Jetson Xavier NX, Jetson AGX Xavier)
  • NVIDIA Jetson (L4T CUDA 13): ARM64 support for embedded AI (DGX Spark)
  • Apple Silicon: Native Metal acceleration for Mac M1/M2/M3+
  • Darwin x86: Intel Mac support

CPU Optimization

  • AVX/AVX2/AVX512: Advanced vector extensions for x86
  • Quantization: 4-bit, 5-bit, 8-bit integer quantization support
  • Mixed Precision: F16/F32 mixed precision support

Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).

  • * Only for CUDA and OpenVINO CPU/XPU acceleration.