Model compatibility table

Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the backends, compatible models families and the associated repository.

Note

LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.

Text Generation & Language Models

BackendDescriptionCapabilityEmbeddingsStreamingAcceleration
llama.cppLLM inference in C/C++. Supports LLaMA, Mamba, RWKV, Falcon, Starcoder, GPT-2, and many othersGPT, FunctionsyesyesCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
vLLMFast LLM serving with PagedAttentionGPTnonoCUDA 12, ROCm, Intel
vLLM OmniUnified multimodal generation (text, image, video, audio)Multimodal GPTnonoCUDA 12, ROCm
transformersHuggingFace Transformers frameworkGPT, Embeddings, Multimodalyesyes*CPU, CUDA 12/13, ROCm, Intel, Metal
MLXApple Silicon LLM inferenceGPTnonoMetal
MLX-VLMVision-Language Models on Apple SiliconMultimodal GPTnonoMetal
MLX DistributedDistributed LLM inference across multiple Apple Silicon MacsGPTnonoMetal

Speech-to-Text

BackendDescriptionAcceleration
whisper.cppOpenAI Whisper in C/C++CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T
faster-whisperFast Whisper with CTranslate2CUDA 12/13, ROCm, Intel, Metal
WhisperXWord-level timestamps and speaker diarizationCPU, CUDA 12/13, ROCm, Metal
moonshineUltra-fast transcription for low-end devicesCPU, CUDA 12/13, Metal
voxtralVoxtral Realtime 4B speech-to-text in CCPU, Metal
Qwen3-ASRQwen3 automatic speech recognitionCPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
NeMoNVIDIA NeMo ASR toolkitCPU, CUDA 12/13, ROCm, Intel, Metal

Text-to-Speech

BackendDescriptionAcceleration
piperFast neural TTSCPU
Coqui TTSTTS with 1100+ languages and voice cloningCPU, CUDA 12/13, ROCm, Intel, Metal
KokoroLightweight TTS (82M params)CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
ChatterboxProduction-grade TTS with emotion controlCPU, CUDA 12/13, Metal, Jetson L4T
VibeVoiceReal-time TTS with voice cloningCPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
Qwen3-TTSTTS with custom voice, voice design, and voice cloningCPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
fish-speechHigh-quality TTS with voice cloningCPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
Pocket TTSLightweight CPU-efficient TTS with voice cloningCPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T
OuteTTSTTS with custom speaker voicesCPU, CUDA 12
faster-qwen3-ttsReal-time Qwen3-TTS with CUDA graph captureCUDA 12/13, Jetson L4T
NeuTTS AirInstant voice cloning TTSCPU, CUDA 12, ROCm
VoxCPMExpressive end-to-end TTSCPU, CUDA 12/13, ROCm, Intel, Metal
Kitten TTSKitten TTS modelCPU, Metal
MLX-AudioAudio models on Apple SiliconMetal, CPU, CUDA 12/13, Jetson L4T

Music Generation

BackendDescriptionAcceleration
ACE-StepMusic generation from text descriptions, lyrics, or audioCPU, CUDA 12/13, ROCm, Intel, Metal
acestep.cppACE-Step 1.5 C++ backend using GGMLCPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T

Image & Video Generation

BackendDescriptionAcceleration
stable-diffusion.cppStable Diffusion, Flux, PhotoMaker in C/C++CPU, CUDA 12/13, Intel SYCL, Vulkan, Metal, Jetson L4T
diffusersHuggingFace diffusion models (image and video generation)CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T

Specialized Tasks

BackendDescriptionAcceleration
RF-DETRReal-time transformer-based object detectionCPU, CUDA 12/13, Intel, Metal, Jetson L4T
rerankersDocument reranking for RAGCUDA 12/13, ROCm, Intel, Metal
local-storeLocal vector database for embeddingsCPU, Metal
Silero VADVoice Activity DetectionCPU
TRLFine-tuning (SFT, DPO, GRPO, RLOO, KTO, ORPO)CPU, CUDA 12/13
llama.cpp quantizationHuggingFace → GGUF model conversion and quantizationCPU, Metal
OpusAudio codec for WebRTC / Realtime APICPU, Metal

Acceleration Support Summary

GPU Acceleration

  • NVIDIA CUDA: CUDA 12.0, CUDA 13.0 support across most backends
  • AMD ROCm: HIP-based acceleration for AMD GPUs
  • Intel oneAPI: SYCL-based acceleration for Intel GPUs (F16/F32 precision)
  • Vulkan: Cross-platform GPU acceleration
  • Metal: Apple Silicon GPU acceleration (M1/M2/M3+)

Specialized Hardware

  • NVIDIA Jetson (L4T CUDA 12): ARM64 support for embedded AI (AGX Orin, Jetson Nano, Jetson Xavier NX, Jetson AGX Xavier)
  • NVIDIA Jetson (L4T CUDA 13): ARM64 support for embedded AI (DGX Spark)
  • Apple Silicon: Native Metal acceleration for Mac M1/M2/M3+
  • Darwin x86: Intel Mac support

CPU Optimization

  • AVX/AVX2/AVX512: Advanced vector extensions for x86
  • Quantization: 4-bit, 5-bit, 8-bit integer quantization support
  • Mixed Precision: F16/F32 mixed precision support

Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).

  • * Only for CUDA and OpenVINO CPU/XPU acceleration.