Model Quantization

LocalAI supports model quantization directly through the API and Web UI. Quantization converts HuggingFace models to GGUF format and compresses them to smaller sizes for efficient inference with llama.cpp.

Note

This feature is experimental and may change in future releases.

Supported Backends

BackendDescriptionQuantization TypesPlatforms
llama-cpp-quantizationDownloads HF models, converts to GGUF, and quantizes using llama.cppq2_k, q3_k_s, q3_k_m, q3_k_l, q4_0, q4_k_s, q4_k_m, q5_0, q5_k_s, q5_k_m, q6_k, q8_0, f16CPU (Linux, macOS)

Quick Start

1. Start a quantization job

curl -X POST http://localhost:8080/api/quantization/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/functiongemma-270m-it",
    "quantization_type": "q4_k_m"
  }'

2. Monitor progress (SSE stream)

curl -N http://localhost:8080/api/quantization/jobs/{job_id}/progress

3. Download the quantized model

curl -o model.gguf http://localhost:8080/api/quantization/jobs/{job_id}/download

4. Or import it directly into LocalAI

curl -X POST http://localhost:8080/api/quantization/jobs/{job_id}/import \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-quantized-model"
  }'

API Reference

Endpoints

MethodPathDescription
POST/api/quantization/jobsStart a quantization job
GET/api/quantization/jobsList all jobs
GET/api/quantization/jobs/:idGet job details
POST/api/quantization/jobs/:id/stopStop a running job
DELETE/api/quantization/jobs/:idDelete a job and its data
GET/api/quantization/jobs/:id/progressSSE progress stream
POST/api/quantization/jobs/:id/importImport quantized model into LocalAI
GET/api/quantization/jobs/:id/downloadDownload quantized GGUF file
GET/api/quantization/backendsList available quantization backends

Job Request Fields

FieldTypeDescription
modelstringHuggingFace model ID or local path (required)
backendstringBackend name (default: llama-cpp-quantization)
quantization_typestringQuantization format (default: q4_k_m)
extra_optionsmapBackend-specific options (see below)

Extra Options

KeyDescription
hf_tokenHuggingFace token for gated models

Import Request Fields

FieldTypeDescription
namestringModel name for LocalAI (auto-generated if empty)

Job Status Values

StatusDescription
queuedJob created, waiting to start
downloadingDownloading model from HuggingFace
convertingConverting model to f16 GGUF
quantizingRunning quantization
completedQuantization finished successfully
failedJob failed (check message for details)
stoppedJob stopped by user

Progress Stream

The GET /api/quantization/jobs/:id/progress endpoint returns Server-Sent Events (SSE) with JSON payloads:

{
  "job_id": "abc-123",
  "progress_percent": 65.0,
  "status": "quantizing",
  "message": "[ 234/ 567] quantizing blk.5.attn_k.weight ...",
  "output_file": "",
  "extra_metrics": {}
}

When the job completes, output_file contains the path to the quantized GGUF file and extra_metrics includes file_size_mb.

Quantization Types

TypeSizeQualityDescription
q2_kSmallestLowest2-bit quantization
q3_k_sVery smallLow3-bit small
q3_k_mVery smallLow3-bit medium
q3_k_lSmallLow-medium3-bit large
q4_0SmallMedium4-bit legacy
q4_k_sSmallMedium4-bit small
q4_k_mSmallGood4-bit medium (recommended)
q5_0MediumGood5-bit legacy
q5_k_sMediumGood5-bit small
q5_k_mMediumVery good5-bit medium
q6_kLargeExcellent6-bit
q8_0LargeNear-lossless8-bit
f16LargestLossless16-bit (no quantization, GGUF conversion only)

The UI also supports entering a custom quantization type string for any format supported by llama-quantize.

Web UI

A “Quantize” page appears in the sidebar under the Tools section. The UI provides:

  1. Job Configuration — Select model, quantization type (dropdown with presets or custom input), backend, and HuggingFace token
  2. Progress Monitor — Real-time progress bar and log output via SSE
  3. Jobs List — View all quantization jobs with status, stop/delete actions
  4. Output — Download the quantized GGUF file or import it directly into LocalAI for immediate use

Architecture

Quantization uses the same gRPC backend architecture as fine-tuning:

  1. Proto layer: QuantizationRequest, QuantizationProgress (streaming), StopQuantization
  2. Python backend: Downloads model, runs convert_hf_to_gguf.py and llama-quantize
  3. Go service: Manages job lifecycle, state persistence, async import
  4. REST API: HTTP endpoints with SSE progress streaming
  5. React UI: Configuration form, real-time progress monitor, download/import panel