Audio to Text
Audio to text models are models that can generate text from an audio file.
The transcription endpoint allows to convert audio files to text. The endpoint supports multiple backends:
- whisper.cpp: A C++ library for audio transcription (default)
- moonshine: Ultra-fast transcription engine optimized for low-end devices
- faster-whisper: Fast Whisper implementation with CTranslate2
- parakeet-cpp: A C++/ggml port of NVIDIA NeMo Parakeet (FastConformer TDT/CTC/RNNT/hybrid). Runs quantized GGUFs on CPU or GPU, emits word-level timestamps, and supports cache-aware streaming (the
realtime_eoumodel surfaces end-of-utterance events). - llama-cpp: Route transcription to any multimodal-audio GGUF model served by the
llama-cppbackend (e.g. Qwen3-ASR, Voxtral, Qwen2-Audio). Under the hood the request is converted into a chat completion with the audio attached via the model’s audio encoder — the same path the upstream llama.cpp server uses. Setbackend: llama-cppin the model YAML and pointmmprojat the matching audio encoder. - voxtral: Voxtral-family models served by a dedicated backend
The endpoint input supports all the audio formats supported by ffmpeg.
Looking for “who spoke when” instead of a flat transcript? See Speaker Diarization —
/v1/audio/diarizationreturns time-stamped speaker segments and supports therttmformat used bypyannote.metrics.
Usage
Once LocalAI is started and whisper models are installed, you can use the /v1/audio/transcriptions API endpoint.
For instance, with cURL:
Example
Download one of the models from here in the models folder,
and create a YAML file for your model:
The transcriptions endpoint then can be tested like so:
Result:
You can also specify the response_format parameter to be one of lrc, srt, vtt, text, json or verbose_json (default):
Result (first few lines):
Supported request parameters
In addition to file and model, the endpoint accepts the following multipart form fields, matching the OpenAI audio transcription API:
| Field | Description |
|---|---|
language | ISO-639-1 language hint (e.g. en). Passed through to the backend. |
prompt | Optional context hint to bias the decoder. |
temperature | Sampling temperature (float). Honored by backends that support it. |
timestamp_granularities[] | Multi-value form field: word and/or segment. Honored when the backend produces the requested granularity. |
response_format | One of json (default for backwards-compat), verbose_json, text, srt, vtt, lrc. |
stream | When true, the endpoint emits an SSE stream of transcript.text.delta events followed by a final transcript.text.done event. |
diarize | LocalAI extension — speaker diarization (whisper.cpp only). |
The response body for verbose_json includes text, language, duration, and segments[] (with speaker populated when diarization is enabled).
Streaming transcriptions
Set -F stream=true to receive token-by-token SSE events as the backend produces them. The event shape matches the OpenAI streaming transcription format:
Backends that do not natively stream tokens fall back to emitting one delta plus a done event with the full text — the SSE contract is identical either way.
Using the llama-cpp backend with an audio-capable model
Any GGUF model whose mmproj contains an audio encoder can be used for transcription via the llama-cpp backend. This reuses the model’s own audio front-end rather than shelling out to whisper.cpp, which is useful when you want a single backend serving both chat-with-audio and transcription.
Example using ggml-org/Qwen3-ASR-0.6B-GGUF:
Then call /v1/audio/transcriptions as usual:
Using the parakeet-cpp backend
parakeet.cpp is a C++/ggml port of NVIDIA NeMo Parakeet that matches the upstream PyTorch models on CPU. GGUF weights for every model and quant are published in a single repo, mudler/parakeet-cpp-gguf. F16 is the recommended default, and Q4_K stays near-lossless on the small models. The easiest path is to import directly (the GGUFs auto-detect to this backend):
Or write a model YAML:
Then call /v1/audio/transcriptions as usual. Pass timestamp_granularities[]=word for per-word timings:
For real-time use, load a cache-aware streaming model (e.g. realtime_eou_120m-v1-*.gguf) and pass -F stream=true. Deltas are emitted as the audio is decoded, with end-of-utterance events closing each segment.
Segment timestamps
Transcriptions are split into segments the same way NVIDIA NeMo does: a new segment starts after sentence-ending punctuation (., ?, !), and each segment carries start/end times. This is the default (NeMo’s punctuation-only segmentation) and needs no configuration. While streaming, each end-of-utterance closes a segment, now with timestamps.
You can additionally split on silence by setting segment_gap_threshold (NeMo’s segment_gap_threshold, in encoder frames; off by default). When set, a gap between two words wider than the threshold also starts a new segment. The value is in frames to match NeMo exactly; the backend converts it to seconds using the model’s frame stride (frame_sec, reported by the engine):
Dynamic batching
The backend can coalesce concurrent transcription requests into a single batched engine call, which improves throughput on GPU when many requests arrive at once. Batching is off by default (batch_max_size:1, one request at a time); raise it to opt in. Two options: knobs control it:
By default each request runs on its own. Raise batch_max_size (for example 4 to 16) to enable batching; it pays off on GPU under concurrent load, where coalescing the per-step decode GEMMs across requests is a large throughput win. Leave it at 1 on CPU and for low-concurrency setups, where batching only adds latency. Batching only affects concurrent unary requests; streaming sessions always run on their own.
See also
- Audio Transform — clean up the audio (echo cancellation, noise suppression, dereverberation) before passing it to a transcription model.