Sound Classification
Sound-event classification (audio tagging) answers the question “what am I hearing?” - given an audio clip, it returns a list of scored AudioSet labels (e.g. Baby cry, infant cry, Glass breaking, Dog bark, Alarm).
LocalAI exposes this through the /v1/audio/classification endpoint, modelled after /v1/audio/transcriptions. The reference backend is ced.cpp (CED, a 527-class AudioSet tagger), a small ViT over a log-mel spectrogram ported to ggml with full PyTorch parity. Apache-2.0 weights are redistributable as GGUF.
Because classification is exposed as a regular OpenAI-style endpoint, any HTTP client works - there is no Python dependency on the consumer side.
Endpoint
| Field | Type | Description |
|---|---|---|
file | file (required) | audio file in any format ffmpeg accepts |
model | string (required) | name of the sound-classification-capable model (e.g. ced-base) |
top_k | int | number of top tags to return (0 = backend default) |
threshold | float | drop tags scoring below this value |
Response
Detections are returned in score-descending order. Scores are per-class probabilities (multi-label, independent), so they do not sum to 1.
Example
See also
- Audio to Text - speech transcription
- Speaker Diarization - who spoke when