Sound Classification

Sound-event classification (audio tagging) answers the question “what am I hearing?” - given an audio clip, it returns a list of scored AudioSet labels (e.g. Baby cry, infant cry, Glass breaking, Dog bark, Alarm).

LocalAI exposes this through the /v1/audio/classification endpoint, modelled after /v1/audio/transcriptions. The reference backend is ced.cpp (CED, a 527-class AudioSet tagger), a small ViT over a log-mel spectrogram ported to ggml with full PyTorch parity. Apache-2.0 weights are redistributable as GGUF.

Because classification is exposed as a regular OpenAI-style endpoint, any HTTP client works - there is no Python dependency on the consumer side.

Endpoint

POST /v1/audio/classification
Content-Type: multipart/form-data
FieldTypeDescription
filefile (required)audio file in any format ffmpeg accepts
modelstring (required)name of the sound-classification-capable model (e.g. ced-base)
top_kintnumber of top tags to return (0 = backend default)
thresholdfloatdrop tags scoring below this value

Response

{
  "model": "ced-base",
  "detections": [
    {"index": 23, "label": "Baby cry, infant cry", "score": 0.87},
    {"index": 22, "label": "Crying, sobbing", "score": 0.41}
  ]
}

Detections are returned in score-descending order. Scores are per-class probabilities (multi-label, independent), so they do not sum to 1.

Example

curl http://localhost:8080/v1/audio/classification \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/clip.wav" \
  -F model="ced-base" \
  -F top_k=10

See also