Voice Activity Detection (VAD)

Voice Activity Detection (VAD) identifies segments of speech in audio data. LocalAI provides a /v1/vad endpoint powered by the Silero VAD backend.

API

  • Method: POST
  • Endpoints: /v1/vad, /vad

Request

The request body is JSON with the following fields:

ParameterTypeRequiredDescription
modelstringYesModel name (e.g. silero-vad)
audiofloat32[]YesArray of audio samples (16kHz PCM float)

Response

Returns a JSON object with detected speech segments:

FieldTypeDescription
segmentsarrayList of detected speech segments
segments[].startfloatStart time in seconds
segments[].endfloatEnd time in seconds

Usage

Example request

curl http://localhost:8080/v1/vad \
  -H "Content-Type: application/json" \
  -d '{
    "model": "silero-vad",
    "audio": [0.0012, -0.0045, 0.0053, -0.0021, ...]
  }'

Example response

{
  "segments": [
    {
      "start": 0.5,
      "end": 2.3
    },
    {
      "start": 3.1,
      "end": 5.8
    }
  ]
}

Model Configuration

Create a YAML configuration file for the VAD model:

name: silero-vad
backend: silero-vad

Detection Parameters

The Silero VAD backend uses the following internal defaults:

  • Sample rate: 16kHz
  • Threshold: 0.5
  • Min silence duration: 100ms
  • Speech pad duration: 30ms

Error Responses

Status CodeDescription
400Missing or invalid model or audio field
500Backend error during VAD processing