Voice Activity Detection (VAD)
Voice Activity Detection (VAD) identifies segments of speech in audio data. LocalAI provides a /v1/vad endpoint powered by the Silero VAD backend.
API
- Method:
POST - Endpoints:
/v1/vad,/vad
Request
The request body is JSON with the following fields:
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model name (e.g. silero-vad) |
audio | float32[] | Yes | Array of audio samples (16kHz PCM float) |
Response
Returns a JSON object with detected speech segments:
| Field | Type | Description |
|---|---|---|
segments | array | List of detected speech segments |
segments[].start | float | Start time in seconds |
segments[].end | float | End time in seconds |
Usage
Example request
Example response
Model Configuration
Create a YAML configuration file for the VAD model:
Detection Parameters
The Silero VAD backend uses the following internal defaults:
- Sample rate: 16kHz
- Threshold: 0.5
- Min silence duration: 100ms
- Speech pad duration: 30ms
Error Responses
| Status Code | Description |
|---|---|
| 400 | Missing or invalid model or audio field |
| 500 | Backend error during VAD processing |