š£ Text to audio (TTS)
API Compatibility
The LocalAI TTS API is compatible with the OpenAI TTS API and the Elevenlabs API.
LocalAI API
The /tts
endpoint can also be used to generate speech from text.
Usage
Input: input
, model
For example, to generate an audio file, you can send a POST request to the /tts
endpoint with the instruction as the request body:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"input": "Hello world",
"model": "tts"
}'
Returns an audio/wav
file.
Backends
šø Coqui
Required: Don’t use LocalAI
images ending with the -core
tag,. Python dependencies are required in order to use this backend.
Coqui works without any configuration, to test it, you can run the following curl command:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "coqui",
"model": "tts_models/en/ljspeech/glow-tts",
"input":"Hello, this is a test!"
}'
You can use the env variable COQUI_LANGUAGE to set the language used by the coqui backend.
You can also use config files to configure tts models (see section below on how to use config files).
Bark
Bark allows to generate audio from text prompts.
This is an extra backend - in the container is already available and there is nothing to do for the setup.
Model setup
There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.
Usage
Use the tts
endpoint by specifying the bark
backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "bark",
"input":"Hello!"
}' | aplay
To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the model
parameter:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "bark",
"input":"Hello!",
"model": "v2/en_speaker_4"
}' | aplay
Piper
To install the piper
audio models manually:
- Download Voices from https://github.com/rhasspy/piper/releases/tag/v0.0.2
- Extract the
.tar.tgz
files (.onnx,.json) insidemodels
- Run the following command to test the model is working
To use the tts endpoint, run the following command. You can specify a backend with the backend
parameter. For example, to use the piper
backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model":"it-riccardo_fasol-x-low.onnx",
"backend": "piper",
"input": "Ciao, sono Ettore"
}' | aplay
Note:
aplay
is a Linux command. You can use other tools to play the audio file.- The model name is the filename with the extension.
- The model name is case sensitive.
- LocalAI must be compiled with the
GO_TAGS=tts
flag.
Transformers-musicgen
LocalAI also has experimental support for transformers-musicgen
for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
curl --request POST \
--url http://localhost:8080/tts \
--header 'Content-Type: application/json' \
--data '{
"backend": "transformers-musicgen",
"model": "facebook/musicgen-medium",
"input": "Cello Rave"
}' | aplay
Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
Vall-E-X
VALL-E-X is an open source implementation of Microsoft’s VALL-E X zero-shot TTS model.
Setup
The backend will automatically download the required files in order to run the model.
This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.
Usage
Use the tts endpoint by specifying the vall-e-x backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "vall-e-x",
"input":"Hello!"
}' | aplay
Voice cloning
In order to use voice cloning capabilities you must create a YAML
configuration file to setup a model:
name: cloned-voice
backend: vall-e-x
parameters:
model: "cloned-voice"
tts:
vall-e:
# The path to the audio file to be cloned
# relative to the models directory
# Max 15s
audio_path: "audio-sample.wav"
Then you can specify the model name in the requests:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"model": "cloned-voice",
"input":"Hello!"
}' | aplay
Parler-tts
parler-tts
. It is possible to install and configure the model directly from the gallery. https://github.com/huggingface/parler-tts
Using config files
You can also use a config-file
to specify TTS models and their parameters.
In the following example we define a custom config to load the xtts_v2
model, and specify a voice and language.
name: xtts_v2
backend: coqui
parameters:
language: fr
model: tts_models/multilingual/multi-dataset/xtts_v2
tts:
voice: Ana Florence
With this config, you can now use the following curl command to generate a text-to-speech audio file:
curl -L http://localhost:8080/tts \
-H "Content-Type: application/json" \
-d '{
"model": "xtts_v2",
"input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?"
}' | aplay
Response format
To provide some compatibility with OpenAI API regarding response_format
, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response.
Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI)
Supported format thanks to ffmpeg are wav
, mp3
, aac
, flac
, opus
, defaulting to wav
if an unknown or no format is provided.
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"input": "Hello world",
"model": "tts",
"response_format": "mp3"
}'
If a response_format
is added in the query (other than wav
) and ffmpeg is not available, the call will fail.
Last updated 03 Nov 2024, 03:15 +0100 .