Meta's Voicebox can generate high-quality sound clips and edit pre-recorded audio while preserving the style. It is a multilingual AI model, capable of producing speech in six different languages.
2
Scenario
Voicebox can either create outputs from scratch or modify a sample given to it. It can help with speech synthesizing, audio editing, noise removal, diverse sample generation, and style conversion.
3
Approach
Voicebox employs a novel approach to learning that relies solely on raw audio and transcription. It is based on a technique known as Flow Matching, which has been shown to outperform diffusion models.
4
Training
According to Meta, Voicebox is trained with 50,000+ hours ofpre-recorded speech/transcripts from public-domain audiobooks in English, French, Spanish, German, Polish, and Portuguese.
5
Applications
Voicebox can use an audio sample, and replicate its style for text-to-speech generation. It can restore a section of speech interrupted by noise, or replace mispronounced words.
6
Availability
Despite having many intriguing applications, the Voicebox modelor code isn't publicly available atthe moment due to the potential risks of misuse.