Microsoft's VALL-E can imitate any voice in just 3 seconds
Three seconds, that's all it takes for Microsoft's newly developed text-to-speech AI model to mimic a person's voice. Dubbed VALL-E, it can generate audio of a person saying anything once it learns a specific voice. The AI's ability in mimicking voices has caught everyone by surprise. It is trained on over 60,00 hours of English speaking, much more than other text-to-speech models.
Why does this story matter?
We're at the beginning of the golden age of AI. ChatGPT, DALL-E, Stable Diffusion, and now we have, VALL-E. These AI models have been shocking us with their abilities. The weird thing is that they are only going to get better. The pertinent question here is, how long can we play with them before they threaten our livelihoods?
It can replicate the emotion and tone of original speaker
What makes VALL-E stand out from other AI tools is its ability to replicate the emotion and tone of a speaker. The AI can do the same even when creating a recording of words that the original speaker never said. In addition to imitating the speaker's vocal timbre and emotional tone, VALL-E can also replicate the acoustic environment.
How does VALL-E work?
Researchers at Microsoft describe VALL-E as a "neural codec language model." It is built on Meta's EnCodec, an AI-powered audio compression method. Usually, text-to-speech tools synthesize speech by manipulating waveforms. VALL-E, on the other hand, analyzes how a person sounds and breaks that information into discrete components called "tokens" using EnCodec. It then uses this training data to create the final waveform.
VALL-E uses 'acoustic tokens' to synthesize personalized speech
In the paper titled "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers," Microsoft describes how VALL-E synthesizes personalized speech: "VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform."
Quality of synthesized speech depends on sample-matching training data
VALL-E is trained on 60,000 hours of English speech from over 7,000 speakers. Microsoft trained the AI's speech synthesis capabilities on Libri-light, an audio library assembled by Meta. The quality of audio generated by VALL-E depends on how closely the three-second sample matches the training data. The samples shared by Microsoft vary in quality. While some sound convincing, others sound quite robotic.
There are multiple security concerns associated with VALL-E
Microsoft has not made VALL-E's code available for others to experiment with. This is mainly due to the potential harm it can cause. Imagine getting spam calls with the sound of someone you know. It can also be used to create scandalous recordings of politicians or other public figures. Security concerns related to VALL-E pan to anything that can be activated with voice.