Microsoft's AI tool creates 'deepfake voices' so real they're banned
Microsoft has developed an AI speech generator, VALL-E 2, capable of so convincingly mimicking human voices, that it cannot be released to the public. According to a paper published on arXiv, the text-to-speech (TTS) generator can reproduce human speech using just a few seconds of audio. The researchers describe VALL-E 2 as "the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time."
Key reasons behind VALL-E 2's performance
VALL-E 2's high-quality speech synthesis is attributed to two key features: "Repetition Aware Sampling" and "Grouped Code Modeling." The former improves the AI's conversion of text into speech by preventing repetitions of language units, the infinite loops of sounds, and phrases. The latter enhances efficiency by lowering sequence length, speeding up how quickly VALL-E 2 generates speech, and managing difficulties associated with processing long strings of sounds.
VALL-E 2 surpasses previous AI systems in speech synthesis
Researchers used audio samples from LibriSpeech and VCTK speech libraries and ELLA-V, an evaluation framework, to assess VALL-E 2's performance. They concluded that "VALL-E 2 surpasses previous zero-shot TTS systems in speech robustness, naturalness, and speaker similarity," making it the first to reach human parity on these benchmarks. However, the quality of VALL-E 2's output is influenced by factors like the length and quality of speech prompts, and environmental factors like background noise.
Microsoft withholds VALL-E 2 over misuse concerns
Despite its capabilities, Microsoft has decided not to roll out VALL-E 2 to the public due to potential misuse risks. This decision echoes rising concerns around voice cloning and deepfake technology. The researchers stated in a blog post that "VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public."
Possible applications for AI speech tech
The researchers suggested potential applications for AI speech technology like VALL-E 2 in education, entertainment, journalism, accessibility features, translation, interactive voice response systems, and chatbots. They stated that if the model is generalized to unseen speakers in the real world, it should have a protocol to ensure that the speaker approves the use of their voice, and a synthesized speech detection model.