
Amazon's new AI voice model can talk like a human
What's the story
Amazon has unveiled its latest generative AI model, Nova Sonic. The innovative tool is capable of taking voice inputs and generating natural-sounding speech.
According to Amazon, Nova Sonic competes with top-tier voice models from OpenAI and Google in terms of speed, speech recognition, and conversational quality.
The new model can be accessed via Bedrock, Amazon's developer platform for building enterprise AI applications.
Advanced technology
It outshines legacy AI voice models
Nova Sonic marks a major leap over AI voice models like those driving Amazon Alexa and Apple's Siri. Rohit Prasad, Amazon's Senior Vice President and Head Scientist of AGI, disclosed that elements of Nova Sonic are in use in Alexa+, Amazon's enhanced digital voice assistant.
Technical prowess
How does Nova Sonic work?
Prasad explained that Nova Sonic builds on Amazon's expertise in "large orchestration systems," the technical foundation of Alexa.
He emphasized that unlike other AI voice models, Nova Sonic excels at routing user requests to different APIs.
This way, it can calculate when it needs to fetch real-time information from the internet, parse a proprietary data source, or even take action in an external application - and use the appropriate tool for each task.
Benchmarking results
Performance in real-world scenarios
Amazon claims Nova Sonic is less susceptible to speech recognition problems than other AI voice models.
This means it can comprehend a user's intent even when they mumble, misspeak, or are in a noisy environment.
In the Multilingual LibriSpeech benchmark measuring speech recognition across languages and dialects, Nova Sonic had a word error rate (WER) of only 4.2% when averaged across English, French, Italian, German, and Spanish.
Comparison
Nova Sonic outperforms competitors in speed
In another benchmark that measured loud interactions with multiple participants (Augmented Multi Party Interaction), Amazon claims that Nova Sonic was 46.7% more accurate in terms of WER than OpenAI's GPT-4o-transcribe model.
The new voice model also shows industry-leading speed, with an average perceived latency of 1.09 seconds.
This makes it faster than the GPT-4o model powering OpenAI's Realtime API, which responds in 1.18 seconds according to benchmarks by Artificial Analysis (an independent analyst of AI models and providers).
AGI strategy
Amazon's future AI plans
Prasad revealed Nova Sonic is part of Amazon's broader strategy to build artificial general intelligence (AGI) - "AI systems that can do anything a human can do on a computer."
He said Amazon wants to release more AI models capable of understanding different modalities, including picture, video, and voice data.