How Apple's new AI project could supercharge Siri
Apple is making significant progress in artificial intelligence (AI) and machine learning with its latest project, Ferret-UI. This advanced multimodal large language model (MLLM) has the potential to revolutionize Siri's understanding of iOS apps. It can assist Siri in comprehending the layout of apps on an iPhone display, potentially enhancing the functionalities of Apple's virtual assistant. A detailed research paper on this new MLLM, titled "Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs," was recently published by Cornell University.
A joint venture between Apple and Cornell University
The Ferret project is a collaborative effort between Cornell University and Apple, first introduced as an open-source MLLM in October. The original model was designed to identify/interpret various sections of an image for intricate queries. For instance, recognizing an animal species within a photograph's specific area.
Unique approach to overcome MLLM challenges
Despite noteworthy advancements, the compact nature and portrait orientation of user interface screens often pose challenges for existing MLLMs to comprehend and interact effectively with user interface (UI) screens. To tackle these challenges, Ferret-UI employs a unique magnification system. This system can upscale images to any resolution, enhancing the readability of icons and text. Unlike other LLMs that analyze a lower-resolution global image, Ferret-UI splits the screen into two smaller sections for processing and training.
Potential impact on Siri and user experience
The integration of Ferret-UI into Siri could offer users more control over their devices by understanding user interface elements. This would allow Siri to independently choose graphical elements within apps to perform actions on behalf of users. The model also holds promise for visually impaired users as it can provide detailed descriptions of what is on screen and execute actions based on user commands.
Impressive performance in app interface interaction
Ferret-UI has shown remarkable results in understanding and interacting with app interfaces through rigorous training. If integrated into Siri, it could empower the digital assistant to carry out complex tasks within apps. For instance, users can ask Siri to schedule a flight or make a reservation, and Siri would interact with the relevant app to complete the task.
Ferret-UI outperforms OpenAI's MLLM in public benchmarks
In public benchmarks and advanced tasks, Ferret-UI has outperformed GPT-4V, OpenAI's MLLM. Additionally, it surpassed GPT-4V in nearly all elementary category tasks, such as icon recognition, OCR, widget classification, and locating icons and widgets on iPhone and Android. The only task where GPT-4V had a slight edge over the Ferret models was the "find text" task on the iPhone.