Alibaba unveils new AI models capable of controlling PCs, smartphones
What's the story
Alibaba's Qwen team has launched a new series of artificial intelligence (AI) models, called Qwen2.5-VL.
The advanced models can handle a range of text and image analysis tasks, from parsing files and understanding videos to counting objects in images and even controlling a PC or smartphone.
The capabilities of Qwen2.5-VL are on par with those of OpenAI's recently launched Operator model.
Benchmark success
Outperforming rival AI models in benchmark tests
According to the internal benchmarking tests conducted by the Qwen team, the Qwen2.5-VL outperforms multiple competitors across different evaluations.
These include video understanding, math problems, document analysis, and question-answering tasks.
The models Qwen2.5-VL outperformed include OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 2.0 Flash.
Model features
Diverse capabilities and availability
Not only can the Qwen2.5-VL analyze charts and graphics, but it can also extract data from scanned invoices/forms, comprehend long videos, and recognize intellectual properties from film/TV series and other products.
These capabilities indicate that the training data may have included copyrighted works.
The models are available for testing in Alibaba's Qwen Chat app and developers can download them from AI development platform Hugging Face.
Regulatory compliance
It adheres to Chinese internet regulations
Being an AI developed by a Chinese firm, Qwen2.5-VL also follows some restrictions on what topics it will discuss, especially in the Qwen Chat app.
For example, when asked to talk about "Xi Jinping's mistakes," the app showed an error message.
This follows China's internet regulator's practice of benchmarking domestically developed models to ensure their responses follow core socialist values and avoid sensitive topics that could upset regulators.
Software interaction
Qwen2.5-VL can interact with software on various devices
One of the coolest things about Qwen2.5-VL is the ability to interact with software on PC and mobile.
In a clip shared by Philipp Schmid, a technical lead at Hugging Face, the AI model was seen launching the Booking.com app for Android and booking a flight from Chongqing to Beijing.
In another demo where it controlled apps on a Linux desktop, it was limited to tab switching.
Licensing details
Licensing terms for Qwen2.5-VL models
While the two less sophisticated models in Qwen2.5-VL series, Qwen2.5-VL-3B and Qwen2.5-VL-7B, are available under a permissive license, the most advanced model, Qwen2.5-VL-72B, is subject to Alibaba's custom license.
This mandates companies and developers with over 100 million monthly active users to seek permission from Alibaba before commercially deploying the model, ensuring controlled usage of this AI technology in larger-scale applications.