OpenAI's new o1 model can reason and correct its mistakes
What's the story
OpenAI, the organization behind ChatGPT, has launched its latest product: a generative artificial intelligence (AI) model named o1.
Internally referred to as 'Strawberry,' it is capable of performing its own fact-checking.
The new LLM is a collection of models including two versions available today through ChatGPT and OpenAI's API: o1-preview and o1 mini.
These versions are currently accessible to subscribers of ChatGPT Plus or Team in the ChatGPT client. Enterprise and Edu users will gain access early next week.
AI vs humans
OpenAI o1 exceeds human PhD-level accuracy in physics, chemistry
OpenAI o1 is designed to avoid common reasoning errors often encountered by generative AI models.
It can effectively fact-check itself by spending more time considering all aspects of a query or command.
The model originated from an internal project known as Q*, and is especially skilled at solving math and programming-related challenges.
OpenAI o1 "exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA)," as per the company's blog post.
AI performance
OpenAI o1's 'thinking' process and performance
What sets OpenAI o1 apart from other generative AI models is its ability to "think" before responding to queries.
Given additional time, it can reason through a task holistically, planning ahead and performing a series of actions over an extended period that help it arrive at answers.
"It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn't working," said OpenAI.
AI training
OpenAI o1's training and optimization
"o1 is trained with reinforcement learning," which teaches the system through rewards and penalties, "to 'think' before responding via a private chain of thought," said Noam Brown, a research scientist at OpenAI.
He added that OpenAI used a new optimization algorithm and training data set specifically tailored for the o1 models.
"The longer [o1] thinks, the better it does on reasoning tasks," Brown said.
Real-world application
OpenAI o1's performance in real-world scenarios
Pablo Arredondo, VP at Thomson Reuters, who had access to the model before its launch, stated that o1 is superior to OpenAI's previous models (e.g., GPT-4o) at tasks like analyzing legal briefs and identifying solutions to problems in LSAT logic games.
"We saw it tackling more substantive, multi-faceted analysis," Arredondo told TechCrunch.
However, he also noted that depending on the query, o1 can be slower than other models and may take over 10 seconds to answer some questions.
Competitive edge
OpenAI o1's performance in competitive scenarios
In a qualifying exam for the International Mathematics Olympiad, a high school math competition, o1 correctly solved 83% of problems while GPT-4o only solved 13%, according to OpenAI.
The company also claims that the model reached the 89th percentile of participants in online programming contests known as Codeforces competitions.
Despite its slower response time, these results demonstrate o1's potential superiority in complex problem-solving tasks.
Model features
OpenAI o1's capabilities and limitations
The current version of o1 is somewhat basic and does not have the ability to browse the internet or analyze files.
It also has usage restrictions with weekly limits set at 30 messages for o1-preview and 50 for o1-mini.
Notably, o1 models are expensive. In the API, o1-preview costs $15 per one million input tokens (3x the cost of GPT-4) and $60 per one million output tokens. For reference, one million tokens roughly equals 750,000 words.