OpenAI unveils 'o3' reasoning models with groundbreaking results in benchmarks
OpenAI unveiled its latest reasoning model, o3, on the last day of its 12-day "shipmas" event. The new model is an improved version of the previously released o1 "reasoning" model. Notably, the o3 is a model family, similar to o1. There's o3 and a compact version called o3-mini, which has been fine-tuned for specific tasks.
o3: A step toward AGI?
OpenAI hints that o3 could be a step toward artificial general intelligence (AGI) under certain conditions. However, this comes with major caveats. The company's CEO Sam Altman has previously expressed his preference for a federal testing framework to guide the monitoring and mitigation of risks associated with new reasoning models like o3.
Naming controversy: Why 'o3' and not 'o2'
The name o3 was chosen instead of o2 reportedly due to trademark issues. To avoid any possible conflict with British telecom provider O2, OpenAI opted for this naming strategy. Altman confirmed this during a recent livestream.
Availability and safety testing
As of now, neither o3 nor o3-mini are available widely. However, safety researchers can sign up for a preview of the o3-mini model. The full version of the o3 model is slated to be launched later, with Altman saying they plan to release o3-mini by January end and follow it up with the launch of o3.
Concerns and new alignment technique
AI safety testers have discovered that o1's reasoning capabilities could make it deceive human users more often than traditional models. To counter these issues, OpenAI is using a new technique called "deliberative alignment" to ensure models like o3 align with its safety principles. The success of this technique in reducing deception risks will be determined when testing results are released by OpenAI's red-team partners.
Unique features of o3 models
Unlike most AI, reasoning models like o3 can fact-check themselves, preventing themselves from pitfalls that often trip other models. However, this self-checking can lead to some delay in reaching solutions than regular non-reasoning models. Despite this latency, these reasoning models are more reliable in fields like science, physics, and mathematics. The key difference between o3 and o1 is the ability to "adjust" reasoning time—low, medium, or high-compute (i.e., thinking time). Higher compute settings allow o3 to perform better on tasks.
o3 scored 87.5% in high compute setting
According to a benchmark, OpenAI is slowly edging closer to AGI. On ARC-AGI—a test assessing an AI system's ability to learn new skills beyond its training data—o3 scored 87.5% in the high compute setting. Even at its lowest (low compute setting), the model tripled the o1's performance. Admittedly, the high compute setting was incredibly costly, running into thousands of dollars per challenge, according to ARC-AGI co-creator François Chollet.
Take a look at Chollet's post
o3 blows away competition in EpochAI's Frontier Math benchmark
The o3 also outperformed o1 by 22.8% on SWE-Bench Verified and achieved a Codeforces rating of 2727. It scored 96.7% on the 2024 American Invitational Mathematics Exam, missing just one question, and 87.7% on GPQA Diamond, a test comprising graduate-level biology, physics, and chemistry questions. Finally, o3 set a new record on EpochAI's Frontier Math benchmark, solving 25.2% of problems—a feat unmatched by any other model, as none have surpassed 2%.