
OpenAI unveils GPT-4.1 AI models focused on coding tasks
What's the story
OpenAI has unveiled a new series of models, called GPT-4.1. The lineup includes three variants: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano.
These multimodal models, as the company says, are optimized for coding and instruction-following tasks.
They can handle up to one million tokens or roughly 750,000 words at a time, and are available via OpenAI's API but not ChatGPT, the company's popular chatbot.
AI evolution
Aiming for complex software engineering tasks
The launch of GPT-4.1 comes as tech giants such as Google and Anthropic double down on the race to build advanced programming models.
OpenAI's ambitious goal is to create an "agentic software engineer," capable of performing complex software engineering tasks such as programming entire apps end-to-end, quality assurance, bug testing, and documentation writing.
Model efficiency
Enhanced performance and cost-effectiveness
OpenAI claims GPT-4.1 beats its predecessors, GPT-4o and GPT-4o mini, on coding benchmarks like SWE-bench.
The mini and nano variants of GPT-4.1 are said to be more efficient and faster, with a slight compromise on accuracy.
The nano variant is OpenAI's fastest—and most economical—model yet.
GPT-4.1 costs $2 per million input tokens and $8 per million output tokens; GPT-4.1 mini is $0.4/million input tokens and $1.6/million output tokens; and GPT-4.1 nano is just $0.1/million input tokens and $0.4/million output tokens.
AI challenges
Performance and limitations of GPT-4.1 models
In internal tests, GPT-4.1 scored between 52% and 54.6% on SWE-bench Verified, a human-validated subset of SWE-bench.
This was lower than Google's Gemini 2.5 Pro (63.8%) and Anthropic's Claude 3.7 Sonnet (62.3%).
OpenAI also tested the model on Video-MME, achieving a top score of 72% accuracy on the "long, no subtitles" video category.
As per experts, even top AI models struggle with fixing security vulnerabilities and bugs in code generation—an issue acknowledged by OpenAI as well.
Model behavior
Reliability and prompt requirements of GPT-4.1
The reliability of GPT-4.1 diminishes with an increase in input tokens, according to OpenAI's own tests.
In one such test, the model's accuracy fell from roughly 84% with 8,000 tokens to 50% with one million tokens.
Further, OpenAI observed that GPT-4.1 was more "literal" than its predecessor GPT-4o—sometimes needing more specific and explicit prompts for optimal performance.
Its "knowledge cutoff" is more recent, extending up to June 2024. This provides it with an improved understanding of current events.
Scenario
OpenAI to phase out GPT-4.5
OpenAI has announced plans to phase out GPT-4.5, its most powerful AI model to date, from its API.
Released in late February, GPT-4.5 will remain accessible via the API until July 14, after which developers will need to switch to another model, with OpenAI recommending GPT-4.1 as the preferred alternative.
Importantly, GPT-4.5 will continue to be available within ChatGPT for paying users as part of its research preview. The change only affects API access.
Cost concerns
One of the priciest models in company's lineup
GPT-4.5, internally code-named Orion, was trained with more data and computing power than any of OpenAI's earlier models.
It offers improvements over GPT-4o in areas like writing quality and persuasiveness.
OpenAI had acknowledged that GPT-4.5 is extremely expensive to operate, so much so that back in February, the company hinted it might not keep the model available via its API long term.
The model costs $75 per million input tokens (about 750,000 words) and $150 per million output tokens.