OpenAI Releases GPT-5.5 With Agentic Computing Focus and Its Strongest Safety Safeguards Yet
OpenAI's GPT-5.5 lands with an 82.7% Terminal-Bench 2.0 score, a 400K-token Codex window, and the company's strictest safety review yet.
AnIntent Editorial
Photo by Harshit Katiyar on Unsplash
OpenAI shipped GPT-5.5 on April 23, 2026, less than two months after GPT-5.4, and the company is positioning it as the most agentic model it has ever released. On Terminal-Bench 2.0, which tests complex command-line workflows requiring planning, iteration, and tool coordination, GPT-5.5 hits 82.7%, a state-of-the-art accuracy. The headline is not raw IQ. It is autonomy: how far the model gets without a human nudging it back on track.
What OpenAI actually shipped on April 23
The rollout pattern matters. OpenAI said GPT-5.5 was rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex on launch day, with API access on the Responses and Chat Completions endpoints arriving April 24. The official model snapshot ID is gpt-5.5-2026-04-23.
Developers get five reasoning effort levels: none, low, medium (the default), high, and xhigh. That last tier is new and aimed squarely at long-horizon agentic runs where the model is expected to plan, act, and self-correct across hours of work.
GPT-5.5 is also now the default model in Codex, which operates with a 400K-token context window. Codex itself picked up a quietly significant capability earlier this spring: support for multiple concurrent AI agents working in parallel on a single codebase via isolated git worktrees. Pair that with GPT-5.5 and you have an agent harness that can fan out across a repository instead of marching through it serially.
The OpenAI GPT-5.5 release pitch: less guidance, more execution
During the press call, OpenAI president Greg Brockman said GPT-5.5 will let Codex produce polished code and approach projects with the judgment of a senior software engineer. Brockman framed the launch as a step toward what he called a "super app" and described the model as a meaningful jump toward more agentic and intuitive computing.
His other quote is the one worth pinning to the wall. According to CNBC, Brockman said what makes the model special is how much more it can do with less guidance. Translation: the prompt engineering tax is supposed to drop.
Chief Research Officer Mark Chen pushed the science angle, telling reporters the model shows meaningful gains on scientific and technical research workflows and flagging drug discovery as a plausible near-term application. That is a notable shift in emphasis. Coding has been the marquee benchmark for two years; biology is now in the foreground.
GPT-5.5 vs GPT-5.4: where the gap actually shows up
Benchmarks first, because the numbers are unusually clean.
- Terminal-Bench 2.0: GPT-5.5 hits 82.7%, ahead of GPT-5.4 at 75.1%, Anthropic's Claude Opus 4.7 at 69.4%, and Google's Gemini 3.1 Pro at 68.5%.
- SWE-Bench Pro: 58.6% of real-world GitHub issues resolved end-to-end in a single pass.
- OSWorld-Verified (computer use): 78.7%, versus 75% for GPT-5.4 and 78% for Opus 4.7.
- GDPval (knowledge work across 44 occupations): 84.9%.
- Tau2-bench Telecom: 98.0%, run without prompt tuning.
- FrontierMath: 51.7% on difficulty 1-3 problems and 35.4% on difficulty 4.
The efficiency story is the more interesting one. GPT-5.5 uses roughly 40% fewer output tokens than GPT-5.4 to complete equivalent Codex tasks. Larger, more capable models are usually slower to serve, but GPT-5.5 matches GPT-5.4 per-token latency in real-world serving while running at a much higher level of intelligence. OpenAI credits custom heuristic GPU partitioning algorithms for a token-generation speedup of more than 20%.
That efficiency claim is doing real work in the pricing math. API output token pricing comes in at $30 per million for GPT-5.5, double the $15 GPT-5.4 charged. If the 40% token-reduction figure holds in production, the effective cost increase per finished task is closer to 20% than 100%. If it does not hold, enterprise bills are about to look very different.
There is also a long-context surcharge nobody is talking about loud enough: prompts exceeding 272K input tokens are billed at 2x input and 1.5x output for the entire session. Regional data residency endpoints carry a further 10% uplift. Anyone running 400K-token Codex sessions in an EU data residency tier is paying for it.
GPT-5.5 Codex agentic AI and the senior-engineer claim
The "senior engineer" line is marketing, but there is signal underneath it. CodeRabbit's early testing on its curated code-review benchmark found GPT-5.5 raised expected-issue detection from 58.3% to 79.2% and lifted precision from 27.9% to 40.6%, while only nudging comment volume from 67 to 75. More useful issues, not just more issues.
CodeRabbit's qualitative read is the part most reviews missed. Its engineers reported that GPT-5.5 favored the smallest workable change instead of sweeping rewrites - a behavioral shift toward conservatism that matters more in production codebases than any benchmark score. Junior models love the rewrite. Seniors leave the rewrite for next quarter.
Where the model still wobbles: open-ended UI work where execution quality is high but originality lags, and any task where the prompt itself is genuinely ambiguous about scope. xhigh reasoning helps. It is not a substitute for a clear spec.
For readers tracking how these capabilities reshape underlying compute demand, our piece on why AI infrastructure is now more strategic than AI models covers the supply-side context.
OpenAI's strongest safeguards yet - and what "High" cybersecurity means
This is the part of the announcement that deserves the most scrutiny. OpenAI released GPT-5.5 alongside what it described as its strongest set of safeguards to date, including targeted cybersecurity and biology capability testing and feedback from nearly 200 early-access partners.
According to CNBC, GPT-5.5 meets OpenAI's internal "High" cybersecurity risk classification but does not cross the "Critical" threshold. VP of Research Mia Glaese confirmed extensive third-party safeguard testing and red teaming for cyber and bio risks, and OpenAI has been iterating on cyber safeguards for months ahead of this release.
"High" is not a marketing word here. It is a tier in OpenAI's preparedness framework that triggers specific deployment mitigations. Crossing into "Critical" would block release entirely. The fact that GPT-5.5 sits one rung below the cliff is the most concrete admission yet that frontier coding ability and frontier offensive-security ability are now the same capability viewed from two different angles.
Anthropic's recently launched Mythos cybersecurity model came up repeatedly during the press briefing as a comparison point - a reminder that safety positioning is now part of competitive positioning. For broader context on where this is heading, see our coverage of the AI peer-preservation alignment problem and our other AI Safety articles.
The competitive picture: why the timing is not a coincidence
GPT-5.5 arrived under pressure. Anthropic now holds roughly 32% of the enterprise LLM API market against OpenAI's 25% a reversal from a year prior. Codex usage is the counter-narrative OpenAI wants in the headlines: about 4 million developers now use the Codex coding agent weekly, according to OpenAI.
A higher-accuracy variant, GPT-5.5 Pro, is available to Pro, Business, and Enterprise users. The Pro tier is where OpenAI is quietly testing whether enterprises will pay a premium for marginal accuracy gains on long-horizon tasks where the cost of a wrong answer dwarfs the cost of inference.
Google is the other variable. The Terminal-Bench gap with Gemini 3.1 Pro is wide, but Google's models continue to lead on raw context length and multimodal grounding. Our breakdown of the Gemini 3.1 Ultra launch covers that side of the board.
What to watch next
Three things will tell us whether GPT-5.5 is a generational shift or a checkpoint release.
First, whether the 40% token-reduction figure holds at enterprise scale once developers stop hand-tuning prompts and start letting the model run with default reasoning. Token efficiency is the entire economic argument for the doubled output price.
Second, whether independent red teams confirm OpenAI's "High but not Critical" cybersecurity classification, or whether community jailbreaks push the practical capability past the line OpenAI drew. Frameworks are only as honest as their auditors.
Third, whether Anthropic's Mythos releases publicly and how its safety posture compares. Anthropic's newest frontier model, called Mythos, is finished but has not been publicly released. The next move is theirs.
For more breaking coverage, the News articles and AI Tools articles sections track each frontier release as it lands.
Frequently Asked Questions
OpenAI released GPT-5.5 on April 23, 2026, with rollout to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex on launch day. API access via the Responses and Chat Completions endpoints became available the following day, April 24, 2026.
GPT-5.5 API output pricing is $30 per million tokens, double the $15 per million GPT-5.4 charged. Prompts exceeding 272K input tokens are billed at 2x input and 1.5x output for the full session, and regional data residency endpoints add a 10% uplift.
GPT-5.5 is now the default model in Codex, which operates with a 400K-token context window. Codex also supports multiple concurrent AI agents working in parallel on a single codebase via isolated git worktrees as of April 2026.
According to CNBC, GPT-5.5 meets OpenAI's internal 'High' cybersecurity risk classification but does not cross the 'Critical' threshold, which would block deployment. VP of Research Mia Glaese confirmed extensive third-party safeguard testing and red teaming for cyber and bio risks before release.
On Terminal-Bench 2.0, GPT-5.5 scores 82.7% versus Claude Opus 4.7 at 69.4% and Gemini 3.1 Pro at 68.5%, according to Fast Company. On OSWorld-Verified computer-use tests, GPT-5.5 reaches 78.7% compared with 78% for Opus 4.7.