OpenAI's GPT-5.5 Is Here, And It's No Potato: Narrowly Beats Anthropic's Claude Mythos Preview On Terminal-Bench 2.0

This followed months of rumors and reports that OpenAI was developing a new, more powerful AI big language model for use in ChatGPT and through its application programming interface (API), reportedly codenamed OpenAI. "spade" Internally, the company today unveiled its latest offering under the more formal name GPT-5.5.

And it probably won’t surprise anyone, it’s hardly a "Potato" In the pejorative sense of the word: GPT-5.5 has taken the lead for OpenAI in generally available LLMs, ahead of the latest public offerings from rivals Anthropic and Google, and even beating the private Anthropic Cloud Mythos Preview model on a benchmark (essentially a statistical tie).

"This is certainly our strongest model yet on coding, both as measured by benchmarks and based on feedback from trusted partners as well as our own experience." Amelia explained "MIA" Glaze, vice president of research at OpenAI, spoke in a video call with reporters today ahead of the launch.

OpenAI positions GPT-5.5 as a fundamental redesign of how intelligence interacts with computers’ operating systems and professional software stacks.

"What’s really special about this model is how much it can do with less guidance," Greg Brockman, co-founder and president of OpenAI, said on the same call. "It is much more intuitive to use. It can look at an obscure problem and figure out what needs to happen next."

Brockman went on to emphasize the areas in which users can expect to see benefits from using GPT-5.5 compared to OpenAI’s previous state-of-the-art model, GPT-5.4, which is available (for now) to users and enterprises at half the API cost of its new successor.

"It is very good at coding," Brockman said about GPT-5.5. "It’s also great for broader computer work, computer use, scientific research – those kinds of applications that have very intelligent constraints."

OpenAI CEO and co-founder Sam Altman also emphasized the launch and the company’s philosophy in a post on X, writing in part: "We want our users to have access to the best technology and a level playing field for all."

focus on agency

Focus is at the core of GPT-5.5 "agentic" Performance—especially in coding, computer use, and scientific research.

Unlike its predecessors, which often required nuanced, step-by-step hints to avoid "hallucinated" The way forward, GPT-5.5 is designed to handle messy, multi-part tasks autonomously.

It excels at conducting online research, debugging complex codebases, and moving between documents and spreadsheets without human intervention.

One of the most significant technological leaps is the efficiency of the model. While larger models typically suffer from increased latency, GPT-5.5 matches the per-token latency of the previous GPT-5.4 while providing a higher level of intelligence.

This was achieved through an intensive hardware-software co-design. OpenAI served GPT-5.5 on NVIDIA GB200 and GB300 NVL72 systems, which uses custom heuristic algorithms written by AI to divide and balance work across GPU cores.

This optimization has reportedly increased the speed of token creation by more than 20%. For high stakes arguments, "GPT-5.5 thinking" Mode in ChatGPT provides smarter, more concise answers by allowing the model to be more internal "calculate time" To verify your own assumptions before responding.

This capability is particularly visible in the model’s performance "Expert-SWE," An internal OpenAI benchmark for long-horizon coding tasks with an average human completion time of 20 hours. GPT-5.5 outperformed GPT-5.4 on this metric while using significantly fewer tokens.

Benchmarks show that OpenAI has again taken the lead over Cloud Opus 4.7, the most powerful publicly available LLM (but the unreleased Mythos still outperforms it).

The market for major US-made Frontier models has become an increasingly fierce race between OpenAI, Anthropic and Google.

Literally a week ago today, OpenAI rival Anthropic released to the public its most powerful generally available model, Opus 4.7, which took over the leaderboard in terms of the number of third-party benchmark tests it leads.

Yet today, GPT-5.5 outperforms it and even Anthropic’s highly restricted, more powerful model Cloud Mythos Preview, even if only on one benchmark. Terminal-Bench 2.0one who tests "A model’s ability to navigate and complete tasks in a sandboxed terminal environment."

GPT-5.5 achieved 82.7% Accuracy on Terminal-Bench 2.0, easily exceeded Opus 4.7 (69.4%) and narrowly escaped Mythos Preview (82.0%).

However, in multidisciplinary reasoning without tools, the scenario is more competitive. But final test of humanity Without tools, GPT-5.5 Pro scored 43.1%trailing behind Opus 4.7 (46.9%) And Mythos Preview (56.8%).

benchmark	GPT-5.5	cloud opus 4.7	Gemini 3.1 Pro	Mythos Preview*
Terminal-Bench 2.0	82.7	69.4	68.5	82.0
Expert-SWE (Internal)	73.1	—	—	—
GDPval (win or tie)	84.9	80.3	67.3	—
osworld-verified	78.7	78.0	—	79.6
toolathlon	55.6	—	48.8	—
browsecomp	84.4	79.3	85.9	86.9
FrontierMath Tier 1-3	51.7	43.8	36.9	—
FrontierMath Tier 4	35.4	22.9	16.7	—
cybergym	81.8	73.1	—	83.1
Tau2-Bench Telecom (Original Signal)	98.0	—	—	—
OfficeQA Pro	54.1	43.6	18.1	—
Investment Banking Modeling Work (Internal)	88.5	—	—	—
MMMU Pro (No Tools)	81.2	—	80.5	—
MMMU Pro (with tools)	83.2	—	—	—
genebench	25.0	—	—	—
bixbench	80.5	—	—	—
Capture the Flag Challenge Task (Internal)	88.1	—	—	—
ARC-AGI-2 (Verified)	85.0	75.8	77.1	—
SWE-Bench Pro (Public)	58.6	64.3	54.2	77.8

This shows that while OpenAI is winning "computer use" And "agency," Other models may still maintain an edge in pure, zero-shot academic knowledge.

It’s important to clarify that Mythos Preview is not a generally available product; Anthropic has classified it as a strategic defensive asset due to its high cybersecurity risks, limiting its access to a small, limited audience of trusted partners and government agencies.

Since Mythos has been taken out of widespread commercial use, the primary market competition remains between GPT-5.5, Gemini 3.1 Pro, and Cloud Opus 4.7.

So when it comes to models that the general public can access, GPT-5.5 again takes the crown for OpenAI, achieving state-of-the-art in 14 benchmarks compared to 4 for Cloud Opus 4.7 and 2 for Google Gemini 3.1 Pro.

It dominates in agentic computer use, economic knowledge work (GDPval), specialized cyber security (CyberGym), and complex mathematics (Frontier Math).

In comparison, Cloud Opus 4.7 leads in software engineering and toolless reasoning, while Gemini 3.1 Pro leads in three categories, particularly excelling in academic reasoning and financial analysis.

Increased costs for users

The shift to intelligence comes with a significant price increase for API developers. OpenAI has effectively doubled the entry price for its flagship model compared to the previous generation, and doubled it again for GPT-5.5 Pro, the most cutting-edge version of the model:

Sample	Input Price(Per 1M Token)	Output Value(Per 1M Token)
GPT-5.4	$2.50	$15.00
GPT-5.5	$5.00	$30.00
GPT-5.5 Pro	$30.00	$180.00

To reduce these costs, OpenAI emphasizes that GPT-5.5 is more "token efficient," That is, it uses fewer tokens to accomplish the same task than GPT-5.4.

For users who need speed over depth, OpenAI also introduced a fast mode In Codex, which generates tokens 1.5x faster but at a 2.5x price premium.

"Mini" And "nano" There is currently no GPT-5.5 equivalent at the levels seen in the GPT-5.4 era (priced at $0.75 and $0.20 per 1M input token, respectively), although the company notes that GPT-5.5 is becoming available across all subscription tiers, including Plus, Pro, and Enterprise.

Licensing and ‘cyber-permissive’ limits

OpenAI’s approach to security and licensing for GPT-5.5 introduces an innovative concept: Trusted access to cyber. Because the model is now able to identify and fix advanced security vulnerabilities, OpenAI has enforced it more rigorously "cyber-risk classifier" For general users.

However, for legitimate security professionals, OpenAI is offering a special "cyber-permissive" License. The program allows verified protectors – those responsible for critical infrastructure such as power grids or water supplies – to use models such as gpt-5.4-cyber or unrestricted versions of GPT-5.5 with fewer rejections for security-related signals.

This dual-use framework recognizes that while AI can enhance cyber defense, it can also be weaponized. Under OpenAI’s readiness framework, GPT-5.5 is classified as "High" Risks to biological and cyber security capabilities.

To manage this, API deployments currently require different security measures than consumer-facing ChatGPT, and OpenAI is working with government partners to ensure that these tools are used to strengthen, not weaken, digital resilience.

Initial reactions: Losing access feels like a ‘lame bite’

Initial feedback from power users and engineers suggests that GPT-5.5 has crossed a psychological threshold in AI usefulness. For developers, maintainability of the model "conceptual clarity" It has exceptional features in massive codebases.

"The first coding model I’ve used that has serious conceptual clarity," noted Dan Schipper, CEO of Avery.

Schipper tested the model by asking it to debug a complex system failure that previously required a team of human engineers to rewrite; GPT-5.5 produced a similar fix autonomously. Similarly, Pietro Chirano, CEO of MagicPath, described a "phase change" In demonstration when the model successfully merged a branch with hundreds of refactored changes into the main branch in a single, 20-minute interval. Perhaps the sharpest response came from an anonymous NVIDIA engineer who had early access to the model:

"Losing access to GPT-5.5 feels like having a limb amputated".

This sentiment is echoed in the scientific community. Derya Unutmaz, a professor at the Jackson Laboratory for Genomic Medicine, used GPT-5.5 Pro to analyze a dataset of 28,000 genes, producing a report in minutes that would normally have taken her team months.

Brandon White, CEO of Axiom Bio, further said that if OpenAI continues this momentum, "The foundation of drug discovery will change by the end of the year".

GPT-5.5 is more than an incremental update; It’s a tool designed for a world where humans delegate entire workflows rather than single signals. While costs are high and security guardrails are tight, the performance gains in agentic work show that AI is eventually moving from chat boxes to operating systems.

According to the company’s researchers, perhaps the most surprising thing is that the end of the scaling limit is not even heard – after which the models are trained on more and more GPUs.

"In fact we still have room to train smarter models than this," said Jakub Pachocki, chief scientist at OpenAI.

<a href

OpenAI's GPT-5.5 is here, and it's no potato: narrowly beats Anthropic's Claude Mythos Preview on Terminal-Bench 2.0

focus on agency

Benchmarks show that OpenAI has again taken the lead over Cloud Opus 4.7, the most powerful publicly available LLM (but the unreleased Mythos still outperforms it).

Increased costs for users

Licensing and ‘cyber-permissive’ limits

Initial reactions: Losing access feels like a ‘lame bite’

Like this:

Related

Leave a Comment Cancel reply

focus on agency

Benchmarks show that OpenAI has again taken the lead over Cloud Opus 4.7, the most powerful publicly available LLM (but the unreleased Mythos still outperforms it).

Increased costs for users

Licensing and ‘cyber-permissive’ limits

Initial reactions: Losing access feels like a ‘lame bite’

Share this:

Like this:

Related

Leave a Comment Cancel reply