OpenAI and Anthropic dropped their flagship coding models fifteen minutes apart on Wednesday. If you watched the benchmarks roll in, you saw what looked like a clear winner. GPT-5.3-Codex scored 77.3% on Terminal-Bench 2.0. Claude Opus 4.6 reportedly hit 65.4% on the same test. A 12-point gap. In AI, that is supposed to be a blowout.
Then you tried to install the thing. The Codex desktop app runs on macOS only. No Windows. No Linux. No API access at launch. If you wanted to wire it into your CI pipeline or IDE on a Thursday afternoon, tough luck. Claude Opus 4.6? Full API access, same $5/$25 per million tokens as before. A 1-million-token context window in beta. Agent teams that split work across parallel processes you can take over mid-run. All of it live on day one.
The two companies had scheduled matching 10 a.m. Pacific announcements. Anthropic jumped the gun by 15 minutes. A small move, but revealing. Sam Altman fired back on X, calling Anthropic "authoritarian." Anthropic bought Super Bowl airtime to make fun of ChatGPT's ad plans. The gloves came off a while ago.
But here is the thing. Benchmark charts make for great Twitter posts. Lousy purchasing decisions, though. Sit down and compare what each tool does for someone with a production deadline, and the picture gets messy in a hurry. The model that wins on paper may not be the one that ships your code.
The Argument
- GPT-5.3-Codex wins on terminal and coding benchmarks, but ships without API access or Windows support.
- Claude Opus 4.6 trails on raw scores but delivers agent teams, 1M context, and a unified install across every platform.
- Enterprise data shows OpenAI's wallet share is shrinking while Anthropic's production adoption rate is higher.
- At the frontier, model quality is converging. Distribution and developer experience decide the next round.
Where Codex genuinely pulls ahead
Give credit where it is earned. GPT-5.3-Codex posted real gains in spots that matter to developers who live in the terminal.
On SWE-Bench Pro, which tests real-world software engineering across four programming languages, Codex hit 56.8%. Its predecessor scored 56.4%. GPT-5.2 managed 55.6%. Marginal gains, all of them. Terminal-Bench 2.0 told a different story. That benchmark measures command-line skills, the bread and butter of anyone who ships from a terminal, and Codex vaulted from 64.0% to 77.3%. Thirteen points in a single generation. On OSWorld, where models have to complete actual desktop productivity tasks, the jump was even wilder. Previous model: 38.2%. This one: 64.7%. Nearly doubled.
OpenAI also claims Codex uses fewer tokens for equivalent tasks. Fewer tokens means lower cost per interaction and longer effective context. For shops running hundreds of agent sessions daily, the arithmetic adds up.
And then there is the self-improvement angle. OpenAI says GPT-5.3-Codex helped debug its own training, manage its own deployment, and diagnose test results. "Our first model that was instrumental in creating itself." Ars Technica reality-checked that claim and called it an overstatement, noting that the tasks described, managing deployments and handling test results, are what many enterprise engineering teams already automate. Fair enough. But the speed-up was real enough that OpenAI shipped the model faster than internal timelines predicted.
OpenAI also classified GPT-5.3-Codex as "high capability" for cybersecurity tasks under its Preparedness Framework, a first for any of its models. The company committed $10 million in API credits for cyber defense research and is expanding its Aardvark security agent beta. That framing turns Codex into a security story, too. Smart positioning for enterprise buyers who need to justify spend to a CISO, not only a VP of engineering.
Where Claude quietly wins the workflow war
Opus 4.6 does not dominate the same benchmarks. On Terminal-Bench 2.0, it trails by double digits. On SWE-Bench Pro, Anthropic did not even publish a direct comparison number for Opus 4.6, which suggests the result was not flattering.
But Anthropic played a different game entirely. OpenAI tuned the engine for horsepower. Anthropic built the car around it.
The biggest gap shows up in how each tool handles scale. Agent teams let you split a large codebase review across multiple Claude instances running in parallel, coordinated automatically, with the ability to jump into any sub-agent mid-run using keyboard shortcuts or tmux. That solves the "staring at a spinner for 40 minutes" problem that plagues single-agent workflows. Context compaction keeps long sessions alive by summarizing older context before it crashes into window limits. And four effort levels give developers a throttle they can adjust, from low for quick lookups to max for the kind of gnarly debugging that used to eat an afternoon.
Anthropic also bet on something OpenAI hasn't matched. Opus 4.6 scores 76% on the 8-needle variant of MRCR v2, a test that hides information deep in massive contexts. Sonnet 4.5 scored 18.5% on the same test. That is a qualitative shift in how much context a model can actually use without drifting. On GDPval-AA, the independent evaluation of knowledge work run by Artificial Analysis, Opus 4.6 outperformed GPT-5.2 by roughly 144 Elo points. That translates to winning about 70% of head-to-head comparisons on real work tasks like building presentations, analyzing financials, and structuring spreadsheets.
And then there is the context window. One million tokens, still in beta, but anyone working in a monorepo with hundreds of files knows what that means. At 200k tokens, the model is guessing at your codebase. At a million, it can actually see the thing.
Weighing the two right now? Here is what the scorecards look like side by side.
| GPT-5.3-Codex | Claude Opus 4.6 | |
|---|---|---|
| SWE-Bench Pro | 56.8% | Not published |
| Terminal-Bench 2.0 | 77.3% | ~65.4% (reported) |
| OSWorld | 64.7% | N/A (different eval) |
| GDPval (knowledge work) | 70.9% | Leads by ~144 Elo over GPT-5.2 |
| Humanity's Last Exam | Not published | Leads all frontier models |
| Context window | 128k | 1M (beta) |
| API access at launch | No | Yes ($5/$25 per MTok) |
| Desktop app | Mac only | Mac + Windows |
| Agent teams / parallel | No | Yes |
| Effort controls | No | Four levels (low to max) |
And here is the pro/con breakdown for each.
Join 10,000+ AI professionals
Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.
No spam. Unsubscribe anytime.
| Advantages | Disadvantages | |
|---|---|---|
| GPT-5.3-Codex | Highest Terminal-Bench score (77.3%). 25% faster inference. Uses fewer tokens per task. Strong OSWorld score (64.7%). Cybersecurity CTF score of 77.6%. | No API yet. Mac-only app. No agent teams. Smaller context window. Separate from ChatGPT app. No effort controls. |
| Claude Opus 4.6 | Full API on day one. 1M context beta. Agent teams for parallel work. Integrated app on Mac and Windows. Adaptive thinking. Excel and PowerPoint integration. Lower over-refusal rates. | Weaker Terminal-Bench score. SWE-Bench Pro number not disclosed. Higher cost at extended context. Adaptive thinking can overthink simple tasks. |
The installation gap nobody talks about
You want to try Codex right now? Here is what it actually takes.
Download the Codex app from openai.com/codex. Mac only, so Windows and Linux folks are already out. You will need a paid ChatGPT plan. That is $20 a month minimum for Plus. Log in, and the desktop app works fine. But if you want CLI access, that is a separate npm install. And the IDE extension is yet another installation. That is three separate onboarding headaches before anyone on your team writes a line of code.
Claude Code is one npm command: npm install -g @anthropic-ai/claude-code. Authenticate with your API key. Done. Same model in the desktop app on Mac or Windows, same model through the API, same model on the web. One credential, every surface.
This is not a minor usability quibble. When a junior developer on your team needs to get set up on Friday afternoon, the tool with fewer friction points wins. OpenAI has a distribution advantage with 500,000 Codex app downloads in the first few days. Anthropic has an integration advantage with a unified experience across every surface.
What the enterprise numbers actually say
Neither company is fighting for hobbyists. The real money sits in enterprise contracts, and the Andreessen Horowitz survey data that dropped this week should keep a few people in San Francisco up at night.
The a16z numbers are striking. Enterprises spent an average of $7 million on LLMs last year, nearly triple the $2.5 million they spent in 2024. Nobody had forecast that kind of acceleration. Next year's projection, $11.6 million per enterprise, looks aggressive until you realize last year's actual number already blew past last year's forecast by 56%. Cash is flooding in. But OpenAI's slice of that pie keeps shrinking. Its share of the enterprise AI wallet dropped from 62% in 2024 to a projected 53% this year. Anthropic picked up ground, climbing from 14% to 18%.
The more telling number. Only 46% of surveyed OpenAI customers are running the company's most capable models in production. For Anthropic, that figure is 75%. For Google, 76%. OpenAI has the biggest install base. Anthropic has the highest conversion rate. If you are a CTO deciding where to place your bet this quarter, that second number should matter more to you than any benchmark score.
OpenAI looks emboldened by the Codex app's half-million downloads in three days. But Anthropic looks confident in a different way, the kind that comes from knowing your paying customers actually use the product they are paying for. The anxiety at OpenAI is not about model quality. The engine is fine. It is about whether the rest of the vehicle can keep up before the enterprise wallet splits further.
Benchmarks measure models. Developers choose tools.
Simon Willison, one of the most closely watched independent voices in AI tooling, wrote on Wednesday that he had preview access to both models and was "finding it hard to find a good angle to write about them." Both are good. Both were preceded by models that were also good. The previous generation already handled most tasks people threw at them.
That is the real story nobody wants to hear. At the frontier, model quality is converging. The differences that determined market share in 2024, raw intelligence and coding accuracy, are compressing into margins that most developers cannot feel in daily use. What they notice instead is friction. Does the tool plug into the stack they already have? Can they get it running on their machine before the standup ends? Those are the questions that win procurement fights now.
OpenAI shipped the faster engine. Anthropic shipped the better car. Both companies are burning billions to get here, OpenAI carrying over $1 trillion in financial obligations to compute backers, Anthropic raising at a $350 billion valuation. Neither can afford to be second choice for long. But the scoreboard that matters is not Terminal-Bench. It is the procurement dashboard at every Fortune 500 company that just tripled its AI budget and now has to decide which tool gets the next purchase order.
Check which one fits your garage.
Frequently Asked Questions
Q: Is GPT-5.3-Codex available through the API?
A: Not yet. OpenAI launched GPT-5.3-Codex through the Codex desktop app, CLI, IDE extension, and web only. API access is planned but has no public date. Claude Opus 4.6, by contrast, shipped with full API access on launch day at $5/$25 per million tokens.
Q: Which model is better for long coding sessions?
A: Claude Opus 4.6 has a 1-million-token context window in beta and context compaction that summarizes older messages automatically. GPT-5.3-Codex has a 128k context window. For multi-hour sessions on large codebases, the context difference is significant.
Q: What does GPT-5.3-Codex's 'high capability' cybersecurity rating mean?
A: OpenAI classified GPT-5.3-Codex as the first model to reach 'high capability' for cybersecurity tasks under its Preparedness Framework. It scored 77.6% on capture-the-flag challenges. OpenAI is gating advanced cybersecurity uses behind a Trusted Access program and committed $10 million in API credits for defensive research.
Q: Can I use Codex on Windows or Linux?
A: The Codex desktop app is macOS only as of launch. The CLI tool and IDE extensions work cross-platform, but require separate installations. Claude Code runs on Mac, Windows, and Linux through a single CLI install and also offers a desktop app on Mac and Windows.
Q: How do enterprise adoption rates compare between OpenAI and Anthropic?
A: According to Andreessen Horowitz survey data, 75% of Anthropic customers use its most capable models in production, compared to 46% for OpenAI. OpenAI holds a larger overall market share (projected 53% in 2026) but that share has dropped from 62% in 2024.



