The number landed quietly inside a 423-page PDF that Stanford released Monday morning. Top American AI model, 1,503 points on the Arena Leaderboard. Top Chinese AI model, 1,464. The gap between Anthropic's Claude Opus 4.6 and ByteDance's Dola-Seed-2.0 Preview is 39 Elo points. In percentage terms, 2.7%.
That is the US lead in artificial intelligence as of March 2026, according to the 2026 edition of Stanford HAI's AI Index, the most-cited scorecard on the state of the technology. Fourteen months ago, when DeepSeek-R1 shipped from a Hangzhou lab nobody had heard of, the gap was already just five points. US and Chinese models have traded the top spot multiple times since. The performance race is over. Stanford's own framing calls it "effectively closed."
Then comes the part nobody is putting in their headline. The Chinese labs that closed the gap are walking away from the playbook that got them there. DeepSeek's long-awaited V4 model, the release that was supposed to extend their run, has gone missing. Alibaba and Zhipu have started locking down the source of their flagship models. The scoreboard just certified Chinese parity in the same week Chinese labs quit playing the game that produced it.
Key Takeaways
- Stanford's 2026 AI Index puts the US lead over China at 2.7%. Stanford's own framing calls the performance race 'effectively closed.'
- The top six Arena labs cluster inside a 79-Elo-point band. Four are American, two are Chinese.
- Alibaba and Zhipu are pivoting flagship models (Qwen3.6-Plus, GLM-5-Turbo) to closed hosted offerings, ending the open-weight playbook that got them here.
- DeepSeek V4 has slipped twice. Reuters reports it is now targeting late April on Huawei's Ascend 950PR chips instead of Nvidia silicon.
AI-generated summary, reviewed by an editor. More on our AI guidelines.
What Stanford actually measured
Start with the Arena Leaderboard. Users pit two anonymous models against each other, pick the winner, and the rankings emerge from millions of these votes. It is imperfect. The Index itself flags the method as gameable, noting researchers have shown that additional Arena-style interaction data can train a model's way up the ladder. You can study for the test.
Set that caveat aside for a moment. On the other benchmarks Stanford tracks, the convergence is even sharper. On GPQA Diamond, the graduate-level science exam where an expert human scores 81.2%, the top 2025 model hit 93% mean accuracy. On SWE-bench Verified, the canonical real-world coding test, performance went from 60% to near 100% of meeting the human baseline in a single year. On OSWorld, a desktop-agent benchmark that sat near 12% a year ago, top models jumped to 66.3% and now sit within six points of human performance.
The point is not which lab leads on which specific test. The point is that at the frontier, everyone is bunched up. The top cluster on the Arena Leaderboard holds four American labs and two Chinese ones inside a 79-point spread. Anthropic, xAI, Google, and OpenAI are the top four, inside 25 points of each other. ByteDance's Dola-Seed and Alibaba's Qwen sit just below. On any given benchmark, the order flips.
The commodity frontier
Here is what that means for anyone trying to bet on one lab over another. Capability is no longer the differentiator. The Stanford authors state this directly in the opening of the Technical Performance chapter. With capability no longer a clear differentiator, they write, competitive pressure is shifting toward cost, reliability, and real-world usefulness.
For a decade, "frontier lab" meant three firms in California and one in London. The 2026 Index has six, and the scoreboard rotates. Anthropic leads the Arena this week. Last week it was Google. A month ago OpenAI. Before that DeepSeek held the top for 48 hours. The scoreboard still runs. It just stopped sorting anybody.
Washington is quietly anxious about this. The export-control apparatus built during the first Trump term and hardened under Biden assumed a durable US technical lead. The premise of the chip ban was that locking down Nvidia silicon would keep China a generation behind at the model layer. Stanford just certified that China is not a generation behind. China is less than three percentage points behind, and only on evaluations that Stanford itself says are saturating so fast they have stopped meaning what they were designed to mean.
Caught up with open source. Quietly walking away from it.
Now the complication. The reason China closed the gap was not stolen secrets or a breakthrough nobody saw coming. It was DeepSeek, and DeepSeek was open.
When R1 shipped in January 2025, it matched OpenAI's top reasoning model, used a fraction of the compute to train, and released weights and training recipes into the open. Stanford's Index credits the release with triggering a temporary one-trillion-dollar drop in US tech stocks. That was the moment the investment thesis of the frontier labs came into real question for the first time.
The rest of the Chinese ecosystem spent 2025 trying to replicate DeepSeek's impact with open-weight releases of their own. Qwen. Zhipu's GLM series. MiniMax. Moonshot. August 2024 was the high-water mark. The open-closed gap on Arena hit half a percentage point. Closest it has ever come.
That was the peak. The gap reopened through 2025 as frontier closed labs pushed ahead, and Stanford now reports it at 3.4%. Six of the top ten Arena models are closed-weight. Here is what the Index could not print because it happened last week: Alibaba's Qwen team just launched Qwen3.6-Plus and Qwen3.5-Omni as closed hosted offerings on Alibaba Cloud. Zhipu rolled out GLM-5-Turbo as closed source. ChinaTalk reported that the two leading Chinese labs are abandoning the open-source strategy that made them globally famous fifteen months ago.
Zhipu's pricing tells you why. The company now charges $1.40 per million input tokens and $4.40 per million output tokens, according to South China Morning Post. Anthropic's Claude Opus 4.6 costs $5 and $25 for the same tier. That is a pricing ladder, not an idealist manifesto. Chinese labs caught up on capability, and now they want to get paid for it.
The AI scoreboard keeps shifting. We track it.
Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.
No spam. Unsubscribe anytime.
The DeepSeek silence
Which brings us to the company that started all of this, and the fact that it has gone quiet. DeepSeek's V4 model has been expected for weeks. Reuters reported on April 4, citing The Information, that V4 has slipped twice and is now targeting the back half of April on Huawei's Ascend 950PR. TrendForce confirmed that Alibaba, ByteDance, and Tencent have placed bulk orders for hundreds of thousands of those chips. Nothing has shipped.
The straightforward read is that V4 is being trained on domestic silicon rather than Nvidia, and the engineering is harder than DeepSeek expected. The subtler read is that the lab that punched above its weight by giving everything away has hit the point where the next move is either stay open and fall behind, or go closed and join the rest of the pack. Whichever it chooses, the DeepSeek of January 2025 is not coming back.
The lab that gave away its playbook fifteen months ago has gone silent at the exact moment the scoreboard certified it worked. This is not a lab that is usually quiet.
What's left of the US lead
The honest answer, if you are reading the Index for investment signal rather than for bragging rights, is that the US lead now lives outside the model layer. Stanford's own chapter on research and development, which sits separately from the Technical Performance data, shows the US still leads on capital, on infrastructure buildout, and on access to top-tier Nvidia silicon. Anthropic closed a $30 billion Series G in February at a $380 billion post-money valuation, led by GIC and Coatue with Founders Fund, ICONIQ, and MGX co-leading. Chinese frontier labs have no comparable check to cash. Gulf money has pushed roughly $100 million into Minimax and Zhipu combined, according to ChinaTalk, against roughly $15 billion into Anthropic and OpenAI.
That is a real gap, and it will outlast any one benchmark number. But it is a gap about who can afford to pay for the next training run, not about which country can build the better model. The top-of-stack race is over. The commercialization race just started.
For executives trying to operationalize the 2026 Index next quarter, the procurement question has to change. It is no longer "which model tops benchmark X." The top six labs cluster inside a narrow Elo band on every leaderboard that matters, and two of those labs are Chinese, and one of those two is sovereign-risk-coded for half the world's buyers. The new question is which model has the best cost curve, the best reliability in your specific workflow, the best fit for your domain, and the best vendor story for your legal team. The benchmark has stopped doing the procurement work for you.
What the scoreboard cannot tell you
DeepSeek V4 was supposed to ship this month. It has not. The lab that made Washington flinch fifteen months ago is just sitting there. Alibaba and Zhipu are quietly closing their doors. American incumbents stretch their Series-G runways. Whatever DeepSeek does next is the story. Model lands on Huawei silicon, the 2026 AI Index becomes the year China caught up. Silence holds, and it becomes the year China plateaued.
Stanford handed you the scoreboard. It cannot tell you what the next quarter of reading it looks like. Check back for the 2027 edition. The interesting numbers will not be on the leaderboard.
Frequently Asked Questions
How big is the US-China AI performance gap in the 2026 AI Index?
The top American model (Anthropic's Claude Opus 4.6) leads the top Chinese model (ByteDance's Dola-Seed-2.0 Preview) by 39 Elo points on the Arena Leaderboard, or 2.7%. Fourteen months ago the gap was five points. Stanford's own framing says the performance race is 'effectively closed.'
Why are Chinese AI labs going closed source?
Monetization. ChinaTalk reported that Alibaba launched Qwen3.6-Plus and Qwen3.5-Omni as hosted offerings on Alibaba Cloud while Zhipu rolled out GLM-5-Turbo as closed source. Zhipu now charges $1.40 per million input tokens and $4.40 per million output tokens, still well below Anthropic's Claude Opus 4.6 at $5 and $25, but a long way from free.
What happened to DeepSeek V4?
It has slipped twice. Reuters reported on April 4, citing The Information, that V4 is now targeting the back half of April and will run on Huawei's Ascend 950PR chips rather than Nvidia silicon. TrendForce reports Alibaba, ByteDance, and Tencent have placed bulk orders for hundreds of thousands of those chips in preparation.
Is the Arena Leaderboard even a reliable measure?
The Index itself flags it as gameable. Cited research shows that additional Arena-style interaction data can improve performance on Arena-derived evaluations, meaning leaderboard standing may partly reflect platform adaptation rather than general capability. Benchmarks are also saturating fast: Humanity's Last Exam went from under 10% to 38.3% in a single year.
Where does the US still lead?
Capital, infrastructure, and access to top-tier Nvidia silicon. Anthropic closed a $30 billion Series G in February at a $380 billion post-money valuation. Per ChinaTalk, Gulf money has put roughly $100 million into Minimax and Zhipu combined against roughly $15 billion into Anthropic and OpenAI. The remaining gap is financial, not technical.
AI-generated summary, reviewed by an editor. More on our AI guidelines.



IMPLICATOR