Even Top AI Models Fall Short on Physics Problems

OpenAI sets sights on trillion-dollar IPO

OpenAI's targeting a $1 trillion IPO by 2027—the largest in history. The restructure that made it possible gave Microsoft 27% and a revenue share. Now comes the hard part: convincing public markets to fund Altman's $1.4 trillion infrastructure vision.

Trump and Xi dial back tariffs, dodge the Blackwell question

Trump and Xi cut tariffs 10% and paused rare earth controls for a year. Soybean orders resumed. Nvidia's Blackwell speculation collapsed. The one-year truce relieves immediate pressure but sidesteps structural forces driving U.S.-China decoupling.

Even Top AI Models Fall Short on Physics Problems

Even the most advanced AI models stumble when faced with basic physics problems. A new benchmark called PHYBench reveals these supposedly intelligent systems solve physics problems about as well as a struggling high school student.

The research comes from Professor Wei Chen's team at Peking University. The test puts AI through its paces with 500 carefully crafted physics problems. These range from simple mechanics to head-scratching quantum physics puzzles. The results? Not great. Gemini 2.5 Pro, Google's latest AI powerhouse, managed only 37% accuracy. For comparison, human experts hit nearly 62%.

PHYBench doesn't just check if answers are right or wrong. It uses a clever scoring system called Expression Edit Distance (EED) to measure how close AI gets to the correct solution. Think of it as giving partial credit for showing your work. Even here, the gap between human and machine remains stark. Humans scored 70.4 on the EED scale, while Gemini limped in at 49.5.

How the Test Works

The problems in PHYBench are purely text-based. No diagrams, no graphs – just words describing physical scenarios. AI must figure out the forces at play and translate them into mathematical expressions. It's like asking someone to picture a game of pool and predict where the balls will go without seeing the table.

The benchmark emerged from a rigorous development process. A team of 178 physics students helped refine the problems, while 109 human experts validated the final set. This ensures the test measures real physics understanding, not just pattern matching.

Where AI Falls Short

The results expose two major weaknesses in AI. First, physical perception – the ability to understand how objects interact in the real world. Second, robust reasoning – the capacity to turn that understanding into correct mathematical expressions. AI often identifies the right physics principles but applies them incorrectly, like knowing the rules of chess but making illegal moves.

These shortcomings show up across all physics domains, but some areas prove particularly challenging. Thermodynamics and advanced physics concepts give AI the most trouble. It's as if the models hit a wall when physics gets more abstract.

The findings carry weight beyond physics. They suggest current AI systems, despite their impressive abilities in language and pattern recognition, lack fundamental reasoning capabilities we take for granted in humans. This gap matters for any field requiring precise logical thinking.

Traditional AI tests often use simplified problems with yes/no answers. PHYBench raises the bar by demanding exact symbolic solutions. This approach reveals subtle differences between models that might look equally capable on simpler tests.

A More Efficient Way to Test

The benchmark's scoring system proves remarkably efficient. The EED score can distinguish between AI models using far fewer test problems than traditional right/wrong scoring. This efficiency makes PHYBench a powerful tool for measuring progress in AI reasoning.

The Road Ahead

Looking ahead, PHYBench sets clear goals for AI development. Future models need better ways to represent physical concepts internally. They must learn to derive relationships from first principles rather than memorizing patterns from training data.

Why this matters:

The gap between AI and human physics understanding remains massive, suggesting current AI systems lack true reasoning capabilities
This benchmark gives us a clear way to measure progress in AI's ability to understand the physical world – a crucial step toward more capable and reliable systems

Read on, my dear:

PHYBench: Holistic Evaluation of Physical Perception
and Reasoning in Large Language Models

AI assistants misreport news in 45% of cases, putting trusted brands at risk

AI assistants fail basic accuracy checks on news queries nearly half the time, but users don't just blame the AI—they blame the news outlets it cites. As adoption climbs, newsrooms face reputational damage for errors they didn't commit and can't fix.

Maria Garcia Oct 22, 2025

Anthropic Targets Lab Tools While AI Drug Discovery Stalls

AI Research

Anthropic builds lab tools while rivals chase drug breakthroughs

Anthropic wires Claude into lab systems for documentation speed while rivals burn billions chasing AI-discovered drugs that don't exist yet. The strategy: sell efficiency today, skip moonshot risk—but if discovery suddenly works, infrastructure looks conservative.

Robert Brown Oct 20, 2025

AI Backdoor Attacks Scale by Count, Not Percentage

AI Research

250 poisoned documents can backdoor 13B-parameter models

Security teams assumed attackers needed to taint a percentage of training data. New research shows a fixed number of documents can backdoor models regardless of scale—upending detection strategies built around dilution assumptions.

Robert Brown Oct 10, 2025

Dreamer 4: AI Imagines Its Way to Minecraft Diamonds

AI Research

Dreamer 4 mines diamonds in an imagined Minecraft

How can an AI master a complex game without ever playing it? DeepMind's Dreamer 4 learns by watching, then trains in imagination. This shift from big data to efficient world models could be key for real-world robotics and autonomous systems.

Robert Brown Oct 2, 2025

Truces and Trillions

OpenAI sets sights on trillion-dollar IPO

Trump and Xi dial back tariffs, dodge the Blackwell question

Even Top AI Models Fall Short on Physics Problems

How the Test Works

Where AI Falls Short

A More Efficient Way to Test

The Road Ahead

Maria Garcia

Read next