💡 TL;DR - The 30 Seconds Version
🧠 Apple researchers discovered AI "thinking" models like OpenAI's o3 and Claude collapse completely at certain complexity levels, reducing reasoning effort when problems get harder.
📊 Models tested on 25 samples each across four puzzle types showed three zones: simple tasks favor standard models, medium complexity helps thinking models, high complexity destroys both.
⚙️ Even with complete algorithms provided, models still failed at identical thresholds, revealing problems with step execution rather than solution discovery.
🎯 Claude 3.7 Thinking drops from 20,000 tokens on simple puzzles to 5,000 tokens on harder ones despite having 64,000 available.
🔍 Traditional math benchmarks mislead because models memorized training data—they perform worse on AIME 2025 than 2024 despite humans finding it easier.
🚫 Current AI reasoning hits architectural limits that more compute can't fix, requiring new approaches for reliable logical consistency at scale.
Apple researchers found that AI's latest "thinking" models—like OpenAI's o1 and Claude's reasoning variants—collapse completely when problems cross certain complexity thresholds. More surprising: these models actually think less as problems get harder, despite having computational power to spare.
The study tested models like DeepSeek-R1, Claude 3.7 Sonnet Thinking, and OpenAI's o3-mini on controlled puzzles instead of traditional math tests. Unlike established benchmarks that suffer from data contamination, these puzzle environments let researchers manipulate complexity precisely while tracking each step of the AI's reasoning process.
The results reveal troubling gaps in current AI reasoning capabilities. These models, trained with reinforcement learning to generate detailed thought processes before answering, still fail to develop true problem-solving skills. When complexity crosses a threshold, accuracy drops to zero across all tested models.
The researchers identified three distinct performance zones that challenge assumptions about AI reasoning capabilities.
In simple problems, standard language models often outperform their "thinking" counterparts. The extra reasoning steps waste computational resources without improving results. For moderate complexity, thinking models gain an edge by working through problems step-by-step. But at high complexity, both model types fail completely.
This pattern breaks conventional wisdom about scaling AI capabilities. More thinking doesn't always mean better results—sometimes it makes things worse.
The algorithm paradox
Even when researchers gave complete solution algorithms, the models still failed at identical complexity thresholds. This suggests the problem isn't finding solutions but executing logical steps consistently.
Consider the Tower of Hanoi puzzle. When given the exact recursive algorithm to solve it, models performed no better than when working from scratch. They couldn't follow their own prescribed steps, revealing gaps in logical execution rather than creative problem-solving.
Reasoning effort drops when needed most
The most counterintuitive finding: models reduce their reasoning tokens as problems become more complex, exactly when more thinking should help. This happens well before hitting computational limits.
For simple Tower of Hanoi puzzles, Claude 3.7 Sonnet Thinking might use 20,000 tokens. But for harder versions requiring more moves, it drops to 5,000 tokens despite having 64,000 available. The model seems to give up rather than persist through difficulty.
The overthinking trap
Analysis of the models' internal reasoning reveals another problem: inefficient exploration patterns. On simple problems, models often find correct solutions early but continue exploring wrong paths—a phenomenon researchers call "overthinking."
For medium complexity problems, the pattern reverses. Models explore many incorrect solutions before stumbling on correct ones late in their reasoning process. Beyond a certain threshold, they find no correct solutions at all.
This suggests these models lack effective self-correction mechanisms. They can't distinguish promising reasoning paths from dead ends, wasting computational resources on fruitless exploration.
Real-world implications
These limitations have serious implications for deploying reasoning models in critical applications. If models consistently fail beyond certain complexity thresholds and can't follow explicit algorithms reliably, they're unsuitable for tasks requiring guaranteed logical consistency.
The findings also question whether current training approaches can produce genuinely generalizable reasoning. Despite sophisticated reinforcement learning techniques, these models appear to rely on pattern matching rather than developing robust logical reasoning capabilities.
The contamination problem
Traditional AI benchmarks suffer from data contamination—models perform well because they've seen similar problems during training. The researchers' puzzle approach sidesteps this issue by using controllable environments with systematic complexity scaling.
Mathematical benchmarks like AIME show suspicious patterns. Models perform worse on AIME 2025 than AIME 2024, despite humans finding the newer test easier. This suggests training data leakage rather than genuine reasoning improvement.
Beyond current limits
The study exposes gaps between current AI capabilities and human-like reasoning. While these models excel at pattern recognition and can handle moderate complexity tasks, they lack the persistent, systematic thinking required for complex problem-solving.
Current approaches may have hit architectural limits. Simply scaling model size or training data won't solve the logical consistency problems revealed by this research. New training methods or architectures may be needed to achieve reliable reasoning at scale.
The models' inability to benefit from explicit algorithms particularly concerns researchers. If an AI can't follow step-by-step instructions reliably, it cannot serve as a dependable reasoning partner for complex tasks.
Why this matters:
- AI reasoning models hit hard limits that more compute can't solve—they think less, not more, when problems get complex
- These failures happen even when given explicit solution steps, revealing gaps in logical execution rather than creative problem-solving
Read on, my dear:
Apple Research: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
❓ Frequently Asked Questions
Q: Which AI models did this study test?
A: Apple researchers tested Claude 3.7 Sonnet Thinking, DeepSeek-R1, DeepSeek-R1-Qwen-32B, and OpenAI's o3-mini in medium and high configurations. They compared these "thinking" models against their standard counterparts like Claude 3.7 Sonnet (non-thinking) and DeepSeek-V3 to measure the actual impact of reasoning mechanisms.
Q: What puzzles did researchers use instead of math tests?
A: Four puzzle types: Tower of Hanoi (moving disks between pegs), Checker Jumping (swapping red and blue pieces), River Crossing (transporting people safely across a river), and Blocks World (rearranging block stacks). These puzzles scale complexity precisely—Tower of Hanoi with 8 disks requires 255 moves, while 15 disks needs 32,767 moves.
Q: How much computational power do these models waste on simple problems?
A: On simple puzzles, thinking models often use 10-20x more tokens than standard models while performing worse. For basic Tower of Hanoi problems, Claude 3.7 Sonnet Thinking might use 15,000 tokens where the non-thinking version uses 1,500 tokens and gets the right answer more often.
Q: Why do traditional AI benchmarks give misleading results?
A: Data contamination. Models perform better on AIME 2024 than AIME 2025, despite humans finding the 2025 test easier. This suggests these models memorized solutions during training rather than developing real reasoning skills. Controlled puzzles avoid this problem by using novel configurations not in training data.
Q: What happens when AI models are given the exact algorithm to solve problems?
A: They still fail at the same complexity thresholds. Researchers provided complete step-by-step algorithms for Tower of Hanoi puzzles, but models couldn't execute them reliably. This reveals the core problem isn't finding solutions—it's following logical steps consistently, even when explicitly told what to do.
Q: How many attempts did researchers test for each puzzle?
A: Researchers generated 25 samples per puzzle at each complexity level and reported average performance. They tested across multiple difficulty levels—Tower of Hanoi from 1 to 20 disks, Blocks World up to 50 blocks. This extensive testing ruled out random chance and showed consistent failure patterns across all reasoning models.
Q: Can these AI models solve any complex reasoning tasks reliably?
A: Only up to moderate complexity levels. Claude 3.7 Sonnet Thinking achieves near-perfect accuracy on Tower of Hanoi with 5 disks (31 moves) but fails completely at 10+ disks. For River Crossing puzzles, it handles 2-3 pairs well but collapses with more participants, despite shorter solution lengths.
Q: Do different puzzle types reveal different AI weaknesses?
A: Yes, dramatically. Models can handle 100+ correct moves in Tower of Hanoi but fail after just 4 moves in River Crossing. This suggests some puzzle types appear more frequently in training data than others, revealing that these models rely on memorized patterns rather than general reasoning skills.