AI Gets Worse When It Thinks, Harvard-Amazon Study Shows

A surprising new study reveals that when AI models stop to think, they actually get worse at following basic instructions. This finding challenges the common belief that more reasoning always leads to better results.

Researchers tested 15 different language models, including GPT-4o-mini, Claude 3.5 Haiku, Claude 3.5 Sonnet, Claude 3.7 Sonnet, and DeepSeek-V3. They found that adding chain-of-thought reasoning - where models explain their thinking step by step - made them less likely to follow simple rules.

The performance drops were significant. Llama-3-8B-Instruct's accuracy fell from 75.2% to 59.0% when using reasoning. Even advanced models like Claude 3.7 Sonnet, which typically excels at complex tasks, saw its performance drop from 90.6% to 90.2%.

The Scale of the Problem

"It's like asking someone to plan out a task so carefully that they forget the basic requirements," explains the research team at Harvard University and Amazon. "The models focus so much on high-level thinking that they miss straightforward instructions."

The study tested two benchmarks: IFEval and ComplexBench. IFEval tested simple rules like "don't use commas" or "write exactly 400 words." ComplexBench gave more complex instructions that built on each other, similar to real-world tasks.

Performance Across Model Sizes

The results showed consistent patterns across model sizes. Small models like Llama-3.2-1B-Instruct dropped from 49.0% to 40.7% accuracy. Medium-sized models like Qwen2.5-7B-Instruct fell from 63.6% to 57.7%. Even large models like Llama-3.1-70B-Instruct declined from 85.6% to 77.3%.

Reasoning-focused models showed similar issues. DeepSeek-R1, specifically designed for reasoning tasks, performed worse than its base version DeepSeek-V3 on instruction following. The reasoning version scored 83.3% compared to the base model's 85.2% on IFEval.

The researchers found two main problems. First, models thinking through complex tasks often forgot basic rules. Second, they added unnecessary explanations that broke the original instructions.

A Real-World Example

Take a simple task: Write a haiku in Italian about a yacht. When models just wrote the poem, they stuck to Italian. But when they reasoned through it, they often added English explanations, breaking the "Italian only" rule.

Neural Network Evidence

The attention patterns in the models' neural networks backed this up. When reasoning, models paid less attention to specific instructions and more to their own thought process. The researchers could track this shift by measuring "constraint attention" - how much the model focused on key instruction words.

"We can actually see the models shifting their focus away from the constraints they need to follow," the researchers note. "It's like they get lost in their own thoughts."

This matters because many AI companies push their models to explain their reasoning, assuming it leads to better results. But this research suggests that's not always true.

Solutions Tested

The team tested four ways to fix this problem:

Showing models examples of good reasoning (few-shot learning)
Having models check their own work (self-reflection)
Letting models choose when to use reasoning (self-selective)
Using a separate AI to decide when reasoning helps (classifier-selective)

The classifier-selective approach worked best. For example, it improved GPT-4o-mini's performance by 5.2 percentage points and Llama-3-8B-Instruct's by 10.7 points. The self-reflection method worked particularly well for larger models, helping Claude 3.7 Sonnet achieve 92.1% accuracy.

Mixtral-8x7B-Instruct showed unique behavior. It was one of the few models that performed better with reasoning on IFEval (improving from 53.0% to 56.4%), but still struggled on ComplexBench (dropping from 60.4% to 58.3%).

The findings could change how companies develop and use AI. Instead of always pushing for more reasoning capability, they might need to build systems that know when to think deeply and when to act directly.

Human Parallels

The research also reveals something about human thinking. We often assume that carefully reasoning through every step of a task will lead to better results. But sometimes, overthinking can make us worse at following simple instructions.

For AI companies, the message is clear: Build systems that can match their approach to the task at hand. Sometimes, that means thinking less and doing more.

For users working with AI, the takeaway is equally important: Don't always ask the AI to explain its thinking. For simple tasks, direct instructions might work better.

Future Research

The researchers plan to explore whether this pattern appears in other types of AI tasks. They also want to develop better ways for AI systems to balance reasoning with direct action.

This work opens up new questions about how AI systems should approach different tasks. It challenges the assumption that more thinking always leads to better results, suggesting instead that knowing when to think might matter more than thinking itself.

Why this matters:

This research flips conventional wisdom about AI reasoning on its head, showing that more sophisticated isn't always better
The findings point to a new direction in AI development: building systems that know when to think and when to act directly

Read on, my dear: