Musk promised truth-seeking AI. When Grok 4 tackles politics, it searches Musk's posts first. Tests show 54 of 64 citations came from him. Accident or intent? The answer matters for every AI system we build.
Experienced developers work 19% slower with AI coding tools but think they're 20% faster. New study challenges AI's flagship use case and shows why self-reported productivity gains can't be trusted.
Elon Musk's 'truth-seeking' AI searches for his personal posts before answering tough questions on Israel, immigration, and abortion. Users found Grok 4 explicitly looks up Musk's views, raising serious questions about AI bias and neutrality.
AI Coding Tools Make Experts Slower, Not Faster, New Study Finds
Experienced developers work 19% slower with AI coding tools but think they're 20% faster. New study challenges AI's flagship use case and shows why self-reported productivity gains can't be trusted.
🚨 Experienced developers work 19% slower with AI coding tools like Cursor Pro, despite believing they work 20% faster - a 39-point perception gap.
📊 METR studied 16 developers across 246 real coding tasks on mature repositories averaging 23,000 stars and 1 million lines of code.
🎯 Developers accepted less than 44% of AI-generated code suggestions and spent 9% of their time just reviewing and cleaning AI output.
🔍 The slowdown hit experienced programmers on large, complex projects they knew well - AI's weakest scenario rather than its strongest.
💡 Results challenge AI's flagship use case and suggest coding benchmarks may overestimate real-world productivity gains.
🌍 Findings question reliability of self-reported AI productivity surveys used to measure workplace AI impact across industries.
Experienced software developers using AI coding tools work 19% slower than those coding without AI assistance, according to a new randomized controlled trial from AI research nonprofit METR. The twist? The developers believed they were working 20% faster.
METR recruited 16 experienced open-source developers to work on 246 real tasks across mature repositories averaging 23,000 stars and over 1 million lines of code. These weren't junior programmers learning the ropes - the developers had an average of five years experience working on their specific repositories and had made roughly 1,500 commits each.
The Perception Problem
The disconnect between reality and perception proved striking. Before starting work, developers forecasted AI would speed them up by 24%. After completing their tasks, they estimated AI had accelerated their work by 20%. The actual measurement showed a 19% slowdown.
This wasn't a small sample study with artificial tasks. Researchers randomly assigned each real-world coding task to either allow AI tools (primarily Cursor Pro with Claude 3.5/3.7 Sonnet) or prohibit them entirely. They recorded developers' screens and tracked time spent with granular precision.
The methodology addressed common criticisms of productivity studies. Tasks were defined before randomization, preventing developers from gaming the system by choosing easier work for the AI-restricted sessions. Multiple outcome measures and statistical approaches all pointed to the same conclusion: experienced developers moved slower when AI was available.
Why Smart Developers Got Fooled
The study revealed how developers actually spent their time. With AI tools available, they spent less time actively coding and searching for information. Instead, they devoted significant time to prompting AI systems, waiting for responses, and reviewing AI-generated output. Screen recordings showed developers spent roughly 9% of their AI-assisted time just reviewing and cleaning up AI suggestions.
The quality issue proved significant. Developers accepted fewer than 44% of AI-generated code suggestions. Even accepted code often required substantial cleanup - 75% of developers reported reading every line of AI output, and 56% said they frequently made major changes to AI-generated code.
One developer noted the AI "made some weird changes in other parts of the code that cost me time to find and remove." Another observed that AI "doesn't pick the right location to make the edits" and lacks crucial context about backward compatibility requirements.
The Expertise Trap
The study identified several factors contributing to the slowdown. Experienced developers working on familiar, complex codebases represent AI's weak spot. These programmers already work efficiently and possess deep knowledge about their systems' quirks and requirements. AI tools, by contrast, approach each task like "a new contributor to the repository," lacking institutional memory and contextual understanding.
The size and complexity of the repositories mattered. Projects averaging 10 years old with over 1 million lines of code present challenges that simple coding benchmarks don't capture. AI models trained on smaller, more isolated examples struggle with the messy reality of large, interconnected systems.
Implications for the AI Hype Cycle
These findings carry implications beyond coding productivity. The study suggests that widely adopted survey methods for measuring AI impact may be fundamentally flawed. As METR researchers note, "It is sometimes proposed that we should monitor AI R&D acceleration inside of frontier AI labs via simple employee surveys. We're now more pessimistic about these."
The research arrives as AI companies tout impressive benchmark scores and widespread adoption of coding tools. But the gap between controlled benchmarks and real-world productivity mirrors broader questions about AI capabilities. Benchmarks often use self-contained tasks with algorithmic scoring - a far cry from the messy, context-dependent work of maintaining large software systems.
The timing proves particularly relevant as AI companies increasingly focus on coding as a path to artificial general intelligence. If AI struggles to speed up experienced programmers on familiar tasks, the timeline for AI systems revolutionizing software development may be longer than anticipated.
The Bigger Picture
The study doesn't demolish the case for AI coding tools entirely. The researchers emphasize their findings apply specifically to experienced developers working on large, mature projects. Different scenarios - junior developers, smaller projects, or unfamiliar codebases - might yield different results.
The perception gap remains the most troubling finding. If experienced developers consistently misjudge AI's impact on their productivity, broader surveys about AI's workplace effects become questionable. The phenomenon resembles other productivity paradoxes where new technologies feel helpful while delivering minimal measurable benefits.
The research team plans additional studies across different settings. But for now, the results suggest caution about extrapolating from impressive AI demonstrations to real-world productivity gains.
Why this matters:
• The flagship AI use case may be oversold - if experienced developers move slower with AI tools, claims about AI revolutionizing programming deserve skepticism.
• Self-reporting productivity gains is unreliable - the 39-percentage-point gap between perceived and actual impact suggests people can't accurately judge AI's effect on their work, undermining survey-based assessments of AI adoption.
❓ Frequently Asked Questions
Q: How much AI experience did these developers have before the study?
A: 93% had used web-based LLMs like ChatGPT, but only 44% had experience with Cursor (the main tool tested). Most had tens to hundreds of hours using AI tools generally. All developers received basic Cursor training at the start, but researchers noted this might not represent optimal usage.
Q: What AI tools did the developers actually use?
A: Primarily Cursor Pro with Claude 3.5 and 3.7 Sonnet models. Developers could choose any AI tools when allowed, including ChatGPT, but Cursor Pro was most common. The study ran February-June 2025, using state-of-the-art models available then.
Q: How long did this study actually take?
A: The study ran from February to June 2025, with developers completing an average of 15 tasks each. Individual tasks averaged 2 hours to complete. Developers worked on about 20-40 hours of coding total, spread over several months at their own pace.
Q: Were these developers paid, and could that affect the results?
A: Yes, developers received $150 per hour. Researchers chose hourly pay specifically to avoid incentivizing rushed work or gaming the system. Bonus payments based on task completion could have encouraged breaking tasks into smaller pieces or lowering quality standards.
Q: What kinds of coding tasks did developers work on?
A: Real issues from their own repositories - bug fixes, new features, refactoring, and exploration tasks. Tasks ranged from minutes to hours, with most under 2 hours. Developers selected issues they would normally work on, not artificial benchmark problems.
Q: Did any individual developers actually get faster with AI?
A: Yes, but only 25% of developers experienced any speedup. The one developer with over 50 hours of Cursor experience showed positive results, suggesting significant learning curves may exist. However, 75% of developers were slowed down by AI tools.
Q: How does this compare to other studies showing AI speeds up coding?
A: Previous studies found 21-56% speedups, but used artificial tasks, junior developers, or measured output quantity rather than completion time. This study used experienced developers on real work with fixed outcomes, which may explain the different results.
Q: Would results be different for junior developers or smaller projects?
A: Likely yes. Researchers expect AI tools help more on smaller projects, with less experienced developers, or in unfamiliar codebases. This study focused specifically on experienced developers working on large, mature projects they knew well - AI's hardest use case.
Japanese researchers prove AI models work better as teams than alone, boosting performance 30%. TreeQuest system lets companies mix different AI providers instead of relying on one, potentially cutting costs while improving results.
New research finds AI models often fabricate step-by-step explanations that look convincing but don't reflect their actual reasoning. 25% of recent papers incorrectly treat these as reliable—affecting medicine, law, and safety systems.
AI models ace standardized tests but fail basic tasks humans handle easily. New MIT research reveals "Potemkin understanding" - when AI correctly answers benchmark questions but shows no real grasp of concepts. 🤖📚
Anthropic launches research program to study AI's job impact after CEO predicts 50% of white-collar roles will vanish in 5 years. New data shows coding work already transforming as AI agents automate 79% of developer tasks.