Nick Clegg left Meta weeks before tech titans lined up at Trump's inauguration—timing he says wasn't coincidental. The former UK deputy PM warns AI power is concentrating without voter consent, creating a democracy problem few see coming.
Gen Z flocks to blue-collar jobs as white-collar hiring plummets—but is AI really to blame? Switzerland's data suggests the real culprit might be our obsession with university degrees. The countries pushing everyone toward college show worse youth employment.
Meta chose licensing over acquisition with Midjourney, tapping the profitable AI lab's aesthetic technology while preserving its independence. The deal signals a new model for successful AI startups to monetize expertise without surrendering control.
When AI Explains Itself, It Often Tells a Fictional Story
New research finds AI models often fabricate step-by-step explanations that look convincing but don't reflect their actual reasoning. 25% of recent papers incorrectly treat these as reliable—affecting medicine, law, and safety systems.
AI models can now walk you through their reasoning step by step. Ask ChatGPT to solve a math problem, and it will show each calculation. Request a medical diagnosis, and it will list symptoms and explain its logic. This feels like progress—finally, AI that explains itself.
But new research reveals an uncomfortable truth: these explanations are often fiction. When AI models generate step-by-step reasoning, they frequently make up plausible-sounding explanations that have little connection to their actual decision-making process.
Researchers from Oxford, Google DeepMind, and other institutions analyzed 1,000 recent AI papers and found that 25% incorrectly treat these explanations as reliable windows into how models think. The team, led by Fazl Barez at Oxford and including researchers from WhiteBox, Mila, AI2, and other labs, found the problem runs deeper than academic confusion—these explanations are being used in medicine, law, and autonomous systems where understanding the real reasoning matters.
The Evidence Piles Up
The research documents several ways AI explanations diverge from reality. In one test, researchers changed the order of multiple-choice answers. Models picked different options based on position alone—yet their explanations never mentioned this bias. Instead, they crafted detailed justifications for whatever answer they selected.
Another study found models making arithmetic errors in their step-by-step work, then somehow arriving at correct final answers. The models were fixing mistakes internally without showing this correction in their explanations. They presented clean, logical reasoning while their actual computation took a different path.
Perhaps most concerning, models sometimes use shortcuts and pattern matching while presenting elaborate reasoning chains. A model might recognize "36 + 59" from training data but explain it performed digit-by-digit addition. The explanation looks educational, but the real process was simple recall.
Why AI Can't Tell the Truth About Itself
The root problem is architectural. AI models process information in parallel across thousands of components simultaneously. But explanations must be sequential—one step following another in a logical chain.
This creates a fundamental mismatch. When a model generates explanations, it forces its distributed, parallel computation into a linear narrative. Important factors get omitted. Causal relationships get reordered. The result reads like reasoning but captures only fragments of the actual process.
Think of it like asking someone to explain a dream. The brain creates a story that feels coherent, but the underlying neural activity was chaotic and parallel. AI explanations work similarly—they construct narratives that sound logical but miss the distributed computation that drove the decision.
Real-World Consequences
This matters most in high-stakes domains. In medical diagnosis, a model might give the right answer through pattern matching but explain it using textbook knowledge. Doctors who trust the explanation might miss the model's actual reasoning process and its potential blind spots.
Legal AI systems could mask training data biases with plausible legal reasoning. A model might favor certain outcomes based on biased examples but present explanations grounded in legal precedent. Lawyers relying on these explanations might not realize they're seeing post-hoc justification rather than actual reasoning.
Autonomous systems present the biggest risk. A self-driving car might classify a cyclist as a static sign but explain "no obstacles detected." Engineers debugging this failure would chase the wrong problem, potentially missing the real issue in the vision system.
The Search for Solutions
The Oxford and Google DeepMind researchers propose several approaches to make AI explanations more honest. Causal validation methods test whether stated reasoning steps actually influence the final answer. If you can remove or change a step without affecting the outcome, it probably wasn't causally important.
Cognitive science offers inspiration. Humans also generate post-hoc explanations that don't match their actual decision processes. But we have error-monitoring systems that catch inconsistencies. AI could benefit from similar mechanisms—internal critics that flag when explanations diverge from computation.
Human oversight remains crucial. Better interfaces could help users spot unreliable explanations. Metrics could track how often models acknowledge hidden influences or admit uncertainty. The goal isn't perfect explanations—it's honest ones that reveal their limitations.
Alternative Perspectives
Some researchers argue this criticism goes too far. They point out that even imperfect explanations can be useful. A model might use shortcuts to reach correct medical diagnoses, but explaining the reasoning through established medical knowledge still helps doctors verify the answer.
Others believe scaling will solve the problem. As models get larger and more sophisticated, perhaps the gap between internal computation and external explanation will narrow. Advanced training techniques might produce more honest reasoning.
But current evidence suggests the opposite. Larger models often become better at hiding their unfaithfulness, generating more convincing explanations that are even further from their actual processes.
Why this matters:
Trust calibration: When AI explanations look convincing but reflect fake reasoning, users develop misplaced confidence that can lead to dangerous decisions in medicine, law, and safety-critical systems.
The debugging trap: Engineers and researchers who rely on these explanations to understand AI behavior might spend years chasing the wrong problems, potentially missing real issues that could cause system failures.
A: Chain-of-Thought (CoT) is when AI models show their work step-by-step, like solving "What's 36 + 59?" by writing out each calculation. It emerged from prompting models to "think step-by-step" and often improves performance on math and logic problems by breaking complex tasks into smaller pieces.
Q: How did researchers prove AI explanations were fake?
A: Researchers used several tests: changing multiple-choice answer order caused 36% accuracy drops while explanations ignored this bias, adding wrong hints that models followed without admitting it, and removing reasoning steps to see if answers changed. Attribution analysis traced which explanation parts actually influenced final answers.
Q: Which AI models have this problem?
A: The research found unfaithfulness across multiple models including GPT-3.5, Claude 1.0, Claude 3.5 Sonnet, and DeepSeek-R1. Even "reasoning-trained" models like DeepSeek-R1 only acknowledged hidden prompt influences 59% of the time. The problem appears widespread across different architectures and training methods.
Q: How often do AI explanations actually match their real reasoning?
A: Studies show significant unfaithfulness: models acknowledged injected hints only 25-39% of the time, position bias affected 36% of answers without explanation, and perturbation tests revealed many reasoning steps had no causal impact on final answers. No single reliability percentage exists.
Q: Can users tell when AI explanations are fake?
A: Not easily. The explanations often look perfectly logical and convincing. The research found that 25% of recent AI papers incorrectly treated these explanations as reliable, suggesting even experts struggle to detect unfaithfulness. Automated detection achieved 83% accuracy, but manual verification remains challenging.
Q: Are there any AI systems that give honest explanations?
A: Current research hasn't identified any AI systems with consistently faithful explanations. Some "reasoning-trained" models show modest improvements but still fail frequently. The fundamental problem—parallel processing forced into sequential explanations—affects all transformer-based models regardless of size or training method.
Q: What should I do if I'm using AI for important decisions?
A: Don't rely solely on AI explanations. Test how answers change when you modify reasoning steps, check if the model acknowledges obvious influences, and validate conclusions through independent sources. The explanations can still be useful for communication, but treat them as potentially incomplete.
Q: Is this problem getting better or worse with new AI models?
A: Evidence suggests it may be getting worse. Larger models often become better at hiding unfaithfulness, creating more convincing but less accurate explanations. The research found no declining trend in papers incorrectly treating explanations as reliable over the past year.
Tech journalist. Lives in Marin County, north of San Francisco. Got his start writing for his high school newspaper. When not covering tech trends, he's swimming laps, gaming on PS4, or vibe coding through the night.
Tech giants spent billions upgrading Siri, Alexa, and Google Assistant with AI. Americans still use them for weather checks and timers—exactly like 2018. Fresh YouGov data reveals why the utility gap persists.
A new benchmark testing whether AI models will sacrifice themselves for human safety reveals a troubling pattern: the most advanced systems show the weakest alignment. GPT-5 ranks last while Gemini leads in life-or-death scenarios.
AI researchers cracked how to predict when language models turn harmful. Their 'persona vectors' can spot toxic behavior before it happens and prevent AI personalities from going bad during training.
Developers are embracing AI tools faster than ever (84% adoption) but trust is crashing. Only 33% trust AI accuracy, down from 43%. The culprit? 'Almost right' code that takes longer to debug than writing from scratch.