AI Reliance Tests How Students Learn to Think

Experienced endoscopists' ability to spot precancerous growths without software help fell from 28.4% to 22.4% after they grew used to AI assistance, a 2025 multicentre study in The Lancet Gastroenterology & Hepatology found. The dataset covered 1,443 colonoscopies across four centres, the first large clinical measure of what happens to a trained skill once the AI is switched off.

Researchers working on the question increasingly sort the effect by how the tool is used, not whether it is used. "It is not so much that the use of gen AI leads to reduced critical thinking, but it's rather how we use it," said Michael Gerlich, a professor at SBS Swiss Business School who surveyed 666 people in 2025 and found the heaviest AI users scored lowest on a critical-thinking test, with the gap widest among 17-to-25-year-olds. The studies treat substitutive use, where the system hands over an answer before the user has tried, as the costly case, and guided tutoring as the beneficial one.

A field experiment with nearly 1,000 Turkish high-school students put the split to a direct test. Students given a GPT-4 interface built to mimic standard ChatGPT raised their practice-problem scores 48%, then scored 17% lower than a control group on an exam they sat without it, Hamsa Bastani and co-authors at Wharton found. A version rebuilt as a guarded tutor, which withheld answers and gave hints, lifted practice scores 127% with no penalty on the unaided exam. The pre-registered working paper has not been peer reviewed.

Key Takeaways

The strongest evidence points to substitutive AI use, not general cognitive decline.
A Lancet colonoscopy study found unassisted detection fell from 28.4% to 22.4% after AI exposure.
A Wharton working paper found GPT Base lifted practice scores but hurt unaided exams.
Scaffolded AI tutors and forced verification can preserve learning while reducing over-reliance.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

The 28.4% colonoscopy baseline

Budzyń and co-authors studied 19 experienced endoscopists, each with more than 2,000 prior procedures, before and after routine exposure to AI-assisted colonoscopy. Their unassisted adenoma detection rate fell 6.0 percentage points, from 28.4% before exposure to 22.4% after it, with the decline appearing across all four centres.

The study is observational and cannot prove AI exposure alone caused the drop. Its authors treated the finding as a warning anyway. "To our knowledge, this is the first study to suggest a negative impact of regular AI use on health-care professionals' ability to complete a patient-relevant task in medicine of any kind," said Marcin Romańczyk of the Academy of Silesia, a co-author, in a statement. Yuichi Mori of the University of Oslo, the senior author, said the result casts the earlier AI-assisted trials in a new light: the endoscopists in those trials "may have been negatively affected by continuous AI exposure." Omer Ahmad, in a linked Lancet commentary, told clinicians to guard against the "quiet erosion of fundamental skills required for high-quality endoscopy."

The 48% practice gain

Bastani put the worry in terms of skill formation. "We're really worried that if … they start using these tools as a crutch and rely on it, then they won't actually build those fundamental skills," she told Knowledge@Wharton. She drew the line by how the tool gets used: as an assistant whose "outputs" a person checks, it "can be a huge benefit," but used "lazily" to "completely trust the machine learning model, then that's when we could be in trouble."

The cognitive science behind the worry is among the most replicated in the field. A generation-effect meta-analysis found people remember self-produced answers better than ones they read, at d=0.40 across 86 studies; a separate review put the practice-testing effect at g=0.61 to 0.70. Reading the model's answer before attempting recall skips the step that does the encoding.

Nataliya Kosmyna, who led an MIT Media Lab study on writing essays with ChatGPT, calls the shortfall that builds up "cognitive debt." "There is no cognitive credit card," she said. "You cannot pay this debt off." The education researcher Carl Hendrick drew the harder limit: "The most advanced AI can simulate intelligence, but it cannot think for you."

Stay sharp on the AI evidence

Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.

No spam. Unsubscribe anytime.

The 11.3-point diagnostic penalty

A second failure mode is misplaced trust. In a JAMA randomized clinical vignette study, 457 clinicians shown a deliberately biased AI recommendation were 11.3 percentage points less accurate in their diagnoses. An explanation that exposed the model's faulty reasoning lifted accuracy by a statistically insignificant 2.3 points.

Ethan Mollick, the Wharton professor who studies workplace AI, calls the trap a "jagged frontier." "That wall is the capability of AI, and the further from the center, the harder the task. Everything inside the wall can be done by the AI, everything outside is hard for the AI to do," he has written. "The problem is that the wall is invisible." The passivity that follows easy answers he calls "falling asleep at the wheel": "when the AI is very good humans have no reason to work hard and pay attention."

Know someone who'd find this useful? ✉️ Email it to a friend in one click, or they can subscribe free here.

The pattern showed up at work, too. In a study Mollick co-authored, consultants using GPT-4 did better on tasks inside the frontier and were 19 percentage points less likely to get a task outside it right. The same experiment cut the other way for the weakest performers, whose scores jumped 43% with the tool. Kosmyna, whose MIT essay-writing study the Implicator has covered, has since pushed back on the panic her own work set off: "We didn't find any brain rot. … we didn't measure IQ."

The 0.73-standard-deviation tutor case

A Harvard physics tutor built with pedagogical scaffolding produced learning gains of 0.73 to 1.3 standard deviations over a well-run active-learning class for 194 students, Gregory Kestin and Kelly Miller reported in Scientific Reports. The tutor helped students "learn significantly more in less time," they wrote, with higher engagement.

Workplace data points the same way for novices. Noy and Zhang found ChatGPT cut writing time and raised quality, with the weakest writers gaining most. Brynjolfsson, Li and Raymond measured a 14% productivity gain among support agents that rose to 34% for the least experienced.

Evan Risko, who co-wrote a 2016 review of cognitive offloading at the University of Waterloo, drew the boundary. "When you offload, you free up some mental resources," he said. "Now, if you devote that mental effort to some productive task there should be a net benefit." Whether that benefit shows up depends on the freed effort going back into checking, retrieval or practice.

How much of this carries over to people who grow up offloading from the start is unmeasured. Every causal study in the literature so far runs a single semester or less, and none has tracked habitual users across years and then tested them once the tool is gone, in students, coders or clinicians. That is the evidence the field still lacks.

Frequently Asked Questions

Does AI use make people worse thinkers?

The evidence does not support a broad population claim. It points to a narrower risk: substitutive AI use can reduce unaided practice, while scaffolded AI tutoring can improve learning outcomes.

What is substitutive AI use?

Substitutive use means asking the system to supply the answer before the user has tried retrieval, reasoning or verification. The learning risk is highest when the tool replaces the struggle that builds skill.

What did the colonoscopy study find?

A 2025 Lancet Gastroenterology and Hepatology study found experienced endoscopists' unassisted adenoma detection rate fell from 28.4% to 22.4% after routine AI exposure. The study was observational.

What did the Bastani education trial find?

The pre-registered Wharton working paper found a GPT-4-based GPT Base interface raised practice performance by 48% but reduced unaided exam scores by 17%. A guarded tutor version avoided the exam penalty.

What evidence pushes against AI panic?

Harvard, workplace and tutoring studies show scaffolded AI can raise performance, especially for novices. The policy question is how to protect tool-free practice and verification while using the systems.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

AI Research

Marcus Schuler

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: editor@implicator.ai