Did DeepSeek Train Its AI on Google’s Gemini Without Permission?

Chinese AI company DeepSeek caught training its model on Google's data while simultaneously censoring 85% of government criticism topics. How companies steal billions in training costs.

Did DeepSeek Train Its AI on Google’s Gemini Without Permission?

💡 TL;DR - The 30 Seconds Version

🔍 AI researchers caught Chinese company DeepSeek training its latest R1-0528 model on stolen data from Google's Gemini AI system.

📊 Microsoft detected large data theft through OpenAI accounts in late 2024, which OpenAI believes connects to DeepSeek operations.

🚫 DeepSeek's new model refuses to discuss 85% of Chinese government taboo topics, making it the most censored version yet.

💰 Training AI from scratch costs $63-200 million, while data theft through distillation costs just $1-2 million.

🌐 The internet now overflows with AI-generated content, creating a feedback loop that degrades future AI training quality.

🚀 AI companies race to protect their data while the industry splits into Chinese and Western systems with different capabilities and restrictions.

Chinese AI company DeepSeek released an updated reasoning model last week that performs well on math and coding tests. The problem? Researchers suspect the company trained it using data stolen from Google's Gemini AI.

Sam Paech, a Melbourne developer who tests AI systems, published what he calls evidence that DeepSeek's R1-0528 model learned from Gemini outputs. The model uses similar words and expressions that Google's Gemini 2.5 Pro favors, Paech noted on X.

Another developer who goes by a pseudonym found that DeepSeek's model traces read like Gemini traces. These traces are the "thoughts" AI models generate while working toward answers.

The Pattern Repeats

This isn't DeepSeek's first time facing such accusations. In December, developers noticed the company's V3 model often identified itself as ChatGPT, OpenAI's chatbot. This suggested it learned from ChatGPT conversation logs.

OpenAI told the Financial Times earlier this year it found evidence linking DeepSeek to distillation. That's a technique where companies extract data from bigger, more capable AI models to train their own systems.

Microsoft detected large amounts of data being pulled through OpenAI developer accounts in late 2024. OpenAI believes these accounts connect to DeepSeek.

The practice violates OpenAI's terms of service, which ban customers from using model outputs to build competing AI systems.

Why This Happens

Training AI models requires massive amounts of text data. Companies initially used books, articles, and websites written by humans. But AI-generated content now floods the internet. Content farms use AI to create clickbait. Bots spam Reddit and X with AI-written posts.

This contamination makes it hard to filter AI outputs from training datasets. Many models now misidentify themselves and use similar phrases because they learned from the same polluted data sources.

Still, AI researcher Nathan Lambert from nonprofit AI2 thinks DeepSeek probably did train on Gemini data. "If I was DeepSeek, I would definitely create synthetic data from the best API model out there," Lambert wrote on X. "They're short on GPUs and flush with cash. It's more compute for them."

The Arms Race Begins

AI companies now race to prevent data theft. OpenAI requires organizations to verify their identity before accessing advanced models. The process needs government-issued ID from supported countries. China isn't on the list.

Google recently started summarizing traces from models on its AI Studio platform. This makes it harder for rivals to train performant models on Gemini traces. Anthropic announced similar protections in May, citing competitive concerns.

Double Standards and Dirty Data

The accusations highlight a double standard in AI development. Western companies trained their models on copyrighted books, news articles, and creative works without permission. Publishers and authors sued over this practice.

Now those same companies cry foul when competitors might be using their AI outputs as training data. The hypocrisy is striking.

Meanwhile, the internet becomes less useful for training AI systems. As models generate more content, they risk learning from their own outputs. This creates a feedback loop that degrades quality over time.

Researchers call this "model collapse." AI trained on AI starts producing incoherent results. Accuracy drops. Models generate content that ranges from wrong to offensive.

The Censorship Factor

DeepSeek's latest model faces criticism beyond alleged data theft. The R1-0528 version shows increased censorship around Chinese political topics.

A developer using the handle "xlr8harder" tested the model with a tool called SpeechMap. The results show this version as "the most censored DeepSeek model yet for criticism of the Chinese government."

The model refuses to discuss internment camps in Xinjiang, even when prompted with documented human rights cases. It sometimes acknowledges violations occurred but stops short of assigning responsibility.

This fits China's 2023 AI regulations. Systems must not produce content that challenges government narratives or undermines state unity. Previous research found the first DeepSeek R1 model refused to answer 85% of questions on state-designated taboo topics.

What Companies Can Do

The situation forces AI companies to get creative about data sources. Some pay publishers for access to clean, human-written content. Others develop partnerships with news organizations and book publishers.

Retrieval-augmented generation offers another approach. This lets AI models search the internet in real-time instead of relying only on pre-trained data. But tests show these systems produce more unsafe responses, from privacy violations to outright misinformation.

The fundamental problem remains: AI models need human creativity to evolve. As one researcher put it, these systems cannot replace humans because they depend entirely on human input to improve.

Companies may need to pay people to create content specifically for AI training. This reverses the current model where AI companies take human work without compensation.

Why this matters:

  • The AI industry built its foundation on free human content but now faces a data shortage as the internet fills with AI-generated material, creating a feedback loop that could collapse the quality of future models.
  • Chinese AI companies operate under different rules than Western rivals, leading to models that excel technically but face severe political restrictions, showing how geopolitics shapes artificial intelligence development.

❓ Frequently Asked Questions

Q: What exactly is "distillation" in AI training?

A: Distillation extracts knowledge from a powerful AI model to train a smaller, cheaper one. Think of it like copying homework from the smart kid in class. The student model learns to mimic the teacher model's responses without needing the same massive computing power.

Q: How much does it cost to train an AI model from scratch?

A: Training a model like GPT-4 costs between $63 million and $200 million. That includes GPU rental, electricity, and engineering time. Distillation costs a fraction of this - maybe $1-2 million - which explains why companies like DeepSeek might prefer stealing data over starting fresh.

Q: Can you really tell if one AI was trained on another AI's outputs?

A: Yes, but it's like forensic detective work. Researchers look for telltale signs: similar word choices, identical mistakes, or models that identify themselves as their competitors. When DeepSeek's model called itself "ChatGPT," that was a dead giveaway it learned from OpenAI's data.

Q: Why don't AI companies just pay for clean human-written content?

A: Some do. Reddit sold its content to Google for $60 million per year. News Corp signed a deal with OpenAI worth $250 million over five years. But most companies still scrape free content because paying for quality data at scale would cost billions annually.

Q: What happens if China and the West develop separate AI systems?

A: We get digital iron curtains. Chinese models excel at math and coding but refuse to discuss Xinjiang. Western models discuss politics freely but face different biases. This split could create incompatible AI ecosystems, forcing companies to choose sides like the old iOS vs Android wars.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to implicator.ai.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.