Tech moves fast—stay faster.

Saturday , 1 November 2025

Tech moves fast—stay faster.

Saturday , 1 November 2025
AI

Did DeepSeek Train Its R1 Model on Gemini Outputs? AI Experts Think So

Share
Share

By Farhan Ali • June 23, 2025

The AI world is abuzz with fresh allegations that Chinese startup DeepSeek may have used outputs from Google’s Gemini model as part of its training data for its latest flagship model, R1. The claim, while still under investigation, has sparked a renewed debate around LLM provenance, synthetic data ethics, and the murky grey zone of modern AI development.

The Allegations

  • Researchers analyzing DeepSeek-R1’s responses have discovered stylistic and semantic similarities to Gemini-generated outputs.
  • Phrases like “as a large language model developed by…” or structured citations reminiscent of Gemini Ultra’s output were discovered during blind testing.
  • AI2’s Nathan Lambert suggested that even if direct copying didn’t occur, it’s highly plausible DeepSeek used Gemini to generate synthetic datasets, effectively “training on a model’s brain without touching its code.”

Why This Matters

The use of model outputs—particularly those generated by commercial systems like Gemini or ChatGPT—for downstream model training raises several legal and ethical concerns:

  • IP Ownership: Do generative outputs fall under copyright or proprietary protection?
  • Data Contamination: Models trained on other models may inherit biases, errors, or hallucination patterns.
  • Unfair Advantage: Leveraging outputs from billion-dollar models without paying for API access may constitute unauthorized usage.

DeepSeek’s Position

To date, DeepSeek has declined to specify which datasets were used to train R1. The company is known to rely on open-sourced corpora, but their public documentation omits synthetic data sources—raising questions about internal practices.

The Precedent

This isn’t the first instance of such controversy:

  • In 2023, researchers found evidence of GPT-style responses in multiple open-source Chinese models.
  • Meta previously warned against training on LLaMA outputs due to model mimicry risks.
  • Stability AI was also criticized for training on partially scraped commercial data in early versions of StableLM.

The Bigger Picture

As LLMs become increasingly commoditized, model makers are turning to synthetic data to enhance quality without spiraling training costs. But when that synthetic data originates from another company’s model—especially a closed-source one—the industry runs the risk of turning LLM development into an arms race of “who copied who better.”

Conclusion

If DeepSeek did train its model using Gemini outputs, it may not be the first—and it certainly won’t be the last. As generative AI accelerates, transparency in training data is becoming not just a technical concern, but a legal and competitive imperative.


Additional References:

  • TechCrunch AI Desk (@techcrunch)
  • Allen Institute for AI (@ai2_alleninstitute)
  • Hugging Face LLM Tracker (@huggingface)
  • The Information AI Insider (@theinformation)
  • Google DeepMind (@googledeepmind)

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles
AI

Snowcap Compute Raises $23M for Superconducting AI Chips Promising 25× Efficiency Gains

By Farhan Ali • June 23, 2025 Snowcap Compute, a U.S.-based semiconductor...

AI

SandboxAQ and Nvidia Partner to Speed Up Drug Discovery with Quantum-Enhanced AI Models

By Farhan Ali • June 23, 2025 SandboxAQ has announced a strategic...

AI

AI Startups Dominate North American VC with 70% Share, Signaling Explosive and Risky Growth

 By Farhan Ali • June 23, 2025 Venture capital in North America...

AI

Ola Krutrim Launches Kruti: India’s First Multilingual Agentic AI Assistant

 By Farhan Ali • June 23, 2025 India’s AI race just leveled...