
By Farhan Ali • June 23, 2025
The AI world is abuzz with fresh allegations that Chinese startup DeepSeek may have used outputs from Google’s Gemini model as part of its training data for its latest flagship model, R1. The claim, while still under investigation, has sparked a renewed debate around LLM provenance, synthetic data ethics, and the murky grey zone of modern AI development.
The Allegations
- Researchers analyzing DeepSeek-R1’s responses have discovered stylistic and semantic similarities to Gemini-generated outputs.
- Phrases like “as a large language model developed by…” or structured citations reminiscent of Gemini Ultra’s output were discovered during blind testing.
- AI2’s Nathan Lambert suggested that even if direct copying didn’t occur, it’s highly plausible DeepSeek used Gemini to generate synthetic datasets, effectively “training on a model’s brain without touching its code.”
Why This Matters
The use of model outputs—particularly those generated by commercial systems like Gemini or ChatGPT—for downstream model training raises several legal and ethical concerns:
- IP Ownership: Do generative outputs fall under copyright or proprietary protection?
- Data Contamination: Models trained on other models may inherit biases, errors, or hallucination patterns.
- Unfair Advantage: Leveraging outputs from billion-dollar models without paying for API access may constitute unauthorized usage.
DeepSeek’s Position
To date, DeepSeek has declined to specify which datasets were used to train R1. The company is known to rely on open-sourced corpora, but their public documentation omits synthetic data sources—raising questions about internal practices.

The Precedent
This isn’t the first instance of such controversy:
- In 2023, researchers found evidence of GPT-style responses in multiple open-source Chinese models.
- Meta previously warned against training on LLaMA outputs due to model mimicry risks.
- Stability AI was also criticized for training on partially scraped commercial data in early versions of StableLM.
The Bigger Picture
As LLMs become increasingly commoditized, model makers are turning to synthetic data to enhance quality without spiraling training costs. But when that synthetic data originates from another company’s model—especially a closed-source one—the industry runs the risk of turning LLM development into an arms race of “who copied who better.”

Conclusion
If DeepSeek did train its model using Gemini outputs, it may not be the first—and it certainly won’t be the last. As generative AI accelerates, transparency in training data is becoming not just a technical concern, but a legal and competitive imperative.
Additional References:
- TechCrunch AI Desk (@techcrunch)
- Allen Institute for AI (@ai2_alleninstitute)
- Hugging Face LLM Tracker (@huggingface)
- The Information AI Insider (@theinformation)
- Google DeepMind (@googledeepmind)
Leave a comment