Tech moves fast—stay faster.

Saturday , 1 November 2025

No products in the cart.

Join Today!

Tech moves fast—stay faster.

Saturday , 1 November 2025

No products in the cart.

Join Today!

Did DeepSeek Train Its R1 Model on Gemini Outputs? AI Experts Think So

adminSeptember 12, 20252 Mins read9 Views

By Farhan Ali • June 23, 2025

The AI world is abuzz with fresh allegations that Chinese startup DeepSeek may have used outputs from Google’s Gemini model as part of its training data for its latest flagship model, R1. The claim, while still under investigation, has sparked a renewed debate around LLM provenance, synthetic data ethics, and the murky grey zone of modern AI development.

The Allegations

Researchers analyzing DeepSeek-R1’s responses have discovered stylistic and semantic similarities to Gemini-generated outputs.
Phrases like “as a large language model developed by…” or structured citations reminiscent of Gemini Ultra’s output were discovered during blind testing.
AI2’s Nathan Lambert suggested that even if direct copying didn’t occur, it’s highly plausible DeepSeek used Gemini to generate synthetic datasets, effectively “training on a model’s brain without touching its code.”

Why This Matters

The use of model outputs—particularly those generated by commercial systems like Gemini or ChatGPT—for downstream model training raises several legal and ethical concerns:

IP Ownership: Do generative outputs fall under copyright or proprietary protection?
Data Contamination: Models trained on other models may inherit biases, errors, or hallucination patterns.
Unfair Advantage: Leveraging outputs from billion-dollar models without paying for API access may constitute unauthorized usage.

DeepSeek’s Position

To date, DeepSeek has declined to specify which datasets were used to train R1. The company is known to rely on open-sourced corpora, but their public documentation omits synthetic data sources—raising questions about internal practices.

The Precedent

This isn’t the first instance of such controversy:

In 2023, researchers found evidence of GPT-style responses in multiple open-source Chinese models.
Meta previously warned against training on LLaMA outputs due to model mimicry risks.
Stability AI was also criticized for training on partially scraped commercial data in early versions of StableLM.

The Bigger Picture

As LLMs become increasingly commoditized, model makers are turning to synthetic data to enhance quality without spiraling training costs. But when that synthetic data originates from another company’s model—especially a closed-source one—the industry runs the risk of turning LLM development into an arms race of “who copied who better.”

Conclusion

If DeepSeek did train its model using Gemini outputs, it may not be the first—and it certainly won’t be the last. As generative AI accelerates, transparency in training data is becoming not just a technical concern, but a legal and competitive imperative.

Additional References:

TechCrunch AI Desk (@techcrunch)
Allen Institute for AI (@ai2_alleninstitute)
Hugging Face LLM Tracker (@huggingface)
The Information AI Insider (@theinformation)
Google DeepMind (@googledeepmind)

Previous post OpenAI Tops $10B in Revenue, But the Road to $125B Remains Steep

Next post Nigeria’s Internet Has a Border Problem—IXPN Wants to Keep 80% of Traffic Local by 2030

Did DeepSeek Train Its R1 Model on Gemini Outputs? AI Experts Think So

The Allegations

Why This Matters

DeepSeek’s Position

The Precedent

The Bigger Picture

Conclusion

Leave a comment

Leave a Reply Cancel reply

Explore more

China’s Defense Industry Is Experiencing a “DeepSeek Moment” in Breakthrough and Independence

Indonesia Eyes Nuclear Partnerships with China and Russia After Discovering Massive Uranium Reserve

FIFA Selects Jakarta as Regional Headquarters for Southeast and East Asia—Here’s What That Means for Football in the Region

Scoot Named World’s Best Long-Haul Low-Cost Airline by Skytrax—Here’s Why It Matters

Snowcap Compute Raises $23M for Superconducting AI Chips Promising 25× Efficiency Gains

SandboxAQ and Nvidia Partner to Speed Up Drug Discovery with Quantum-Enhanced AI Models

AI Startups Dominate North American VC with 70% Share, Signaling Explosive and Risky Growth

Ola Krutrim Launches Kruti: India’s First Multilingual Agentic AI Assistant

Get to Know Us

Join 100k Evolvers

Weekly update

China Just Switched On the World’s First Thorium Nuclear Reactor—Here’s Why That Matters

China’s Defense Industry Is Experiencing a “DeepSeek Moment” in Breakthrough and Independence

Indonesia Eyes Nuclear Partnerships with China and Russia After Discovering Massive Uranium Reserve

Weekly Newsletter