Can I Train My AI Model on Public Data? The Legal Reality in 2026

Scraping the web and training on public data sounds straightforward. It is not. Copyright law, GDPR, terms of service, and emerging AI-specific law create a complex landscape that has already generated billion-dollar litigation. What founders and ML engineers need to know.

Key Takeaways

Publicly accessible data is not the same as data you have the right to use for AI training. Copyright law, GDPR, database rights, and terms of service create multiple independent legal constraints on web-scraped training data.
Copyright: training a model on copyrighted text creates potential infringement liability. The 'fair use' and 'text and data mining' exceptions are narrower than most founders assume — they vary significantly by jurisdiction and have not been fully settled by courts.
GDPR and Privacy Act: if your training data includes personal information about identifiable individuals (which most web-scraped data does), you need a lawful basis for processing that data for training. 'It was publicly available' is not a lawful basis.
Terms of Service: most major platforms (LinkedIn, Twitter/X, Reddit, news sites) explicitly prohibit scraping for AI training. Violating ToS creates breach of contract exposure and in some jurisdictions computer fraud liability.
The safest training data strategy: synthetic data, licensed datasets, user-consented data, or data from providers who have taken on the legal risk. Document your data sources from day one — retroactive documentation is often impossible.

"Apenas para fins informativos. Este artigo não constitui aconselhamento jurídico, regulatório, financeiro ou profissional. Consulte um especialista qualificado para orientação específica."

The three legal frameworks that constrain AI training data

Copyright law is the framework generating the most litigation. The core question — whether training an AI model on copyrighted material constitutes copyright infringement — has not been finally resolved in any major jurisdiction. In the US, ongoing cases (Getty Images v. Stability AI, NYT v. OpenAI, and others) are working through the courts. In the EU, the Copyright Directive's text and data mining exception provides some protection for non-commercial research purposes, but its application to commercial AI training is contested. In Australia, there is no equivalent exception, and the legal analysis is even less settled.

What founders need to understand: the absence of settled law does not mean the absence of risk. Using copyrighted material for AI training creates potential liability — even if the legal outcome is uncertain, the litigation risk is real and the discovery and defence costs can be material for early-stage companies.

The GDPR problem most founders miss

Web-scraped data almost always contains personal information — names, email addresses, biographical details, professional information, opinions, and potentially sensitive data. If your training data includes personal information about individuals in the EU, UK, or Australia, data protection law applies to your use of that data for training. The specific problem: the individuals whose data is in your training set did not consent to that use, and "legitimate interest" as a legal basis for AI training is not straightforward to establish given the scale and the purpose.

The Italian DPA's enforcement action against ChatGPT, the EDPB's work on AI training data, and the OAIC's guidance on AI and privacy all point in the same direction: using personal data scraped from the web for AI training requires careful legal analysis and cannot simply be assumed to be lawful. If your training data includes personal information, you need to have worked through the lawful basis question before you start training.

What the safest training data strategies look like

Synthetic data — AI-generated training data that does not represent real individuals or real copyrighted content — avoids the copyright and privacy problems entirely. It is increasingly viable for many use cases. Licensed datasets — data from providers who have taken on the legal risk and hold appropriate rights — transfer the legal responsibility and provide documentary evidence of lawful use. User-consented data — data that users of your product have explicitly consented to use for training — provides a clean lawful basis but requires careful consent design. And public domain and permissively licensed data (Creative Commons, government datasets, academic datasets) provides the most legally clear foundation.

Ler em inglês