Can I Train My AI Model on Public Data

Can I Train My AI Model on Public Data? The Legal Reality in 2026

Scraping the web and training on public data sounds straightforward. It is not. Copyright law, GDPR, terms of service, and emerging AI-specific law create a complex landscape that has already generated billion-dollar litigation. What founders and ML engineers need to know.

AIRiskAware

Specialist AI Risk Governance & Compliance

Can I train AI on public data? The legal answer is complicated

The question seems straightforward, if data is publicly available, can you use it to train AI? The answer across every major jurisdiction is: it depends, and "publicly available" is not the same as "legally available for any purpose."

Copyright law

Public availability does not waive copyright. Publicly posted articles, images, code, music, and other creative works are typically protected by copyright. Using them for AI training may or may not constitute fair use (US), fair dealing (UK, Australia, Canada), or an applicable exception. This is the central question in NYT v OpenAI (filed December 2023, ongoing), Sarah Silverman et al., and dozens of similar cases against foundation model providers. The US Copyright Office's May 2025 guidance confirmed that fully AI-generated works are not copyrightable. Japan's copyright framework has been relatively permissive for AI training but is being tested (Yomiuri Shimbun v Perplexity, 2025). The EU's Text and Data Mining exceptions (Directive 2019/790) permit mining for research purposes and commercial purposes where rights holders haven't opted out.

Data protection law

If public data includes personal data, names, faces, social media profiles, public records, data protection law applies regardless of public availability. GDPR, UK GDPR, PDPA, DPDP Act, Australian Privacy Act all regulate processing of personal data. Public availability may provide a lawful basis (legitimate interest under GDPR) but doesn't eliminate obligations for transparency, data minimisation, and individual rights. The DUAA 2025 reforms to purpose limitation under UK GDPR Article 5(1)(b) give UK-based organisations more latitude to repurpose personal data for AI training, but this is the most material UK-EU divergence since Brexit and doesn't apply in EU jurisdictions.

Scraping publicly available personal data for AI training has attracted enforcement: Clearview AI (multiple jurisdictions, GDPR/PDPA fines), CNIL enforcement against training on French personal data, ICO investigations.

Platform terms of service

Many public data sources prohibit scraping or commercial use in their terms of service. Social media platforms, news sites, forums, and databases often restrict automated data collection. Breach of terms of service creates contractual liability even where the underlying data might otherwise be legally available. LinkedIn v hiQ Labs (US Supreme Court, 2022 remand) demonstrated the complexity, access isn't the same as permission for all uses.

Practical guidance

Don't assume public means free to use for AI training. For each data source: check copyright status; check whether personal data is involved; check platform terms; check applicable jurisdiction's text and data mining exceptions; document your legal basis. For most organisations, the safest approach to AI training data is: use properly licensed datasets; use synthetic data; use data you've generated or collected with appropriate consent; and keep detailed records of training data provenance for audit and litigation defence.

Primary sources: US Copyright Office · ICO · CNIL

Can I Train My AI Model on Public Data? The Legal Reality in 2026

Key Takeaways

Can I train AI on public data? The legal answer is complicated

Copyright law

Data protection law

Platform terms of service

Practical guidance

Related reading

Keep reading on Business and Adoption

AI Governance for Scale-Ups and Series A Companies: What Investors Are Now Asking, and How to Answer

Do I Need AI Governance for My Startup? The Honest Answer

What AI Regulations Apply to My SaaS Product? A Founder's Compliance Map

More from AIRiskAware