Can I train AI on public data? The legal answer is complicated
The question seems straightforward, if data is publicly available, can you use it to train AI? The answer across every major jurisdiction is: it depends, and "publicly available" is not the same as "legally available for any purpose."
Copyright law
Public availability does not waive copyright. Publicly posted articles, images, code, music, and other creative works are typically protected by copyright. Using them for AI training may or may not constitute fair use (US), fair dealing (UK, Australia, Canada), or an applicable exception. This is the central question in NYT v OpenAI (filed December 2023, ongoing), Sarah Silverman et al., and dozens of similar cases against foundation model providers. The US Copyright Office's May 2025 guidance confirmed that fully AI-generated works are not copyrightable. Japan's copyright framework has been relatively permissive for AI training but is being tested (Yomiuri Shimbun v Perplexity, 2025). The EU's Text and Data Mining exceptions (Directive 2019/790) permit mining for research purposes and commercial purposes where rights holders haven't opted out.
Data protection law
If public data includes personal data, names, faces, social media profiles, public records, data protection law applies regardless of public availability. GDPR, UK GDPR, PDPA, DPDP Act, Australian Privacy Act all regulate processing of personal data. Public availability may provide a lawful basis (legitimate interest under GDPR) but doesn't eliminate obligations for transparency, data minimisation, and individual rights. The DUAA 2025 reforms to purpose limitation under UK GDPR Article 5(1)(b) give UK-based organisations more latitude to repurpose personal data for AI training, but this is the most material UK-EU divergence since Brexit and doesn't apply in EU jurisdictions.
Scraping publicly available personal data for AI training has attracted enforcement: Clearview AI (multiple jurisdictions, GDPR/PDPA fines), CNIL enforcement against training on French personal data, ICO investigations.
Platform terms of service
Many public data sources prohibit scraping or commercial use in their terms of service. Social media platforms, news sites, forums, and databases often restrict automated data collection. Breach of terms of service creates contractual liability even where the underlying data might otherwise be legally available. LinkedIn v hiQ Labs (US Supreme Court, 2022 remand) demonstrated the complexity, access isn't the same as permission for all uses.
Practical guidance
Don't assume public means free to use for AI training. For each data source: check copyright status; check whether personal data is involved; check platform terms; check applicable jurisdiction's text and data mining exceptions; document your legal basis. For most organisations, the safest approach to AI training data is: use properly licensed datasets; use synthetic data; use data you've generated or collected with appropriate consent; and keep detailed records of training data provenance for audit and litigation defence.
Primary sources: US Copyright Office · ICO · CNIL