Where AI Training Data Comes From, and the Fights Over Who Said Yes

A photographer finds her style in a model she never licensed. A novelist learns her book was in a training set. Here's how the data gets collected, and who's fighting about it.

A working illustrator types her own name into an image generator and watches it spit out pictures in her style, the look she spent fifteen years building, and she never signed anything. A novelist searches a leaked list of training files and finds the title she wrote in a graduate apartment, no email, no check, no heads-up. That jolt, the feeling of recognizing your own work inside a machine you never agreed to feed, is what most of the current AI fights are actually about. The legal briefs come later. The gut-punch comes first.

This is a plain-language guide to where the data behind AI models actually comes from, and why so many people are now arguing in court over whether anyone said yes.

The three ways data gets in

Almost every large AI model is trained on text, images, audio, or code that came from one of three places.

Scraping. Bots crawl the open web and copy what they find, mostly public pages, forums, news sites, blogs, image galleries, and code repositories. Much of this flows through large public datasets. Common Crawl, a nonprofit, has spent years archiving billions of web pages, and many language models were trained on filtered slices of it. Image models leaned on collections like LAION, which gathered billions of image-and-caption pairs scraped from the web. The key thing to understand: "publicly visible" is not the same as "free to copy." A page you can read for free can still be under copyright.

Licensing. Increasingly, AI companies pay for data instead of just taking it. There have been publicized deals between AI developers and news organizations, stock-image libraries, social platforms, and forums to license their archives for training. Some companies also pay people directly to write or label data. Licensing is the cleanest path on consent, and it is growing, but it covers only a fraction of what big models have already absorbed.

User and contributed data. Some of what trains a model comes from the people using it, your chats, prompts, uploads, and feedback, depending on the product's settings. This is its own consent question, and it is usually governed by a terms-of-service checkbox most people never read.

What's actually being fought over

The disputes break into a few clean buckets.

Copyright. This is the big one. Multiple lawsuits, brought by authors, news publishers, visual artists, and others, argue that copying their work to train a model without permission is infringement. AI companies generally respond that training is "fair use" in the United States, an established legal doctrine that allows some unlicensed copying for transformative purposes. Whether training qualifies is genuinely unsettled. Courts have not delivered a single, final answer, and reasonable lawyers disagree. Treat anyone claiming total certainty in either direction with suspicion.

Consent and notice. Even setting copyright aside, many creators say the real injury is that no one asked. A photo posted to share with friends ends up in a dataset of billions. The image is "public," but the person never imagined it becoming training fuel. The law and people's expectations are not aligned here, and that gap is where a lot of the anger lives.

Personal data and privacy. Scraped web pages contain real people's names, faces, medical posts, and old mistakes. Privacy regulators, especially in Europe, have scrutinized whether vacuuming up personal data to train models is lawful, separate from any copyright question.

Attribution and likeness. Artists worry about style mimicry. Performers and ordinary people worry about voice and face cloning. These often fall outside classic copyright and into newer questions about a person's right to their own likeness.

The opt-out tools, and their limits

A patchwork of "don't train on me" mechanisms now exists. Be honest about how partial they are.

robots.txt is a decades-old file websites use to tell crawlers what not to visit. Some AI crawlers now publish names you can block in it. But robots.txt is voluntary; a bot can ignore it, and it does nothing about copies already taken.
AI-specific crawler controls let site owners block named AI bots. Again, voluntary, and only as good as the company's compliance.
Dataset opt-outs let creators ask to be excluded from some future training sets. Useful going forward, useless for models already trained.
Regulatory opt-outs. The European Union's copyright framework includes a text-and-data-mining exception that lets rightsholders reserve their works from mining, and the EU AI Act adds transparency duties around training data. The exact reach is still being worked out.

The honest summary: opt-outs are real, but they are mostly forward-looking, mostly voluntary, and put the burden on the creator to find and use them.

The even-handed part

There is a genuine case on both sides, and skipping that would be spin.

The builders' argument: models learn patterns, they do not store and replay copies; training on broad swaths of human knowledge is closer to how a person learns by reading; and a world where every data point needs a signed license could lock model-building to whoever already owns the biggest archives. The creators' argument: a commercial product is being built directly on their labor, sometimes competing with them, and "the machine learned from it" should not erase the fact that someone's work was copied without pay or permission. Both can be partly right. That tension is exactly why this is being settled case by case rather than by slogan.

What you can actually do

If you create. Know that anything public can be scraped. Add AI-crawler rules to sites you control, use available dataset opt-outs, and keep records of your work and where you post it. None of this is bulletproof, but it builds a paper trail.
If you use AI tools. Check the data settings. Many products let you turn off training on your inputs; some default it on. Read that one checkbox.
If you follow the news. Watch the court rulings, not the headlines. The fair-use question is the hinge, and it is being decided slowly. Until a higher court or legislature settles it, anyone selling certainty is selling something.

The records are still being written. Kooky till proven, both ways.