Generative-AI companies have established extraordinary influence over how people seek and access information. Chatbots, which confidently promise to answer any question and can generate images and videos remarkably quickly, are replacing traditional search engines and human experts as go-to sources of knowledge. Yet their inputs—the data that determine how chatbots respond to their users—are secrets closely guarded by powerful companies that are fighting intensely with one another for AI dominance.
The question of how, exactly, AI models are trained is hugely consequential—and not only because AI companies have trained their machines on an enormous number of copyrighted works without the consent of writers, musicians, podcasters, filmmakers, and others. (Many tech companies have been sued for doing this, and the legality of the practice remains an open question.) The works undergirding an AI’s behavior may also include misinformation, conspiracy theories, and material that some people may find objectionable: racist text, pornographic media, step-by-step instructions for committing acts of violence, and so on.
The Atlantic’s goal in creating AI Watchdog is to open machine learning’s black box. Understanding the future of technology—and the wild imagination, hubris, and upheaval that accompany every technological revolution—has been a preoccupation among Atlantic writers for generations. Vannevar Bush invented the hyperlink in our pages. And we were trying to train machines to write like humans long before ChatGPT existed. More recently, we published a groundbreaking investigation of Books3, a data set of nearly 200,000 copyrighted books that were used to train large language models. Since then, we’ve covered a much larger pirated book collection and shown that writing from movies and TV shows has also been used by AI companies without consent from writers.
AI Watchdog expands these efforts with a search tool that allows you to see what material is included in various data sets—and which tech companies use that material to train their AI products. At launch, the tool includes more than 7.5 million books, 81 million research articles, 15 million YouTube videos, and writing from tens of thousands of movies and television shows. We will continue to add more data sets as we verify them. Most of the data sets in our collection were created by AI companies or research organizations and shared publicly on AI-developer forums.
If my work shows up in the search tool, was it definitely used to train AI?
It’s likely, but the appearance of a work in a data set is not definitive proof that a given company actually used that work. Any company may have decided to exclude any work when training their model.
How do AI companies acquire content without paying for it?
Sometimes AI companies pay to license content for training, but they also use a number of techniques to avoid paying:
- Books are usually acquired through pirated libraries on the web or via BitTorrent.
- Other media may be downloaded by broadly scraping the web, or by downloading existing scrapes of the web, such as Common Crawl.
- Search indexes such as Bing, Brave, and Google make full-text articles available for use by AI companies.
How can I prevent AI companies from using my work?
The training-data frenzy has been happening for several years, and many companies may have already used your work. However, tech firms are still constantly scraping the web for new material. There are things you can do that may help protect your work.
If your work is visual, putting a watermark or logo on your images or videos will make them less attractive for AI training. Companies generally don’t want to risk identifying individual creators; Stable Diffusion, for example, was sued after its image generator produced synthetic photos containing a Getty Images watermark.
There are also AI-poisoning systems, such as Nightshade and Glaze, which change images in ways that are invisible to humans but that may interfere with AI models’ ability to learn from those images. Poisoned AI models may generate incoherent content. At least one poisoning system has also been developed for music.
I believe my work was used by a specific company. What can I do about it?
Individuals and institutions have brought dozens of lawsuits against AI companies for training their products with copyrighted books, articles, songs, videos, and art. Some of these lawsuits are class actions, meaning you might be entitled to damages if the plaintiffs win. (The Atlantic is a plaintiff in one such suit, against the AI start-up Cohere.) The likelihood of receiving damages will be greater if your work is registered with the U.S. Copyright Office.