INSUBCONTINENT EXCLUSIVE:
But if you're not intimately familiar with the AI industry and copyright, you might wonder: Why would a company spend millions of dollars on
books to destroy them? Behind these odd legal maneuvers lies a more fundamental driver: the AI industry's insatiable hunger for high-quality
text.To understand why Anthropic would want to scan millions of books, it's important to know that AI researchers build large language
models (LLMs) like those that power ChatGPT and Claude by feeding billions of words into a neural network
During training, the AI system processes the text repeatedly, building statistical relationships between words and concepts in the
process.The quality of training data fed into the neural network directly impacts the resulting AI model's capabilities
Models trained on well-edited books and articles tend to produce more coherent, accurate responses than those trained on lower-quality text
like random YouTube comments.Publishers legally control content that AI companies desperately want, but AI companies don't always want to
That meant buying physical books offered a legal workaround.And yet buying things is expensive, even if it is legal
So like many AI companies before it, Anthropic initially chose the quick and easy path
In the quest for high-quality training data, the court filing states, Anthropic first chose to amass digitized versions of pirated books to
But by 2024, Anthropic had become "not so gung ho about" using pirated ebooks "for legal reasons" and needed a safer source.