Anthropic Settles for $1.5 Billion Over Pirated Books in AI Training

Serge Bulaev

Serge Bulaev

Courts in the U.S. generally find that it may be fair use for AI companies to train models on books if the books were lawfully bought or licensed. However, using pirated or illegally obtained books has led to legal problems. Anthropic agreed to pay $1.5 billion to settle claims that it stored and used pirated books for AI training, but the settlement does not allow future use of those books. Experts suggest that future court cases might focus more on whether AI companies got their data legally. There is no single rule yet, but courts seem to accept fair use more often if the data was legally acquired and the AI does not directly replace the original books.

Anthropic Settles for $1.5 Billion Over Pirated Books in AI Training

Anthropic agreed to a $1.5 billion settlement resolving authors' claims that it used pirated books for AI training. Recent U.S. court rulings on AI training data highlight a critical distinction: using lawfully acquired books may be considered fair use, but using pirated copies is not. This legal pattern is central to why Anthropic settled a landmark case for $1.5 billion over its use of pirated books for AI training. Federal opinions consistently focus on two key questions: the legal origin of the training data and whether the AI's purpose is "transformative."

The legal reasoning is becoming clear: courts often find that training AI on purchased or licensed materials is "highly transformative" and permissible under fair use. However, using texts from illicit "shadow libraries" can create liability for copyright infringement, regardless of the training's transformative nature. This distinction was pivotal in a mid-2025 ruling where Judge William Alsup noted Anthropic's AI training could be transformative, but its possession of pirated files was a separate violation, as detailed in a Variety report.

US rulings on training LLMs: courts generally find fair use when works were lawfully acquired

U.S. courts are establishing a pattern where AI training is considered fair use if the data was legally acquired and the resulting AI model is 'transformative' - meaning it doesn't directly substitute the original work. Liability arises not from the training itself, but from the illegal acquisition of copyrighted materials.

Two key cases highlight this legal landscape:

  • Bartz v. Anthropic: The Northern District of California ruled that while lawful copies could be used for training, damages could still arise from using unlawfully acquired data.
  • Kadrey v. Meta: A separate California decision reinforced the transformative-use argument, finding minimal harm to the original book market.

These rulings hinge on two of the four Fair Use factors:

  • Purpose and Character: Is the text repurposed for statistical analysis rather than for human reading?
  • Effect on the Market: Does the AI system directly compete with or replace the original books?

The $1.5 billion Anthropic settlement

The $1.5 billion settlement for about 500,000 books was reported in September 2025 and was still under court review in May 2026; it had not newly gained preliminary approval in September 2025 as stated here. This agreement specifically resolves claims about past possession of pirated works. It compensates authors with approximately $3,000 per book and mandates that Anthropic destroy the disputed files, but it does not grant a license for future AI training.

What the split decisions may indicate

Legal experts note that AI developers who properly source their data through purchases or publisher deals have stronger legal footing. In contrast, datasets built from illicit sources consistently face infringement claims, shifting the legal battleground from fair use arguments to data provenance audits.

This has led to a split in compensation. While some individual authors report one-time licensing offers around $2,500 per title, large media corporations have secured significant deals for their entire catalogs. This disparity is expected to continue, with publishers commanding far larger sums than individual creators.

Ultimately, while no single rule governs AI training, a clear formula is emerging from U.S. courts: fair use is defensible when data is lawfully acquired and the AI output is non-substitutive. The legal focus is increasingly on the 'how' of data acquisition, not just the 'what' of its use.


Why did Anthropic agree to pay $1.5 billion if the judge ruled AI training could be fair use?

Judge Alsup in Bartz v. Anthropic held that training itself may be fair use because the models transform the works into statistical patterns rather than competing book markets. The liability arose solely from Anthropic's storage and use of pirated copies, not from the act of training. The company opted to settle without admitting fault, agreeing to destroy the allegedly pirated files and pay roughly $3,000 per affected book to a class covering approximately 500,000 titles.

How did the courts distinguish between "fair use training" and "pirated acquisition"?

Courts apply the four fair-use factors differently:

  • Purpose and character - high transformation weighs in favor when the output is non-competitive with the original works.
  • Market harm - training is not seen as a substitute for buying the book, so this factor favored AI companies in Kadrey v. Meta and Bartz v. Anthropic.

However, the same rulings stress that unlawful acquisition (piracy or unlicensed downloading) is a separate tort, so the fair-use defense does not immunize misuse of the source copies themselves.

Could every author now demand $3,000 for allowing their books into training data?

The $3,000 figure comes from the class-action settlement, not a market rate or court-imposed licensing fee. Data from 2024-2026 show selective, one-time payments instead:

Example: HarperCollins negotiated $2,500 per title for certain nonfiction works. According to industry reports, major publishers have obtained significant sums for their entire catalogs, dwarfing individual author checks. Industry observers expect bigger deals for publishers, modest token sums for individual authors.

Are there cases where AI training was ruled not fair use?

Yes. In Thomson Reuters v. Ross Intelligence, the Delaware court found no fair use because Ross trained a competing legal research product using Westlaw headnotes. The use was commercial and non-transformative, and it directly competed with Thomson Reuters' market, satisfying the market-harm factor.

What practical steps should AI companies take to avoid billion-dollar settlements?

  • Verify dataset provenance: Courts are receptive to transformative training only when the source works are lawfully acquired.
  • Negotiate publisher-level deals: News Corp, Reddit, and HarperCollins have shown large-scale agreements can insulate against class-action suits.
  • Document transformation and non-substitution: Detailed evidence that the model does not reproduce or compete with original works strengthens the purpose-and-character factor.