---
id: "concept-shadow-libraries"
type: "concept"
source_timestamps: ["§ Lessons for Gen AI Companies", "¶14"]
tags: ["training-data", "piracy", "litigation"]
related: ["concept-piracy-caveat", "claim-piracy-financial-risk", "entity-anthropic", "entity-meta"]
definition: "Massive illegal online repositories of pirated books and documents (e.g., LibGen, Books3) scraped by AI companies to train large language models."
sources: ["tail2"]
sourceVaultSlug: "hbr-seg-tail2"
originDay: 2
articleStem: "hbr-tail-126-genai-copyright"
sourceUrl: "https://hbr.org/2025/07/can-gen-ai-and-copyright-coexist"
sourceTitle: "Can Gen AI and Copyright Coexist?"
---
# Shadow Libraries

Shadow libraries are vast, illicit databases containing millions of pirated books, academic papers, and articles — the best-known being **LibGen** and **Books3**. The article stresses that major generative-AI companies leaned heavily on these illegal repositories to reach the scale of text required for LLM training.

Reported figures from discovery: *Bartz v. Anthropic* surfaced the use of **7 million pirated books** (attributed to [[entity-anthropic-d2]]), while *Kadrey v. Meta* revealed [[entity-meta-d2]] used at least **82 Terabytes** of pirated book data. Other suits center on similar datasets: *Tremblay v. OpenAI* and *O'Nan v. Databricks*. Reliance on these datasets is precisely what triggers the [[concept-piracy-caveat]] in a fair-use defense and drives the exposure quantified in [[claim-piracy-financial-risk]].

**Enrichment flag on the numbers:** The *core* claim — that AI firms used shadow-library datasets (LibGen, Books3, etc.) and that this is at the center of multiple lawsuits — is strongly supported by legal and industry commentary. However, the *specific quantities* ("7 million books," "82 TB") appear in secondary reporting and complaint allegations rather than as adjudicated court findings, and should be labeled as **allegation/secondary-reporting estimates**, not established facts. Some commentary on the Anthropic class certification frames the pirated set in the hundreds of thousands of works with theoretical statutory exposure in the tens of billions.
