---
id: "contrarian-unlicensed-data-unnecessary"
type: "contrarian-insight"
source_timestamps: ["§ Lessons for Gen AI Companies", "¶17"]
tags: ["contrarian-insight", "ai-research", "industry-assumptions"]
related: ["claim-unlicensed-data-performance", "entity-eleuther-ai", "question-unlicensed-data-necessity"]
challenges: "The Gen AI industry's core defense that indiscriminate scraping of copyrighted web data and shadow libraries is a strict technical necessity for state-of-the-art performance."
sources: ["tail2"]
sourceVaultSlug: "hbr-seg-tail2"
originDay: 2
articleStem: "hbr-tail-126-genai-copyright"
sourceUrl: "https://hbr.org/2025/07/can-gen-ai-and-copyright-coexist"
sourceTitle: "Can Gen AI and Copyright Coexist?"
---
# Contrarian: Unlicensed Scraping May Be Technically Unnecessary

**Contrarian insight.** This note challenges the generative-AI industry's core defense: that indiscriminate scraping of copyrighted web data and shadow libraries is a *strict technical necessity* to achieve frontier LLM performance.

The evidence comes from [[entity-eleuther-ai]], which released **Common Pile v0.1**, an 8 TB dataset composed entirely of open-source or licensed content, and reported that models trained on it performed comparably to those trained on unlicensed copyrighted data (see [[claim-unlicensed-data-performance]] and [[quote-eleuther-performance]]). If true, the marginal performance benefit of scraping unlicensed data does not justify its legal and financial risk.

**Balancing view (from enrichment):** Many practitioners argue that *scale and diversity* of data are critical to frontier performance and that restricting corpora to openly licensed text may degrade quality on specialized domains (literature, academic research, news). EleutherAI's Common Pile results are promising but preliminary and largely self-reported; without broad, independent benchmarks across tasks and model sizes, this remains a **hypothesis that challenges industry assumptions, not settled fact** — see the open question [[question-unlicensed-data-necessity]].
