---
id: "claim-data-value-percentage"
type: "claim"
source_timestamps: ["§ Ask the Bot"]
tags: ["scaling-laws", "value-attribution", "ai-economics"]
related: ["concept-scaling-laws-valuation", "entity-dario-amodei", "entity-chris-olah", "framework-cmo-compensation"]
confidence: "high"
testable: true
sources: ["tail1"]
sourceVaultSlug: "hbr-seg-tail1"
originDay: 1
articleStem: "hbr-tail-109-ai-pay-fair-rates-content"
sourceUrl: "https://hbr.org/2026/06/how-ai-companies-can-pay-fair-rates-for-the-content-they-need"
sourceTitle: "How AI Companies Can Pay Fair Rates for the Content They Need"
---
# Training data accounts for 20% to 50% of a model's pre-training value

## Claim

Training data accounts for roughly **20% to 50%** of a model's **pre-training** value, with a suggested **one-third (33%) midpoint** as a working number for compensation frameworks.

## The bounds

- **Upper bound (~40–50%):** derived from standard industry estimates of [[concept-scaling-laws-valuation|scaling laws]], before crediting algorithmic innovation.
- **Lower bound (~20%):** from a 2021 memo attributed to Anthropic's [[entity-dario-amodei|Dario Amodei]] and [[entity-chris-olah|Chris Olah]].
- **Midpoint (33%):** a reasonable working number.

The range provides clear, evidence-based bounds that **automatically update** as AI technology shifts. It feeds **Step 1** of the [[framework-cmo-compensation]].

## Confidence: HIGH · Testable: yes

## Enrichment caveat

**Not verified** by the sources reviewed. The provided evidence supports only that marginal data value *can be studied*; it does not corroborate the specific 20–50% interval or the 33% midpoint. Treat the numbers as the authors' estimates, not settled findings.
