---
tags: ["synthesis", "training-data", "copyright", "data-provenance"]
articles: ["a126", "a123", "a128", "a129"]
synthesis: true
id: "cross-training-data-economy"
sources: ["tail2"]
type: "synthesis"
sourceVaultSlug: "hbr-seg-tail2"
originDay: 2
articleStem: "hbr-seg-tail2"
sourceUrl: "(unified vault: 14 sources)"
sourceTitle: "HBR — Tail Ⅱ · Founders, PE, 2025 items, industry/security/ops (#118–131)"
---
## Data quality and provenance are becoming the battleground

Four articles, from law to strategy to security, converge on the idea that *what data a model is trained on* — its provenance, curation, and legality — is now a first-order strategic and risk variable.

- **A126 (legal):** unlicensed/pirated training data creates existential exposure ([[claim-piracy-financial-risk]], the [[concept-piracy-caveat]], [[concept-shadow-libraries]]). The market response is [[concept-curated-training-datasets]] and licensing — plus the contrarian claim that unlicensed data may be unnecessary ([[contrarian-unlicensed-data-unnecessary]]).
- **A123 (strategy):** Chinese vertical dominance rests on [[concept-domain-specific-small-models]] — an 80/20 industry-specific data split — proving curated, domain-heavy data beats generic scale for business tasks.
- **A129 (legal-tech):** [[concept-domain-specific-legal-training]] and 'better data over more data' ([[claim-precision-non-negotiable]]) — jurisdiction-specific data, not corpus size, produces enforceable contracts.
- **A128 (security):** the enrichment explicitly broadens the AI supply chain to include *model and data provenance* (backdoored pre-trained models, poisoned repositories) — a SolarWinds-style risk.

## The unifying shift

The field is moving from 'scrape everything' to 'curate deliberately.' A126 supplies the legal pressure, A123/A129 supply the performance argument (targeted data outperforms indiscriminate scale for verticals), and A128 supplies the security argument (provenance is an attack surface). Together they suggest a future where clean, licensed, domain-specific, provenance-verified data is both a competitive asset and a compliance necessity — a natural extension of [[cross-governance-transparency-gate]].