---
id: "concept-controlled-experimentation-ai"
type: "concept"
source_timestamps: ["§ Controlled Experimentation"]
tags: ["a-b-testing", "roi-measurement", "data-science"]
related: ["concept-business-value-measurement", "action-run-ai-experiments"]
part_of: "framework-6-disciplines-gen-ai"
definition: "The practice of using A/B testing and statistical analysis to empirically measure the productivity and quality impacts of generative AI on specific business tasks."
sources: ["spine"]
sourceVaultSlug: "hbr-seg-spine"
originDay: 1
articleStem: "hbr-cl-95-6-disciplines-genai"
sourceUrl: "https://hbr.org/2024/07/the-6-disciplines-companies-need-to-get-the-most-out-of-gen-ai"
sourceTitle: "The 6 Disciplines Companies Need to Get the Most Out of Gen AI"
---
# Controlled Experimentation for AI Value

Discipline #2 of the [[framework-6-disciplines-gen-ai|six disciplines]]. Leaders **cannot assume** generative AI will universally boost productivity or quality; its efficacy varies wildly by task and application. The only reliable method to ascertain its value in a specific business domain is through **controlled experimentation**.

The method:
- Create experimental groups (those using Gen AI vs. those who are not) and **measure/compare** their productivity or effectiveness.
- Test different **modalities** — e.g., AI as a solo generator versus a collaborative *co-pilot*.
- Actually running this requires the background in [[prereq-ab-testing-stats]] (control vs. treatment groups, statistical significance).

The authors note that while the statistical analysis is straightforward for data scientists, it is a discipline **most organizations currently leave to academics and vendors**. Building this capability **in-house** is vital for ongoing AI assessment. The concrete step is [[action-run-ai-experiments]]. Experimentation feeds directly into the next discipline, [[concept-business-value-measurement]].

Enrichment nuance: RCTs of Gen AI in writing, programming, and customer support show large productivity/quality impacts, validating the approach; Microsoft, Google, and major platforms routinely test AI features this way. **Counter-perspective:** in small organizations or rare, high-stakes tasks, rigorous randomized experiments may be impractical — observational studies, expert judgment, and simulation play larger roles. Narrow A/B tests can also miss second-order and system-level effects (skill development, error propagation, customer trust).


## Related across articles
- [[concept-ai-learning-journeys]]
- [[action-run-half-day-prototype]]
- [[concept-minimum-viable-ai]]
