---
id: "action-establish-metrics"
type: "action-item"
source_timestamps: ["§ The Road Ahead"]
tags: ["measurement", "academic-rigor"]
related: ["action-setup-poc", "open-question-modality-vs-content"]
action: "Establish key metrics such as test-retest reliability and external validity for AI-moderated research."
outcome: "Ensures the data collected by AI moderators is scientifically rigorous and trustworthy for strategic decision-making."
sources: ["commercial"]
sourceVaultSlug: "hbr-seg-commercial"
originDay: 5
articleStem: "hbr-new-30-ai-scale-customer-research"
sourceUrl: "https://hbr.org/2026/04/how-ai-helps-scale-qualitative-customer-research"
sourceTitle: "How AI Helps Scale Qualitative Customer Research"
---
# Establish Reliability and Validity Metrics

**Action.** Because AI moderation is still in its infancy as a *measurement tool*, rigorously evaluate its outputs. Establish key psychometric/research metrics — specifically **test–retest reliability** (consistency of AI probing over time) and **external validity** (how well the AI's findings generalize to the real world).

**Expected outcome.** Data that is scientifically rigorous and trustworthy for strategic decisions.

This extends [[action-setup-poc]] and connects to the causal uncertainty in [[open-question-modality-vs-content]]. Enrichment situates this in standard measurement theory (Nunnally et al. — reliability, construct validity, external validity) and adds a fairness dimension: LLM and emotion models can encode demographic bias, so probes and interpretations "may systematically differ by race, gender, dialect." A complete rigor program therefore also audits for **bias/fairness**, not just reliability.
