---
id: "action-build-test-suite"
type: "action-item"
source_timestamps: ["13:50:00", "14:25:00"]
tags: ["testing", "mlops"]
related: ["concept-quantitative-skill-testing", "claim-agents-lack-recovery"]
speakers: ["Nate B. Jones"]
outcome: "Ensures updates to a skill actually improve performance and prevents regressions before deploying to autonomous agents."
sources: ["s43-file-format-agreement"]
sourceVaultSlug: "s43-file-format-agreement"
originDay: 43
---
# Build a quantitative test suite for skills

## Action

Create a **basket of tests** to quantitatively measure a skill's performance across different versions and wording changes.

## Why

Agents lack the human recovery loop (see [[claim-agents-lack-recovery]]). If a regression slips into production, the agent may propagate flawed outputs through hundreds of downstream invocations.

## Outcome

Ensures updates to a skill **actually improve** performance and prevents regressions before deploying to autonomous agents.

## How

- Curate a fixed set of representative inputs (happy path + known edge cases).
- For each skill version, run the suite and score outputs against expected results.
- Use frameworks like LangSmith, Arize Phoenix, or custom harnesses.
- Treat passing the suite as a **deploy gate** before any skill update goes to production agents.

See [[concept-quantitative-skill-testing]].
