---
id: "prereq-evals"
type: "prereq"
source_timestamps: ["00:04:33", "00:10:56"]
tags: ["machine-learning", "testing"]
related: ["concept-spec-driven-development", "entity-factory-ai", "quote-spec-becomes-eval"]
reason: "Necessary to understand how AI code is validated and how Spec-Driven Development integrates into AI workflows."
sources: ["s23-amazon-16k-engineers"]
sourceVaultSlug: "s23-amazon-16k-engineers"
originDay: 23
---
# Understanding of AI Evals

## What You Need to Know

**Evals** (evaluations) are the modern AI development discipline of running automated benchmarks against AI outputs to validate that the model produces correct results on a defined set of inputs. They are the AI equivalent of a test suite, but specifically designed to score model behavior across many examples.

## Why It's a Prerequisite Here

The speaker's argument that 'the spec becomes the eval' (see [[quote-spec-becomes-eval]]) only makes sense if you understand:

- Evals are how AI code is validated, not just unit tests.
- A well-written specification can be translated into eval criteria the AI is scored against.
- This is what makes [[concept-spec-driven-development]] structurally different from traditional spec writing.

Similarly, [[entity-factory-ai]]'s 'evals layer' strategy is incomprehensible without grasping what evals are — and the speaker's critique in [[claim-pipeline-layers-insufficiency]] depends on that grasp.

## Quick Mental Model

```
Unit test  : single function, deterministic
Eval       : model behavior, statistical, scored across many cases
```

Evals measure capability *as claimed*, which is exactly what makes them suitable as the operational definition of a spec.