---
id: "action-build-eval-infrastructure"
type: "action-item"
source_timestamps: ["00:21:50", "00:22:00"]
tags: ["evaluation", "infrastructure"]
related: ["claim-cannot-automate-unmeasurable", "prereq-evaluation-infrastructure", "concept-metric-gaming", "concept-silent-degradation"]
action: "Invest heavily in programmatic evaluation suites and sandboxes before attempting autonomous agent optimization."
outcome: "Prevention of metric gaming and silent degradation during autonomous optimization."
sources: ["s04-karpathy-agent-700"]
sourceVaultSlug: "s04-karpathy-agent-700"
originDay: 4
---
# Build robust evaluation infrastructure first

## Action
Invest heavily in programmatic evaluation suites and sandboxes before attempting autonomous agent optimization.

## Outcome
Prevention of [[concept-metric-gaming|metric gaming]] and [[concept-silent-degradation|silent degradation]] during autonomous optimization.

## Detail
Shift engineering resources **away from** building agents and **toward** building the **evals** — the test suites, sandboxes, and programmatic scoring functions that accurately reflect business value. An auto-agent is only as good as the metric it is optimizing against.

## Foundational Logic
Driven by [[claim-cannot-automate-unmeasurable]] and crystallized in [[quote-cannot-automate-score|"You cannot automate what you cannot score."]] Without a reliable, programmatic scoring function, the optimization loop will either thrash aimlessly or aggressively optimize for the wrong proxy metric.

## Required Properties of Evals
- **Programmatic** (no manual scoring)
- **Objective** (no subjective human review at scale)
- **Multi-dimensional** (catches secondary regressions)
- **Un-gameable** (resistant to Goodhart-style exploitation)

## Where It Fits
This is the gating prerequisite — see [[prereq-evaluation-infrastructure]]. Do this *before* building the agent itself.