---
id: "prereq-evaluation-infrastructure"
type: "prereq"
source_timestamps: ["00:14:52", "00:15:20"]
tags: ["infrastructure", "testing"]
related: ["claim-cannot-automate-unmeasurable", "action-build-eval-infrastructure", "question-evaluating-subjective-domains"]
reason: "An auto-agent requires a fast, objective, programmatic metric to evaluate its experiments; without it, the loop cannot function."
sources: ["s04-karpathy-agent-700"]
sourceVaultSlug: "s04-karpathy-agent-700"
originDay: 4
---
# Programmatic Evaluation Infrastructure

## Prerequisite
Programmatic Evaluation Infrastructure.

## Reason
An auto-agent requires a fast, objective, programmatic metric to evaluate its experiments; without it, the loop cannot function.

## Detail
Before an organization can benefit from auto-optimizing agents, it must possess the ability to **programmatically and objectively score** the outcomes of a business process. If evaluation relies on subjective human review or manual data pulling, the optimization loop cannot run autonomously at scale.

## What "Programmatic" Means
- Runs without human intervention
- Returns a number (or vector of numbers)
- Reproducible across runs
- Scales to hundreds of evaluations per night

## Foundational Claim
Driven by [[claim-cannot-automate-unmeasurable]] and crystallized in [[quote-cannot-automate-score|"You cannot automate what you cannot score."]]

## Operator Action
[[action-build-eval-infrastructure]] — invest in evals before agents.

## Open Problem
[[question-evaluating-subjective-domains]] — how to score subjective domains like empathy or brand voice. The enrichment overlay points to **LLM-as-Judge (Zheng et al., 2023)** as a promising proxy, with ~85% agreement with humans on subjective evals.


## Related across days
- [[concept-scenario-testing]]
- [[concept-private-bench]]
- [[claim-cannot-automate-unmeasurable]]
- [[framework-safety-pillars]]
- [[arc-evaluation-frontier]]