---
id: "claim-cannot-automate-unmeasurable"
type: "claim"
source_timestamps: ["00:14:52", "00:15:20", "00:21:50", "00:22:00"]
tags: ["evaluation", "deployment-strategy"]
related: ["concept-metric-gaming", "prereq-evaluation-infrastructure", "quote-cannot-automate-score", "question-evaluating-subjective-domains", "action-build-eval-infrastructure"]
confidence: "high"
testable: true
speakers: ["Nate B. Jones"]
sources: ["s04-karpathy-agent-700"]
sourceVaultSlug: "s04-karpathy-agent-700"
originDay: 4
---
# You cannot automate what you cannot score

## Claim
The foundational law of deploying auto-improving agents: **automation is strictly bounded by measurability**.

## Statement
> [[quote-cannot-automate-score|"You cannot automate what you cannot score."]]

## Reasoning
If an organization cannot clearly define what "better" looks like in a **programmatic, objectively testable** way, a Meta-Agent (see [[concept-meta-task-agent-split]]) cannot optimize for it.

## Common Failure Modes
Many businesses rely on:
- **Subjective human reviews** — cannot scale to evaluate hundreds of autonomous experiments overnight
- **Activity metrics** rather than outcome metrics — measure motion, not value

Neither can support an auto-loop. Without a reliable, programmatic scoring function, the optimization loop will either:
- **Thrash aimlessly**, or
- Aggressively optimize for **the wrong proxy metric** (see [[concept-metric-gaming]])

## Operational Implication
Building [[prereq-evaluation-infrastructure|robust programmatic evaluation infrastructure]] is the **non-negotiable prerequisite** for autonomous improvement. Operators must follow [[action-build-eval-infrastructure]] *before* deploying agents.

## Confidence and Testability
- **Confidence**: high
- **Testable**: yes — directly demonstrable by attempting to run a loop without programmatic metrics.

## Open Question
This claim raises [[question-evaluating-subjective-domains]] — how to score subjective domains like empathy, brand voice, or creative writing, where programmatic metrics are difficult.


## Related across days
- [[concept-scenario-testing]]
- [[concept-private-bench]]
- [[claim-klarna-intent-failure]]
- [[concept-silent-failure]]
- [[arc-evaluation-frontier]]
