---
type: "synthesis"
spans: ["s25", "s26", "s28"]
tags: ["arc", "evaluation", "frameworks"]
id: "arc-evaluation-paradigm-shift"
sources: ["cross-day"]
---
# The Evaluation Paradigm Shift — From Answers to Carry to 10× Litmus

Across S25, S26, and S28 the *question used to evaluate* AI systems and AI-era businesses changes shape. Each video proposes a different evaluation question, and read together they describe a paradigm shift in how the AI ecosystem judges value.

## The three evaluation questions

| Day | The old question | The new question | Anchor |
|---|---|---|---|
| S25 | "Can the human prompt this well?" | "Can the human *manage agent teams* well?" | [[claim-bottleneck-shift]], [[concept-engineering-manager-mindset]] |
| S26 | "Can the model *answer* this?" | "Can the model *carry* this?" | [[concept-can-it-carry]], [[quote-can-it-carry]] |
| S28 | "Is the product good?" | "What do I own that still matters if AI gets 10× better?" | [[framework-strategic-litmus-test]], [[quote-strategic-litmus-test]] |

## The shared structure

Each replacement question shares three properties:

1. **It is *temporal***. Old questions ask about a single moment (this prompt, this answer, this product). New questions ask about durability across time, length of workflow, or capability headroom.
2. **It is *systemic***. The new questions all evaluate the system around the unit, not the unit alone — see [[arc-system-greater-than-component]].
3. **It is *adversarial***. New questions are framed against decay, against AI improvement, against future failure. They are stress tests, not snapshots.

## Why all three matter together

- S26's *carry* test is the **model-level** stress test.
- S25's *manage-agents* test is the **builder-level** stress test.
- S28's *10× litmus* is the **business-level** stress test.

An AI strategy that passes only one is incomplete. A model that can carry but a builder who can't manage agents wastes capability. A builder who manages well in a business that fails the 10× litmus is rowing on a sinking ship.

## Practical use for the agent

When a downstream user asks an evaluative question, *triangulate* the answer through the three tests:

1. *Is the model capable of carrying this workflow?* → S26 / [[framework-private-bench-suite]]
2. *Can the user manage agents at this scale of work?* → S25 / [[framework-2026-builder-practices]]
3. *Will the resulting business survive 10× AI?* → S28 / [[framework-strategic-litmus-test]]

Three-axis evaluation is the operational form of this arc.

See also [[arc-private-judgment-thread]] (the private side of the new evaluations), [[arc-bottleneck-relocation]] (why the questions shift), [[arc-the-meta-stack-org-builder-tool-moat]] (where each test sits in the stack).