---
id: "exhibit-icm-paper-figures"
type: "exhibit"
canonicalName: "ICM Paper — Figures, Tables & Visual Exhibits"
source_url: "https://arxiv.org/html/2603.16021v2"
tags: ["exhibit", "diagram", "figures", "tables", "supporting-source", "visual"]
related: ["entity-icm-paper-arxiv", "concept-icm", "concept-dialogue-structure", "claim-icm-superiority", "question-icm-scaling"]
sources: ["paper"]
sourceVaultSlug: "icm-paper-folder-architecture-2026Jun02"
originDay: 2
---
# ICM Paper — Visual Exhibits (Figures & Tables)

> **Source & provenance.** All exhibits below are extracted from the companion paper [[entity-icm-paper-arxiv]] (*Interpretable Context Methodology: Folder Structure as Agent Architecture*, Van Clief & McDermott, [arXiv:2603.16021v2](https://arxiv.org/html/2603.16021v2)). Figures were rendered from the paper's inline SVG and captured as PNG; tables were transcribed from the paper's HTML. They were **not** in the YouTube primary source — the talk shows none of this structure on screen. This is the single richest layer of detail the companion source adds. Each exhibit below pairs the rendered image with synthesized insight for a downstream agent.

---

## Figure 1 — The Five-Layer Context Hierarchy

![Five-layer context hierarchy](fig1-five-layer-hierarchy.png)

The load-bearing diagram of the whole paper. Two things the prose and the talk both omit, visible only here:

**(a) Each layer carries an explicit token budget** — the hierarchy is a *budget*, not just a taxonomy:

| Layer | File | Token budget | Diagnostic question | Class |
|-------|------|--------------|---------------------|-------|
| **0** | `CLAUDE.md` | ~800 tok | *"Where am I?"* (global identity) | Structural (routing) |
| **1** | `CONTEXT.md` | ~300 tok | *"Where do I go?"* (workspace routing) | Structural (routing) |
| **2** | Stage `CONTEXT.md` | 200–500 tok | *"What do I do?"* (stage contract) | Structural (routing) |
| **3** | Reference material | 500–2k tok | *"What rules apply?"* (the **factory**, stable across runs) | Content |
| **4** | Working artifacts | varies | *"What am I working with?"* (the **product**, per-run) | Content |

**(b) The colour split is the architecture.** Layers 0–2 (blue, *structural / routing*) total only ~1.3–1.6k tokens — they tell the agent **where it is and what role to play**. Layers 3–4 (orange, *content*) carry the actual substance. The factory/product metaphor (L3 = factory/recipe, L4 = product/ingredients) is the paper's mnemonic for *what should change between runs* (only L4) vs. *what stays fixed* (L3). See [[concept-icm-d2]].

> **Agent takeaway:** total structural overhead is ~1.5k tokens. Everything else in a well-scoped stage is task content. That is *why* a stage lands at 2–8k tokens instead of 40k.

---

## Figure 2 — ICM Workspace Folder Structure (layer-annotated)

![Folder structure of an ICM workspace](fig2-folder-structure.png)

The canonical on-disk layout, every node tagged by layer:

```
workspace/
├── CLAUDE.md                 ← Layer 0  (global identity)
├── CONTEXT.md                ← Layer 1  (workspace routing)
├── stages/
│   ├── 01_research/
│   │   ├── CONTEXT.md        ← Layer 2  (stage contract)
│   │   ├── references/       ← Layer 3  (reference, persists)
│   │   └── output/           ← Layer 4  (working, per-run)
│   ├── 02_script/            … same triad (L2 / L3 / L4)
│   └── 03_production/        … same triad (L2 / L3 / L4)
├── _config/                  ← Layer 3  (shared reference)
├── shared/                   ← Layer 3  (shared reference)
└── setup/
    └── questionnaire.md      ← (setup-time only; unannotated)
```

**Synthesized insight (not stated explicitly in prose):**
- Every stage folder is the *same triad* — `CONTEXT.md` (L2) + `references/` (L3) + `output/` (L4). The repeating triad is what makes "add/remove a stage" a filesystem op (see Table 1).
- `_config/` and `shared/` are **top-level Layer 3** — cross-stage reference that escapes the per-stage `references/`. This is how ICM shares stable material (voice, design system, conventions) without duplicating it into every stage.
- `setup/questionnaire.md` is *un-layered* — it runs once at workspace creation and is not part of any run's context. This is the seam where the **non-coder onboarding** happens (the three non-coders in the study filled a questionnaire, not code).

---

## Figure 3 — Context Window Composition by Stage (the efficiency claim, visualized)

![Context window composition by stage](fig3-token-composition.png)

Representative token counts from the paper's *script-to-animation* workspace. Stacked by source: **blue** = Layers 0–2 (structural), **orange** = Layer 3 (reference), **tan** = Layer 4 (working), **grey** = unused / irrelevant context.

| Stage | Total tokens | Composition |
|-------|-------------|-------------|
| Research | **~4.9k** | almost entirely useful (structural + reference + working) |
| Script | **~5.5k** | almost entirely useful |
| Production | **~5.6k** | almost entirely useful |
| **Monolithic** | **~42k** | **mostly grey — the irrelevant band dwarfs the useful content** |

**The visual is the argument.** In the three ICM bars there is almost no grey: nearly every token in context is relevant to the current stage. In the monolithic bar, the grey *"unused/irrelevant"* band is larger than the entire useful payload of any single stage — the agent is carrying all three stages' instructions, all reference material, and all prior outputs simultaneously, ~80%+ of it irrelevant to whatever it is doing right now. This is the concrete mechanism behind [[claim-icm-superiority]] and Liu et al.'s *"lost in the middle"*: ICM doesn't just use fewer tokens, it keeps the **relevant** tokens out of the degraded middle band.

> **Caveat (per the paper):** these are *representative* counts from one workspace, not a measured benchmark across many. The shape is illustrative.

---

## Figure 4 — Pipeline Flow with Human Review Gates

![Pipeline flow through three stages with review gates](fig4-pipeline-review-gates.png)

`Stage 1 (Research) → [Review gate / Human] → Stage 2 (Script) → [Review gate / Human] → Stage 3 (Production)`. Each stage receives its own context (Layers 0–4), writes to its `output/` folder; a **human review gate** (red diamond) sits on every stage boundary where the output becomes editable before the next stage reads it.

**The single most important sentence in the figure:** *"The same model executes every stage; the folder structure controls what context it receives."* This is the thesis in one line — there is **no second agent, no router model, no orchestration code**. The *only* thing that differs between stages is which files the one agent reads. The "multi-agent" behaviour is an illusion produced entirely by folder scoping + human gates. Connects to [[concept-dialogue-structure]] (each stage contract is a persisted decision tree) and the talk's [[contrarian-frameworks]] stance.

---

## Figure 5 — U-Shaped Human Intervention (N=33 practitioners)

![U-shaped frequency of human edits per stage](fig5-ushaped-edits.png)

Y-axis is ordinal (Never → Rarely → Sometimes → Often → Almost always). Self-reported edit frequency at each stage boundary:

| Stage boundary | Edit frequency | Ordinal band | Character of the edit |
|----------------|----------------|--------------|------------------------|
| **Stage 1 output (Research)** | **92%** | Almost always | **Creative judgment** — direction-setting |
| **Stage 2 output (Script)** | **30%** | Rarely | constrained execution, little to fix |
| **Stage 3 output (Production)** | **78%** | Often | **closer to debugging** — aligning output with earlier decisions |

The **U-shape** is the headline empirical pattern: humans intervene heavily where they set direction (stage 1) and where they reconcile final output against intent (stage 3), but largely leave the constrained middle alone. The paper is careful: *"Values are approximate and based on practitioner self-report through conversation, not instrumented measurement"* (N=33, invite-only community). Use as a **directional** finding, not a metric. Extends [[question-icm-scaling]]'s methodology caveats.

---

## Table 1 — Control-Surface Comparison: Framework vs. ICM

The paper's most honest exhibit — first six rows favour ICM, **last four rows favour frameworks** (the "what ICM gives up" section). Reproduced verbatim:

| Dimension | Framework approach | ICM approach |
|-----------|--------------------|--------------|
| Change stage order | Edit orchestration code, redeploy | Rename or reorder folders |
| Modify a prompt | Edit agent configuration in code | Edit a markdown file |
| Add or remove a stage | Write new agent class, update orchestrator | Add or delete a folder |
| Inspect intermediate state | Add logging, build dashboard | Open the folder, read the files |
| Hand off to another person | Document environment, dependencies, setup | Copy the folder |
| Who can make changes | Developer | Anyone with a text editor |
| **Error recovery mid-pipeline** | Built-in retry, fallback, exception handling | Manual re-run of failed stage |
| **Conditional branching** | Programmatic routing based on agent output | Human decides between stages |
| **Concurrent execution** | Native parallel agent coordination | Sequential by design |
| **External service integration** | Programmatic API calls, auth management | Local scripts or MCP connections |

**Synthesis:** rows 1–6 are ICM's pitch (everything is a filesystem/text-editor op, *anyone* can change it, handoff = copy the folder). Rows 7–10 are the **concession lines** — frameworks win on automated error recovery, programmatic branching, true concurrency, and managed integrations. This table is the precise boundary of where the talk's *"multi-agent harnesses are absurdities"* over-reaches: ICM trades those four capabilities away **on purpose**, in exchange for interpretability and zero orchestration code. It is the right trade *only* for sequential, human-reviewed workflows. Directly grounds the counter-perspective in [[claim-icm-superiority]] and [[contrarian-frameworks]].

---

## Table 2 — Layer 3 (Reference) vs. Layer 4 (Working)

| | Layer 3: Reference | Layer 4: Working |
|---|---|---|
| Changes between runs | **No** | **Yes** |
| Example files | `voice.md`, `design-system.md`, `conventions.md` | `research-output.md`, `script-draft.md` |
| Model should | **Internalize as constraints** | **Process as input** |
| Configured during | Workspace setup (once) | Pipeline execution (each run) |
| Folder location | `references/`, `_config/`, `shared/` | `output/` |
| Analogy | **The recipe** | **The ingredients** |

**Why this matters for an agent consuming this vault:** the L3/L4 distinction tells the agent *how to treat each file it reads*. L3 content (`voice.md`, conventions) is a **constraint to obey**; L4 content (prior `output/`) is **material to transform**. Misclassifying the two is the core failure mode the layering prevents — e.g., treating a style guide as editable working text, or treating a prior draft as an immutable rule. The recipe/ingredients metaphor is the paper's compression of this rule.

---

## Cross-References

- Paper entity: [[entity-icm-paper-arxiv]]
- Core concept: [[concept-icm-d2]] · Stage contracts as persisted dialogue: [[concept-dialogue-structure]]
- Efficiency claim these figures ground: [[claim-icm-superiority]]
- Limitations these figures inherit: [[question-icm-scaling]]
- Authors: [[entity-jake-van-clief]] · [[entity-david-mcdermott]]


## Related across days
- [[entity-icm-paper]]
- [[entity-icm-paper-arxiv]]
- [[concept-five-layer-hierarchy]]
- [[framework-icm-architecture]]