---
id: "framework-hex-eval"
type: "framework"
source_timestamps: ["00:00:00"]
tags: ["evaluations", "benchmarking", "stress-testing"]
related: ["claim-hallucinates-audit"]
sources: ["s12-opus-47"]
sourceVaultSlug: "s12-opus-47"
originDay: 12
---
# Hex's Zero-Guidance Eval Method

## Purpose

A rigorous, single-shot evaluation methodology used to test frontier models on **complex, real-world data tasks** without providing intermediate scaffolding or hints.

## The Five Steps

### Step 1 — Data Preparation
Assemble a massive dataset of **hundreds of messy, real-world files** in diverse formats:
- CSV, JSON, PDF, VCF.
- Include planted errors and duplicate records.

### Step 2 — Single-Shot Prompting
Provide the model with **a single complex prompt** requiring it to:
- Inventory files.
- Design a schema.
- Extract data.
- Resolve conflicts.
- Build a UI.

### Step 3 — Zero Iteration
Do **not** provide any:
- Intermediate guidance.
- Error correction.
- Multi-turn prompting.

The model must execute the entire pipeline autonomously.

### Step 4 — Audit Verification
**Manually verify** the model's self-reported audit trail against the actual data processed — to detect [[concept-trust-failure-hallucination|hallucinated successes]] or missed files.

This is the step that surfaced [[claim-hallucinates-audit|the TSV-file fabrication]] in [[entity-claude-opus-4-7-d12|Opus 4.7]].

### Step 5 — Peer Review
Have **competing models** (e.g., [[entity-chatgpt-5-4|GPT-5.4]]) review the output using a strict rubric to identify errors the executing model missed.

Note: account for [[concept-model-self-review-bias]] when interpreting peer-review grades.

## Why This Methodology Matters

It directly counters the failure mode highlighted by [[contrarian-benchmarks-vs-business]]: standardized benchmarks are gameable; messy real-world tasks expose true reliability.

## Operator Application

If you are evaluating a model for production agentic deployment, run this method against your own workload before trusting any leaderboard.

## Cross-References

- Claim: [[claim-hallucinates-audit]]
- Concept: [[concept-trust-failure-hallucination]], [[concept-model-self-review-bias]]
- Action: [[action-build-deterministic-evals]]
- Contrarian: [[contrarian-benchmarks-vs-business]]


## Related across days
- [[framework-private-bench-suite]]
- [[concept-scenario-testing]]
- [[concept-trust-failure-hallucination]]
- [[framework-agent-evaluation]]
