---
id: "framework-private-bench-suite"
type: "framework"
source_timestamps: ["00:07:32", "00:08:01"]
tags: ["benchmarking", "testing-methodology"]
related: ["concept-private-bench", "claim-gpt-5-5-superiority", "claim-public-benchmarks-flatten"]
steps: ["Dingo (Executive Judgment & Production Discipline)", "Splash Brothers (Backend Correctness & Data Hygiene)", "\"Artemis (Research", "Interactivity & Visual Taste)\""]
sources: ["s26-gpt55-claude-gemini"]
sourceVaultSlug: "s26-gpt55-claude-gemini"
originDay: 26
---
# Private Bench Evaluation Suite

## Purpose
A three-part testing framework designed to evaluate frontier models on messy, real-world tasks where public benchmarks fail. See [[concept-private-bench]] for motivation and [[contrarian-public-benchmarks]] for the broader argument.

## The Three Tests

### 1. Dingo — Executive Judgment + Production Discipline
Generate a **23-deliverable launch packet** for an absurd fictional startup. Tests whether the model can:
- Manage **legal and ethical risk** without smoothing over dangerous parts.
- **Separate real buyers from curiosity traffic** (executive judgment).
- Carry a **23-artifact deliverable** without hallucinating file extensions or losing thread context.

Result cited: **GPT-5.5 87.3 vs Opus 67.0** (see [[claim-gpt-5-5-superiority]]).

### 2. Splash Brothers — Backend Correctness + Data Hygiene
Migrate **465 messy, corrupted files** (CSVs, PDFs, JSONs) into a clean database. Tests whether the model can:
- **Catch planted traps** (fake records, test accounts, fake payments).
- **Normalize schemas** across heterogeneous source formats.
- Preserve service codes, enum values, and source provenance.

Result cited: GPT-5.5 caught the planted traps ([[claim-gpt-5-5-caught-traps]]) but still failed boring backend hygiene ([[concept-production-trust]], [[question-backend-hygiene]]). Operational pipeline detailed in [[framework-data-migration-pipeline]].

### 3. Artemis — Research + Interactivity + Visual Taste
Build an **interactive 3D visualization of a NASA lunar flyby** from scratch, **without provided facts**. Tests:
- Independent research and citation.
- Interactive build (3D, scrubbing, hover states).
- Information density vs visual composition tradeoff (see [[concept-visual-taste-vs-density]]).

## Steps Summary
1. Dingo (Executive Judgment & Production Discipline)
2. Splash Brothers (Backend Correctness & Data Hygiene)
3. Artemis (Research, Interactivity & Visual Taste)

## Important Limitation
The suite is **proprietary and unreplicated externally**. Per BetterBench critiques, private suites need independent construct validation before their results carry weight outside the author's context.


## Related across days
- [[concept-scenario-testing]]
- [[concept-private-bench]]
- [[contrarian-public-benchmarks]]
- [[arc-evaluation-frontier]]
