---
id: "entity-terminalbench"
type: "entity"
entityType: "product"
canonicalName: "TerminalBench"
aliases: ["Terminal Bench", "Terminal-bench"]
source_timestamps: ["00:01:26"]
tags: ["benchmark", "software-engineering"]
related: ["claim-public-benchmarks-flatten", "contrarian-public-benchmarks"]
sources: ["s26-gpt55-claude-gemini"]
sourceVaultSlug: "s26-gpt55-claude-gemini"
originDay: 26
---
# TerminalBench

## Profile
A public benchmark for software engineering tasks. OpenAI reportedly cited [[entity-gpt-5-5|GPT-5.5]] scoring **82%** on TerminalBench in its release materials.

## Role in the Vault
The speaker uses TerminalBench as the **canonical example** of a public benchmark that **flattens differences** between frontier models (see [[claim-public-benchmarks-flatten]] and [[contrarian-public-benchmarks]]). Even an 82% score is presented as uninformative because all frontier models cluster in a narrow band on this kind of task.

## Canonical Reference
No exact match; the closest related public benchmarks are Terminal-bench (github.com/terminal-bench) and SWE-Bench (swe-bench.com), which evaluate software-engineering tasks. The video's specific 'TerminalBench' label may correspond to one of these or to an OpenAI-internal variant.
