---
id: "framework-multi-llm-evaluation"
type: "framework"
source_timestamps: ["00:06:12", "00:06:45"]
tags: ["quality-assurance", "workflow"]
related: ["concept-multi-llm-refinement"]
steps: ["Generate the initial skill file with Claude", "Download the file locally", "\"Open a new chat in a different LLM (e.g.", "ChatGPT)\"", "Upload the file and ask the second LLM to crack it open and critique it", "Capture the critique and recommendations", "\"Return to Claude", "paste the critique", "and ask Claude to improve the skill\""]
sources: ["s40-super-prompts"]
sourceVaultSlug: "s40-super-prompts"
originDay: 40
---
# Multi-LLM Skill Refinement Loop

## Purpose

Improve the quality of an AI-generated skill by leveraging the critique of a *competing* model. The conceptual framing is captured in [[concept-multi-llm-refinement]].

## Steps

1. **Generate** the initial skill file (`.zip` or `.md`) using [[entity-claude-d40]] — typically the output of [[framework-skill-creation]].
2. **Download** the skill file to your local machine.
3. **Open a new chat in a different LLM** — usually [[entity-chatgpt-d40]], but [[entity-gemini-d40]] works too.
4. **Upload** the skill file and prompt the second LLM:
   > *"Crack open this file, assess whether it is high quality, and make specific recommendations for how to improve it."*
5. **Capture** the critique and recommendations.
6. **Return to Claude.** Paste the critique and instruct Claude to revise the skill accordingly.

Repeat as needed until the skill stabilizes.

## Why It Works

Different models have different reasoning fingerprints. ChatGPT will catch ambiguities Claude wrote past, and vice versa. This is essentially the LLM-as-a-Judge pattern from recent research, applied to skill artifacts rather than to model outputs.

## Prerequisite

This loop is only possible because of [[claim-skills-are-platform-agnostic]] — the Markdown format makes the file readable across ecosystems.

## Action

The operational version of this framework is [[action-multi-llm-critique]].
