---
id: "question-evaluating-subjective-domains"
type: "open-question"
source_timestamps: ["00:15:03", "00:15:15"]
tags: ["evaluation", "metrics"]
related: ["claim-cannot-automate-unmeasurable", "prereq-evaluation-infrastructure"]
resolutionPath: "Advancements in using LLMs as judges (LLM-as-a-Judge) that can reliably and consistently evaluate subjective criteria at scale without human intervention."
sources: ["s04-karpathy-agent-700"]
sourceVaultSlug: "s04-karpathy-agent-700"
originDay: 4
---
# How do we build reliable evals for subjective business processes?

## Question
How do we build reliable evals for highly subjective business processes?

## Detail
While it is easy to programmatically score:
- Code execution (pass/fail tests)
- Latency (numeric)
- Resolution time (numeric)

...it is incredibly difficult to build **un-gameable, programmatic metrics** for subjective domains:
- Customer empathy
- Brand voice
- Creative writing
- Tone calibration

Until these can be reliably scored, auto-agents cannot safely optimize them — see [[claim-cannot-automate-unmeasurable]].

## Resolution Path
Advancements in using LLMs as judges (**LLM-as-a-Judge**) that can reliably and consistently evaluate subjective criteria at scale without human intervention.

## Status
Partially addressed by:
- **LLM-as-Judge (Zheng et al., 2023)** — ~85% agreement with humans on subjective evals.
- **AgentEval Benchmarks (Zhong et al., 2024)** — standardized multi-dim metrics with metric-gaming flags.

Still unresolved at production scale for high-stakes subjective domains.
