---
id: "framework-data-migration-pipeline"
type: "framework"
source_timestamps: ["00:12:40", "00:12:47"]
tags: ["data-engineering", "workflow"]
related: ["concept-production-trust", "claim-gpt-5-5-caught-traps", "action-implement-human-validation", "prereq-database-normalization"]
steps: ["Inventory sources", "Normalize schemas", "Merge duplicates and reject fakes", "Reconcile conflicts and preserve provenance", "Build Audit UI for human review"]
sources: ["s26-gpt55-claude-gemini"]
sourceVaultSlug: "s26-gpt55-claude-gemini"
originDay: 26
---
# AI Data Migration Pipeline

## Purpose
The workflow required for an AI model to successfully migrate a messy, unstructured set of business files into a clean, canonical database. Operationalizes the lessons from the **Splash Brothers** test in [[framework-private-bench-suite]].

## The Five Steps

### 1. Inventory
Catalog all incoming sources (CSVs, JSONs, handwritten PDFs, scanned receipts). Establish a complete map of what exists before any transformation.

### 2. Normalize
Parse multiple schemas and standardize formats:
- Date formats.
- Capitalization.
- Phone/address formats.
- Currency representations.

This is the step where GPT-5.5 still struggles with **enum normalization and service code preservation** (see [[concept-production-trust]]).

### 3. Merge
Identify and merge duplicate records while **rejecting fake/test data**:
- Detect 'Mickey Mouse' style fake customers.
- Reject ASDF test accounts.
- Flag implausible payments (e.g., the fake $25,000 in [[claim-gpt-5-5-caught-traps]]).

### 4. Reconcile
Resolve conflicting information across sources:
- Pricing discrepancies.
- Service code mismatches.
- Preserve **source provenance** so any canonical record can be traced back.

### 5. Audit UI
Build a **human-facing review interface** to check edge cases before final canonical staging. This is the step that operationalizes [[action-implement-human-validation]] and ensures [[concept-production-trust]].

## Required Background
See [[prereq-database-normalization]] for assumed knowledge of schema normalization, enum mapping, canonical records, and source provenance.


## Related across days
- [[concept-production-trust]]
- [[claim-gpt-5-5-caught-traps]]
- [[framework-private-bench-suite]]
