---
id: "prereq-gpu-memory-hierarchy"
type: "prereq"
source_timestamps: ["01:44:00", "18:10:00"]
tags: ["hardware", "infrastructure"]
related: ["entity-hbm", "framework-memory-optimization-landscape"]
reason: "Necessary to understand the physical constraints driving the need for software compression and tiering strategies."
sources: ["s49-killed-ram-limits"]
sourceVaultSlug: "s49-killed-ram-limits"
originDay: 49
---
# Understanding of GPU Memory Hierarchy

**Prerequisite**: Understanding of GPU Memory Hierarchy.

**Why**: Comprehending the difference between [[entity-hbm]] (on-chip), standard CPU RAM, and disk storage is necessary to understand:

1. Why **'offloading and tiering' strategies** (bucket #4 of [[framework-memory-optimization-landscape]]) are used — they trade latency for capacity by moving cold KV pairs to slower, cheaper substrates.
2. Why **HBM scarcity** specifically is a critical industry bottleneck — HBM bandwidth is irreplaceable for hot inference workloads even when CPU RAM is plentiful.
3. Why **software compression** like [[concept-turboquant]] matters most for the HBM tier — it directly multiplies effective HBM capacity.

Without this hierarchy in mind, the framing of the [[concept-ai-memory-crisis]] as specifically an **HBM** problem (rather than a generic 'memory' problem) is opaque.