---
id: "framework-memory-optimization-landscape"
type: "framework"
source_timestamps: ["16:22:00", "19:50:00"]
tags: ["industry-landscape", "taxonomy", "optimization"]
related: ["concept-turboquant", "concept-multi-head-latent-attention", "concept-kv-cache", "entity-h2o", "entity-deepseek-v2"]
steps_count: 5
sources: ["s49-killed-ram-limits"]
sourceVaultSlug: "s49-killed-ram-limits"
originDay: 49
---
# The 5 Approaches to AI Memory Optimization

[[concept-turboquant]] is just **one part** of a broader industry-wide attack on the memory bottleneck. There are five distinct vectors of innovation currently being pursued. Production systems can stack multiple approaches simultaneously.

## 1. Quantization

Compressing the data representation itself. **Examples**: [[concept-turboquant]], ZipCache.

The goal: pack the same information into fewer bits. Turboquant is the most aggressive published example — losslessly down to 3 bits.

## 2. Eviction and Sparsity

Throwing away tokens that don't matter and keeping only high-attention tokens. **Examples**: [[entity-h2o]]'s approach, SnapKV, StreamingLLM.

The goal: reduce the number of tokens stored, not the bits per token.

## 3. Architectural Redesign

Changing the model structure to require less memory **by design** rather than by post-hoc compression. **Examples**: [[concept-multi-head-latent-attention]] in [[entity-deepseek-v2]], IBM Granite 4.0.

The goal: train models from scratch with smaller-footprint attention.

## 4. Offloading and Tiering

Shifting memory from expensive GPU [[entity-hbm]] cache to cheaper CPU RAM or disk storage for high-throughput workloads. **Examples**: ShadowKV, FlexGen.

The goal: trade latency for capacity by moving cold KV pairs out of HBM.

## 5. Attention Optimization

Restructuring how the GPU **reads and writes** memory to minimize transfers and make computation cheaper. **Example**: Flash Attention.

The goal: reduce the bandwidth cost of attention by reordering memory access patterns.

## How They Compose

A production stack can combine: MLA (architecture) + Turboquant (quantization) + H2O-style eviction + ShadowKV tiering + Flash Attention. The five vectors are largely orthogonal and stack multiplicatively.
